Circonus: Design (Failures) Case Study

Abstract

In this talk, we will discuss the Circonus telemetry platform’s architectural evolution from bare-metal to Google Cloud and the lessons learned from various design failures. As SREs, we use Service Level Objectives (SLOs) to manage Circonus and have faced numerous recursive lessons reflective of data handling complexities. Along this journey, we’ve committed several ‘crimes against computing’, and through this session, we aim to explore in detail the systems architecture from its inception to today. Our narrative focuses on significant design missteps along this pathway and the strategic corrections that followed, concluding with insights that may seem straightforward yet are often overlooked by engineers. Join us as we explore challenges, solutions, and conclusions that may be useful in avoiding similar pitfalls in engineering design.

Introduction
Context and System Overview
Message Queues and Architecture Evolution
Operational Challenges with RabbitMQ
Transition to Custom Message Queue and Cloud Considerations
Storing and Managing Massive Data Volumes
Database Development and System Reliability
Reflections on Analytics Tooling
Stream Analytics and Architectural Lessons
Redesign and Integration Efficiency
Observability and System Operations
High Availability and Deployment
Final Lessons and Takeaways

Transcript

[00:07] All right, welcome to SR Econ. This talk is an interesting balance of new content and certainly old problems. I don't know if you've seen those horrific action movies from the US; they all look the same, where people compare bullet wounds with each other. These are all like production traumas. We have enough of them to fill up a day's worth of content, but we'll try to keep it to ones that show demonstrable takeaways—things that we actually learned.

[00:43] A real quick context: what do we do? Sir Onus is a metrics and telemetry platform for monitoring options. We take in measurements of things and make pictures out of that data, like graphs and heat maps. We do offline analytics and stream-based analytics for millisecond-level real-time alerting. That sounds simple until you realize that billions of unique things are measured, and sometimes millions of measurements per second come in. You can already see that that’s like 10 to the 18th measurements per second sometimes in these systems.

[01:23] On top of that, things can disconnect for long periods because store-and-forward networks are challenging, and we run it as a SaaS, so it’s multi-tenant. Now, that simple “I take data and make pictures” turns into “Oh my God, why did I take this job?” We want to run through four parts of the system that we designed wrong and the things we did to correct those. We’ll sum up with some conclusions that might seem obvious or controversial, but there are things we lose sight of as engineers because we’re always trying to get better or assume we won’t make the same mistakes.

[02:11] I’m going to dive right in. This is sort of what our architecture looks like, except a lot of these nodes have 20 of them, or two of them, or 10 of them. That’s sort of what our architecture looks like today, more or less. Message queues—how many people here use message queues for something? Our architecture started in 2010 before microservices were really popular, so it’s more macro service-oriented. It’s not like one API endpoint per binary running in a Kubernetes container because Go and Kubernetes didn’t exist then.

[03:02] We did componentize the architecture and have it communicate over the network, sometimes directly over things like REST and other times over message queues. Typically, in message queues, components learn of things through control messages. A component will say, “Hey, I just changed this user,” or “I created this user,” so anybody who cares should probably do what they need to do about this. The rest of the architecture subscribes to those events, learns about it, and reacts.

[03:55] The other thing you might do over a message queue is say, “I have this piece of data; I took this measurement.” When you put two things over the same substrate and one of them is very active, this happens. If anybody’s ever seen anybody use RabbitMQ, there we go. Anybody ever have it absolutely and utterly fail on you? The problem I have with RabbitMQ is only one: the pathology of its failure is incredibly hard to diagnose. It works beautifully until it doesn’t, and then when it doesn’t work, it’s incredibly difficult to understand why it isn’t working. [04:22] To triage, what do you do? You do the same thing that everyone has ever done in every network: you separate your control plane and your data plane. Luckily, the control plane is not on fire anymore. If we overload the data plane, we might start dropping or delaying data from devices, but the consistency and state of the system remain intact. This is because the low-volume queue around control can be isolated, and the quality of service around that can be controlled. It seems pretty obvious what you would do next.

[05:04] The next step is to perhaps extinguish flames. To do that, we actually built our own message queue. You might ask why we did that. I wouldn’t use Kafka because it’s 2010, and Kafka didn’t really exist. The lessons learned here are that things you build in the past are almost always more optimized for your problem, yet still a poor choice for what you’re doing today. Our message queue system works better than Kafka for our stuff, but if I were building it today, I would use Kafka. It has a large community, it’s stable, and it benefits from being open source.

[05:59] FQ is also open source. Anyone want to join in and use our unpopular message queue? You’re welcome to. It’s fast, written in C, so 40 gigabit wire speed is not a problem. It’s interesting moving from that message queue. We moved into the cloud recently, and the idea of changing our architecture to use cloud-based pub/sub was nuts. We ran two FQ instances and were able to push around a million sustained messages per second through each one. These messages are complex, with hundreds or thousands of telemetry points, so you’re talking about millions or billions of telemetry points per second.

[06:54] It costs somewhere between $15 and $25 a month to run that on our own hardware in a data center. When you look at pub/sub in Google or Amazon, it’s like $80,000 a month, which is quite different from $15 a month. It’s important to know that all these design decisions are based on available software and what’s good enough at the time. We replaced this, and now nothing’s on fire. The horrible design decision was putting the data plane and control plane on the same thing, which was a mistake.

[07:31] We always track the latency of every message going through the system, even if it’s a million a second. We were able to see when we did our deployment launch that all of our message latencies went down. The long tail went up a little, but the average went down, showing a shift down, which is fantastic. Messages are flowing into the system, and we need to store them. We stored our messages in Postgres. Anybody here use Postgres? Almost everybody, that’s great. I’ve been part of the Postgres community for a long time, and it’s encouraging to see more hands go up at every conference.

[08:16] Postgres screwed us hard, but I still love it very much. We still use it for parts of the system, but we basically turned Postgres into a column store, similar to how TimescaleDB works. The problem was that we weren’t really pushing the boundaries of Postgres’s storage model or query model. [08:37] We were pushing the boundaries of Postgres’s operational model. In large systems, nodes are always failing, and in telemetry systems, ingest never stops. If you’re getting an average of a million messages per second and have five minutes of downtime, the data doesn’t just disappear. If we have a five-minute outage, it’s five minutes missing in every graph for every customer, which would be unacceptable. So, if you’re down for five minutes, you have to catch up.

[09:27] To improve operational and storage efficiency, we defined our own storage format. We used genetic algorithms to design a structured layout for aggregate data and optimized it for LZ4 compression. We used LMDB-backed storage for that data, which is a memory-mapped embedded database. We extensively used the filesystem, and it takes about eight bits for any given data point in the system, which is efficient. We also store histograms in the system and have been doing so for about seven years.

[10:31] We decided to build our own database to ensure zero points of failure and recovery without operational intervention. When systems crashed, we had people working to bring them back online, which was cumbersome. We could either automate that or build a system that didn’t suffer from those problems. So, why not use InfluxDB, Cassandra, or others? In 2010, Cassandra was an okay choice but had operational problems. We realized they solved problems we didn’t have.

[11:00] These databases solve consistency and coordination problems using algorithms like Paxos or Raft. However, when storing only telemetry data, every operation can be commutative and idempotent, meaning you don’t need to agree to be eventually consistent. Messages can be in order, out of order, or duplicated, and you’ll still get the same result. This led us to write our own database that doesn’t try to solve these other problems, making it really fast.

[12:10] We were able to put nodes on different continents, in completely different failure regions, and remain uninterrupted. You can have an entire data center fail and still process a million messages per second per node. Our systems are failing all the time, often on purpose, as we deploy code and crash them intentionally to understand the system’s resilience. [12:39] Engineers have complete confidence in the system’s resilience because failures don’t affect users. This was crucial. Our system looks similar to React or Cassandra, with some differences. Initially, our Postgres usage graph was quite different from what we have now with IronDB. The graph shows 11 months of redundancy. In January 2012, when we decided to switch over, we found it had already happened. We had been running the new system for new users without realizing it.

[13:53] We could toggle between systems using a user flag, with data going into both systems. This flexibility allowed engineers to develop faster and more confidently. As histogram enthusiasts, we track the latency of everything in the system, from disk I/O operations to database calls. We use the open-source statistics library, Libcast, and the C stats framework, Libcirca Metrics, both BSD licensed. Tracking latency is efficient, costing less to record than to measure.

[14:56] We also gained distributed tracing, becoming fond of Zipkin, an older version of open tracing. We integrated this into our C library framework, LibMTB, which is open source. This setup provides complete distributed tracing for every app. We view the application itself as a distributed system, tracing interactions between threads and asynchronous I/O operations. This detailed tracing is invaluable for debugging, allowing us to pinpoint latency issues.

[15:50] For instance, when retrieving 10,000 data points, we can generate thousands of traces to examine the system’s inner workings. This isn’t just about macro components; it involves tracing thread-to-thread interactions. It’s incredibly useful for debugging, such as identifying an extra millisecond of latency. When retrieving a year’s worth of data, we expect it to return in a millisecond or two. If it doesn’t, finding the cause can be challenging. We’ll discuss the lessons learned from this experience shortly. [16:44] I’m Heinrich, and I work for the company as well. I was hired in 2014 to manage the analytics tooling. My background is in mathematics, but I’ve been primarily coding for the past few years. When I joined, the architecture was already established, though I organized it a bit more clearly. We have brokers, which are like agents on your systems, capable of running their own checks. Data from these brokers is ingested by a component called Strikon.

[17:51] From the start, our architecture had a dual data path. The alerting path runs through an FQ message queue to an alerting system that sends out messages. The storage path is different, with data ingested into Snow, now called IronDB, and accessed via a web UI. Today, I want to focus on stream analytics. Initially, users wanted to alert on analytics transformations, like forecasting disk space usage to avoid running out in a few days. This required transforming native histograms into numeric values for alerting.

[18:56] We also had a query language to allow users to run their own queries. Our ambitious plans included automatic transformations on all metrics to detect anomalies. The first idea for analytics alerting was to poll the database, which is straightforward but has drawbacks. The data in IronDB was delayed compared to the alerting pathway, which had millisecond latency. In IronDB, data replication caused delays of 30 seconds to a minute, though this has improved recently.

[20:04] Polling the database put a constant load on it, especially with 10,000 alerts generating 10,000 requests every minute. This approach was inefficient, as it involved pulling months of data for analytics transformations, only to discard most of it. The process repeated with each new data point, leading to redundancy. This method also struggled with alerting on all metrics, highlighting the need for a more efficient solution. [20:40] Keeping all your data hot in the database for weeks is costly. We decided to build a stream analytics component, aiming to be minimally invasive to the architecture. We integrated it into FQ to listen to messages, perform transformations, and avoid loading IronDB. The component used C++ and Lua, and it worked well, with typical analytics queries taking around 217 milliseconds. However, processing messages this way was about a thousand times faster, as it involved simple load operations.

[22:15] Despite the initial success, we had regrets after a year with the first prototype. We learned that minimal architectural changes don’t equate to minimal product changes. While easy to build and integrate, it required new concepts, UI work, and user education, which was costly. Additionally, the results of analytics transformations weren’t persisted, leading to a lack of historical data for alerts, especially important for anomaly detection.

[23:40] We underestimated the operations tooling needed to make it work. For instance, setting up a service like Backtrace for debugging took significant effort. There were also challenges with logs configuration, instrumentation, packaging, and CI, consuming a large part of our development effort. The uptime requirements were high, as it was in the alerting path. Lastly, we overlooked a critical requirement: users wanted to aggregate metrics, like summing 50 metrics or merging histograms, which we hadn’t initially considered. [24:38] We initially overlooked the need for transformations to apply to multiple metrics, as they were tied to a single metric. Changing this wasn’t difficult, but we were already a year in and realized it wasn’t ideal. Reflecting on our architectural choices, we realized that in 2015, we had a new CEO from sales eager to sell the product and a new CTO focused on system stability. The team was new, and we were unfamiliar with OmniOS, which added to the challenge.

[26:06] In 2016, the original CEO returned and suggested redoing the project. We agreed, as we had hit a wall. A new CTO, Riley Burton, joined and quickly understood the code, suggesting we reuse existing technology. We had a multi-threaded server framework, MTF, but initially struggled to leverage it due to the team’s inexperience. However, with Riley’s insight and a more experienced team, we could reuse the technology effectively.

[27:27] The redo didn’t take long, as we moved logic from C to Lua, allowing for dynamic adjustments. We had already built the C++ code, so rearchitecting was swift. The new architecture made more sense, with the output of the broker being metrics. Transformations became just another metric, requiring minimal UI changes. This approach allowed us to reuse our existing operations tooling, which was significant given the effort it took to build initially. The broker, present from the start, had all the necessary tools, making the transition smoother. [28:29] We were able to utilize the component effectively, with observability being crucial for its operation. This component accumulates state over time, making it stateful and reliant on message updates. Restarting it would destroy the state, leading to previous pathological behaviors. Reproducing issues in development requires replaying a lot of data to reach the defective state. Observability allows us to understand the system without causing harm while debugging.

[29:28] We maintain extensive logs, capturing everything the system does. This allows us to sustain the last log message and adjust the granularity as needed. We use instrumentation and methodologies for endpoints, with a stats-based system outputting JavaScript statistics. Introspection capabilities are available via telnet and web, inherited from the broker’s web inspection front-end. This setup allows us to focus on the metrics we’re interested in.

[30:18] Our metrics system includes service-level metrics like uptime, message flow, and CPU usage across a two-node cluster. We use a methodology for metrics that includes utilization, saturation, and error detection, helping identify bottlenecks. For example, it can predict when memory will run out. We emit metrics to test the system, graphing them to ensure accuracy, and perform outlier detection and analytics.

[31:29] For log analytics, we use a simple system that converts logs into an SQLite database, allowing us to run select queries. It’s a basic version of tools like Splunk but sufficient for our needs. We use gnuplot to produce histograms, aiding in data visualization. High availability was a late addition, initially missing in Beaker and partially implemented in the broker tech. Anthea completed the high availability setup in half a day, enhancing system reliability. [32:33] After completing the high availability setup, I spent half a week on testing. The principle is simple: we run several Cagle brokers that communicate via heartbeats, sharing uptime information. If a node has been up longer, its state is likely better, so other nodes defer to it. We haven’t implemented message federation yet, but the system is effective at our current scale. Checks are replicated across brokers, enhancing reliability.

[33:27] Once implemented, the high availability feature significantly improved service stability. Previously, deployments were stressful due to potential downtime, as metrics are emitted and any minute lost is visible on graphs. Deployments could result in a five-minute outage if issues arose. With high availability, we can deploy changes without noticeable impact on users, as the system maintains uptime through longer-running nodes.

[34:38] The service uptime is represented in black, showing the longest-running node, while individual node uptimes are in blue and green. This setup allows for error handling and debugging in production without affecting users, as long as other nodes remain operational. It was a revelation, and a key takeaway is to avoid constant garbage collection to save CPU. Excessive garbage collection was due to frequent calls, not poor object management.

[35:52] Our current architecture includes Cagle Walker, IMDB with Lua extensions, and Cagle as a query language. This allows us to run the same queries on both the Cagle broker and our database, simplifying the learning process. Offline analytics present their own challenges, but that’s another story. With this, I hand it back to Theo. The garbage collection issue was intriguing, and resolving it was satisfying. [36:30] It’s disheartening when you make a small code change that reduces utilization by 90%, only to realize you caused the original issue. We’ve all accidentally committed debugging lines. We faced crashes due to a use-after-free problem, where garbage collection was happening while the resource was still in use. Testing by forcing a full GC revealed the issue, but unfortunately, the fix was accidentally committed to production.

[37:18] The key takeaway is that while our architecture might not apply universally, the lessons do. I’ve written on scalability and helped solve large-scale production issues. Mistakes are inevitable, but repeating them is avoidable. Early in my career, I conflated data and control planes, a mistake I made more than once. Real-world constraints, like time to market, often lead to repeated mistakes because they represent known technical debt.

[38:52] You understand the cost of past mistakes and can tolerate them until you can fix them later. This approach can be a sound technical decision once you’re aware of your technical debt. It’s controversial, but focusing on operability over correctness is crucial. You won’t achieve both perfectly, but ensuring operability helps you understand and fix issues.

[39:38] Instrumenting systems is vital; without it, you can’t understand or manage them effectively. You need historical data to comprehend system behavior. Lastly, building resilient and redundant systems is essential. Without this, your development and SRE practices will be hindered, as you won’t have the flexibility to manage and improve your systems effectively. [40:20] If your systems aren’t resilient and redundant, your development and SRE teams will be hesitant to make changes. Therefore, it’s crucial to build systems safely and robustly to encourage innovation and improvement.

Circonus: Design (Failures) Case Study

Recording

Slides

Abstract

Table of Contents

Transcript

References

Comments