Latency SLOs Done Right

Abstract

In the talk, we will explore the challenges of measuring and aggregating latency for Service Level Objectives (SLOs) effectively. Latency serves as a crucial metric in assessing service quality, yet its measurement, particularly via percentile metrics, faces challenges in situations requiring extensive aggregation across time periods and nodes. We identify these challenges, explaining the pitfalls of using percentile metrics for latency SLOs and offering three practical solutions: using raw logs, counting metrics, and employing histogram metrics. Each method provides unique benefits and limitations, with a focus on accurate latency aggregation.

We will also detail the implementation of histograms, particularly HDR histograms, which allow for precision in latency reporting. This discussion includes the methodology behind using histograms, the advantages of counter metrics for efficient aggregation, and the potential for future technologies in improving latency metric accuracy. Through this session, attendees will gain a comprehensive understanding of how to tackle latency SLO measurement and performance issues using advanced methodologies and technologies.

Introduction and Background
The Challenge of Measuring Latency
Introduction to SLO, SLA, and SLI
Analyzing and Aggregating Data
Counter Metrics Approach
Histogram Metrics and HDR Histogram
Emerging Technologies and Conclusion

Transcript

[00:06] Yeah, hello, thanks for having me again. SR Econ is really a special conference for me. My first talk at a conference was at SR Econ, and I've come back a few times since. This talk is particularly special to SR Econ because it's essentially a reply or a tangent of a workshop I attended last year, which was run by Liz Fong-Jones, who is sitting here, Krysta Bennett, and Stephen Thorne from Google. They had a session about doing all kinds of SLOs and giving you best practices. I remember that very vividly, and I came out of that session feeling like things clicked for me, especially the subtleties of speed and latency SLOs. There was a real learning I took from there.

[00:58] I was sitting in my SRE corner’s point of view for a long time because we had our take on how to do latencies and had some tools. I will describe them, and then if you are outside that bubble, the complications are different, particularly with the latencies. It’s subtle, and so this talk is basically an elaboration on the tangent from last year’s SR Econ. I still hope, because the topic is so relevant and in some sense so basic, that it’s valuable for you.

[01:31] The question this whole talk is about is actually pretty basic. Here’s an example of this question: How many requests in August were served within 100 milliseconds? I’d like you to think of any API that you might monitor or care about. Can you answer that question? Raise your hands if you can. Okay, there are a few. What about within 150 milliseconds? That’s still one or two. How about June 16th, 12:09 to 12:35? How many requests were served within 30 seconds? I see one hand, another hand, very nice.

[02:19] For those people who raised their hand, how are you actually doing it? Sorry, louder? An external monitoring service gives you that data? How many requests were there? Okay, but it’s about all the requests that were served. Sorry, if you have set it up properly, yeah. I will talk about how to do it, but this is surprisingly hard to do. Most of the time when I ask this question, I get the answer, “Well, we have Splunk running back to June, and they have just all the logs,” which is also a valid way to do it.

[03:08] This is the obligatory slide about myself. As was already mentioned, I am working as a data scientist at SREcon. I realize I might be one of the very few data scientists at this conference. My background is in mathematics, and as already mentioned, I used to talk about statistics and how to apply it to the SRE field. This is a picture of my pile of firewood. I recently moved to the countryside, and the main reason is so I can do more wood chopping. This is supposed to be funny.

[03:48] So, probably this leads again to my main topic: latency SLOs. I can take that brief and then elaborate on three methods on how to do it. Why monitor latency? Latency is like the key performance indicator for any API. If you look at the Google SRE book, they wrote the book on it. They have the four signals or golden signals in it: latency, traffic, errors, and saturation. Latency is the number one golden signal. People care about that.

[04:43] The four golden signals later got reshuffled into the RED method for monitoring APIs, which is requests, errors, and duration. The ‘R’ is just another name for latency, so you might have heard that before, but it made its way to the acronym. What is an SLO? That’s also pretty much from the workshop, and you probably already know that, but it doesn’t hurt to say it again.

[05:00] An SLO and SLA come in three different forms. The first is the Service Level Indicator (SLI), which is typically a metric that quantifies the reliability of your service. Typically, it has the form of good events divided by valid events times 100. A good event might be something like your service was up at a certain point in time. The Service Level Objective (SLO) is then measured against an SLI and sets some expectations about the server performance.

[05:30] A Service Level Agreement (SLA) is what happens if an SLO is not met. To make that really plastic, once a minute you SSH into a target host and report one if it’s working and zero if it’s not. An optimistic SLO would be 99.9% uptime. For every measurement you did over the last month, you look if the metric was reporting one, and if it did so 99.9% of the time, then your SLO was met. An SLA would be if you didn’t meet the SLO, you will get exactly one cake. It would be a thing. Maybe you have other kinds of punishments or stuff that happens, maybe nothing happens, but it’s still good to have the SLO at least.

[06:24] Let’s look at the latency SLO example from last year’s workshop. The SLI we were looking at was the proportion of valid requests that were served within one second. The SLO was 99% of valid requests in the past 28 days are served within one second. It kind of seems to be following the pattern, but now we have a look at the data that could try to be used to answer the question. You have the slide here. You see the 99th percentile at the very top, and the y-axis has milliseconds on it. 1k is the threshold we are interested in.

[07:16] What does this graph tell us? This graph tells us, well, most of the time the SLO was met. Like on the 24th of June, you’re mostly below this 1k line, and it’s very tempting to just say… [07:33] Well, let’s smooth out that line, just take a visual average, and I think it should come just below the 1k line if I’m doing a bit of visual averaging. So, I’m led to believe, well, 99 percent of my requests in the past 28 days are indeed served within one second. So that’s probably also true for that service. It looks like a very well-behaved service. I mean, my services usually look more like this when I put a p90 on it because, yeah, sometimes there’s just no load on that API, and it’s nighttime, and nobody listens, so the values go everywhere.

[08:16] The important bit of information that is missing here is how much traffic was served at each point in time. So if I told you that overall the traffic was very low and you had good percentiles, but at this spike, basically all of your data came in, then suddenly your judgment of whether the SLA was met or not would change. Actually, based on this data, we have no way to decide if the SLA was met or not, and that’s the issue.

[08:54] That’s the other thing. Usually, if you look at data like that, it’s not even close. You wouldn’t actually think you could decide this question if you looked at data that should look like that. So, the upshot here is percentile metrics are not suitable for calculating latency SLOs. For latency SLOs, we need to aggregate data across multiple weeks and also across multiple nodes. You might have five web servers that serve requests, but your SLO is over the whole service. You don’t have an SLO for node 1, node 2, and node 3, so you need to do the aggregation across nodes and across long time periods, and both are problematic with percentiles.

[09:44] Then there’s another story I can tell in this context, which is from Monitorama a few years back where I gave a different talk about latency. I was in the statistics for engineers class, and there was a guy called Dan Cinnamon who took that to Twitter and said, “Well, Heinrich said you can’t aggregate percentiles,” and probably had that argument a few times, so he put it out there. Then John Brower, a data scientist at Snapchat, came in and said he gets annoyed. I said, “Well, it’s complicated,” and he said you can actually average percentiles. He wrote a 10-page blog post about it, doing all kinds of sampling, and said it actually isn’t that bad.

[10:47] In many cases, you can actually get a good approximation. I don’t have a quote, but his basic take was, “Well, it’s possible.” I said, “Yeah, it might be, but if you look in practice, you can get three percent errors, and I don’t even have to make data which is very artificial.” That somehow settled the argument. We said, “Yeah, okay, it’s quite easy in effect to find examples where it completely breaks down.”

[11:17] The point is, if your service is well-behaved and you have like five web nodes with the same load and the load balancer is working, then you can average the percentiles, and you will get a very good approximation. This is basically what John was saying: if you’re sampling stuff from the same distribution and then averaging stuff, if you’re careful, you can get very precise approximations. The problem is you’re most interested in your percentiles when stuff is not going smoothly, like when one of your servers is down or serving just 200s really fast, or the distributions are very different, and then the averaging is just killing you.

[12:00] Here’s another example I prepared for this. I am just averaging one-hour percentiles over 24 hours for one data set. It’s not actually super terrible, but it’s a 68 percent error between the true aggregated percentiles in red and the blue average percentile. You can see everything, yeah. The reason here is also what you can see is that the load for the different sections is different. You have like 35,000 requests at hours, and here you have 268, so it would be super stupid to average those two. At least you have to do a weighted average. You can do that, and it goes back to a 10 percent error here.

[12:55] But then remember, this is not actually what we are doing. We are averaging one-minute bins usually. Even if you look at a diagram like that, you usually see one-minute measurements, calculate the percentile, and then to graph it, you already averaged one hour, and then you have an average over very long times, so this effect only gets worse. Now that I stated the problem, I can talk about the solution.

[13:23] There are three methods for that. One is to use logs, one is to use counter metrics, and one is to use histogram metrics. By log data, that’s really easy to explain. If you have the raw data for your time period you’re interested in, you can just say, “Well, give me everything that was faster than the latency threshold and count it.” That’s fair, it’s correct, it’s clean, it’s easy to do. The downside is that you have to keep your log data for, in this case, a month because the reporting period for the latency SLO was one month, and typically SLOs are really phrased in longer time periods.

[14:04] The other problem is it can get very expensive, and it can be very, very expensive. Even a lightly loaded API might easily do terabytes, maybe even a gigabyte of logs every day. A single metric is 10 kilobytes a day of data, so it’s a whole different ballpark. You can do a ton of metrics if you are doing that. [14:34] There are many vendors who do it, and I will say you can do sampling to somehow reduce the cost. I’ll talk about how to do it properly tomorrow. I have some experiments I will show the class. Sampling doesn’t really allow you to do latency percentiles. If you are clever about it, it might be fixable, but I haven’t seen a silver bullet in terms of sampling where you get the latency percentiles. Maybe it will be possible in the future, but I haven’t seen it. I mean, if you have all the data, you can for sure do it, but if you don’t, it’s not so clear for me.

[15:14] The second approach is counter metrics. So, log data is done, and the idea is also very simple. We have to know the threshold for our SLO before setting up the metrics. We pick the latency threshold upfront and then count the requests that are faster than the threshold and store that as a metric. Instead of calculating a percentile value saying 90% of my data was faster than that, I just count how many requests were faster than the threshold I’m ultimately looking at. It’s kind of a reverse thing.

[15:59] When I have this question like, “How many requests did 99% of the requests serve within one second?” I can either look at the 99% or I can look at the one-second mark. One is really looking at the percentile, and the other is just counting the number of requests that were faster than the threshold. I can count that every minute and put it in a counter metric. Looking at it this way allows you to do efficient aggregation and correct aggregation. You can select all metrics you have for different nodes, sum them up, and integrate them over time, and then you will get accurate latency SLOs.

[16:42] This is how it looks: you just count all requests that were faster than a second. In black is the total requests, and in red are the bad requests, the ones that were slow. This line is the integral, so you do a cumulative count. You just count how many total requests you had, and it goes up. At the very end, you read off the value. Here I have 100k requests and 9k are bad, so you have 8.9% which is low, which is in your SLA, which is good. We wanted to have less than 10% slow requests.

[17:23] It’s easy, it’s correct, it’s cost-effective because you are not storing terabytes of data in a Hadoop cluster, and you have full flexibility in choosing aggregation levels. You can select arbitrary time slices, you can select arbitrary nodes, so that’s good. But you have to choose the latency threshold upfront. If you have an established SLO, you might have that knowledge of which thresholds interest you, but oftentimes you’re just guessing there. You pick a bunch, and that’s okay, but it’s not that you can then go back and change the SLO and say, “I want one of ten instead of 100.” You really have to live with the threshold that was provided.

[18:16] Prometheus was already mentioned. They call this method a histogram method. If you have a Prometheus histogram, then you have these LT metrics. This is also the name in nature I used here, and they have means to process that as a histogram, which is not wrong. But when we talk about histograms, we mean something different, which is this guy here. So, this is histograms, and in particular, we are interested mainly in HDR histograms.

[18:48] Just to briefly recap, the raw data is listed below here. These are all the individual samples I’m caring about here, and then I’m taking them into bins and counting how many requests I have in each bin. This is just a minor difference. I’m not counting how many requests were slower than 120; I’m just counting how many requests were in that particular area, and I’m counting that. The bin sizes are flexible here, and I will explain that in the next slide more carefully how to do that best. You already see in this range I have much smaller bins than in that range. The general idea is this works like floating point precision. If I’m going way up, my bins get large, and if I’m close to zero, my bins are very small, so I have a fixed number of significant digits.

[19:43] The basic idea is you store your latency distribution in histograms which are tailored for the operations domain. The way HDR histogram does that is you choose a number of significant digits, and you do everything with base ten floating point numbers. You have a total of 46,000 bins in the range from 10 to the plus/minus 128, so that basically covers everything you might ever want to record. 10 to the minus 128 is smaller than everything you can measure, and 10 to the plus 128 is larger than everything I have at least encountered. You are spacing your bins in a way that you have a fixed 5 to 10 percent accuracy. You do bins at 10, 11, 12, then 100, 110, 120, and so on, so you always have a fixed number of significant digits.

[20:43] Then you apply sparse encoding. Instead of recording 46,000 metrics and doing the counts in each bin, you’re just saying, “Well, this is a blob for us, a binary bit,” and we store it in a database which supports that. Then you can get away even if you have 100k entries in a histogram, you’re just hitting maybe 100 buckets because usually your data is not spread against an order of magnitudes; it’s confined within one or two orders of magnitude. [21:10] You have stuff from 0.1 to 5,000, but that’s only three orders of magnitude, so we’ll have roughly 300 things, which is not a lot of data. You can use it to do latency aggregation and accurate percentile calculations. The HDR histogram was developed by Gil Tene of Azul Systems. He has a magnificent talk about benchmarking and measuring latency, particularly GC latency for Java. We have a highly specialized, optimized C and five other language implementations that do exactly that.

[21:53] It’s called Circonus Hist, and it’s one of the things we based our OCONUS monitoring product on. Here’s how it looks: this is the latency data spread out over half a day, and for each minute, you have a full histogram. We visualize it as a heat map, but this is the more traditional visualization. You can do latency SLOs with histograms very conveniently. You just capture all latency information as histogram metrics, aggregate lost latencies over nodes and endpoints, and because you are now aggregating the histogram, this is very easy. If you have two histograms with the same binning, you can just aggregate the counts straightforwardly.

[22:38] You can count how many samples were below the thresholds to compute your SLO. I might be able to demo this. We can flip that into a histogram summary mode and say, “Give me the latency distribution for the last four weeks.” You can see, for example, 50 milliseconds was here, which was what I was interested in. Despite a UI glitch, in that case, for 47 milliseconds, there would be 11 million samples well below that, and we had a total of 13 million samples, which would be 88.7 percent. I can even go back a year, which will take a little while to render, and then I can zoom in.

[23:32] This is a year’s worth of data, and I’m now zooming in on some data we collected in March. I can say, “Flip this into histogram summary mode,” and evaluate the SLA here, saying, “Okay, below 30 seconds, I had 96 percent of the data.” This is how you can do it with histograms, and it’s fairly convenient because you have full flexibility in thresholds, aggregation levels, and intervals. It’s cost-effective. The downside is you need HDR histogram instrumentation. We have libraries for that, and my proxy recently started using it. I’m seeing more and more uptake of that, with people building HDR histogram instrumentation in, and you need a histogram data store to do it. You can obviously use our products, but there’s maybe some uptake with other backends that also support that.

[24:49] For this talk, I’m particularly excited to see that there is actually a movement in that area. There are more competing technologies doing very similar things. Our stuff dates back to 2013, then we have HDR Histogram in 2015. There’s now a DDSketch for distributed distribution in 2019. It’s a paper from a few months back by people working at Datadog, so I have some suspicions that the “DD” might have a double meaning there. They are calling it a sketch, but what it really is, is a histogram. They are using logarithmic bins, not log-linear bins with base 10; they are using logarithmic bins with a base of 10 to the 0.1, so you approximately get the same resolution we had here.

[25:57] If you’re a related customer, you might be able to use similar technology as well. It’s just exciting to see people acknowledging the problem and building technologies that allow you to address this stuff. I also did a little benchmark of all those tools: Circonus Hist, HDR, T-Digest (another clustering method doing the same thing), and DDSketch. They all nail the accuracy part; this is not 10% error, this is 0.01% error, 0.25% error, so they all pretty much nailed it. Performance-wise, this was a very rough benchmark where our stuff was very favorable, but it doesn’t really matter; this is all pretty good performance, at least the sketch and the Circonus Hist doing regular.

[26:37] That rounds it up. Thank you very much. Heinrich, this was awesome. Tomorrow we have three hours of a statistics workshop with Heinrich, so we will have an opportunity to follow up on this and ask a lot of questions. But yeah, we ran out of time, so if you have questions that can’t wait until tomorrow, just catch Heinrich in the corridor. We will start the next session in about four minutes. Thanks. [Applause]

Recording

Slides

Abstract

Table of Contents

Transcript

References

Comments