Latency SLOs done right

February 3, 2019
Brussels, Belgium

Recording

Slides

Abstract

In the talk, we will discuss the importance of latency as a key indicator of service quality, highlighting the challenges in measuring it accurately. Traditional methods like CPU utilization or request counts don’t capture the complexities of latency. We will explore the shortcomings of popular percentile metrics, especially for setting Service Level Objectives (SLOs) over extended periods. Our presentation will delve into the pitfalls of current practices and propose three practical methods for implementing effective latency SLOs. Our goal is to equip attendees with strategies and tools to measure and manage latency, ensuring their services meet defined performance standards.

Table of Contents

Transcript

[00:06] For the introduction, it's great to be here. Actually, FOSDEM was the very first conference I spoke at in 2013, back then in the Graph Dev room. And now it's the second time, I really had to be back. It's always an event I really enjoy going to, and it's very nice to be here.

[00:23] So, latency SLOs. To get you in the mood, I have a question for you all. You all have APIs that you manage or care about in some way. So if I ask you the following question, or your manager asks you the following question, like how many requests in January were served within 100 milliseconds of all the requests that got in January? It seems like a fairly basic question, right? You’re monitoring it, so how would you actually do it? How can you answer this question already?

[00:56] What about within 50 milliseconds? What about maybe 180 milliseconds? What about if you had a problem on June 16th, 2018, between 9:12 and 9:35? How many requests were served within 100 milliseconds? I guess, like, who can connect to the answer to that question from any of their peers? Okay, that’s great. Can I have someone tell me how they would do it? Splunk? Yeah, excellent. So you keep smart data for more than half a year? Okay, I admire this.

[01:40] Okay, actually, these kinds of questions are examples of latency SLOs, and we at Circonus are being able to answer these kinds of questions very convincingly. At the latency SLO workshop at SREcon last year, I realized that it’s actually, for many people, next to impossible to answer these questions, and I was like, okay, this was really shocking to me, like how hard that really is. So I thought, well, this is maybe a good topic for a talk. So what I want to do here is basically give you a few ways how you can do that with the tools that you have, like everybody uses for monitoring and log analytics, and then some ways, some pitfalls that often arise, and some misunderstandings and clear them up, and also show some of our tooling that is also largely open source, which might give you more insight into your APIs and allows you to answer questions like that very convincingly.

[02:43] So, something more about me. My name is Heinrich. I’m a data scientist at Circonus. I originally come from mathematics, and recently I talked a lot about statistics for engineers, so kind of being a mathematician entering the IT operations monitoring domain. This is what I was talking about, like what percentiles are and these kinds of things and how you apply them correctly. I moved to the countryside in Germany so that I can chop more firewood myself, and yeah, I’m highly complementary today if you want to follow me around.

[03:14] Okay, here’s the plan. First, I want to just talk a little bit about why you want to monitor latency. Then the second thing, where you look at the title, probably I should explain what an SLO actually is. And then I will give you three methods to calculate latency SLOs effectively, and we’ll talk about each of those methods, and in the end, people have a confusion. Isn’t that wonderful?

[03:35] So, yeah, without further ado, latency is important. I will just say so much. Forgot some props. Like, this is the book that everybody who does SRE or DevOps and stuff reads. This is the Site Reliability Engineering book from Google authors Niall Murphy and others, and they had these four golden signals in them. They are on page 60. Here they are: four golden signals. Latency is actually the first one, and it has latency, traffic, errors, and saturation. And those four metrics were later rearranged into the RED monitoring methodology. So if you wanted to monitor APIs, then the recommended best practice is actually to do RED, which is Rate, Errors, and Duration. Duration is latency. So they took the four signals, then got rid of one of them, then rearranged them in a different way, and made an acronym out of this. But it’s still the best thing I recommend everyone to do, and many people agree with me. Yeah, and probably you all care about latency if you are in this talk, so I won’t actually say much more about it.

[04:58] So the second part I should explain is what is an SLO. Well, actually, if you’re monitoring something, then you want to have an idea of how should this value actually look like. So you have some metrics, and what is the expectation? SLOs are kind of a qualification of the service quality that you would expect the service to have. And there’s a methodology also proposed in this book, which they have also seen in which what happens talked before me. This is this SLO, SLI, and SLA, three acronyms which are used in this context. You start on your site, your service quality measurement by specifying certain service level indicators, which very concisely measure the reliability or the quality of your service in a specific way. And then you have service level objectives, which are basically target values or corridor values you expect that SLIs to have throughout the duration of the time. And the SLO is something that you might communicate to your users or internally. So you say something like, I have a 99.9% uptime goal. That would be an example of an SLO. But it’s important to realize that service level objectives are something that take place in longer time spans. It’s something that you manage stuff on, so you want to do management decisions like, should I push out more features, or should I be more conservative, spend more time investing in the reliability of my service? This kind of trade-offs are managed with an SLO, and it’s really an art in how to specify good SLOs. And then also, SLA is basically what happens if we… [06:46] don’t need an SLO, and this is more of a legal question, so I’m actually not going to talk much about it. Here’s the first example.

[06:53] That’s an availability SLO. So the SLI here is I SSH into the target host and I spit out a 1 if it’s working and a 0 if it’s not. I do that every minute and put that into a metric. So that’s a very clear service level indicator, and I have an SLO of a 99.9% uptime over the last month. I take a month of data, look at how many ones I had in that month, and how many zeroes I measured. If I have ones more than 99.9% of the time, then the SLO is met. If I don’t meet the SLO, then you will get exactly one cake—that might be an SLA you follow. You can just put other incentives here, but yeah, it’s up to you what you’re doing there.

[07:43] So here’s an example of a latency SLO, which was used in precisely this way in the SREcon workshop I attended last year. The SLI was the proportion of valid requests that were served within a second. This is a metric you do every minute, and then the SLO was 90% of the valid requests in the past 28 days were served within one second.

[08:15] Yes, so you will recognize that kind of question from the very start. This is basically the same thing I asked you: how many requests? And we are now asking for a percentage. The SLA was skipped here, and interestingly, they afterwards showed you this data about the API. They showed you percentiles, and I guess most of you will actually do percentile-based monitoring. It’s very common and it’s actually what you are recommended.

[08:45] The problem with latency monitoring is that the thing you are monitoring, the API, is inherently event-based. There are tons of events which come in, like many tens of thousands, and then you’re trying to store them in a metric, which is something you have just a single value for. So you basically have to compress a whole lot of information of tens of thousands of events and latencies in that period of time into a single number.

[09:11] The first thing everyone did was averages. Let’s just do the average latency that we have. Then, Dr. Ulu Gong from the Optimized D world wrote a very nice blog article which nailed the problem with this. He said, well, measuring the average latency is like measuring the average temperature of the patients in a hospital. You don’t really care about that; you care about the sick patients most. So you want to focus on this.

[09:43] Then there was the Amazon Dynamo paper which said, well, yeah, everyone knows that averages and standard deviations are for latency, so we do 90th percentile, 99th percentile, 99.9th percentile. This is basically where we currently are, so everybody is doing this. They had it on the slide, and the question was, was the SLO met?

[10:02] You have 28 days of service here, metrics, 90% of the valid requests in the past 28 days were served within one second. They asked the audience, well, was the SLO met? That looks right. There’s the 1k millisecond mark here, and oh yeah, I stood… sorry, this is actually, this should be 99%, but never mind.

[10:35] So you look at the 99th percentile, so you know 99% of all the requests were below the 99th percentile. The question is, were 99% of valid requests served within one second? It’s very tempting to look at the 1k millisecond mark here, this is the 99th percentile, and say, well, we were below the 1-second mark 90% of the time, so it should be okay, you should be fine. But is that really true?

[11:06] What if I told you that actually 99.9% of all the requests occurred there? We don’t actually have the request count here, we don’t know. By the way, this is not how a real percentile metric looks. This is how a percentile metric looks, at least how our percentile metric looks. I don’t know if you have stuff like this. This is a very, very well-behaved, constantly loaded service.

[11:28] This is something I have periods at night here, but I don’t have any requests, like nobody is using my service, so it’s a poor service. If I take the 99th percentile of no requests, it’s missing data. Down here, I have maybe five requests, and the 90th percentile just doesn’t do anything. If I look at this, I can actually not at all plausibly tell you if 90% of my data was below that.

[12:00] So if you’re only doing the percentiles and you are doing maybe averages, then you actually have no way to determine this kind of SLO. That’s the first realization which is quite important, and it is actually not so easy to communicate. I try to make that clear, but I constantly have the discussion, which is basically a percentile aggregation problem, right?

[12:18] SLO asks you to compute a percentile over one month of data or a week of data, and then also across the service. It doesn’t ask for what is www1 doing, so many requests, so about five. It asks if all web servers do or that serve the APIs do that. So you need to aggregate across multiple nodes, you need to aggregate across multiple weeks, and you just cannot do that with percentiles.

[12:47] Then people say, well, maybe you can. This is a reaction to my 2016 Monitorama talk, and John Rouse is a data scientist at Snapchat, and he said, well, if people like Heartland say that you can meaningfully aggregate sample percentiles, then yeah, I’m annoyed. Well, sometimes you actually can, and it went back and forth. I wrote him a letter. [13:05] said, “Yeah, John, blah blah blah,” and then he said, “Well, actually, I wrote a 10-page blog post showing you that you can actually aggregate percentiles.” I have the link here; you can look all of this up. It’s actually a very beautiful post. He’s really great at explaining. If you have multiple distributions and you sample from the same distribution and just average percentiles, then you will get true percentiles. This is kind of a statistical phenomenon. If you’re sampling data from the same distribution, your statistics will converge if you average them. However, in practice, you are not sampling data from a similar distribution.

[13:41] I don’t want to go into the details about what’s on this chart; I will do that later. But I took some production data and compared a one-hour average percentile to a true 90th percentile over one hour. I took all the data, computed the 90th percentile over the full data for one hour, or I took one-minute percentiles and averaged them. The results were a 300% error. The worst thing is, usually, it’s fine. If you have five nodes doing the same work and you’re averaging the p90, you might not match up with the true p90. But if you have two nodes, one blue and one red, the blue is doing a lot of work, so you have a nice distribution, a typical latency distribution. The second one doesn’t do much work, and its p95 is down here.

[14:37] The true 95th percentile of the total will be pretty close to here because the red one isn’t doing much. This might be a failed service or something that just started up, or you have a problem with a load balancer. Your service isn’t in a good state, and if you’re doing the average, you might have a 30% error or something substantial. The real problem is that it works most of the time, but in the situation you really care about, when something goes wrong, like a disk starts to fail, your aggregated percentiles will be terribly off.

[15:20] I hope I’ve convinced you that percentiles are not immediate. To be completely fair, the Google guys didn’t actually take a percentile metric as an SLI, but something else. They did a proportion of the requests served within one second, which is precisely the first metric method on how you can do it right. I will talk about three ways to do it right: the first one is count of not data, the second one is counter metrics, and the third one is histogram metrics.

[15:55] If you have log data, if you have Splunk for half a year, you can answer these kinds of questions very easily. You can say, “Select everything from logs where the time is in some time box and the latency is below a threshold.” You might not have an SQL query interface for your logs, but you might have another query language. Essentially, you put all your logs in a data store, have a field for the latency, and query with that. You can do that with log tools.

[16:29] The original question was, “How many requests were served in January within 180 milliseconds?” Just go and count them. It’s great if you can do that; it’s correct, it’s clean, it’s easy. The problem is that you need to keep all your log data for a month, which can be very expensive, and every Splunk customer knows this. The problem is not really that Splunk is ripping you off, though that might be the case, but it’s not the main reason. It’s just a lot of data. For every request, you have a log line, which is maybe 80 or 100 bytes, and you have to store all of this for every week. If it’s a meaningful volume, like a thousand requests a minute or 10,000 requests per minute, this is gigabytes a day, so it’s just by design very expensive, and very few people can afford to keep log data for very long.

[17:22] The second thing is called counter metrics, which is also very simple. You’re interested in the one-second threshold. You want to know how many requests were served within one second, so you make a new metric. The common name for this is LT (less than) one second. You just count how many requests were served faster than one second. It’s pretty easy. You add a new metric and do that for each node. The beautiful thing is now it has become mergeable; it has become aggregatable. With percentiles, you couldn’t do the aggregation, but with counters, you can just sum them and integrate them over time.

[18:02] This is how it will look. In black, you have the total request count; in red, you have the close requests. You can select the time frame and integrate the graph. You sum up every line you have and arrive at some numbers, which are the two line graphs that come up high at the very end. So, 8.9% of requests. For this API, my SLA was met; 90% should be fast. Latency SLAs for a counter metric are easy, correct, and cost-effective because you’re not storing a lot of log data. They give you full flexibility in choosing aggregation intervals. Note that you can freely select time ranges and the number of nodes you’re aggregating over, but you need to choose your latency thresholds upfront. You have to hard-code the one-second key thing here, and many people who do this seriously actually do precisely this. [19:03] They just hard code a bunch of latency thresholds and they do metrics for them. Cloudflare has examples where they do 1,000 thresholds for each API—not for all the APIs, but for those that they monitor like this. So afterwards, they can just select the right latency threshold they might be interested in. The other technology called HDR histograms allows you to do better. The basic idea is that instead of storing the individual durations like in a log form, we store the histogram representation of them.

[19:41] We put them into bins and then count how many samples were in a bin. We apply one more trick: we don’t store stuff that doesn’t have any samples in it. So, here’s a bin from 2100 to 2200, and we don’t store that. We use a sparse encoding for that, and this way, we can get away with very low storage requirements and a broad range of data coverage. We can cover the whole float range from 10 to the plus-minus 128 with 46,000 bins and still just 300 bytes per histogram. It’s a metric, but it stores much more than just a single value; it stores the full distribution.

[20:39] For each point in time, we not only have one percentile or one average, we have the full histogram. We can aggregate it freely. I selected a time frame, and I can view the complete distribution. At the very end, I can do my latency SLO very easily. I just select something here, and then I have 43 of 40 million total requests in that time span of a month. I hover over this, and I see 89.3% of my requests were faster. This is pretty eye-opening if you have that kind of technology available.

[21:18] Here, I have a demo. I select a date range, like two weeks, and here’s the updated thing. I can check, “Okay, my latency below 27 milliseconds was 69%.” I can do that on high-volume APIs. These are actually I/O block latencies over a month, so it’s not a highly loaded system, but you see three different disks. They each do I/O requests, and I can see the latency distribution of all of them. With this kind of technology, you can not only monitor APIs, but you can also do system-level APIs like disk I/O latencies and still aggregate SLOs of them.

[22:08] We actually have two commercial products: Konus and our DB NDB, which is a time series database fully capable of histograms. Konus is a SaaS monitoring tool. Both are full systems, not open source, but they have a very generous free tier, so you’re welcome to try. The core technologies behind histograms are open-sourced. There are two open-source histogram libraries: one developed by Gil Tene from Azul Systems, an early proponent of how to measure latency, who has an excellent talk on this. If you’re more into the problem of how to do subtle benchmarking of latency, you should definitely watch that. He came up with the HDR histogram name, and we have a very similar thing called libcircllhist, which we use in our products.

[23:00] I just want to tell you that percentile metrics are not suitable for SLOs. If you want to take that away from here, if you just have percentiles, then it’s not good enough to do SLOs. HDR histograms are the ultimate answer that I know, and you should try it out if you want. If you don’t have HDR histograms, the following thing is effective and can be done, and it’s not bad practice. You store log data usually for three to five days, have that available, and there you can do the first method. You can freely choose your thresholds, experiment with this, and determine what are sensible service levels for you.

[23:44] Is it one second? Is it 100 milliseconds? What is my typical performance? This you can do in the first three to five days of data, and then you add the instrumentation for the counter metrics. You do less than 100 milliseconds, less than a second, less than 200 milliseconds, and then aggregate those metrics over weeks and months as needed with the tools you have. It’s a little bit of a pity that you have to make these upfront choices, but yeah, as I said, this is how it works. This is actually how Prometheus histograms work. Many vendors have started to use the term histogram, but what they really do is allow you to specify a bunch of thresholds.

[24:27] With HDR histogram technology, you don’t need this upfront choice. You have 40,000 bins, and you’re basically covering the whole range you ever want without any upfront choices. In my opinion, that’s the best way to do arbitrary latency SLOs for high-volume APIs or what have you. Here are some other correct ways to do it, but please be careful with the percentile metrics. They give you a good impression of how an API is doing, but actually doing math on them is very hard. They are not just a very straightforward statistic in a way that makes it easy to aggregate them.

[25:03] With this, I’m going to close. Thank you very much for your attention, and I think we’ll have two minutes for questions or something, right? Please remain seated during questions. Hello? If you don’t have fully stored logs or full histograms, can you use just the usual volume counter and a regular sampling of the latency? Can you approximate something correct by multiplying the latency? [25:36] Samples by the volume of requests during the period—I don’t know if I understand the question correctly. So the first question was about subsampling, right? If a block of data, I just take it—yeah, you have your usual metrology system that measures the number of requests, and then you have some kind of probe samplers that measure the latency every once in a while.

[26:08] Okay, next time a probe—yeah, with external probes. So the question was, can I cook something up with having the total request count and the external probe latency? The first thing I’m going to say is it’s good to have external probe latency because the internally measured latency and the externally measured latency can be quite different. It’s good to also have external probes, but the main problem with external probes is that they are usually not representative. You see, you have a multimode latency distribution, and you’re just picking one out, and usually, you’re not picking out a random one, which should be required for this kind of sampling.

[26:44] For example, usually, you just hit the homepage of your thing. This will always lie in cache, so this will be a very fast request. So you have to be very careful to properly probe, externally probe your API for this subsampling method to become effective. I have not seen that, and usually, the external probing is basically under-sampling. You do one sample per minute, and we are, like, at least in these examples I showed, we are just talking about one hundred or a thousand samples a minute, so the errors will be pretty drastic in these cases.

[27:18] Yeah, I hope that answers the question. Hi, I feel like I haven’t quite understood the difference between number two and number three because, in both cases, you have a histogram with counters, and you can’t really, if your threshold falls in the middle of a bucket, you still can’t understand precisely how many requests fall. HDR histograms predefine those buckets for you, but conceptually, it seems similar to how histograms are implemented in Prometheus.

[27:49] So from a theoretical standpoint, they are very similar. You count how many things are in certain buckets or less than a certain value. The question is the cardinality and the cost economics behind it. For example, doing a thousand bins is not a lot; it will only cover a little range, and the cost you are paying for it is actually already very high if you have a thousand metrics per API you care about, and probably even multiply this by the number of nodes or endpoints you’re monitoring, so it gets expensive very fast.

[28:22] With the HDR histograms, you do sparse encoding, so if the bin is not hit, then we don’t store this, and usually, your latencies are bounded within a certain range. They are not taking one time in microseconds and then five minutes, so usually, you have one or two orders of magnitude which allow you to dynamically choose the buckets you really want to record. It’s the cost and the performance differential. With the HDR histograms, you don’t have any choices; it’s pretty cheap, and it’s as powerful as if you would do 46,000 metrics of the Prometheus style. But yes, in theory, they are exactly the same.

[29:04] You can view it as a clever compression, which gives you a factor of 10,000 more efficient. Right, any other questions? He goes, my boss, so take that with a grain of salt. So for people who are using arbitrary bins, in Prometheus, you have to choose your bins, right? And everyone chooses wrong. Every time I choose, I choose wrong later. How do you choose better?

[29:46] So if you don’t have HDR histograms and you are using Prometheus, which most people are, how do you go about choosing a better set of bins to protect yourself going forward? I think that’s a complicated question, but there are two hints. One, you have to think about where you want to aggregate your metrics. One error that Google is known to make with their choices is that different services choose different bins, and in the end, they want to aggregate across them, and they can no longer do that. You can only aggregate the counts if they have the same thresholds.

[30:18] So you have to first get an idea in the organization across which teams and which instrumentation, like service administration ways, you are actually in the end wanting to aggregate. Once you have this, the other question is, like, where should we all agree to put our bins? The only method that I can come up with is actually to use log data for that. If you have log data, you can just say, select latency from logs in the last three days or whatever, and you can draw the histogram yourself.

[30:52] You can see all the modes with any Python tooling; it’s very easy to build. You just do a select, and you draw a histogram of all the values you get from there, and then you will have a picture like this produced from log data. You can say, okay, where’s my 90th percentile here? Where is my 95th percentile? You can just take these latency thresholds and start working with them. If your API is evolving, you might want to change it later on, but then you have to change it, and you have to wait a certain amount until you can again compute those SLOs.

[31:35] This is actually, from a practical standpoint, the only methodology I can think of which makes sense. More questions? I think this is not the case. Thank you again for your attention, and have a great conference. [Applause]

References

Comments