Mathematics of SLOs

This video is a re-recording of the SLO section of the Statistics for Engineers talk at SRECon EMEA 2023.

Abstract

In the video, we will delve into the concept of Service Level Objectives (SLOs) within engineering, particularly focusing on their role in steering management and investment decisions through data-driven reliability assessments. Our discussion will cover the fundamental goal of SLOs in balancing reliability, productivity, and team health by defining clear objectives. We intend to highlight how SLOs can aid in managing deployment gating, alerting processes, and longer-term maintenance versus feature trade-offs. Additionally, we will outline methods for implementing SLOs effectively, including the calculation of Success Level Indicators (SLIs) and error budgets, to drive meaningful decisions and training efforts. Moreover, we will discuss efficient visualization techniques for SLOs, such as using burn rates for alerting, which focus on the real user experience rather than system health metrics like CPU usage. Our aim is to equip your team with a robust understanding and practical applications of SLOs for improved engineering management.

Introduction: Understanding SLOs
Purpose of SLOs
Balancing the Reliability Triangle
Navigating with SLOs
Key Definitions: SLI and SLO
Implementing SLIs and SLOs
Error Rates and Visualization Techniques
Error Budget Introduction and Execution
Alerting with SLOs and Burn Rates
Advanced Alerting Strategies
Conclusion and Key Takeaways

Transcript

[00:01] Hello, my name is Heinrich. I’ve been talking about statistics for engineers for a long time, and today I want to share my opinionated take on SLOs and how to approach them. Before we dive into the mathematical aspects of SLOs, I want to highlight the goal setting of SLOs. What are they, and what are we trying to achieve with them? Fundamentally, I think SLOs are about engineering management and steering engineering investments with data.

[00:34] We want to know if we should implement reliability patterns, increase resources on a certain service, and so on. Instead of focusing on the technical aspects, like when to alert or how to observe the systems, with SLOs, we take a step back to think about longer time ranges and understand whether we should spend engineering time on reliability or not. If we are successful with SLOs, we should be able to do this effectively.

[01:00] Thinking about SLOs as a profession and how we steer reliability, we are always navigating this triangle. This triangle basically says reliability is a property of the system we want to optimize or that is important for us. However, we also want to remain productive and keep our on-call teams healthy. We always have to balance between these three aspects, and everything we do, like improving alerting rules or testing systems, involves trade-offs within this triangle.

[01:43] For you and your team to understand where you stand, SLOs are a wonderful tool because they allow you to explicitly manage and understand where you are and where you want to be. If you use SLOs for steering decisions, you’re trading off between reliability and productivity, focusing on features versus reliability. If you use SLOs for deployment gating, you make a direct productivity trade-off. Lastly, you can use them for alerting, which trades off reliability against on-call health.

[02:18] Now that we have the goal setting out of the way, let’s look at the definitions. This is already something a bit controversial, or at least here’s my take on it. There are different versions of this around. SLI, for me, is basically a management KPI. We want to use it to steer management decisions, and management KPIs have two properties: they move slowly, usually over a four-week time horizon, and they are usually proxies.

[02:59] You never expect to measure exactly user behavior or the performance of a certain feature. You can only measure proxies, like how many users used your tool, not how happy they were or what struggles they had. We always have to make some compromises, but it’s important that we have a correlation that gives us direction. An SLO is basically a threshold on an SLI. An SLI is a reliability KPI, calibrated between zero and one, where zero means unreliable and one means fully reliable.

[03:41] We all understand that 100% reliability is just too expensive. We want to make a trade-off, and we need business and customer representatives in the room to make this trade-off effectively. The objective is a quantitative artifact that comes out of this conversation, essentially saying, “Here, I draw a line and say this is acceptable, this is not acceptable,” and we strive to have a certain level of reliability and not more. [04:10] Relatively simple, very simple, I would say, and in very general terms. Now, how are SLIs and SLOs typically implemented? Almost all of them are formulated in this kind of long-term average success rate. So, if you’re looking at a certain population of total events, like HTTP requests or something else, a part of those were good events, meaning successful interactions, and there were bad events, which were not satisfactory. We express that average or compute that average, and this becomes our SLI. If you only had good events, it would be one, and if you only had bad events, it would be zero, fulfilling these properties.

[04:59] If you’re starting your SLO journey, availability SLIs are really what you want to focus on, and this approach takes you a long way. Here are my top three ways to implement them, in order of quality. In many cases, you can quickly set up some synthetic probing. A classical example is using a tool like Pingdom or just pinging your host to see if it’s physically available on the internet. You can do that every minute and then take the success rate of that operation over four weeks, where the total number of probe runs is around 40,000 over the full month, and you count your successful probes.

[05:44] The next better version is if you have a server backend powering your website, you can count the number of successful operations on the server, which gives you a measure for good events. Then, you divide by the total number of requests you received. This is also very simple, but the problem is that this view is from inside your application and can be quite different from the user experience. If you have firewalls or proxy servers in the middle, you are blind to any degradation that happens there or even on the browser or mobile client that users are using. If you can put the measurement closer to the users, you will get a higher quality SLI, which is the third best version of SLI that I regularly work with.

[06:39] Here’s another type of SLI that’s also quite common, which I call latency SLIs. If you’ve seen any of my talks, I always include a bit about histograms. This kind of measurement is aided by histograms because they aggregate latency over long time periods. For this SLI, you ask how many operations are fast enough. It could be website requests, ETL jobs, or job executions like a cron or job scheduling system. We want to know how fast they are, and we can encode the quality of service we provide in this kind of SLI in a very similar way.

[07:22] Let’s come back to the formula we saw before, which is how we formulate most SLIs. If we remove the bit about the four weeks, it’s a familiar formula: good events over total events, measured over a minute. This is what we would call a success rate, or one minus that would be the error rate, which is bad events over total events. Here’s a dataset I’ll use as an example for this talk, which has some requests coming in. In red, I’ve marked the errors we got, and there are some organic error rates that manifest like a little sprinkling on the top. I’ve also injected two larger outages, both with a 98% error rate. [08:11] The error rate was almost a complete outage. One occurred in a low traffic situation, and the other in a high traffic situation. The second chart shows how this looks, with good events over total events represented by the green area. Now, how does the four-week version look? This graph shows total events over a moving window of four weeks, divided by good and bad events over the total number. The request rates look somewhat similar, but you can see artifacts of these red lines lingering.

[08:56] Let me share my screen to show how these bits are related. Here’s an animation showing the relationship. For each frame, I’m incrementing the range of aggregation, now at 10 or 11 days. The top graph shows total events, both good and bad, aggregated over longer rolling windows. Below, we see the average, and how the red areas spread out over longer periods, becoming a thin strip. This corresponds to the fact that we had very few requests failing, despite the high rate. After the outage, there were many failures, visible even on a four-week scale.

[10:08] Moving back, the area below is precisely the SLI. The green area represents good events over total events, averaged over a rolling four-week window. Notably, all values are cramped up at 100%. An SLO of just 90% is already high, maybe at 98%. We’re focused on the higher end of the scale for all measurements. Although the graph might seem naive, this phenomenon is common, with everything cramped near 100%. How can we visualize this better?

[11:01] There are two ways. One is using the nines scale, which you’re probably familiar with. For example, 49 is a low, 59 is a high. It’s a convenient scale for these values. The formula is minus log10 of 1 minus SLI, which you can input into Grafana or your tool. It outputs one for 90%, two for 99%, and five for 99.999%. For values in between, it interpolates. In Grafana, you usually clamp this at five or six nines, as reliability typically falls in that range. If you approach 100%, it runs to infinity, not useful for visualization. Clamped at five or six nines, it makes a decent gauge or line chart.

[12:02] Here, we see request rates and error rates as before. The chart shows two and a half nines, dropping to one and a half nines after the second outage, indicating an unreliable user experience. The next best idea is a linear rescale. [12:27] This is fundamentally what the error budget is. I have the formula or the requirements written here. 100% reliability should equate to a 100% error budget. Conversely, a 0% error budget should mean we are at the SLO value. This means the SLO value maps to a 0% error rate, and we linearly interpolate between these values. By plotting this range, the SLI covers a meaningful scale, not cramped into a small bit.

[13:10] If you work out the math, there’s a simple formula to derive it. For event-based SLOs, you understand why it’s called an error budget. It turns out to be the difference of acceptable bad events. For example, with a 99% SLO, 1% of total events over four weeks, say 100,000, are acceptable to be bad. You subtract the bad events accumulated over the past four weeks, giving you the number of acceptable bad events left. Normalize this by the total number of acceptable bad events over the period to scale between zero and one.

[14:07] Here, I’ve plotted the error budget. The graph is no longer clamped at 100%; it moves meaningfully. 0% represents the SLO, showing where we breached it. The red area on top of the error budget reminds you it’s a rescaling of the error rate graph, as seen in the animation. This visualization of an SLO is useful for managing, which often involves mapping to a traffic light symbol. Managers think in simple terms, and this method simplifies reliability.

[15:05] If you have more than 20% error budget, proceed as usual with no special investments. Below 20%, be cautious with deployments and consider implementing more postmortem action items to improve service reliability. If you’re in the red, reliability has been unacceptable, indicating a need to focus on reliability in the next period. Stop product work and invest in reliability. Implementing this methodology provides an effective feedback loop to maintain a certain reliability level, better than just counting incidents or support tickets.

[16:06] This part is new, focusing on alerting. Instead of managing, we’re in a more technical domain. SLOs also inform alerting, which might be surprising since alerting is tactical. However, SLOs target more real-time horizons. They are useful for alerting because they focus on user pain or symptoms that users experience, measuring whether the user is affected. [16:46] User experience issues are more critical than system health metrics like high CPU or low memory. These are not user concerns and should be managed by your Kubernetes scaling or orchestration system. For a high-quality alert signal, start with your SLOs. However, SLOs won’t provide predictive alerts; they only show SLI degradation once user experience is affected. Sometimes, you want to prevent degradation by renewing resources or upgrading infrastructure. SLOs are a great starting point for alerting but not the endpoint; complementary alerts are necessary.

[17:50] When using SLOs for alerting, understand how the SLI moves. Waiting until the error budget is depleted might be too late. If the error budget decreases slightly over time, alert on that. However, be aware of subtleties. Advancing the four-week period by a minute introduces new data while old data exits. If the error budget increases, it could mean good events occurred or bad events left the window. Conversely, a decrease could mean bad events occurred or good events left. Differentiating these is crucial for effective alerting.

[19:16] To address this, consider the burn rate concept. Burn rates measure how recent events compare to a longer time frame and the SLOs and error budget set for that period. Calibrate the scale similarly to the error budget. A burn rate of zero means no bad events, while a burn rate of one means behavior is at the SLO level. Sustaining this behavior over the SLO period would deplete the error budget.

[20:19] A simple method is the naive burn rate, which many use. It involves comparing the current error rate to the SLO. For example, if there are 10% errors over the past minute and the SLO is 99%, then 1% is the threshold. A 10% error rate over a 1% threshold results in a naive burn rate. [20:53] A burn rate of 10 indicates you’re consuming your error budget 10 times faster than allowed. This measure focuses exclusively on the last minute, highlighting how quickly errors are accumulating. The challenge is understanding how total requests vary over time. To better grasp burn rates, consider longer time periods. Instead of comparing error rates to the SLO, look at absolute counts. Count the bad events from the last minute and compare them to the total allowable errors over a longer period, divided by the number of minutes in a week. This gives an absolute error budget per minute.

[22:24] By calculating the total bad events against the allowable bad events per minute over a four-week period, you align with best practices. For example, alert if 2% of the error budget is consumed in one hour. These are called burnout conditions. You can translate this into a condition on the total burn rate. If the one-hour total burn rate exceeds 2% times a constant number of hours in four weeks, it equates to a threshold of 13.4. Implement this in your monitoring system by checking total errors over the past hour against this constant.

[23:40] The total events over four weeks change slowly, so you might use a constant in your alerting system, updating it daily with a cron job. If your request rate is constant, the naive burn rate equals the total burn rate. This allows using the same condition for alerting. By comparing the error rate over the past hour to a constant, you simplify the alerting rule. This method is documented in literature and used in implementations. [24:52] The condition focuses on a rate observed on an hourly basis, without comparing it to a longer timeframe. When visualized on graphs, both burn rates are shown. The naive burn rate is calculated by dividing the error rate by one minus the SLO. In this scenario, both burn rates have the same error rate, resulting in a naive burn rate of 50. The scale is divided into two areas, with a lower zoom level at 10, allowing visibility of values close to zero or one, as well as higher values.

[25:56] The naive burn rate suggests burning through the error budget at a rate that matches the SLO period. However, the actual depletion rate can vary, with some cases depleting the budget quickly and others more slowly. The naive burn rate doesn’t reflect these differences, even though the slopes differ. The total burn rate, however, shows these differences, with values around 10 or 120, indicating the time needed to deplete the error budget.

[27:03] Both burn rate variants have their merits and can be used for alerts. If you alert on the total burn rate over 13, you may not be paged in some cases, while the naive burn rate would trigger alerts in both scenarios. Depending on your needs, you might choose one method over the other. The talk concludes with key takeaways: the success of an SLO program is measured by its ability to guide reliability decisions, error budgets are similar to SLOs but on a different scale, burn rates are useful for real-time alerting, and SLOs can also be used for alerting with the right calculations.

[28:30] Thank you for your attention. All the best, and I hope to see you soon. Bye-bye.

References

Source Code

Abstract

Table of Contents

Transcript

References

Comments