Statistics for Engineers

October 21, 2022
Amsterdam, Netherlands

Recording

Slides

Abstract

In the talk, we will discuss the statistical methods that are most relevant to your daily work as an SRE. As an SRE, we are constantly confronted with a wealth of telemetry data collected from our systems. Interpreting this data to extract operational information is a key part of our job. Statistics is here to help! Statistics is the art of extracting information from data. We will get up to speed with the basics and see how they apply to the operational domain. Furthermore, we will explore statistical pitfalls commonly found in telemetry systems. Specifically, we will cover subjects such as summarizing and visualizing data with mean values, percentiles, and histograms, implementing latency-SLOs, and the impact of sampling on rate, error, and duration (RED) metrics.

Table of Contents

Transcript

[00:07] Hello everyone. It's really a great pleasure to be here. As a Recon, for me, is always something special. Exactly seven years ago, I gave my first talk ever at the conference at SREcon in Europe. It wasn't in Dublin back then, and the title was the same. I'm not sure if it's a good thing or a bad thing, but I'm still talking about the same topics. It's definitely a subject I care about a lot, and I hope I've gotten a little bit wiser over the years. I certainly got a lot bigger, so thanks for coming.

[00:46] I put some information about myself on this slide just to give you a little bit more background. The important takeaway is that I’m a mathematician by training. In my spare time, I like to think about mathematics and read papers. About ten years ago, I went into IT, working for a monitoring vendor, which is where I started going to conferences like this, talking about statistics. Right now, I’m leading the SRE department for a company called Zalando.

[01:17] I’ve talked about this topic, as I’ve alluded to, over the past years quite a bit, and I’ve included some recent talks here for you to follow up on. This presentation is basically a condensation of a lot of the material from those talks, and I’ve tried to compress it to the most interesting and entertaining parts. I actually removed a lot of the mathematical slides. Yesterday, when I was trying to squeeze out another minute, I removed the last formula from it. It will still be a lot of statistics and topics to cover, so without further ado, let me dive into it.

[02:00] For statistics, I will not be talking about machine learning or anything fancy. We will start with the very basics, and visualizations are always the first thing. If you are presented with a new data set, it’s important to know how to visualize it. I will use these slides to give you an idea about different visualizations so we are primed to recognize them further along. I’ll also present you with some of the data sets.

[02:28] If you’re talking about statistics and don’t have a slide with a normal distribution, I think you’re doing something wrong. It mainly serves as an example of how operations data does not normally look, but we can still use it to familiarize ourselves with these visualizations. The first one you see at the top is called a rug plot. For every data point in a data set, you place a line at the x-axis. It’s a very fair representation, but one issue with that visualization is it can get crowded. This is a thousand samples, which isn’t much, and it’s already quite dense.

[03:11] A second way to visualize raw data is a histogram. I think probably all of you have seen histograms before. To produce a histogram, you choose a binning of your x-axis, which are just a number of values. Then, with everything that falls between two of those values, you count how many data points lie in there. You can judge the shades of the kind of mess in the middle of the rug plot by looking at it with a histogram. One interesting observation about histograms is that the area is… [03:41] proportional to the count, so areas are something that the eye is very good at recognizing. With histograms, you have the ability to quickly grasp how many samples there are. In telemetry data, you usually have a time dependence, or actually always have some kind of timestamps you’re recording with it. Putting that time on the x-axis is very natural. If you do that with normally distributed random samples, you get a chart that looks like noise, often referred to as white noise.

[04:15] Here’s a graph that you will be more familiar with. The last one is certainly something you see in all your dashboards—it’s a request rate chart. All these data sets are taken from production, and there’s also a GitHub repository where you can download and play around with it yourself. It’s a simple representation of three days of request rates. I always like to put the dots on the line charts because those are the actual measurements. The lines in between don’t really mean much. For this one, it’s kind of defensible; for others, not so much.

[04:50] How do you get the other two representations? You forget about the time dimension, get rid of the x-axis, and project everything to the y-axis, leaving you with a rug plot. Those would be dots showing on this axis here, and if you rotate it, you get the rug plot. The histogram is, again, what the histogram is. Now we come to my most favorite data set ever, which is request latencies. I’ll talk about latencies quite a bit, as this is, from a statistical perspective, the most exciting data we have in the monitoring space, at least for me.

[05:20] Again, we see how the rug plot looks at the top. There’s a lot of crowded stuff going on. In the histogram, you have more of an idea. Looking at the time-dependent representation, you see the main mode corresponding to these values at the bottom, and there are quite some outliers at the top. You see some structure in the histogram with multiple modes, called local maxima. You can also see it in the rug plots, and then you have further outliers down here. Note that this is capped at 100 and this is capped at 1,200. When looking at latency charts, there’s usually a long tail that’s hidden, like when a query just took an hour.

[06:12] Here’s a very similar data set. Latency not only comes as a global thing but also depends on time. If you chop a day into 24 hours and produce aggregates for each hour, this is what you arrive at. This will be a data set we look at quite a bit, so I want to bring your attention to the different crowdedness of this graph. In this chart, you have 80 requests here, 100 there, 25 here, and close to 108k elsewhere. Your service is loaded over time very differently, resulting in varying numbers of samples.

[06:55] The same data set again, this is how it looks when you draw histograms. Histograms with lots of samples are very smooth, while those with very few samples are spiky and spotty, resembling a rug plot to a certain degree. A challenge you have with that is if you want to put that time again on an x-axis, you usually don’t look at charts like that in your monitoring. [07:15] In most systems, you don’t have this kind of visualization, so you want to compress it and make time on the x-axis again. We kind of have a dimension too few, as we’ve already used two dimensions on the full histogram. The rug plot isn’t very effective for dealing with dense areas. Another trick is to use color as the third dimension, encoding density in dark colors, which is what we call a heat map. A nice feature of a heat map is that you can use time as an x-axis. Here, I have the 24 hours, and you see the latency distribution. This is the same information as in the previous chart, just presented differently.

[07:54] This visualization is something I’ve always liked, and it’s great to see that over the past years, we’re seeing these kinds of visualizations of latency data in our tools. This is a screenshot from the Grafana documentation, where they now have this heat map visualization of latency. The tool I worked on before, Conus, pioneered a lot of this with their high-resolution histogram views, providing a nice overview. This is the same data set we’re looking at here, with a 24-hour view and a minute-wise summary of what’s happening.

[08:37] Now, we come to the statistics part of it. Having just looked at the data, which is arguably already statistics, I want to talk about statistics. Statistics and aggregates are extremely important for telemetry applications, where they are prevalent. You want to aggregate telemetry data across time in graphs, across hosts, or endpoints. Every time you’re forgetting about some aggregates, you’re already talking about aggregations, and you need to know what different aggregates you have. The first one is the average, which I think everyone knows. I don’t know how many could produce the formula, but I think a good part of you would be able to.

[09:18] Intuitively, we think about averages as the point of support. If I put my rocks in a physical model and place them on a weightless bar, the point of support where the bar is in equilibrium would be the mean. You can see how that works out. I didn’t calculate it, but it looks right by intuition. With a lot of heavy stuff here and some stuff over there, the average is over here. There’s a phenomenon called leverage; if this point moves far to one side, you can balance a lot of things on the other side.

[09:54] The way to rephrase that is that the mean value is not robust. If you forget about a sample, the mean value would be here, but adding just one sample, which may be arbitrarily far away, can change your mean value in unbounded ways. By adding one sample, I can move the mean value very far away. This non-robustness means a single sample can disturb your mean value. Looking at the data sets we discussed before, with a normal distribution, everything works out beautifully. The mean value is right in the middle, as designed. For request rates, mean values make sense; we’ve all heard of the average request rate over a day. For latency, we see some of this leverage effect, with the bulk of your data… [10:48] In the distribution here, the average is shifted out because we have samples that are very far, pressing the lever down with some weight. This is a kind of quiz; I don’t have time for a real quiz, so I’m just doing this to demonstrate where averages are because averages are literally everywhere. Every graph we look at has some averages in it. The question is, where is the actual average? Let me ask differently: what is the max RPS you see in that chart? Just looking at it, I’m like, here it is, going to that axis, like 6.5. It looks like the max RPS is 6.5, but it turns out the max RPS is actually 14, and it was attained here.

[11:37] This is a full month of data. How many minutes are in a full month? It turns out it’s about 40,000 numbers collected over one month, and you only have 500 little dots painted. Every chart you’re looking at, if you’re zooming out, has a group by with some aggregation applied to it. Most tools use the average by default, so you just have average request rates here over hours, and you don’t have the max. Some tools, like Grafana, allow you to overlay different aggregations like max, average, and min. Be aware that when zooming or looking at graphs, there’s implicit aggregation going on. This phenomenon, where zooming into graphs makes spikes larger or zooming out makes them lower, is what I call “spike erosion.”

[12:42] With request latencies, the situation with mean values is terrible. We saw examples where it wasn’t great, but look at this graph with 100,000 samples; the average is nowhere close to the body of the distribution. I overlaid it here as a graph. You may have seen mean latency graphs, but I haven’t seen them in any tool for a long time. Nobody looks at mean values anymore. There’s a quote I like from Dogen at Optimizely: “Looking at your average latency is like measuring the average temperature of your hospital.” If your station has an average of 38.5 degrees, it’s not helpful. With temperatures, there’s a bound; you die if you’re over 42. With latency, there’s no such bound, so it’s even worse.

[13:48] Wrapping up, most data we’ve seen is average, so be aware of the spike erosion phenomenon. The average is easy to compute and mergeable, but don’t rely on average latency; it’s not good. The median is a similar concept to the average but offers robustness. The idea is similar to the average, trying to find a representative sample, but in a different way. Here’s a simple example: you have 178 values here and 178 values there. You can count later to verify if this is really 178. [14:29] The median is the number that divides the dataset into equal halves. If you have one single value in the middle, that’s the median. If you don’t have that middle value, you don’t really know where the median is, so you have to make a choice. Mathematically, the median divides the dataset into equal halves, but practically, everyone chooses the middle value, so there’s no issue. Looking at datasets again, request rates with normal noise look like the average. For larger datasets, the median stays in the middle, and outliers don’t affect it much. This shows where the body of the distribution is, with 50% of the data below and 50% above.

[15:32] You might not care about just half of your request population, but this is what you can clearly get with the median. There’s an empty space in the middle of the graph, but there’s data there; the color coding just lacks fidelity. It looks empty, but data is present, just not visible in the plot. Wrapping this up, medians give central representatives and are robust to outliers. If there’s a single outlier, the median doesn’t move much. However, medians are not easily mergeable. In operations, you often deal with multiple stages of aggregation, like calculating numbers on several nodes and further aggregating them. It’s important to know if a statistic is mergeable. The average is mergeable, but median values are not.

[17:12] Now, let’s talk about percentiles. If you have a dataset with 90 values here, 10 values there, and one single value elsewhere, the P90 is where 90% of the values are below, and 10% are above. The basic idea is that PX means X percent of the values are below, and 100 minus X percent are above. A P10 would have 10% below and 90% above, while P50 has 50% below and 50% above. You recognize that P0 is the minimum, P100 is the maximum, and the median is the same as before. If you don’t have that central representative, where does it go? [17:57] In this case, determining the correct percentile can be quite subtle. There’s no single method that everyone agrees on. A paper from 1996 identified seven or eight different versions, and Wikipedia lists nine. There’s no consensus on the true percentile. Fortunately, you usually don’t need to worry about this unless you’re benchmarking and comparing several percentiles. There are some complexities, but hopefully, you won’t need to delve into them.

[18:32] Looking at the P90 for datasets, 90% of the data is here, and 10% is there. I made a mistake earlier; it’s actually the 99th percentile, not the 90th. This shows how the distribution looks, with 90% here and 10% there. It gives an idea about the tail of the distribution, which is often of interest. Overlaying this on a graph shows the full distribution, with individual red dots representing data points. With little data, percentiles can vary widely. When zooming out and averaging, it’s important to consider what you’re averaging.

[19:40] In tools, you often don’t have a heat map or histogram in the background, but you do have percentile graphs. This is production data from our Zalando systems, and you’re probably familiar with this type of data. Merging percentiles is a common question, and the answer is no. Naively aggregating percentiles with the average doesn’t work, as shown by a 68% error in this example. The reason is that you retain too little information about the distribution.

[20:30] I provided datasets with the same P90, but they can vary significantly. You might have 90 values and 10, or 90 million and 10 million, all with the same P90. Merging these can lead to various outcomes, as the P90 alone gives little information about the distribution. To have mergeable versions, you need to store more information about the distribution. I’ve been involved in developing data structures for this purpose. Technologies like HDR histograms, T-digest, DD sketch, and others allow for aggregation and merging of latency distributions. [21:36] I want to emphasize the importance of sparse histograms. Prometheus has a histogram data type that is quite basic, often with just 10 buckets, leading to significant errors in percentile accuracy—sometimes as high as 300%. However, there’s hope with sparse histograms, which are becoming available in various tools. If you’re concerned with latency data and performance, this is something to watch out for.

[22:10] Percentiles are used to describe latencies, and they generate min, max, and median values. However, percentiles should not be aggregated unless you have histograms. Latency SLOs involve massive time-wise aggregation, and everything learned about percentile aggregation applies here. For example, an SLO might state that 99% of requests are served within 100 milliseconds. People often look at the P99, but this can be misleading if there’s a large batch of slow requests.

[23:17] To address this, you can engineer thresholds in your data collection, such as counting slow requests. This allows you to determine if your latency SLOs are met. With histograms, you can aggregate data over time and easily read off the distribution. This makes engineering latency SLOs straightforward, as you can accurately measure and experiment with different thresholds.

[24:12] Now, I have five minutes to discuss sampling, a new topic in this presentation. At Zalando, we’ve been exploring sampling to save costs, as we sometimes push 10 million traces per second to our tracing systems, which is expensive. Sampling is a natural consideration, as it involves statistics. The trade-off is that by discarding data, you decrease accuracy but save costs. The basic game is managing this balance with your telemetry data. [25:07] Sampling involves taking a large set of data meant to represent a broader dataset and performing statistics on it, such as request rate, error, and latency. The trick with sampling is that you forget most of the data, creating a smaller, cheaper blob of telemetry data. However, this means you don’t know the precise statistics and must rely on estimations.

[25:33] An example of this process involves starting with 100 values and using a 10% sampling rate. You flip a biased coin for each value, retaining about 10 values. To estimate the original count, you multiply the retained count by 10. By running this experiment multiple times, you can observe that the distribution of sampled data resembles a binomial distribution, which is a discrete version of the normal distribution.

[26:59] When you apply sampling to request rate graphs, starting with 100 requests per second, you might see variations represented by green lines. The error margin is around 30%. As the request rate increases, the error decreases. However, if the request rate is low, you might only retain one value after sampling, leading to significant accuracy issues.

[27:52] The key takeaway is to monitor the number of retained values. If this number is low, accuracy problems arise. The formula for Bernoulli sampling shows that the expected value is the original count, and the error is influenced by the number of retained values. If you retain few values, the error increases, which is crucial when analyzing not just global request rates but more granular data. [28:44] When examining specific slices of data, such as a particular user ID, the counts can become very low, leading to high error rates. This is especially true for critical business operations, like monitoring the failure rate of password resets, which don’t receive 100 requests per second. In these cases, the sampling rate is crucial, and you may need to retain more samples. Depending on your tools, you might be able to adjust the sampling rate accordingly.

[29:26] I focused on counts, but there are other aggregates to consider, like percentiles. For instance, with 90% sampling, there’s a 36% error for the 90th percentile (P90). The error increases for the 95th percentile (P95) because you have fewer values at the higher end. If you go to the 99th percentile (P99), accuracy diminishes further. A practical tip is to ensure you have enough data points after sampling to maintain accuracy in high percentiles, especially when assessing latency improvements.

[30:24] The key takeaway is that sampling is an effective way to balance cost with accuracy. However, the accuracy loss depends on sample size and other factors, making it complex. While I provided a formula for counts, error rates and latency are more complicated. Simulations are a useful tool for studying these aspects. I developed a calculator app at HarmonyCartman.com/sampling, which allows you to experiment with sampling rates and observe theoretical and simulated values for error rates, request rates, and percentiles.

[31:36] The software isn’t perfect and contains bugs, but it’s functional enough for effective use. If you’re interested, you can explore the source code on GitHub and experiment with the Python version. For more statistical content, feel free to follow me on Twitter. Thank you for your attention.

References

Comments