Abstract
In this presentation, we explore the vital role of statistics in engineering, particularly in monitoring and improving the performance of APIs. We begin with external monitoring techniques to measure availability and alert on outages, highlighting the benefits and limitations of synthetic checks. Subsequently, we delve into log analysis to gain insights into actual user requests and uncover patterns that affect system performance. As we progress, we emphasize the importance of balancing metrics like latency and user satisfaction, exploring the challenges of using averages versus percentiles to reflect accurate data. Finally, we propose innovative solutions using histogram data, allowing us to monitor meaningful metrics that directly impact business goals and enhance user experience. Throughout this session, we aim to equip engineers with statistical tools to better measure, analyze, and optimize their systems’ performance.
Table of Contents
- Introduction and Speaker Background
- Mathematical Identity and Conference Atmosphere
- Current Work and Past Articles
- Engaging with Mathematics through Stories
- External Monitoring Introduction
- Issues with External Monitoring Charts
- Log Analysis and Data Aggregation
- Visualizing API Performance
- Mathematical Insights into Requests
- Mean and Median Values in Monitoring
- Introduction to Percentiles
- Histograms as a Solution
- Conclusion: Deriving Meaningful Metrics
Transcript
[01:09] Apart from my national identity, I also have a cultural identity as a mathematician. I come from a town called Academia, and I am here as a refugee in the monitoring community. I want to thank you all for welcoming me here. It has been a very warm welcome, and this particular place at the conference underlines this a lot. It’s a very nice atmosphere with very interesting talks, and I can relate to many of the messages that were put out here. Remember, the sign PhD is actually the symbol for the refugees from Academia.
[01:56] At the moment, I’m working for a company called Suus. You might have heard of it; it’s a monitoring analytics platform, and I’m the lead on the analytics part. Statistics for engineers is a topic I’ve been asked to talk about a few times in the last year, and I actually did so. I also put a lot of things into writing. Most notably, there has been a recent article of mine in the ACMQ, which is a great magazine available online for free. You can read that, and these slides will be on Twitter, so you’ll have the opportunity to find that article and read about all the topics covered here. I will use that shamelessly as an excuse to skip over topics and go fast. It’s really tough to distill a lot of statistics into a 30-minute presentation, so please look at those blog posts and articles to learn more. We also have a three-hour version of this talk at Sron in Dublin in a few weeks. If you like, attend this, but of course, as long as I’m here, feel free to come to me and talk about it. Maybe I will have time for questions later in this presentation as well.
[03:19] I’ve been told that mathematics can be a little bit dry from time to time, and to make it more engaging, you should always put it into a story and make it personal. So, what I want to do is tell you a tale of API monitoring, a problem that many of you will be able to relate to, in the very concrete form of a fictional web store which I call Etic. It has a great API, which is just a web interface it exposes. It has a catalog that users can see, and it has this funny thing that they lose money if requests take too long because users get offended. They want to monitor this API, and I will tell you through the tale how they started, what they tried, and what kind of mathematical difficulties they ran into, and what you need to know to make sense of your data. The main goals of monitoring are to meet the user experience and the quality of your service. Is the service up, and at the very least… [04:33] Also, how do users feel about the service? With good monitoring, you’re able to determine the financial implications of service degradations. You can then use these metrics to define sensible service level targets for your Dev and Ops teams. As I said, I’m coming from the mathematical side, but I feel that failures happen, and it’s really about defining sensible measures of how much error is okay. Often, financial implications are suffered from the errors that happen, and to give you a basis for your discussion, that’s one key goal of monitoring APIs in particular.
[05:25] The very first step in the problem of monitoring a website or an API, in general, is the theme of external monitoring. What you might do is configure a synthetic check that reaches out to your API once every minute and checks if the request is successful or if there was an error. You might measure the latency of that request. If you do that with our tool or other monitoring tools, you arrive at charts that look like this. These kinds of charts are great if you want to measure pure availability, so is the service actually reachable? You can use it to alert on outages. However, it’s bad for measuring any kind of user experience because you just have a synthetic request, so the only thing you learn is how you serve the robot. There’s not a single user involved with that request.
[06:18] There are more problems with this kind of monitoring in that these request patterns are often very easy to cache or predict. With high probability, this one request you’re making, maybe to your homepage, will be served completely out of cache, and the only thing you’re measuring there is the performance of your network. These kinds of graphs are often seen, so you have these spikes and otherwise a measure here. Another thing to point out is that these kinds of graphics are always produced with Prometheus. I will make no excuse for that. We use these bar-like things, which I like compared to the line charts we’ve seen a lot. The point here is that line charts always suggest a kind of continuity between points.
[07:37] If you go from this point here to that point there, these are just two measurements. At that minute, you had that duration; at that minute, you had that duration. There’s absolutely no sense of continuity between those two measurements. I really have to move my mouse every other minute, and that’s a bit of a misleading visualization. If you are looking at line plots, always ask yourself if it’s really justified to draw a line that suggests continuity, or if it’s better served by dots, scatter plots, or other kinds of visualizations that show you the true data. You don’t want to be misled by your visualizations; you want to see the actual data points. There’s a problem more severe than just the drawing of lines, which is the phenomenon I call Spike Erosion.
[08:23] This is the same graph on a much larger scale, like April to May, so this is actually a good month, a little more than a month. There are many measurements taken in this whole month. You have to calculate the number of minutes in this month; it’s just a huge number, or a big number. It’s certainly more than the actual amount of pixels you have available on that screen. That means every time you look at graphs which have more than… [08:54] When you look at graphs over small time periods, like an hour, you’ll see aggregation everywhere. What I did here is use what we call a histogram aggregation or a min-ex aggregation to show how drastic the effects of aggregation are on your data. The usual view that many tools present will show you a graph, but I’ve added the actual measurements as a kind of heat map. It looks a bit like dirt, but we call it a heat map or histogram. These are your actual measurements, with very small points sprinkled all over the map. You have a lot of measurements up here, and this other line is the actual one-day maximum latency experienced.
[09:50] Each value in the line graph represents an hourly average of all measurements made, so you’re seeing compressed information of 60 data points. This is drastic because our clients tell us they can’t do capacity planning on averages. If you look at graphs like request rates, you don’t have any idea of your maximum latency if you’re looking at a chart aggregated in averages. The actual data distribution will be all over the place. Some metrics are okay to average, like disk utilization, but for many volatile metrics, it’s misleading to look at them on a large time scale. Many tools have this problem, and it’s not easy to address.
[10:46] You can’t blame the vendors or tools because sending all the data points might involve sending 50 megabytes of JSON to your browser, potentially crashing it. Sometimes it’s better to get a rough idea, but keep in mind that if you’re viewing time scales of weeks, you’re seeing a very thin shadow of your actual data. After external monitoring, which is great, the next step is to understand the API more and serve actual users, not just measure robots. They start with log analysis, which is also very beneficial.
[11:35] In log analysis, you record to a log file the time of completion of your request, the time the request went in, the duration, request latency, and other metadata like the document retrieved or session ID. It’s a rich source of information and pretty trivial to do instrumentation with log files. You can store high-dimensional data in a log file, providing a lot of rich information about the requests. This is the UML version of the API, the internal API. A user makes a request at a certain time, there’s request latency, and the response is sent out. All this information is contained in your log line, which is great because you’re measuring actual users. It’s the best kind of information you can have about the API. The problem, common to all log tools, is that it takes a long time to… [13:00] Retrieving and storing data from log files can be inefficient, especially in the context of Big Data. It’s one of the rare ways you can produce petabytes of data. If you’re looking at data from a year or even a week, you have to process all the log files, which is slow and expensive. This is particularly true for high-volume APIs. APIs are everywhere, from web stores to microservices, and even in software architecture layers like system calls and block devices. Capturing latency and performance information for all these APIs is crucial, but using log files for something like system call latencies isn’t practical.
[14:20] However, if your API is low volume, logging accesses in log files is feasible. I’ve tried to digest the numerical information from logs and API usage in a more concise way, providing another mental model for reasoning about your API. Using a second dimension, I’ve rotated the latency and put it on the y-axis. Every request is represented by vertical lines, with the height indicating how long the request took. There’s a subtlety in monitoring: you can base the graph on the time the request went in or the time it went out. If a request is in flight, you can’t know how long it will take, affecting the timeliness and correctness of your data.
[16:00] By basing the chart on the request finish time, you have the full data available to plot at a certain point. This visualization of your API allows for various analyses. You can project data to the y-axis and create a histogram, which we call a marginal distribution, and derive a probability density function and a maximum likelihood estimator for certain probability distributions. As a mathematician, you can see an arrival rate process and consider a Poisson distribution. If you’re really fancy, you can remember that a request isn’t always on the CPU or being served by the machine, which might actually be… [17:20] Requests can spend time in a queue, leading to fascinating problems related to balancing concurrency, request rates, and determining capacity. These issues can be analyzed through queuing theory, and there’s a great book by Baron Schwartz on this topic that I highly recommend. It’s another mathematical perspective you can apply. However, when discussing APIs with sales and marketing teams, the focus shifts. They see requests as actual people interacting with the system, not just percentiles or mean values.
[18:03] The way requests are handled affects user satisfaction. Some users will be happy, like when a Google bot indexes your site, while others may leave if they’re dissatisfied, taking their business elsewhere. Latency matters because it impacts user emotions and business outcomes. Simply using percentiles and mean values can obscure the fact that you’re dealing with real people and their reactions.
[18:59] To manage large log files, many people compress the information by calculating mean values. This involves summing all values and dividing by the number of samples. Physically, it’s like a lever with equal mass on both sides, finding an equilibrium point. However, outliers can skew the mean, making it easy to manipulate. For instance, if a store measures object fetches, many will be cached, resulting in low latencies and a misleading mean value.
[20:57] In practice, you might see outliers affecting mean values, especially during low-traffic times like night when only a few requests, such as Google indexing, occur. These can have poor cache hit ratios, skewing the mean. For effective request latency monitoring, you must select the reporting period carefully. Calculating mean values can compress a lot of information into a single number, which can be problematic, especially when dealing with actual user interactions. [21:34] To address the outlier problem, you can use median values instead of mean values. Another approach is using truncated means, where you discard the minimum and maximum values before calculating the mean. Collecting deviation measures is also an option, though it’s a delicate topic. People often turn to percentiles to avoid the issues with averages.
[22:12] A percentile, like the 50th percentile, divides your data set into two halves. The 99th percentile divides it into a top 1% and a bottom 99%. However, percentiles are not unique; for example, the 90th percentile can include a range of values. There are eight different ways to define percentiles uniquely, so it’s important to choose the right method for your needs.
[23:18] Percentile monitoring involves selecting a reporting period to determine how many values go into a bucket. You then sort the data to compute latency percentiles, such as the 50th, 90th, and 99th percentiles, and store these as a time series. Alerts can be set when percentiles exceed certain thresholds, providing a robust way to assess API performance and user satisfaction. However, choosing the right percentiles upfront is crucial, and percentiles cannot be aggregated.
[24:49] In practice, the 50th percentile competes with the mean value, and the median often provides a clearer picture. The 90th percentile can be a good alert point, indicating that most users are satisfied. However, percentiles have limitations, such as not being intuitive in graphs and not being aggregatable. The median of two median values is not the median of the total sample set, which is a significant issue when dealing with multiple nodes. [25:59] When computing medians on different nodes serving the same API, you can’t determine the total median from those individual medians. Similarly, if you calculate percentiles over multiple one-minute periods, you can’t derive the percentile for the entire five-minute period. To use percentiles correctly, you must know the aggregation levels, periods, and percentiles needed upfront. Percentiles should not be rolled up because averaging them results in meaningless values.
[27:01] It’s crucial to keep percentiles non-aggregated and store them all to maintain meaningful values. When tuning SLAs, it’s challenging to decide which percentiles to focus on, such as whether you can serve 99.9% of users well. Percentiles come with significant costs, as discussed on our blog.
[27:47] To address these issues, we propose using histograms. Many use histograms internally to compute percentiles. By dividing request latency into bands and counting samples, you create a histogram. Histograms can be stored compactly and aggregated over time, unlike logs. They allow for various data analyses and can derive percentiles and averages with slightly reduced precision.
[29:01] Visualizing API performance with histogram heat maps is intuitive. For example, a heat map can show data from multiple nodes over 10 minutes, revealing issues like requests slowing over time due to a bug. This method allows for deriving other measures, such as means and percentiles, and provides a more intuitive understanding of API performance. [30:09] This is the moment when users become frustrated if the service is too slow. Their phones might prompt them to go faster, and you can count the number of dissatisfied users as a meaningful metric. This is a quest for the monitoring community to find metrics that directly impact business implications.
[30:43] For example, the 90th percentile might not seem bad, but if the volume served is higher, the financial implications of the 99th percentile exceeding a threshold are significant. It’s crucial to correlate service volume with quality to derive actionable insights that matter to your boss.
[31:19] Another approach is to count the total number of users who were upset, which can be insightful. At the end of the day, knowing that 4,000 people are unhappy with your service can be quite revealing.
[31:32] I’ll conclude with a few takeaways: strive to derive meaningful metrics from your data. Thank you very much. [Applause]