Abstract
Measuring Latency for Monitoring and Benchmarking purposes is notoriously difficult. There are a lot of pitfalls with collecting, aggregating and analyzing latency data.
In the talk, we will make an effort to visit this topic from a top-down perspective and compile known complications and best-practice approaches on how to avoid them. This will include:
- Measurement Overhead
- Queuing effects
- Coordinated omission
- Histograms for Aggregation and Visualization
- Percentile aggregation
- Latency bands and burn-down charts
- Latency comparison methods (QQ Plots, KS-Distance)
Table of Contents
- Introduction and Speaker Background
- Statistics and Operational Problems
- Inspiration for the Talk
- Coordinated Omission and Recent Trends
- Challenges in Debugging Latency
- Defining Latency and Measurement Basics
- Considerations in Time Measurement
- Clock Selection and Measurement Overhead
- Monitoring and Reporting Challenges
- Tool Selection in Latency Measurement
- Distributed Systems Latency
- Understanding Queuing Systems
- Client-Side vs. Server-Side Measurement
- Simulation of Latency in a Queuing System
- Analyzing System Load Effects
- Key Takeaways and Limitations
- Conclusion and Future Work
Transcript
[00:16] Hello and welcome to “How to Measure Latency” at the P99 Conference 2021. Thank you very much for showing up, and thanks to the organizers for letting me speak at this beautiful event. My name is Heinrich Hartmann, and I am a Principal Engineer at Zalando, working in the SRE department. I’m going to start off with some motivation and background for this talk.
[00:38] I have been discussing statistics in the operations domain for the better part of five years. I previously worked for a monitoring vendor called Circonus, which has a very latency-focused product. I took it upon myself to explore the mathematical aspects of monitoring and operational problems, publishing several blog posts, articles, and even a paper about the histogram data structure.
[01:08] The inspiration for this talk comes from a series of talks by Gil Tene, dating back to 2013-2015, where he discussed how not to measure latency. This talk is incredibly insightful, and I’ve watched it multiple times but still didn’t grasp all the details. This conference was the perfect excuse to delve into it, update it to the 2021 version, and share it with this audience.
[01:48] Coincidentally, I came across a recent article on the P99 Conference blog about coordinated omission, a concept introduced by Gil in his talks. Seeing it published just two days ago made me smile. It’s a very good blog post that goes into much more depth than I can cover in this 20-minute session, so a shout-out to that post.
[02:22] Latency is the hardest problem you will ever debug, as stated by Theo Schlossnagle from Circonus, my former boss. This resonates with me and sets the tone for this presentation. Latency problems are rarely reproducible, and the telemetry information needed to understand the source of latency is hard to obtain. Sometimes, you might even have conceptual mistakes, thinking you’re looking at one thing when it’s actually something slightly different, which can be crucial for the problem you’re debugging.
[03:03] This talk focuses on the difficulties in measuring latency, not on the best tools to measure it. It’s about the concepts behind it and what you need to do right to have proper measurements for methodological debugging. Let’s dive right in. We should probably first talk about what latency is before discussing how to measure it. You have a certain operation in a computer system with a start time and an end time, and the difference between those timestamps is the latency.
[03:42] There are some subtleties with clocks in computer systems that I will briefly discuss. The most important thing is to measure latency on a single computer system. Measuring the start on one system and the end on another can lead to inconsistencies that are hard to resolve. From a theoretical perspective, that’s really all there is to it.
[04:07] In practice, it looks similar in code. When an operation starts, you record the current time, and when it ends, you record the current time again. The difference gives you a latency measurement. A few things to watch out for: ensure you’re actually measuring the timestamp when the operation ends. If your code has early returns or exceptions, you need to catch them, and some languages provide facilities for this.
[04:40] Another aspect is which clock to use. You want a high-resolution monotonic system time for these applications, which is recommended. Monotonic time prevents backward jumps when NTP hits, and system time is preferred over thread-local CPU time, which can be interesting but requires caution. Python has a time.monotonic function, which is well-suited for this. PEP 418, which introduces this function, does a great job explaining the different available clocks on various systems.
[05:29] Measurement overhead is also something to be wary of. The execution of code lines takes time, and you must be mindful of how much additional latency you’re introducing. For measuring I/O latencies like network or disk accesses, this usually isn’t a problem. However, for CPU-bound workloads, you need to care about this a lot. Microbenchmarking often involves running functions hundreds of thousands of times to avoid skewing the measurements due to overhead.
[06:07] It’s always good to abstract complex things in software engineering. We’ve had good experiences with using decorators or tracing libraries. Measuring latency is often best done with traces, as they provide more information. If you can use tracing, most of this is solved for you.
[06:34] There’s a subtlety when monitoring latency over longer periods. Two timings interfere: the operation timing and the reporting window timing. The reporting window might be 10 seconds or a minute. The question is, for operations that overlap with this window, where do you attribute them? There are no silver bullets here, and you have to make decisions based on your specific context. [07:05] When choosing tools for measuring latency, it’s important to explore and understand the choices available. Consider how operations that intersect with the reporting window boundary are attributed. If you have long-running requests, understand how they affect your measurements. This understanding will help you interpret your signals effectively.
[07:36] That’s all I have for measuring latency. The basics are straightforward, but there’s more to discuss, particularly about where to measure latency. In distributed systems, we’re interested in client-server interactions. You need to decide where to measure latency, which involves some complexity beyond the simple request-reply model.
[08:22] The process involves application code making a request, passing through the runtime, operating system, TCP/IP stack, hardware buffers, and network gear, with queues introducing latency at each step. Observing these latency sources is challenging. I’ve tried to understand latency in Apache buffers and operating system queues, but tracing an exact HTTP call through the system is nearly impossible.
[09:16] A practical model is to view the system as having a hidden queue. The client makes a request, and there are many unseen queues before the server processes it and sends a reply. While there is queuing on the return path, it’s usually less significant than on the way to the server. For this talk, I’m ignoring it, but further investigation might be worthwhile.
[09:59] In the abstract model for client-server interaction, there are two spans: the client-side response time and the server-side service time. The client span usually encompasses the server span, but this isn’t always the case. The service time can vary, but this model provides a good starting point for understanding the differences. [10:32] In real-world queuing systems, the difference between response time and service time is significant. For example, when waiting in line for a bus, the service time is how long the driver takes to process your request, while the response time is the total time from entering the queue to receiving your ticket. This distinction is crucial in computer systems as well, where we must consider the engineering trade-offs between measuring on the server or the client side.
[11:09] Measuring on the server provides only the service time, which is easier to implement but less comprehensive. Measuring on the client side, especially with diverse clients like mobile devices, is more complex but offers a more accurate picture of the latency experienced by users. Network intermediaries like API gateways and load balancers can also provide valuable latency data, depending on their proximity to the client or server.
[12:24] The key takeaway is that you cannot measure response time on the server, only service time. The server is like being inside the bus, unable to see the end of the queue. This limitation underlies many challenges in latency measurement. Gene’s talk highlights the differences between server and client latency data and how to report them effectively.
[13:05] To illustrate these concepts, I created a simulation. It models a theoretical queuing system with a client sending varying requests and a server with 10 workers, each with a 10-millisecond service time per request. This setup provides a total capacity of 1000 requests per second (RPS). We monitor both client and server sides to understand the dynamics of latency in this system. [13:46] We are examining best practices by looking at request rate, concurrency, and latency. On the client side, we measure response time, while on the server side, we measure service time. The server’s request rate, also known as the arrival rate, and concurrency, which indicates active requests, are key metrics. Service time is constant at 10 milliseconds per request, though in practice, an overloaded system might experience longer service times.
[14:37] Under low load conditions, with a request rate of 500 requests per second, the system shows no difference between request and arrival rates. Concurrency is stable with five active requests on both client and server sides. Server latency remains constant at 10 milliseconds, while client latency slightly exceeds this, ranging from 12 to 15 milliseconds. We use percentile metrics, focusing on the median and maximum latencies, to understand latency behavior.
[15:41] As we increase the load to 90% capacity, the request and arrival rates remain similar, but we start to see some queuing. Utilization is at 10, with slight increases in concurrency. Latency percentiles begin to grow, indicating more queuing delays. At 99% capacity, queuing delays become more pronounced, with requests taking longer to clear the queue.
[16:36] At 100% capacity, the system shows significant differences. Arrival rate and utilization are maxed out, while service time remains flat. The request rate is adjusted to 2000, leading to increased pending requests and growing latency. Pushing the system beyond its limits results in unbounded growth in concurrency and latency percentiles, indicating complex behavior and potential system instability. [17:17] When pushing a system dynamically against or over its limit, you might observe a hockey stick curve in the arrival rate. This is where queuing starts to become significant, and latency profiles become more pronounced. Requests that have waited in the queue for a long time eventually arrive, resulting in extremely high latency for those requests. This theoretical behavior is typical in such systems.
[17:57] If the server stalls for a period, you might see no new requests arriving, constant utilization, and no latency. However, a few requests may take an exceptionally long time, visible only if you look at the maximum latency. Other percentiles like p99 might remain unaffected. This is common in situations like garbage collection (GC) pauses, where the client observes increased queuing and a backlog of requests.
[18:49] Latency metrics often miss server-side stalling. Understanding this is crucial in benchmarking. The concept involves hidden coordination between the load generator and the system under test, which can confuse service time with response time. Automatically backing off when the server falls behind can prevent queue growth but may underestimate load and latency profiles.
[19:46] If both the server and client stall, especially if the load generator runs in the same process, this hidden coordination can mask issues. This is why micro-benchmarking might not reveal these problems. For a deeper understanding, refer to Jill’s talk and Ivan’s blog, which explore these concepts in detail.
[20:17] Analyzing latency involves using tools like histograms to capture every request’s latency over time. This provides the necessary data to analyze latency effectively. For more insights, refer to my talks at two conferences, which are linked for further exploration. [20:42] While raw data storage is ideal, using histograms is a great compromise for analyzing latency. More systems, including those in the open-source community, are adopting histograms for this purpose.
[20:55] Thank you for your attention. I plan to write a blog post on this topic, which will be available at heinrichhartmann.com/latency. If you’re interested in more statistics and latency information, follow me on Twitter for updates. Thank you again, and enjoy the rest of the conference. Bye.