The state of the Histogram

May 17, 2021
Remote

Recording

Slides

Abstract

In this talk, we are going to survey different available technologies to capture (latency) distributions and store them in time-series databases. This includes the theoretical underpinnings, accuracy and performance, operational aspects, and adoption. Our aim is to provide an overview of related work in the histogram space and share evaluation results. We also discuss the importance of histograms for latency SLOs and the recent developments in open-source technology that are embracing logarithmic histogram structures.

Table of Contents

Transcript

[00:00] Hello and welcome to "The State of the Histogram" at the SLO Conference in May 2021. Thank you very much for showing up, and thanks a lot to the organizers for giving me the opportunity to speak. My name is Heinrich Hartmann. I currently work for a company called Zalando, which sells socks and all kinds of other fashion articles in Europe.

[00:28] The main reason for this talk is actually my former employment. I used to be with a company called Circonus, which built a monitoring platform. As part of my work there, I coauthored a paper with Theo Schlossnagle called the Circllhist. The Circllhist is a histogram data structure for IT operations applications. In the paper, we look at a variety of other histogram implementations that are available and benchmark them against each other.

[01:00] We evaluate how they perform and what their size characteristics are. The idea of this talk is to give you an overview of the related work that exists in the histogram space and share some of the evaluation results. This talk is brief, so if you’re interested in more details, please look up the paper. It’s easy to find on the archive, and you’ll get the full details.

[01:29] Let me mention that the source code and the datasets we used for the evaluations are on GitHub, so you can try everything yourself. Of course, everything I say should be taken with a grain of salt, as I used to work for a vendor and have some stakes in this game. Let’s get right along and start with a little introduction to histograms before we dive further.

[01:59] Why are we talking about this topic at an SLO Conference? Histograms are a data structure that allows you to summarize data in the form of statistical distributions. Here, I’ve painted a bunch of float values in the form of little Xs. You can insert them into a histogram data structure, which I’ve represented as a black box, meaning any data structure you can insert this data into.

[02:28] We call it a histogram if it has three critical properties. First, it should be small, offering a much smaller memory footprint than storing raw data. Second, despite the reduced size, it should allow for accurate percentile calculations. The third property, which is very critical but more subtle, is that it should be mergeable. If you have not one but maybe a thousand or a million histograms, you should be able to compress them into a single histogram without losing accuracy for percentile calculations. [03:20] or at least with a rounded error. Bias is important for the SLO Conference because latency SLOs are a critical application here. I would argue that latency SLOs are best done with histograms, or maybe even only possible if you have histogram data structures at hand. I’ve talked about this topic at a different conference, for example, at FOSDEM 2019. There’s also a blog post on the Circonus blog called “Latency SLOs Done Right.” If you’re not familiar with this, look it up.

[03:54] Here, we have a slightly different agenda for this talk. Let’s continue with a brief history. Here is a picture of Karl Pearson, who created the first histogram in 1895. Fast-forward a hundred years, and Brendan Gregg, shouting in the data center, is using histograms for operational purposes. This is the first occurrence of histograms that I’m aware of, probably at Sun there was prior art, but here we clearly see latency IOs over time being captured in histograms and then plotted in a heat map. By 2009, Brendan was already doing it.

[04:39] Three years later, in 2012, histograms first appeared in monitoring products. My former boss Theo and Circonus demoed this at a talk called “Monitoring and Observability,” using histograms for API latencies. You can see the throughput of an API and the latency distribution over time of these requests. In 2014, Gil Tene from Azul Systems talked about “How Not to Measure Latency,” focusing on JVM profiling and GC latencies. It was a very insightful talk that highlighted systematic difficulties with measuring latency and how to best address them.

[05:31] As part of this effort, Gil Tene also published a data structure called the HDR histogram, which is very relevant, capable, open source, and still in use. Around the same time, the t-digest came onto the scene. It’s not a traditional histogram as we know it, but the applications are quite similar, especially for quantile calculations on compressed data. Ted Dunning and Otnar Ertl are the authors of that paper and data structure.

[06:04] In recent years, in 2019, a paper at VLDB by Masson, Rim, and Lee from Datadog introduced a DD sketch, which is essentially a histogram for the same purpose. In 2020, Theo and I decided to write a paper about the data structure we had been using all along so that people could find it, look it up, and make sense of it. This paper is also the main source for this talk. Another recent development in 2021 is that Circllhist underwent a licensing change and is now available as open histogram.io. So if you want to check it out and adopt it in any kind of open-source product, this should now be possible. [06:55] with this new license. Now that you’ve gone through the history of histograms, you are in a position where you have the luxury of having a variety of histograms at your disposal. If you start looking into them, you might be confused and ask, “Which one is better? Which one should I use?” Here are the desirable properties I think histograms should have: they should be small, insertion should be fast, you should get accurate percentiles, and they should be mergeable. We’ve talked about these properties before, and I also think zero configuration is a big benefit.

[07:32] Zero configuration is beneficial for usability, so users don’t have to make choices about how to configure a histogram, which usually involves understanding how the histogram works—a high barrier of entry. The second benefit of zero configuration is compatibility. If you have two histograms from different sources, they will always be compatible and mergeable if you didn’t have to configure them. Configuring them often leads to problems with merging differently configured histograms, so we see this as an advantage.

[08:05] Here are the contenders we looked at in the evaluation: HDR Histogram, DDSketch, t-digest, Circllhist, and an Exact Histogram, which is just a numpy array holding raw data to see how it performs. We also included PromQL, or the Prometheus Monitoring System, which has a simplistic histogram data structure with a low number of latency thresholds to configure. Since this is popular, at the time of writing, we thought to include it here as well.

[08:45] Since I have very little time, I’ll jump right to the conclusion of the evaluation. The array method is out because it’s way too large, so that doesn’t work. The Prometheus Histogram has drawbacks with accuracy due to the low number of configured buckets and the need for a lot of configuration for thresholds. The t-digest is very capable with good accuracy, but we put a question mark for mergeability. With the t-digest, you don’t get guarantees for the accuracy of percentiles after merging multiple histograms. Usually, it’s okay, but we found certain configurations where the percentile accuracy becomes quite poor.

[09:48] The rest—HDR, DD, and Circllhist—use similar methods to lay out buckets and store data, so they perform quite similarly. We argue that zero configuration gives a slight advantage to Circllhist since it’s a one-size-fits-all solution. Here, the final little section of this talk… [10:11] The adoption in open source technology is a very recent development. First, Bjorn Rabenstein, who is also speaking at this conference, has a proposal for the inclusion of sparse high-resolution histograms for Prometheus. We’re quite excited about this as it follows the observations we’ve made closely. He also suggested a logarithmic histogram data structure, and it’s nice to see this gaining traction.

[10:35] Similarly, the OpenTelemetry project, which involves various companies and stakeholders, is adding a logarithmic histogram data structure to their protocol specification and libraries. We are very happy to see this development, and I personally try to participate in the discussions there.

[10:58] So that is it. Thank you very much for your attention. This is my Twitter, and if you want to work with me or other folks at Zalando, have a look at our job openings. We currently have two open positions in SRE. Well, that’s it. Thank you very much and have a good day. Bye-bye.

[11:16] (instrumental music)

References

Comments