Explanation

We model the error introduce by Bernuoulli sampling to a number of statistics that are relevant for IT operations. This sampling model used by OpenTelemetry to sample on trace-level (although the spec is not clear on this.)

For each point in the population we make independent sampling decisions with probability given by the sampling rate.

With sampling rate 100% all the elements in the population are retained.
With sampling rate 50%, there is a 50% chance that any given point will be retained.
With sampling rate 0% the sample is empty.

In the following we denote by $N$ the number of requests in the population, by $K$ the number of errors, and by $n$ the size of the sample.

Estimation

For some statistics, we have theoretical models that allow us to estimate the expected values and variances.

In the Bernoulli model, the sample size $n$ follows a Binomial distribution, $n \sim B(N,p)$.s where with $p$ is the sampling rate $p$, and $N = \#X$ is population size. Hence the expected sample size is $E[n] = N \cdot p$, and $Var[n] = N \cdot p \cdot (1-p)$. These formulas are used to calculate the Request Count and Request Rate statistics.

For the error rate calculation, we denote by $k \leq n$ the number of errors in the sample set. The error count also follows a binomial distribution $k \sim B(K,p)$.

The rate of errors in the sample is denoted by $r = k/n$ (for $n>0$). Using the observation $k$ conditioned on $n=const$ follows a Hypergeometric distribution $Hyp(N,K,n)$, and filling in $r = K/N$ for $n=0$, one derives the formula $E[r] = K/N$, and

$$ Var[r] = \frac{K (N-K)}{N^2} \frac{1}{N-1} \sum_{n=1}^N { N \choose n } p^n (1-p)^{N-n} \frac{N-n}{n} $$

We use these formula for estimating the standard-deviation of the Error Rate above.

This can be seen as follows:

$$ \begin{aligned} E[r] &= \sum_{S \subset [N]} P\{S\} \cdot r(S) \\ &= \sum_{S \subset [N]} p^{\# S}(1-p)^{N - \#S} \cdot r(S) \\ &= \frac{K}{N} q^N + \sum_{n \geq 1,k \leq n} p^n q^m \cdot \frac{k}{n} \cdot \# \{ S \subset [N] | \#S=n, \#S\cap[K] = k \} \\ &= \frac{K}{N} q^N + \sum_{n \geq 1,k \leq n} { N \choose n } p^n q^m \frac{ { K \choose k } { N-K \choose n-k } }{ { N \choose n } } \cdot \frac{k}{n} \\ &= \frac{K}{N} q^N + \sum_{n \geq 1} B(N,p)[n] \cdot \frac{1}{n} \cdot \sum_{k \leq n} Hyp(N,K,n)[k] \cdot k \\ &= \frac{K}{N} q^N + \sum_{n \geq 1} B(N,p)[n] \cdot \frac{1}{n} \cdot \frac{n K}{N} \\ &= \frac{K}{N} q^N + \frac{K}{N} (1 - q^N) \\ &= \frac{K}{N} \\ E[r^2] &= \frac{K^2}{N^2} q^N + \sum_{n \geq 1} B(N,p)[n] \cdot \sum_{k \leq n} Hyp(N,K,n)[k] \cdot \frac{k^2}{n^2} \\ &= \frac{K^2}{N^2} q^N + \sum_{n \geq 1} B(N,p)[n] \cdot \frac{1}{n^2} \cdot [ n \frac{K (N-K) (N-n)}{N^2(N-1)} + n^2 \frac{K^2}{N^2} ] \\ &= \frac{K^2}{N^2} q^N + \sum_{n \geq 1} B(N,p)[n] \cdot \frac{K^2}{N^2} [ \frac{1}{n} \frac{L m}{K(N-1)} + 1] \\ &= \frac{K^2}{N^2} q^N + \frac{K^2}{N^2} \cdot (1-q^m) + \frac{L}{K(N-1)} \cdot [ \sum_{n \geq 1} B(N,p)[n] \cdot \frac{m}{n} ] \\ &= \frac{K^2}{N^2} + \frac{K L}{N^2(N-1)} \cdot [ \sum_{n \geq 1} B(N,p)[n] \cdot \frac{m}{n} ] \\ Var[r] &= E[r^2] - E[r]^2 \\ &= \frac{K L}{N^2(N-1)} \cdot \sum_{n \geq 1} B(N,p)[n] \cdot \frac{m}{n} \end{aligned} $$

Where $q = 1-p, m = N-n,L=N-K$, and we used the known formulas for moments of the Hypergeometric Distribution.

Simulation

We simulate multiple iterations of sampling decisions is straight forward. We repeatedly select a subset $S \subset X$ by running Bernoulli experiments for each sample $x \in X$. Then we compute the estimator on the sample $S$, and record the results. In the above tables we report mean, and standard-deviation of the computed estimations.

Note that mean and standard-deviation are sensible measure here, since the distribution of the estimators across different samples is well behaved (very close to normal), in all cases.

Explanation

Estimation

Simulation

Comments