Theory meets Practice at the Median: a worst case comparison of relative error quantile algorithms
TTheory meets Practice:worst case behavior of quantile algorithms
Graham Cormode ∗ , Abhinav Mishra , Joseph Ross , and Pavel Vesel´y ∗ University of Warwick, { G.Cormode,Pavel.Vesely } @warwick.ac.uk Splunk, { amishra,josephr } @splunk.com Abstract
Estimating the distribution and quantiles of data is a foundational task in data mining anddata science. We study algorithms which provide accurate results for extreme quantile queriesusing a small amount of space, thus helping to understand the tails of the input distribution.Namely, we focus on two recent state-of-the-art solutions: t -digest and ReqSketch . While t -digestis a popular compact summary which works well in a variety of settings, ReqSketch comes withformal accuracy guarantees at the cost of its size growing as new observations are inserted. Inthis work, we provide insight into which conditions make one preferable to the other. Namely,we show how to construct inputs for t -digest that induce an almost arbitrarily large error anddemonstrate that it fails to provide accurate results even on i.i.d. samples from a highly non-uniform distribution. We propose practical improvements to ReqSketch , making it faster than t -digest, while its error stays bounded on any instance. Still, our results confirm that t -digestremains more accurate on the “non-adversarial” data encountered in practice. Studying the distribution of data is a foundational task in data mining and data science. Givenobservations from a large domain, we will often want to track the cumulative frequency distribution,to understand the behavior, or to identify anomalies. This cumulative distribution function (CDF)is also known variously as the order statistics, generalizing the median, and the quantiles. When wehave a very large number of input observations, an exact characterization is excessively large, andwe can be satisfied with an approximate representation, i.e., a compact function whose distancefrom the true CDF is bounded. Recent work has argued that, rather than a uniform error bound,it is more important to capture the detail of the tail of the input distribution.Faced with the problem of processing large volumes of distribution data, there have been manyproposals of approximate quantile algorithms to extract the desired compact summary. These aredesigned to handle the input when seen as a stream of updates, or as distributed observations.Even though these various algorithms all draw on the same set of motivations, the emphasis canvary widely. Some works view the question primarily as one of computational complexity, and seekoptimal bounds on the space usage, even if this entails very intricate algorithmic designs and lengthytechnical proofs. Other works aspire to highly practical algorithms that can be implemented andrun efficiently on real workloads. Although their authors might object, we can crudely characterizethese two perspectives as “theoretically-driven” and “pragmatic”. ∗ Supported by European Research Council grant ERC-2014-CoG 647557. a r X i v : . [ c s . D S ] F e b n this paper, we study the behavior of two recent algorithms for the quantiles problem, whichwe take to embody these two mindsets. The pragmatic approach is represented by the t -digestapproach, which is a flexible framework that has been reported as being adopted in practice byvarious tech-focused companies (e.g., Microsoft, Facebook, Google [7]). The theoretical approach isrepresented by the ReqSketch [3], a work building on a line of prior theoretical papers, each makingincremental improvements to the asymptotic bounds.On first glance, the conclusion seems obvious.
ReqSketch is suited for algorithmic study, andcontributes to our understanding of the fundamental computational complexity of the problem.Coding it up is not too hard, but the constants hidden in the “big-O” analysis mean that it requiresa fair amount of space to store and so is unlikely to be competitive with the pragmatic approach.Meanwhile, the t -digest is very compact, and gives accurate answers to realistic workloads, especiallythose uniformly distributed over the domain. Its widespread adoption should give confidence thatthis is a sensible choice to implement.Our contribution in this paper is to tell a more nuanced story, with a less clearcut ending.We dive into the inner workings of the t -digest, and show how to construct inputs that lead toalmost arbitrarily bad accuracy levels. While it may seem that such inputs are highly unlikelyto be encountered in practice, t -digest may fail to provide accurate estimates even if input itemsare repeatedly drawn from a non-uniform distribution, and we demonstrate such distributions.Meanwhile, we engineer an implementation of ReqSketch that improves its time and space efficiency,making it faster than t -digest. The outcome is a collection of empirical results showing ReqSketch can be vastly preferable to t -digest, even on i.i.d. samples, flipping the conclusion for uniformlydistributed inputs.Still, the conclusion is therefore less straightforward than one might wish for. Both the inputdistributions and the careful construction which lead to high error for the t -digest rely on a highlynon-uniform data distribution with numbers ranging from infinitesimal to astronomically large. Formost realistic data patterns encountered in practice, the t -digest will remain a compelling choice,due to its simplicity and ease of use. But, particularly since quantile monitoring is often needed totrack deviations from expected behavior, there is now a case to adopt ReqSketch for scenarios wherea strong guarantee is needed across all eventualities, no matter how unlikely they might appear.Although
ReqSketch has higher overheads on average, and relies on internal randomness, its worstcase is much more tightly bounded. So, in summary, the practical approach nevertheless has flaws– at least in theory – while the theoretical approach is not so impractical as it may first appear.
Paper outline.
We start with necessary definitions and a brief review of related work in the nextsection. Then we describe both t -digest and ReqSketch in Section 3, where we also provide a shortsummary of practical improvements of the latter sketch. In Section 4, we outline a worst-case inputconstruction, showing that t -digest may suffer an almost arbitrarily large error. Empirical resultsappear in Section 5, where we demonstrate the error behavior of both algorithms on the afore-mentioned worst-case inputs and also on inputs that consist of i.i.d. samples from a distribution.Finally, we provide a comparison of average update times in Section 5.3. We consider algorithms that operate on a stream of items drawn from some large domain. Thiscould be any domain U equipped with a total order (e.g., strings with lexicographic comparison),or a more restricted setting, such as the reals, where we additionally have arithmetic operations.2he core notion needed is that of the rank of an element from the domain, which is the numberof items from the input that are (strictly) smaller than the given element. Formally, for an inputstream of n items σ = { x , . . . , x n } , the rank of element y is given by R σ ( y ) = |{ i : x i < y }| . Thereis some nuance in how to handle streams with duplicated elements, but we will gloss over it in thispresentation.The quantiles are those elements which achieve specific ranks. For example, the median of σ is y such that R σ ( y ) = n/
2, and the p th percentile is y such that R σ ( y ) = pn/ q th quantile with R σ ( y ) = qn for 0 ≤ q ≤
1. Again, ambiguity can arise since therecan be a range of elements satisfying this definition, but this need not concern us here.The algorithms we consider aim to find approximate quantiles via approximate ranks. That is,they seek to find elements whose rank is sufficiently close to the requested quantile. Specifically, the quantile error of reporting y as the q th quantile is given by | q − R σ ( y ) /n | . A standard observationis that if we have an algorithm to find the approximate rank of an item, as ˆ R σ ( y ), this suffices toanswer quantile queries, after accounting for the probability of making errors. In what follows, wefocus on the accuracy of rank estimation.Typically, we would like to have some guarantee on rank estimation accuracy. A uniform(additive) rank estimation guarantee asks that the error on all queries be bounded by the samefraction of the input size, i.e., | ˆ R σ ( y ) − R σ ( y ) | ≤ (cid:15)n , for (cid:15) <
1. This will ensure that all quantilequeries have the same accuracy. However, it is noted that in practice we often want greater accuracyon the tails of the distribution, where we can see more variation, compared to the centre whichis usually more densely packed and unvarying. This leads to notions such as the relative errorguarantee, | ˆ R σ ( y ) − R σ ( y ) | ≤ (cid:15)R σ ( y ), or more generally | ˆ R σ ( y ) − R σ ( y ) | ≤ (cid:15)f ( R σ ( y ) /n ), where f is a scale function which captures the desired error curve as a function of the location in quantilespace. In this work, we focus on the relative error guarantee and related scale functions (based onlogarithmic functions). As defined, relative error focuses on the low end of the distribution (i.e.,the elements with low rank), but it is straightforward to flip this to the high end by using the scalefunction (1 − R σ ( y ) /n ), or to make it symmetric with the scale function min( R σ ( y ) /n, − R σ ( y ) /n ). Most related work has focused on providing uniform error guarantees. It is folklore that a randomsample of size O ( (cid:15) ) items from the input is sufficient to provide an (cid:15) additive error estimate forany quantile query, with constant probability. Much subsequent work has aimed to improve thesespace bounds. Munro and Paterson [21] gave initial results, but it was not until two decades laterthat Manku et al. reinterpreted this (multipass) algorithm as a quantile summary taking one passover a stream, and showed improved bounds of O ( (cid:15) log (cid:15)n ) [18]. For many years, the state of theart was the Greenwald-Khanna (GK) algorithm, which comes in two flavours: a somewhat involvedalgorithm with a O ( (cid:15) log (cid:15)n ) space cost, and a simplified version without a formal guarantee, butgood behavior in practice [11]. Subsequent improvements came from Felber and Ostrovsky [9], whoproposed combining sampling with a constant number of GK instances; and Agarwal et al. [2]who adapted the Manku et al. approach with randomness. Both these approaches removed the(logarithmic) dependence on n from the space cost. Most recently, Karnin et al. [16] (KLL) furtherrefined the randomized approach to show an O ( (cid:15) ) space bound. The tightest bound is achieved bya more complicated variant of the approach; a slightly simplified approach with weaker guaranteesis implemented in the Apache DataSketches library [22].The study of other scale functions such as relative error can be traced back to the work ofGupta and Zane [12], who gave a simple multi-scale sampling-based approach, with a space boundof O ( (cid:15) poly log n ). Subsequent heuristic and theoretical work aimed to reduce this cost, leading to3 bound of O ( (cid:15) log ( (cid:15)n )) using a deterministic merge & prune strategy due to Zhang and Wang [24].The cubic dependence on log n can be offputting, and recent work in the form of the ReqSketch [3]has reduced this bound by adopting a randomized algorithm inspired by the KLL method.The theoretical study of the quantiles problem has also led to lower bounds on the amount ofspace needed by any algorithm for the problem, based on information theoretic arguments abouthow many items from the input need to be stored. A simple argument shows that a uniform er-ror guarantee requires space to store Ω(1 /(cid:15) ) items from the input, and a relative error requiresΩ(1 /(cid:15) log (cid:15)n ) (see, e.g., [3]). Some more involved arguments show stronger lower bounds for uni-form error of Ω( (cid:15) log 1 /(cid:15) ) [13] and Ω( (cid:15) log (cid:15)n ) [4], but only for deterministic and comparison-based algorithms. The restriction to comparison-based means that the method can only apply compar-isons to items and is not permitted to manipulate them (e.g., by computing the average of a setof items). Nevertheless, these lower bounds are sufficient to show that the analysis of certainapproaches described above, such as the GK algorithm, is tight, and cannot be improved further.The deterministic bounds can also be extended to apply to randomized algorithms and non-uniformguarantees, becoming weaker as a result.There are other significant approaches to the quantiles problem to study. The moment-basedsketch takes a statistical approach, by maintaining the moments of the input stream (empirically),and using these to fit the maximum entropy distribution which agrees with these moments [10].This requires the assumption that the model fitting procedure will yield a distribution that closelyagrees with the true distribution. The DDSketch achieves “relative error” guarantees in value space via a dynamic histogram with geometrically growing bucket ranges [20]. Finally, the t-digest [8, 7]has been widely used in practice, and is described in more detail in the subsequent section. The t -digest consists essentially of a set of weighted centroids { C , C , . . . } , with a weighted centroid C i = ( c i , w i ) representing w i ∈ Z points near c i ∈ R . Centroids are maintained in the sortedorder, that is, c i < c j for i < j . Rank queries are answered approximately by accumulating theweights smaller than a query point, and performing linear interpolation between the straddlingcentroids. The permissible weight of a centroid is governed by a non-decreasing scale function k : [0 , → R ∪ {±∞} , which describes the maximal centroid weight as a function on quantilespace: faster growth of k enforces smaller centroids and hence higher accuracy. In particular, scalefunctions which grow rapidly near the tails q = 0 , q = 0 . q = 0 ,
1, but trade accuracy for space near q = 0 . t -digest is controlled by a compression parameter δ , which (roughly) bounds fromabove the number of centroids used. (For the scale functions below, Dunning [6] shows that thisrough bound does hold for all possible inputs.) Given δ and scale function k , the weight w i ofcentroid C i must satisfy k (cid:18) w . Z ( N ) is a normalization factor that depends on N . While k provides uniform weight boundfor any q ∈ [0 , k , k , and k get steeper towards the tails q = 0 ,
1, which leads tosmaller centroids and higher expected accuracy near q = 0 ,
1. Dunning [5] proves that adding moredata to a t -digest or merging two instances of t -digest preserves the constraint (1) if any of thesefour scale functions is used. Ross [23] describes asymmetric variants of k , k , and k , using thegiven function k on [ α,
1] and the linearization of k at α on [0 , α ), and shows that the t -digestassociated with any of these modified scale functions accepts insertions and is mergeable as well.There are two main implementations of t -digest that differ in how they incorporate an incomingitem into the data structure. The merging variant maintains a buffer for new updates and oncethe buffer gets full, it performs a merging pass, in which it treats all items in the buffer as (trivial)centroids, sorts all centroids, and merges iteratively any two consecutive centroids whose combinedsize does not violate the constraint (1). The clustering variant finds the closest centroids to eachincoming item x and adds x to a randomly chosen one of the closest centroids to x that still hasroom for x , i.e, satisfies (1) after accepting x . If there is no such centroid, the incoming item formsa new centroid, which may however lead to exceeding the limit of δ on the number of centroids —in such a case, we perform a merging pass over the centroids that is guaranteed to output at most δ centroids.In an ideal scenario, the instance of the t -digest would be strongly-ordered, that is, for each x i represented by centroid C i and each x j represented by C j with i < j , it holds that x i < x j . Thismeans that the (open) intervals spanned by data points summarized by each centroids are disjoint,which is the case when data are presented in the sorted order. Together with assuming a uniformdistribution of items across the domain, strongly-ordered t -digest provides highly accurate rankestimates even for, say, δ = 100. However, strong ordering of centroids is impossible to maintainin a limited memory when items arrive in an arbitrary order and in general, centroids are just weakly ordered , i.e., only the means c i and c j of centroids C i and C j satisfy c i < c j if i < j . Thisweak ordering of centroids, together with non-uniform distribution of items, is the major causeof the error in rank estimates. The hard instances and distributions presented in this paper areconstructed so that they induce a “highly weak” ordering of centroids, meaning that many valuessummarized by centroid C i will not lie between the means of neighboring centroids. As we showbelow, this leads to a highly biased error in rank estimates for certain inputs. ReqSketch
The basic building block of
ReqSketch [3] is a compactor , which is essentially a buffer of a certaincapacity for storing items, and the sketch consists of several such compactors, arranged in levelsnumbered from 0. At the beginning, we start with one buffer at level 0, which accepts incomingitems. Once the buffer at any level h gets full, we discard some items from the sketch in a waythat does not affect rank estimates too much. This is done by sorting the buffer, choosing an5ppropriate prefix of an even size, and removing all items in the chosen prefix from the buffer. Ofthese removed items, a randomly chosen half is inserted into the compactor at level h + 1 (whichpossibly needs to be initialized first), namely, items on odd or even indices with equal probability,while removed items in the other half are discarded from the sketch. This procedure is called the compaction operation . Similar compactors were already used to design the KLL sketch [15] andappear also in earlier works, e.g., in [18, 19, 2, 17].Since the sketch consists of items stored in the compactors, one can view the set of stored itemsas a weighted coreset, that is, a sample of the input where each item stored at level h is assignedweight of 2 h (akin to the weight of a centroid in the t -digest setting). Observe that the total weightof items remains equal to the input size: when a compaction operation discards r items, it alsopromotes r items one level higher, and as the weight of promoted items doubles, the total weightof stored items remains unchanged after performing a compaction. To estimate the rank of somequery item y , we simply calculate the total weight of stored items x with x ≤ y . Overall, thisapproach gives a comparison-based algorithm and thus, its behavior only depends on the orderingof the input and is oblivious to applying an arbitrary order-preserving transformation of the input,even non-linear (which does not hold for t -digest).The error in the rank estimate for any item is unbiased, that is, 0 in expectation, and since it isa weighted sum of a bounded number of independent uniform ± − s/k ), where s is the prefix size and k is a parameter of the sketch controlling the accuracy. The prefix is actuallychosen according to a derandomization of this distribution, which leads to a cleaner analysis andsmaller constant factors. Note that the choice of prefix is qualitatively similar to the choice of scalefunction for a t -digest.The number of compactors is bounded by at most O (log( N/B )), where B is the buffer size,since the weight of items is exponentially increasing with each level, and thus, the size of thesketch is O ( B · log( N/B )). The analysis in [3] implies that if we take B = O ( ε − · √ log εN ),then the sketch provides rank estimates with relative error ε with constant probability, while itssize is O ( ε − · log . εN ) (the parameter k mentioned above should be set to O ( B/ log εN ) = O ( ε − / √ log εN )). Finally, ReqSketch is fully mergeable, meaning that after an arbitrary sequenceof merge operations, the aforementioned accuracy-space trade-off still holds.
ReqSketch
The brief description of
ReqSketch above follows the outline in [3] and is suitable for a mathematicalanalysis. In this section, we describe practical adjustments to
ReqSketch that improve constantfactors as well as the running time. These were used in our proof-of-concept Python implementationand have been incorporated in the implementation in the DataSketches library. First, we applypractical improvements for the KLL sketch proposed by Ivkin et al. [14]. These include lazinessin compaction operations, i.e., allowing the buffer to exceed its capacity provided that the overallcapacity of the sketch is satisfied, and flipping a random coin for choosing even or odd indexed The Python implementation of
ReqSketch by the fourth author is available at https://github.com/edoliberty/streaming-quantiles/blob/master/relativeErrorSketch.py . DataSketches library is available at https://datasketches.apache.org/ ; ReqSketch is implemented in this library according to the aforementioned Python code.
ReqSketch are desired. A specific feature of
ReqSketch , comparedto the KLL sketch, is that the buffer size B depends on the input size N , and this is needed forthe prefix choice by the (derandomized) exponential distribution. For the theoretical result, it ispossible to maintain an upper bound ˆ N on N and once it is violated, use ˆ N as an upper bound andrecompute the buffer size at all levels. As it turns out, it suffices for the exponential distributionto count compaction operations performed at each level h and set the level- h buffer size basedon this count C h , i.e., to O ( ε − · √ log C h ). This results in levels having different capacities, withlower levels being larger as they process more items. Since the level-0 compactor has the largestsize, compaction operations at level 0 are most costly as they take time proportional to the size.To improve the amortized update time, when we perform a compaction operation at level 0, wealso compact the buffer at any other level that exceeds its capacity. In other words, we restrictthe aforementioned laziness in compaction operations to level 0 only and this postpones the nextcompaction operation at level 0 for as long as possible. Experiments reveal that such a “partiallaziness” improves the amortized update time significantly (depending on the number of streamingupdates and parameters); see Section 5.3.Overall, the empirical results in this paper and in the DataSketches library suggest that onrandomly shuffled inputs ReqSketch with the improvements outlined above has over 10 times smallerstandard deviation of the error than predicted by the (already quite tight) mathematical worst-case analysis in [3]. Furthermore, results on various particular input orderings (performed in theDataSketches library and with the proof-of-concept Python implementation) reveal that the erroron random permutations is representative, that is, we did not encounter a data ordering on which
ReqSketch has higher standard deviation.
The t -digest and space bounds. The application of merging on the centroids ensures thatthe number of centroids maintained by the t -digest cannot be too large. In particular, the spaceparameter δ is used to enforce that there are at most δ centroids maintained, no matter what orderthe items in the input stream arrive in, for any of the scale functions considered above [6, 23].While this bound on the size of the summary is reassuring, it appears to stand in opposition to thespace lower bounds referenced in Section 2.2 above [13, 4]. This conflict is resolved by observingthat the lower bounds hold against algorithms which only apply comparisons to item identifiers,whereas the t -digest combines centroids by taking weighted averages, and uses linear interpolationto answer queries. Still, we should not be entirely reassured. We can consider a “sampling” (insteadof averaging) variant of the t -digest approach which stays within the comparison-based model, bykeeping some (randomly-chosen) item in each centroid as its representative, and using this toanswer queries. For sure, this sampling variant makes less use of the information available, and sowe should not expect it be as accurate as the averaging version. But now this sampling variantis certainly subject to the space lower bounds, and so cannot guarantee accuracy while using only O ( δ ) space. Since it is not so different to the full averaging t -digest, we would expect the accuracyoffered to be comparable, suggesting that this too may be vulnerable. This is the essence of oursubsequent study, and we make use of the construction of a “hard” instance from [4] to help formthe adversarial inputs to t -digest. 7 verview of the attack. The idea of the attack is to produce a (very) weakly ordered collectionof centroids by wielding inspiration from [4] against the inner workings of t -digest. For motivation,if centroid ( c , w ) summarizes S := { x , . . . , x w } and x i < c < x i +1 = next( σ, c ) (wherenext( σ, c ) is the smallest stream element larger than c ; see [4]), then for a query point in theinterval ( c , x i +1 ), the rank will be overestimated by (at least) w − i . When c is very closeto the median of S , this produces an overestimate of approximately w . In the worst case, therank error is closer to w : if we operate with integer weights, in the “lopsided” scenario in which x = x = · · · = x w , the rank is overestimated by at least w − c , x ). Usingreal-valued weights, the rank error can be made as close to w as desired, using a weighted set ofthe form { ( x , (cid:15) ) , ( x , w − (cid:15) ) } .If a set S is then inserted within the interval ( c , x i +1 ) and forms a new centroid ( c , w ), thenfor a query point in the interval ( c , next( σ, c )), the rank will be overestimated by w + w in thetypical case, or w + w − S is similarly lopsided. (Theorientation of the attack may be flipped in the evident way, producing underestimates of the rank.)As this nested construction proceeds, some of the inserted points will be merged with a pre-viously created centroid, and hence that portion of the inserted weight will not contribute to therank error. Assuming the merging pass always proceeds from left to right, the sequence of centroidweights progresses as: ( − l − ) , w , ( − r − )( − l − ) , w + v , w , ( − r − )( − l − ) , w + v , w + v , w , ( − r − ) (2)and so on, where ( − l − ) denotes the ordered collection of centroids, having total weight l andsmaller than the centroid with weight w , and similarly ( − r − ) stands for centroids larger thanthe attacked centroid. The idea is to add items of weight v to the “attacked” centroid so that thiscentroid will be full and none of the next w items will get merged into it, and similarly in the nextiteration. Thus, the first insertion has size v + w , the second has size v + w , etc.To see how this affects the rank error as the number of nested insertions of lopsided centroidsincreases, observe that if (cid:80) Ni =1 w i + v i l + r → ∞ (3)as N → ∞ (i.e., if the weight not covered by these lopsided centroids is negligible), then theasymptotic error can be made arbitrarily close to (cid:80) i ( w i − (cid:80) i w i + v i if integer weights are required, or (cid:80) i w i (cid:80) i w i + v i if this restriction is dropped.If in addition w i w i + v i ≥ γ > i , then (cid:80) i w i (cid:80) i w i + v i ≥ γ as well and hence γ serves as a lower bound on the asymptotic error.We will see that the parameter δ influences the rate of convergence to the asymptotic error (i.e.,the growth of the quantity in (3)), but the asymptotic error itself (i.e., γ in (4)) cannot be reducedsimply by taking δ large enough; in fact the asymptotic error is increasing in δ for some importantscale functions. In the next sections we sketch how to achieve the inequalities (3) and (4) abovefor several scale functions of interest. 8 .1 Scale Functions with Bounded Derivative Proposition 1.
Let k be a scale function such that < b ≤ k (cid:48) ( q ) ≤ B for q ∈ [0 , . Then thereexists a number γ > and a δ > such that for all δ ≥ δ , the t -digest associated with ( k, δ ) hasasymptotic error at least γ on the nested sequence of lopsided insertions described above.Sketch of proof. Since by (1), the function k increases by δ on the interval I w in quantile spaceoccupied by the centroid ( c, w ), the Mean Value Theorem guarantees a point q w ∈ I w such that k (cid:48) ( q w ) | I w | = δ , where | I w | denotes the length of the interval. Note that | I w | = w/ ( l + w + r ), where l and r are the weights to the left and right of centroid ( c, w ), respectively. Since wl + r ≥ | I w | , weobtain wl + r ≥ δB . Taking N large enough (depending on δ ), the desired limiting behavior (3) isshown.For the second inequality, we apply similar arguments to the intervals of weights w i , w i + v i , w i +1 appearing in consecutive iterations of the attack. Clearing denominators, we obtain equations: δk (cid:48) ( q w i ) w i = l + w i + rδk (cid:48) ( q w i + v i )( w i + v i ) = l + w i + v i + w i +1 + rδk (cid:48) ( q w i +1 ) w i +1 = l + w i + v i + w i +1 + r (5)From which it follows that w i w i + v i = δk (cid:48) ( q w i + v i ) k (cid:48) ( q w i +1 ) − k (cid:48) ( q w i + v i ) − k (cid:48) ( q w i +1 )( δk (cid:48) ( q w i ) − k (cid:48) ( q w i +1 ) . (6)The denominator is bounded above by δB . For any (cid:15) >
0, we can find δ > δ ≥ δ ,the numerator is bounded below by (1 − (cid:15) ) δb . Hence w i w i + v i is bounded below by (1 − (cid:15) ) b B and (4) isshown as well.A consequence of the proof is that for k ( q ) = q/ δ − δ − , i.e., for sufficiently large δ , the approximations can be arbitrarily poor. k and k Without loss of generality, we assume the attack occurs where q > . according to the t -digest ,i.e., the attacked centroid has more weight to its left than to its right, hence the growth conditionsfor k under (2) give rise to the following system of equations (derived similarly as in the proof ofProposition 1, using the definition of k ): w i + r r = exp (cid:18) δ (cid:19) w i + v i + w i +1 + r w i +1 + r = exp (cid:18) δ (cid:19) w i +1 + r r = exp (cid:18) δ (cid:19) (7)Solving yields w i w i + v i = δ ) and also w i r = exp( δ ) −
1. We may assume l < C ( δ ) r and hence w i l + r > w i ( C ( δ )+1) r = exp( δ ) − C ( δ )+1) . From this the limiting behavior (3) follows; as δ → ∞ , exp( δ ) → + δ . Hence, in theworst case, the quantile error of t -digest with scale function k can be arbitrarily close to 1.While k does not seem as amenable to direct calculation, we observe that for all q ∈ (0 . , k (cid:48) ( q ) < k (cid:48) ( q ) < k (cid:48) ( q ). Therefore the growth of k on an interval can be bounded on bothsides in terms of the growth of k , and the system of equations (7) has a corresponding system ofinequalities. Eventually we find w i r > exp( δ ) −
1, giving (3), and also get w i w i + v i > δ )(exp( δ ) + 1) . This lower bound on the asymptotic error approaches 0 . δ increases, which implies that thequantile error of t -digest with k can be arbitrarily close to 0 . In this section, we study the error behavior of t -digest and of ReqSketch on inputs constructedaccording to the ideas described in Section 4 and also on inputs consisting of i.i.d. items gener-ated by certain non-uniform distributions. Furthermore, we compare the merging and clusteringimplementations of t -digest.The experiments are performed using the Java implementation of t -digest by Dunning and theJava implementation of ReqSketch by the Apache DataSketches library. By default, we run t -digest with compression factor δ = 500, but we can obtain similar results for other values of δ . Wethen choose the parameters of ReqSketch so that its size is similar to that of t -digest with δ = 500for the particular input size (recall that the size of ReqSketch depends on the logarithm of thestream length). The measure for the size is chosen to be the number of bytes of the serialized datastructure. For instance, if the input size is N = 2 , then using k = 4 as the accuracy parameterof ReqSketch leads to essentially the same size of the two sketches, which is about 2 . In implementing the ideas of Section 4, we note that the size of the interval between the attackedcentroid and the next stream value shrinks exponentially as the attack proceeds. Hence the at-tack as described may run out of precision (at least for float or double variable types) afteronly a few iterations. To circumvent this difficulty, we target the attack in the neighborhood ofzero, where more precision is available, so the attack as implemented chooses the largest centroidless than zero, and uses the smallest positive stream value as its “next” element. Additionally,the effectiveness of the attack can be sensitive to the exact compression schedule used by an im-plementation (particularly the clustering variant). Hence the results of the attack are somewhatdependent on the particular manner in which values are chosen from the interval for the ensuingiteration. Nevertheless, equipped with knowledge of the parameters of the t -digest, the ability toquery for centroids near zero and centroid weights, and memory of the actual stream presentedto the t -digest, an adversary may generate a stream on which the t -digest performs rather poorly.Figure 1 shows the (additive) quantile error of both the merging and clustering implementations of t -digest, all using scale function k and δ = 500 (the error of ReqSketch is not shown as it is very All code used is open source, and all scripts and code can be downloaded from our repository at https://github.com/PavelVesely/t-digest/ . For more details about reproducing our empirical results, see Section A. See again https://github.com/PavelVesely/t-digest/ . t -digest on carefully constructed inputclose to 0%, similarly as in the plots below). This shows that the vulnerability of t -digest is notdue to the specifics of implementation choices, but persists across a range of approaches. Here, we provide results for inputs consisting of i.i.d. items generated by some distributions. Ourpurpose is to study the behavior of the algorithms when the order of item arrivals is not adversariallychosen, demonstrating that the class of “difficult” inputs is larger than just the carefully targetedattack stream. Items drawn i.i.d. form a well-understood scenario: if we knew the description ofthe distribution, the quantiles of the distribution serve as accurate estimates of the quantiles ofa sufficiently large input sampled from that distribution. However, we only study algorithms notdesigned for any particular distribution, and so inputs consisting of i.i.d. items can still present achallenge.It is already known that on (i.i.d.) samples from some distributions, such as the uniform orthe normal distribution, t -digest performs very well [8, 1, 23], and so we do not replicate thesescenarios. Instead, we study a class of distributions inspired by the attack of Section 4 and showunder which conditions t -digest fails to provide any accurate rank estimates.In the attack described in Section 4, we carefully construct a sequence of insertions into t -digestso that the error gets arbitrarily large. Recall that with each iteration of the attack the intervalwhere future items are generated shrinks exponentially, while the number of items increases linearly.Surprisingly, the large error is not only provoked by the order in which items arrive but also bythe large range of scales of the numbers on input, as we demonstrate below.Taking this property of the attack as an inspiration, a natural idea for a hard distribution is togenerate items uniformly on a logarithmic scale, i.e., applying an exponential function to an inputdrawn uniformly at random. This is called the log-uniform distribution . To capture the increasingnumber of items in the iterations of the attack, we square the outcome of the uniform distributionused in the exponent and finally, we let it have a negative or positive value with equal probability, A similar construction may be applied to k or k , but as data accumulates on both sides of zero (for precisionreasons), the error is not pushed into the tails. Higher precision computation (using, e.g., BigDecimal ) would seemnecessary for a practical implementation of the attack exhibiting poor performance in the tails of the distribution forthe logarithmic scale functions. signed log-uniform distribution . Thus, each item is distributed according to D hard ∼ ( − b · (2 · R − · E max (8)where b ∈ { , } is a uniformly random bit, R is a uniformly random number between 0 and 1, and E max is a maximum permissible exponent (for base 10) of the double data type, which is boundedby 308 in the IEEE 754-1985 standard. The input is then constructed by taking N samples from D hard .Figure 2 shows the quantile error for t -digest with δ = 500 and asymmetric k scale functiontogether with the error of ReqSketch on this input with N = 2 . We show the error for both themerging and clustering variant of t -digest and take the median error of each rank, based on 2 trials, while for ReqSketch , we plot the 95% confidence interval of the error (recall that the errorof
ReqSketch for any rank is a sub-Gaussian zero-mean random variable). The absolute error isplotted on the y-axis. The relative error requirement in this experiment is that the error shouldbe close to zero for the high quantiles (close to 1.0), with the requirement relaxing as the quantilevalue decreases towards 0.0. This is observed for
ReqSketch , whose error gradually increases in therange 1.0 to 0.5, before approximately plateauing below the median. However, the two t -digestvariants show larger errors on high quantiles, approaching -30% absolute error for q =0.8, for themerging variant. Note that if we simply report the maximum input value for q = 0 .
8, this wouldachieve +20% absolute error. The resulting size of the merging t -digest is 2 752 bytes, while theclustering variant needs just 2 048 bytes and the size of ReqSketch is 2 624 bytes.For comparison, Figure 3 shows similar results on the signed log-uniform distribution, i.e.,items generated according to ( − b · (2 · R − · E max , with b, R, and E max as above. The only notabledifference is that t -digest achieves a slightly better accuracy. Our further experiments suggest thatthe error of t -digest with other scale functions capturing the relative error, such as the (asymmetric) k function, is even worse than if asymmetric k is used. We obtain analogous results even for largervalues of δ (with an appropriately increased parameter k for ReqSketch ), although this requireslarger stream lengths N .A specific feature of both D hard and the log-uniform distribution is that there are numbersranging (in absolute value) from 10 − to 10 . Such numbers, however, rarely appear in real-world datasets and one may naturally wonder what happens if we limit the range, for example, to10 − to 10 , that is, use E max = 10. As it turns out, the error of t -digest drops below 2%, asdepicted in Figure 4, implying that a very large range of numbers is needed to enforce a large error.This shows a case where t -digest is clearly preferable to ReqSketch .We also note that the clustering variant of t -digest appears to have better (though admittedlystill inadequate) accuracy on the hard inputs than does the merging variant. The merging variant isgenerally preferred due to its faster updates (see Section 5.3) and avoidance of dynamic allocations.Thus, those efficiencies may come with a price of higher error. Explanation of the large error for t -digest. The particularly striking feature of the merging t -digest error in Figure 2 is the straight line which approximately goes from rank 0 .
48 to rank 0 . t -digest. This is because the last centroid with mean below 0 has mean equal to ≈ − − , whilethe next centroid in the order has mean ≈ +10 − (for one particular trial). Hence, there is nocentroid to represent values [0 , − ], while according to the definition of the hard distributionin (8), approximately 40% of items fall within that range. Furthermore, most of these 40% items In our implementation, we use E max = log ( M/N ), where M is the maximum value of double , so that t -digestdoes not exceed M when computing the average of any centroid. t -digest on i.i.d. samples from D hard ( ± ReqSketch means ± t -digest on i.i.d. samples from the log-uniform distribution.lie in a much smaller interval of, say, [0 , − ], meaning that linear interpolation does not helpto make the estimated ranks more accurate. A similar observation can be made about the errorof the clustering variant, as well as in other scenarios. While the infinitesimal values are not well-represented by centroids, they distort the centroid means. In the clustering variant, for example,all the centroids are pulled towards zero by being averaged with infinitesimal items, leading tooverestimates of quantiles for q < . q > . As outlined in Section 3.1, centroids of t -digest are only weakly ordered, which in particularmeans that the numbers covered by one centroid may be larger than the mean of a subsequentcentroid of t -digest. The mixed scales, when presented in random order, lead to centroids with ahigh degree of overlap, at least measured locally (on consecutive centroids): there are substantialregions in centroid weight space in which the two-sample Kolmogorov-Smirnov statistic computedon neighboring centroids is close to zero. The centroids produced by the careful attack, by contrast,are pairwise somewhat distinguishable, but have a global nested structure which causes the t -digest A similar phenomenon occurs for the merging variant, but the error shifts since merging passes always proceedfrom left to right. This is due to alternating merging passes not being properly supported for asymmetric scalefunctions. t -digest on i.i.d. samples from D hard with E max = 10.to have large error. Note the same data presented in sorted order does not pose nearly the samedifficulty for t -digest, as the infinitesimal items eventually form their own centroids and give the t -digest sufficient detail on that scale. Finally, we provide empirical results that compare the running time of the open source Java imple-mentations of t -digest and ReqSketch . We evaluate both merging and clustering variants of t -digestassociated with δ = 500 and asymmetric k scale function, and choose the accuracy parameter of ReqSketch as k = 4. Additionally, we include the KLL sketch as well, with accuracy parameter k = 100. Table 1 shows amortized update times in nanoseconds (rounded to integer) on an inputconsisting of N i.i.d. samples from a uniform distribution, for N = 2 . The results are obtainedon a 3 GHz AMD EPYC 7302 processor. We remark that the update times remain constant forvarying N , unless N is too small (in the order of thousands). In summary, the results show Re-qSketch to be more than two times faster than merging t -digest and about 4 . t -digest.Furthermore, we compare ReqSketch with the partial laziness technique from Section 3.3 (whichis the default option) and with “full laziness” that was proposed for the KLL sketch in [14]. Figure 5shows the average update times of both variants for varying accuracy parameter k on N i.i.d.samples from a uniform distribution, for N = 2 . This implies that the partial laziness ideaprovides a significant speedup, especially for larger values of k . See https://github.com/PavelVesely/t-digest/blob/master/docs/python/adversarial_plots/notebooks/overlap_computation.ipynb for the supporting computations.
Table 1: Average update time in nanoseconds.Merging t -digest Clustering t -digest ReqSketch
KLL129 251 55 4114igure 5: Average update time of
ReqSketch in nanoseconds.
Our standpoint as authors of this work cannot be viewed as neutral: two of us (from Warwick)are co-authors of the
ReqSketch paper [3], and two of us (at Splunk) have deployed t -digest andanalyzed its behavior. Our collaboration in this work was driven by a desire to better understandthese algorithms, their strengths and weaknesses, and provide advice to other data scientists onhow to make best use of them. As foreshadowed in the introduction, our view at this conclusionis perhaps more complicated than when we started, when we hoped for a simple answer. Fromour studies, the main takeaway is that t -digest can fail to give the desired levels of accuracy oninputs with a highly non-uniform distribution over the domain. However, these inputs are far fromappearing natural, and should not significantly trouble any teams who have deployed this algorithm.Our second observation is that as implemented, the ReqSketch is pretty fast in practice, and quitereliable in accuracy, despite its somewhat offputtingly technical description. We did not observeany examples where its error shoots up, but on “expected” distributions (like the “easier” input inFigure 4), it is appreciably less accurate than t -digest. There is no clear win for the pragmatic ortheoretically minded solutions, at least in this case. In the final analysis, our advice to practitionersis to consider both styles of algorithm for their applications, and to weigh up carefully the tradeoffbetween performance in the worst case to performance on average. Acknowledgements.
We wish to thank Lee Rhodes and other people working on the DataS-ketches project for many useful discussions about implementing
ReqSketch . P. Vesel´y is grateful tothe Computer Science Institute of Charles University for access to the computational servers. Thework of G. Cormode and P. Vesel´y is supported by European Research Council grant ERC-2014-CoG 647557.
References [1] Apache datasketches: Kll sketch vs t-digest. https://datasketches.apache.org/docs/Quantiles/KllSketchVsTDigest.html . Acessed: 2021-01-27.152] Pankaj K Agarwal, Graham Cormode, Zengfeng Huang, Jeff M Phillips, Zhewei Wei, andKe Yi. Mergeable summaries.
ACM Transactions on Database Systems (TODS) , 38(4):26,2013.[3] Graham Cormode, Zohar Karnin, Edo Liberty, Justin Thaler, and Pavel Vesel´y. Relative errorstreaming quantiles. arXiv preprint arXiv:2004.01668 , 2020.[4] Graham Cormode and Pavel Vesel´y. A tight lower bound for comparison-based quantile sum-maries. In
Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principlesof Database Systems , PODS’20, page 81–93, New York, NY, USA, 2020. ACM.[5] Ted Dunning. Conservation of the t -digest scale invariant. arXiv preprint arXiv:1903.09919 ,2019.[6] Ted Dunning. The size of a t -digest. arXiv preprint arXiv:1903.09921 , 2019.[7] Ted Dunning. The t-digest: Efficient estimates of distributions. Software Impacts , 7:100049,2021.[8] Ted Dunning and Otmar Ertl. Computing extremely accurate quantiles using t-digests. arXivpreprint arXiv:1902.04023 , 2019.[9] David Felber and Rafail Ostrovsky. A randomized online quantile summary in O(1/epsilon* log(1/epsilon)) words. In
Approximation, Randomization, and Combinatorial Optimiza-tion. Algorithms and Techniques (APPROX/RANDOM 2015) , volume 40 of
Leibniz Interna-tional Proceedings in Informatics (LIPIcs) , pages 775–785, Dagstuhl, Germany, 2015. SchlossDagstuhl–Leibniz-Zentrum fuer Informatik.[10] Edward Gan, Jialin Ding, Kai Sheng Tai, Vatsal Sharan, and Peter Bailis. Moment-basedquantile sketches for efficient high cardinality aggregation queries.
Proceedings of the VLDBEndowment , 11(11):1647–1660, 2018.[11] Michael Greenwald and Sanjeev Khanna. Space-efficient online computation of quantile sum-maries. In
ACM SIGMOD Record , volume 30, pages 58–66. ACM, 2001.[12] Anupam Gupta and Francis X. Zane. Counting inversions in lists. In
Proceedings of the 14thAnnual ACM-SIAM Symposium on Discrete Algorithms , SODA ’03, pages 253–254, Philadel-phia, PA, USA, 2003. Society for Industrial and Applied Mathematics.[13] Regant Y. S. Hung and Hing-Fung Ting. An ω ( (cid:15) log (cid:15) ) space lower bound for finding (cid:15) -approximate quantiles in a data stream. In Frontiers in Algorithmics, 4th International Work-shop, FAW 2010, Wuhan, China, August 11-13, 2010. Proceedings , volume 6213 of
LectureNotes in Computer Science , pages 89–100. Springer, 2010.[14] Nikita Ivkin, Edo Liberty, Kevin Lang, Zohar Karnin, and Vladimir Braverman. Streamingquantiles algorithms with small space and update time. arXiv preprint arXiv:1907.00236 ,2019.[15] Daniel M Kane, Jelani Nelson, and David P Woodruff. An optimal algorithm for the dis-tinct elements problem. In
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGARTsymposium on Principles of database systems , pages 41–52. ACM, 2010.1616] Zohar Karnin, Kevin Lang, and Edo Liberty. Optimal quantile approximation in streams. In
Proceedings of the 57th Annual Symposium on Foundations of Computer Science (FOCS ’16) ,pages 71–78. IEEE, 2016.[17] Ge Luo, Lu Wang, Ke Yi, and Graham Cormode. Quantiles over data streams: Experimentalcomparisons, new analyses, and further improvements.
The VLDB Journal , 25(4):449–472,August 2016.[18] Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. Approximate medians andother quantiles in one pass and with limited memory. In
ACM SIGMOD Record , volume 27,pages 426–435. ACM, 1998.[19] Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G Lindsay. Random sampling tech-niques for space efficient online computation of order statistics of large datasets. In
ACMSIGMOD Record , volume 28, pages 251–262. ACM, 1999.[20] Charles Masson, Jee E. Rim, and Homin K. Lee. Ddsketch: A fast and fully-mergeable quantilesketch with relative-error guarantees.
Proc. VLDB Endow. , 12(12):2195–2205, August 2019.[21] J Ian Munro and Michael S Paterson. Selection and sorting with limited storage.
Theoreticalcomputer science , 12(3):315–323, 1980.[22] Lee Rhodes, Kevin Lang, Alexander Saidakov, Edo Liberty, and Justin Thaler. DataSketches:A library of stochastic streaming algorithms. Open source software: https://datasketches.apache.org/ , 2013.[23] Joseph Ross. Asymmetric scale functions for t -digests. arXiv preprint arXiv:2005.09599 , 2020.[24] Ying Zhang, Xuemin Lin, Jian Xu, Flip Korn, and Wei Wang. Space-efficient relative errororder sketch over data streams. In Proceedings of the 22nd International Conference on DataEngineering (ICDE’06) , pages 51–51. IEEE, 2006.17
Reproducibility
All code used in obtaining the experimental results is open source and can be downloaded from https://github.com/PavelVesely/t-digest/ , where we also provide documentation and resources needed to reproduce our experiments. Ourrepository is a clone of the original t -digest repository available at https://github.com/tdunning/t-digest (the original repository was last merged into ours on 2021-01-28), and it additionallyincorporates asymmetric scale functions from https://github.com/signalfx/t-digest/tree/asymmetric . The asymmetric scale functions provide a natural t -digest analogue of a ReqSketch with guarantees on one end of the distribution.The DataSketches library is available at https://datasketches.apache.org/ and we took theparticular Java implementation of
ReqSketch from the GitHub repository at https://github.com/apache/datasketches-java . For technical reasons, the code we use in our experiments requires the
ReqSketch algorithm to work with the double data type, however, the DataSketches implementationworks with float numbers only. We provide an adjusted implementation using double inside ourabove-mentioned repository for reproducing the experiments. We also incorporate the KLL sketchfrom the DataSketches library into our repository, with a similar adjustment to the double type.
A.1 Main Experimental Setups
We implemented three experimental setups: • A careful construction of a hard input for t -digest, according to Sections 4 and 5.1. • A generator of i.i.d. samples from a specified distribution, for reproducing results in Sec-tion 5.2. • A comparison of the average update times of t -digest (both the merging and clustering vari-ants), ReqSketch , and the KLL sketch, for reproducing results in Section 5.3.The parameters of these experiments are adjustable by a configuration file, which allows, forexample, to set the compression parameter δ and scale function for t -digest and the accuracyparameter k for ReqSketch . Each of the experiments outputs a CSV file with results into a specifieddirectory. See the README file in the repository for more details on how to run the experimentsand how to produce the plots.The first two experiments output statistics on absolute error of t -digest and of ReqSketch for eachof 200 evenly spaced normalized ranks (the number of these ranks can be adjusted). Furthermore,we perform T trials, where T is adjustable and set to 2 by default, and output the median errorfor each variant of t -digest and the 95% confidence interval for ReqSketch (recall that the error of
ReqSketch is unbiased; see Section 3.2). More precisely, the errors for each rank are accumulatedusing the KLL sketch with accuracy parameter k = 200 and then from this sketch we recover anapproximate median or appropriate quantiles for two standard deviations of the normal distribution.The error introduced by using the KLL sketch instead of exact quantiles is negligible as we do notneed to estimate extreme quantiles of the distribution.18 .2 Auxiliary Experiments The possibility to adjust the parameters allow for verification of other claims in this paper. Forinstance, one can obtain plots similar to those in Figures 2-4 for (asymmetric) scale function k or for other values of δ . We remark that in general, larger values of δ require larger values of theinput size N to invoke bad accuracy levels for t -digest, compared to similarly-sized ReqSketch .Additionally, the configuration files may be altered to produce more verbose output, namely toalso write the datapoints underlying the centroids in the resulting t -digest. Some plots describingthe local overlap of centroids are available in the repository. These help to illuminate the natureof the weak ordering of centroids discussed in Section 5.2.Finally, further experiments with ReqSketch can be performed with our proof-of-concept Pythonimplementation and in the DataSketches library. The Python implementation of
ReqSketch by thefourth author is available at https://github.com/edoliberty/streaming-quantiles/blob/master/relativeErrorSketch.py and the generator of some particular data orderings is at https://github.com/edoliberty/streaming-quantiles/blob/master/streamMaker.py . Moreover, the DataSketches library provides a repository for doing extensive accuracy and speedexperiments with
ReqSketch (as well as other sketches in this library), which is available at https://github.com/apache/datasketches-characterization/https://github.com/apache/datasketches-characterization/