[PDF] Sampling Sketches for Concave Sublinear Functions of Frequencies

Abstract

We consider massive distributed datasets that consist of elements modeled as key-value pairs and the task of computing statistics or aggregates where the contribution of each key is weighted by a function of its frequency (sum of values of its elements). This fundamental problem has a wealth of applications in data analytics and machine learning, in particular, with concave sublinear functions of the frequencies that mitigate the disproportionate effect of keys with high frequency. The family of concave sublinear functions includes low frequency moments ( p≤1 ), capping, logarithms, and their compositions. A common approach is to sample keys, ideally, proportionally to their contributions and estimate statistics from the sample. A simple but costly way to do this is by aggregating the data to produce a table of keys and their frequencies, apply our function to the frequency values, and then apply a weighted sampling scheme. Our main contribution is the design of composable sampling sketches that can be tailored to any concave sublinear function of the frequencies. Our sketch structure size is very close to the desired sample size and our samples provide statistical guarantees on the estimation quality that are very close to that of an ideal sample of the same size computed over aggregated data. Finally, we demonstrate experimentally the simplicity and effectiveness of our methods.

Full PDF

aa r X i v : . [ c s . D S ] D ec Sampling Sketches forConcave Sublinear Functions of Frequencies

Edith Cohen

Google Research, CATel Aviv University, Israel [email protected]

Oﬁr Geri

Stanford University, CA [email protected]

Abstract

We consider massive distributed datasets that consist of elements modeled as key-value pairs and the task of computing statistics or aggregates where the contribu-tion of each key is weighted by a function of its frequency (sum of values of itselements). This fundamental problem has a wealth of applications in data analyticsand machine learning, in particular, with concave sublinear functions of the fre-quencies that mitigate the disproportionate effect of keys with high frequency. Thefamily of concave sublinear functions includes low frequency moments ( p ≤ ),capping, logarithms, and their compositions. A common approach is to samplekeys, ideally, proportionally to their contributions and estimate statistics from thesample. A simple but costly way to do this is by aggregating the data to producea table of keys and their frequencies, apply our function to the frequency values,and then apply a weighted sampling scheme. Our main contribution is the designof composable sampling sketches that can be tailored to any concave sublinear function of the frequencies. Our sketch structure size is very close to the desiredsample size and our samples provide statistical guarantees on the estimation qual-ity that are very close to that of an ideal sample of the same size computed overaggregated data. Finally, we demonstrate experimentally the simplicity and effec-tiveness of our methods. We consider massive distributed datasets that consist of elements that are key-value pairs e =( e. key , e. val ) with e. val > . The elements are generated or stored on a large number of serversor devices. A key x may repeat in multiple elements, and we deﬁne its frequency ν x to be thesum of values of the elements with that key, i.e., ν x := P e | e. key = x e. val . For example, the keyscan be search queries, videos, terms, users, or tuples of entities (such as video co-watches or termco-occurrences) and each data element can correspond to an occurrence or an interaction involvingthis key: the search query was issued, the video was watched, or two terms co-occurred in a typedsentence. An instructive common special case is when all elements have the same value and thefrequency ν x of each key x in the dataset is simply the number of elements with key x .A common task is to compute statistics or aggregates, which are sums over key contributions. Thecontribution of each key x is weighted by a function of its frequency ν x . One example of suchsum aggregates are queries of domain statistics P x ∈ H ν x for some domain (subset of keys) H .The domains of interest are often overlapping and speciﬁed at query time. Sum aggregates alsoarise as components of a larger pipeline, such as the training of a machine learning model withparameters θ , labeled examples x ∈ X with frequencies ν x , and a loss objective of the form ℓ ( X ; θ ) = P x f ( ν x ) L ( x ; θ ) . The function f that is applied to the frequencies can be any con-cave sublinear function. Concave sublinear functions, which we discuss further below, are used in pplications to mitigate the disproportionate effect of keys with very high frequencies. The trainingof the model typically involves repeated evaluation of the loss function (or of its gradient that alsohas a sum form) for different values of θ . We would like to compute these aggregates on demand,without needing to go over the data many times.When the number of keys is very large it is often helpful to compute a smaller random sample S ⊆ X of the keys from which aggregates can be efﬁciently estimated. In some applications,obtaining a sample can be the end goal. For example, when the aggregate is a gradient, we can usethe sample itself as a stochastic gradient. To provide statistical guarantees on our estimate quality,the sampling needs to be weighted ( importance sampling), with heavier keys sampled with higherprobability, ideally, proportional to their contribution ( f ( ν x ) ). When the weights of the keys areknown, there are classic sampling schemes that provide estimators with tight worst-case variancebounds [33, 17, 8, 13, 37, 38].The datasets we consider here are presented in an unaggregated form: each key can appear multi-ple times in different locations. The focus of this work is designing composable sketch structures(formally deﬁned below) that allow to compute a sample over unaggregated data with respect tothe weights f ( ν x ) . One approach to compute a sample from unaggregated data is to ﬁrst aggregatethe data to produce a table of key-frequency pairs ( x, ν x ) , compute the weights f ( ν x ) , and applya weighted sampling scheme. This aggregation can be performed using composable structures thatare essentially a table with an entry for each distinct key that occurred in the data. The number ofdistinct keys, however, and hence the size of that sketch, can be huge. For our sampling application,we would hope to use sketches of size that is proportional to the desired sample size, which is gen-erally much smaller than the number of unique keys, and still provide statistical guarantees on theestimate quality that are close to that of a weighted sample computed according to f ( ν x ) . Concave Sublinear Functions.

Typical datasets have a skewed frequency distribution, where asmall fraction of the keys have very large frequencies and we can get better results or learn a bettermodel of the data by suppressing their effect. The practice is to apply a concave sublinear func-tion f to the frequency, so that the importance weight of the key is f ( ν x ) instead of simply itsfrequency ν x . This family of functions includes the frequency moments ν px for p ≤ , ln(1 + ν x ) , cap T ( ν x ) = min { T, ν x } for a ﬁxed T ≥ , their compositions, and more. A formal deﬁnitionappears in Section 2.4.Two hugely popular methods for producing word embeddings from word co-occurrences use thisform of mitigation: word2vec [30] uses f ( ν ) = ν . and f ( ν ) = ν . for positive and negativeexamples, respectively, and GloVe [35] uses f ( ν ) = min { T, ν . } to mitigate co-occurrence fre-quencies. When the data is highly distributed, for example, when it originates or resides at millionsof mobile devices (as in federated learning [29]), it is useful to estimate the loss or compute astochastic gradient update efﬁciently via a weighted sample.The suppression of higher frequencies may also directly arise in applications. One example is cam-paign planning for online advertising, where the value of showing an ad to a user diminishes withthe number of views. Platforms allow an advertiser to specify a cap value T on the number of timesthe same ad can be presented to a user [23, 34]. In this case, the number of opportunities to displayan ad to a user x is a cap function of the frequency of the user f ( ν x ) = min { T, ν x } , and the numberfor a segment of users H is the statistics P x ∈ H f ( ν x ) . When planning a campaign, we need toquickly estimate the statistics for different segments, and this can be done from a sample that ideallyis weighted by f ( ν x ) . Our Contribution.

In this work, we design composable sketches that can be tailored to any con-cave sublinear function f , and allow us to compute a weighted sample over unaggregated data withrespect to the weights f ( ν x ) . Using the sample, we will be able to compute unbiased estimatorsfor the aggregates mentioned above. In order to compute the estimators, we need to make a secondpass over the data: In the ﬁrst pass, we compute the set of sampled keys, and in the second pass wecompute their frequencies. Both passes can be done in a distributed manner.A sketch S ( D ) is a data structure that summarizes a set D of data elements, so that the output ofinterest for D (in our case, a sample of keys) can be recovered from the sketch S ( D ) . A sketchstructure is composable if we can obtain a sketch S ( D ∪ D ) of two sets of elements D and D from the sketches S ( D ) and S ( D ) of the sets. This property alone gives us full ﬂexibility to2arallelize or distribute the computation. The size of the sketch determines the communication andstorage needs of the computation.We provide theoretical guarantees on the quality (variance) of the estimators. The baseline forour analysis is the bounds on the variance that are guaranteed by PPSWOR on aggregated data.PPSWOR [37, 38] is a sampling scheme with tight worst-case variance bounds. The estimatorsprovided by our sketch have variance at most / ((1 − ε ) ) times the variance bound for PPSWOR.The parameter ε ≤ / mostly affects the run time of processing a data element, which grows near-linearly in /ε . Thus, our sketch allows us to get approximately optimal guarantees on the variancewhile avoiding the costly aggregation of the data.We remark that these guarantees are for soft concave sublinear functions. This family approximatesany concave sublinear function up to a multiplicative factor of − /e . As a result, our sketch can beused with any (non-soft) concave sublinear function while incurring another factor of (cid:16) e − (cid:17) in the variance.The space required by our sketch signiﬁcantly improves upon the previous methods (which all re-quire aggregating the data). In particular, if the desired sample size is k , we show that the spacerequired by the sketch at any given time is O ( k ) in expectation. We additionally show that, with prob-ability at least − δ , the space will not exceed O (cid:16) k + min { log m, log log (cid:16) Sum D Min ( D ) (cid:17) } + log (cid:0) δ (cid:1)(cid:17) at any time while processing the dataset D , where m is the number of elements, Sum D the sumof weights of all elements, and Min ( D ) is the minimum value of an element in D . In the com-mon case where all elements have weight , this means that for any δ , the needed space is at most O (cid:0) k + log log m + log (cid:0) δ (cid:1)(cid:1) with probability at least − δ . We complement our work with a small-scale experimental study. We use a simple implementationof our sampling sketch to study the actual performance in terms of estimate quality and sketch size.In particular, we show that the estimate quality is even better than the (already adequate) guaranteesprovided by our worst-case bounds. We additionally compare the estimate quality to that of twopopular sampling schemes for aggregated data, PPSWOR [37, 38] and priority (sequential Poisson)sampling [33, 17]. In the experiments, we see that the estimate quality of our sketch is close to whatachieved by PPSWOR and priority sampling, while our sketch uses much less space by eliminatingthe need for aggregation.The paper is organized as follows. The preliminaries are presented in Section 2. We provide anoverview of PPSWOR and the statistical guarantees it provides for estimation. Then, we formallydeﬁne the family of concave sublinear functions. Our sketch uses two building blocks. The ﬁrstbuilding block, which can be of independent interest, is the analysis of a stochastic PPSWOR sample .Typically, when computing a sample, the data from which we sample is a deterministic part of theinput. In our construction, we needed to analyze the variance bounds for PPSWOR sampling thatis computed over data elements with randomized weights (under certain assumptions). We providethis analysis in Section 3. The second building block is the

SumMax sampling sketch , which isdiscussed in Section 4. This is an auxiliary sketch structure that supports datasets with a certain typeof structured keys. We put it all together and describe our main result in Section 5. The experimentsare discussed in Section 6.

Related Work.

There are multiple classic composable weighted sampling schemes for aggregated datasets (where keys are unique to elements). Schemes that provide estimators with tight worst-casevariance bounds include priority (sequential Poisson) sampling [33, 17] and VarOpt sampling [8, 13].We focus here on PPSWOR [37, 38] as our base scheme because it extends to unaggregated datasets,where multiple elements can additively contribute to the frequency/weight of each key.There is a highly proliﬁc line of research on developing sketch structures for different task overstreamed or distributed unaggregated data with applications in multiple domains. Some early exam-ples are frequent elements [31] and distinct counting [20], and the seminal work of [1] providing atheoretical model for frequency moments. Composable sampling sketches for unaggregated datasetswere also studied for decades. The goal is to meet the quality of samples computed on aggregated For streaming algorithms we are typically interested in deterministic worst-case bounds on the space, butstreaming algorithms with randomized space have also been considered in some cases, in particular whenstudying the sliding window model [3, 7]. f ( ν ) = 1 when ν > ) [27, 39] andsketch structures for sum sampling ( f ( ν ) = ν ) [12]. The latter generalizes the discrete sample andhold scheme [22, 18, 14] and PPSWOR. Sampling sketches for cap functions ( f ( ν ) = min { T, ν } )were provided in [11] and have a slight overhead over the aggregated baseline. The latter work alsoprovided multi-objective/universal samples that with a logarithmic overhead simultaneously providestatistical guarantees for all concave sublinear f . In the current work we propose sampling sketchesthat can be tailored to any concave sublinear function and only have a small constant overhead.An important line of work uses sketches based on random linear projections to estimate frequencystatistics and to sample. In particular, ℓ p sampling sketches [21, 32, 2, 26, 25] sample (roughly)according to f ( ν ) = ν p for p ∈ [0 , . These sketches have a higher logarithmic overhead on thespace compared to sample-based sketches, and do not support all concave sublinear functions ofthe frequencies (for example, f ( ν ) = ln (1 + ν ) ). In some respects they are more limited in theirapplication – for example, they are not designed to produce a sample that includes raw keys. Theiradvantage is that they can be used with super-linear ( p ∈ (1 , ) functions of frequencies and canalso support signed element values (the turnstile model). For the more basic problem of sketchesthat estimate frequency statistics over the full data, a complete characterization of the frequencyfunctions for which the statistics can be estimated via polylogarithmic-size sketches is providedin [6, 4]. Universal sketches for estimating ℓ p norms of subsets were recently considered in [5].The seminal work of Alon et al. [1] established that for some functions of frequencies (momentswith p > ), statistics estimation requires polynomial-size sketches. A double logarithmic sizesketch, extending [19] for distinct counting, that computes statistics over the entire dataset for allsoft concave sublinear functions is provided in [10]. Our design builds on components of that sketch. Consider a set D of data elements of the form e = ( e. key , e. val ) where e. val > . We denotethe set of possible keys by X . For a key z ∈ X , we let Max D ( z ) := max e ∈ D | e. key = z e. val and Sum D ( z ) := P e ∈ D | e. key = z e. val denote the maximum value of a data element in D with key z and the sum of values of data elements in D with key z , respectively. Each key z ∈ X thatappears in D is called active . If there is no element e ∈ D with e. key = z , we say that z is inactive and deﬁne Max D ( z ) := 0 and Sum D ( z ) := 0 . When D is clear from context, it is omitted.For a key z , we use the shorthand ν z := Sum D ( z ) and refer to it as the frequency of z . Thesum and the max-distinct statistics of D are deﬁned, respectively, as Sum D := P e ∈ D e. val and MxDistinct D := P z ∈X Max D ( z ) . For a function f , f D := P z ∈X f ( Sum D ( z )) = P z ∈X f ( ν z ) is the f -frequency statistics of D .For a set A ⊆ R , we use A ( i ) to denote the i -th order statistic of A , that is, the i -th lowest elementin A . k Structure

In this work, we will use composable sketch structures in order to efﬁciently summarize streamedor distributed data elements. A composable sketch structure is speciﬁed by three operations: Theinitialization of an empty sketch structure s , the processing of a data element e into a structure s ,and the merging of two sketch structures s and s . To sketch a stream of elements, we start withan empty structure and sequentially process data elements while storing only the sketch structure.The merge operation is useful with distributed or parallel computation and allows us to compute thesketch of a large set D = S i D i of data elements by merging the sketches of the parts D i .In particular, one of the main building blocks that we use is the bottom- k structure [15], speciﬁed inAlgorithm 1. The structure maintains k data elements: For each key, consider only the element withthat key that has the minimum value. Of these elements, the structure keeps the k elements that havethe lowest values. In this subsection, we describe a scheme to produce a sample of k keys, where at each step theprobability that a key is selected is proportional to its weight. That is, the sample we produce will4 lgorithm 1: Bottom- k Sketch Structure // Initialize structureInput: the structure size ks.set ← ∅ // Set of ≤ k key-value pairs// Process elementInput: element e = ( e. key , e. val ) , a bottom- k structure s if e. key ∈ s.set then replace the current value v of e. key in s.set with min { v, e. val } else insert ( e. key , e. val ) to s.set if | s.set | = k + 1 then Remove the element e ′ with maximum value from s.set // Merge two bottom- k structuresInput: s , s // Bottom- k structures Output: s // Bottom- k structure P ← s .set ∪ s .sets.set ← the (at most) k elements of P with lowest values (at most one element per key) Algorithm 2:

PPSWOR Sampling Sketch // Initialize structureInput: the sample size k Initialize a bottom- k structure s. sample // Algorithm 1// Process elementInput: element e = ( e. key , e. val ) , PPSWOR sample structure sv ∼ Exp [ e. val ] Process the element ( e. key , v ) into the bottom- k structure s. sample // Merge two structures s , s to obtain ss. sample ← Merge the bottom- k structures s . sample and s . sample be equivalent to performing the following k steps. At each step we select one key and add it to thesample. At the ﬁrst step, each key x ∈ X (with weight w x ) is selected with probability w x / P y w y .At each subsequent step, we choose one of the remaining keys, again with probability proportionalto its weight. Since the total weight of the remaining keys is lower, the probability that a key isselected in a subsequent step (provided it was not selected earlier) is only higher. This process iscalled probability proportional to size and without replacement (PPSWOR) sampling.A classic method for PPSWOR sampling is the following scheme [37, 38]. For each key x withweight w x , we independently draw seed ( x ) ∼ Exp ( w x ) . Outputting the sample that includes the k keys with smallest seed ( x ) is equivalent to PPSWOR sampling as described above. This methodtogether with a bottom- k structure can be used to implement PPSWOR sampling over a set of dataelements D according to ν x = Sum D ( x ) . This sampling sketch is presented here as Algorithm 2.The sketch is due to [12] (based on [22, 18, 14]). Proposition 2.1.

Algorithm 2 maintains a composable bottom- k structure such that for each key x , the lowest value of an element with key x (denoted by seed ( x ) ) is drawn independently from Exp ( ν x ) . Hence, it is a PPSWOR sample according to the weights ν x . The proof is provided in Appendix A. k Samples

PPSWOR sampling (Algorithm 2) is a special case of bottom- k sampling [38, 15, 16]. Deﬁnition 2.2.

Let k ≥ . A bottom- k sample over keys X is obtained by drawing independentlyfor each active key x a random variable seed ( x ) ∼ SeedDist x . The k − keys with lowest seed ( x ) values are considered to be included in the sample S , and the k -th lowest value τ := { seed ( x ) | x ∈ X } ( k ) is the inclusion threshold . SeedDist x are such that for all t > , Pr[ seed ( x ) < t ] > , that is, there ispositive probability to be below any positive t . Typically the distributions SeedDist x come from afamily of distributions that is parameterized by the frequency ν x of keys. The frequency is positivefor active keys and otherwise. In the special case of PPSWOR sampling by frequency, SeedDist x is Exp ( ν x ) .We review here how a bottom- k sample is used to estimate domain statistics of the form P x ∈ H f ( ν x ) for H ⊆ X . More generally, we will show how to estimate aggregates of the form X x ∈X L x f ( ν x ) (1)for any set of ﬁxed values L x . Note that we can represent domain statistics in this form by setting L x = 1 for x ∈ H and L x = 0 for x / ∈ H . For the sake of this discussion, we treat f x := f ( ν x ) assimply a set of weights associated with keys, assuming we can have f x > only for active keys.In order to estimate statistics of the form (1), we will deﬁne an estimator b f x for each f x . Theestimator b f x will be non-negative ( b f x ≥ ), unbiased ( E h b f x i = f x ), and such that b f x = 0 when thekey x is not included in the bottom- k sample ( x S using the terms of Deﬁnition 2.2).As a general convention, we will use the notation b z to denote an estimator of any quantity z . Wedeﬁne the sum estimator of the statistics P x ∈X L x f x to be \ X x ∈X L x f x := X x ∈ S L x b f x We also note that since b f x = 0 for x / ∈ S , computing the sum only over x ∈ S is the same ascomputing the sum over all x ∈ X , that is, P x ∈ S L x b f x = P x ∈X L x b f x .Note that the sum estimator can be computed as long as the ﬁxed values L x and the per-key estimates b f x for x ∈ S are available. From linearity of expectation, we get that the sum estimate is unbiased: E " \ X x ∈X L x f x = E "X x ∈ S L x b f x = E " X x ∈X L x b f x = X x ∈X L x E h b f x i = X x ∈X L x f x . We now deﬁne the per-key estimators b f x . The following is a conditioned variant of the Horvitz-Thompson estimator [24]. Deﬁnition 2.3.

Let k ≥ and consider a bottom- k sample, where S is the set of k − keys in thesample and τ is the inclusion threshold. For any x ∈ X , the inverse-probability estimator of f x is b f x = ( f x Pr seed ( x ) ∼ SeedDist x [ seed ( x ) <τ ] x ∈ S x / ∈ S .

In order to compute these estimates, we need to know the weights f x and distributions SeedDist x for the sampled keys x ∈ S . In particular, in our applications when f x = f ( ν x ) is a functionof frequency and the seed distribution is parameterized by frequency, then it sufﬁces to know thefrequencies of sampled keys. Claim 2.4.

The inverse-probability estimator is unbiased, that is, E h b f x i = f x .Proof. We ﬁrst consider b f x when conditioned on the seed values of all other keys X \ { x } and inparticular on τ x := { seed ( z ) | z ∈ X \ { x }} ( k − , In the general case, we assume these functions of the frequency are computationally tractable or can beeasily approximated up to a small constant factor. For our application, the discussion will follow in Section 5.4. k − smallest seed on X \ { x } . Under this conditioning, a key x is included in S withprobability Pr seed ( x ) ∼ SeedDist x [ seed ( x ) < τ x ] . When x S , the estimate is . When x ∈ S , wehave that τ x = τ and the estimate is the ratio of f x and the inclusion probability. So our estimator isa plain inverse probability estimator and thus E h b f x | τ x i = f x .Finally, from the fact that the estimator is unbiased when conditioned on τ x , we also get that it isunconditionally unbiased: E h b f x i = E τ x h E h b f x | τ x ii = E τ x [ f x ] = f x .We now turn to analyze the variance of the estimators. The guarantees we can obtain on the qualityof the sum estimates depend on how well the distributions SeedDist x are tailored to the values f x , where ideally, keys should be sampled with probabilities proportional to f x . PPSWOR, where seed ( x ) ∼ Exp ( f x ) , is such a “gold standard” sampling scheme that provides us with strong guaran-tees: For domain statistics P x ∈ H f x , we get a tight worst-case bound on the coefﬁcient of variation of / p q ( k − , where q = P x ∈ H f x / P x ∈X f x is the fraction of the statistics that is due to thedomain H . Moreover, the estimates are concentrated in a Chernoff bounds sense. For objectivesof the form (1), we obtain additive Hoeffding-style bounds that depend only on sample size and therange of L x .When we cannot implement “gold-standard” sampling via small composable sampling structures, weseek guarantees that are close to that. Conveniently, in the analysis it sufﬁces to bound the variance ofthe per-key estimators [9, 11]: A key property of bottom- k estimators is that ∀ x, z, cov ( b f x , b f z ) ≤ (equality holds for k ≥ ) [11]. Therefore, the variance of the sum estimator can be bounded bythe sum of bounds on the per-key variance. This allows us to only analyze the per-key variance Var (cid:16) b f x (cid:17) . To achieve the guarantees of the “gold standard” sampling, the desired bound on theper-key variance for a sample of size k − (a bottom- k sample where the k -th lowest seed is theinclusion threshold) is Var (cid:16) b f x (cid:17) ≤ k − f x X z ∈X f z . (2)So our goal is to establish upper bounds on the per-key variance that are within a small constant of(2). We refer to this value as the overhead . The overhead factor in the per-key bounds carries overto the sum estimates.We next review the methodology for deriving per-key variance bounds. The starting point is to ﬁrstbound the per-key variance of b f x conditioned on τ x . Claim 2.5.

With the inverse probability estimator we have

Var (cid:16) b f x (cid:17) = E τ x h Var (cid:16) b f x | τ x (cid:17)i = E τ x (cid:20) f x (cid:18) seed ( x ) < τ x ] − (cid:19)(cid:21) . Proof.

Follows from the law of total variance and the unbiasedness of the conditional estimates forany ﬁxed value of τ x , E h b f x | τ x i = f x .For the “gold standard” PPSWOR sample, we have Pr[ seed ( x ) < t ] = 1 − exp( − f x t ) and using e − x − e − x ≤ x , we get Var (cid:16) b f x | τ x (cid:17) = f x (cid:18) seed ( x ) < τ x ] − (cid:19) ≤ f x τ x . (3)In order to bound the unconditional per-key variance, we use the following notion of stochasticdominance. Deﬁnition 2.6.

Consider two density functions a and b both with support on the nonnegative reals.We say that a is dominated by b ( a (cid:22) b ) if for all z ≥ , R z a ( y ) dy ≤ R z b ( y ) dy . Deﬁned as the ratio of the standard deviation to mean. For our unbiased estimators it is equal to the relativeroot mean squared error. a is pointwise at most theCDF of b . In particular, the probability of being below some value y under b is at least that of a . When bounding the variance, we use a distribution B that dominates the distribution of τ x and iseasier to work with and then compute the upper bound Var h b f x i = E τ x (cid:20) f x (cid:18) seed ( x ) < τ x ] − (cid:19)(cid:21) (4) ≤ E t ∼ B (cid:20) f x (cid:18) seed ( x ) < t ] − (cid:19)(cid:21) . With PPSWOR, the distribution of τ x is dominated by the distribution Erlang [ P z ∈X f z , k − ,where Erlang [ V, k ] is the distribution of the sum of k independent exponential random variableswith parameter V . The density function of Erlang [ V, k ] is B V,k ( t ) = V k t k − e − V t ( k − . Choosing B tobe Erlang [ P z ∈X f ( ν z ) , k − in (4) and using (3), we get the bound in (2).Note that if we have an estimator that gives a weaker bound of c · f x τ x on the conditional varianceand the distribution of τ x is similarly dominated by Erlang [ P z ∈X f z , k − , we will obtain acorresponding bound on the unconditional variance with overhead c . A function f : [0 , ∞ ) → [0 , ∞ ) is soft concave sublinear if for some a ( t ) ≥ it can be expressedas f ( ν ) = L c [ a ]( ν ) := Z ∞ a ( t )(1 − e − νt ) dt . (5) L c [ a ]( ν ) is called the complement Laplace transform of a at ν . The function a ( t ) is the inverseLaplace c (complement Laplace) transform of f : a ( t ) = ( L c ) − [ f ]( t ) . (6)A table with the inverse Laplace c transform of several common functions (in particular, the moments ν p for p ∈ (0 , and ln (1 + ν ) ) appears in [10]. We additionally use the notation L c [ a ]( ν ) βα := Z βα a ( t )(1 − e − νt ) dt . The sampling schemes we present in this work will be deﬁned for soft concave sublinear functionsof the frequencies. However, this will allow us to estimate well any function that is within a smallmultiplicative constant of a soft concave sublinear function. In particular, we can estimate concavesublinear functions . These functions can be expressed as f ( ν ) = Z ∞ a ( t ) min { , νt } dt (7)for a ( t ) ≥ . The concave sublinear family includes all functions such that f (0) = 0 , f is mono-tonically non-decreasing, ∂ + f (0) < ∞ , and ∂ f ≤ .Any concave sublinear function f can be approximated by a soft concave sublinear function asfollows. Consider the corresponding soft concave sublinear function ˜ f using the same coefﬁcients a ( t ) . The function ˜ f closely approximates f pointwise [10]: (1 − /e ) f ( ν ) ≤ ˜ f ( ν ) ≤ f ( ν ) . We note in our applications, lower values mean higher inclusion probabilities. In most applications, highervalues are associated with better results, and accordingly, ﬁrst-order stochastic dominance is usually deﬁned asthe reverse. The deﬁnition also allows a ( t ) to have discrete mass at points (that is, we can add a component of the form P i a ( t i )(1 − e − νt i ) ). We generally ignore this component for the sake of presentation, but one way to modelthis is using Dirac delta. Here we also allow a ( t ) to have discrete mass using Dirac delta. For the sake of presentation, we alsoassume bounded ν – otherwise we need to add a linear component A ∞ ν for some A ∞ ≥ . The component A ∞ ν can easily be added to the ﬁnal sketch presented in Section 5, for example, by taking the minimum withanother independent PPSWOR sketch. ˜ f will respectively approximate a weighted sample for f (later explainedin Remark 5.11). In this section, we provide an analysis of PPSWOR for a case that will appear later in our mainsketch. The case we consider is the following. In the PPSWOR sampling scheme described inSection 2.2, the weights w x of the keys were part of the deterministic input to the algorithm. In thissection, we consider PPSWOR sampling when the weights are random variables. We will show thatunder certain assumptions, PPSWOR sampling according to randomized inputs is close to samplingaccording to the expected values of these random inputs.Formally, let X be a set of keys. Each key x ∈ X is associated with r x ≥ independent randomvariables S x, , . . . , S x,r x in the range [0 , T ] (for some constant T > ). The weight of key x is therandom variable S x := P r x i =1 S x,i . We additionally denote its expected weight by v x := E [ S x ] , andthe expected sum statistics by V := P x v x .A stochastic PPSWOR sample is a PPSWOR sample computed for the key-value pairs ( x, S x ) . Thatis, we draw the random variables S x , then we draw for each x a random variable seed ( x ) ∼ Exp [ S x ] , and take the k keys with lowest seed values.We prove two results that relate stochastic PPSWOR sampling to a PPSWOR sample according tothe expected values v x . The ﬁrst result bounds the variance of estimating v x using a stochasticPPSWOR sample. We consider the conditional inverse-probability estimator of v x (Deﬁnition 2.3).Note that even though the PPSWOR sample was computed using the random weight S x , the esti-mator b v x is computed using v x and will be v x Pr[ seed ( x ) <τ ] for keys x in the sample. Based on thediscussion in Section 2.3, it sufﬁces to bound the per-key variance and relate it to the per-key vari-ance bound for a PPSWOR sample computed directly for v x . We show that when V ≥ T k , theoverhead due to the stochastic sample is at most (that is, the variance grows by a multiplicativefactor of ). The proof details would also reveal that when V ≫ T k , the worst-case bound on theoverhead is actually closer to . Theorem 3.1.

Let k ≥ . In a stochastic PPSWOR sample, if V ≥ T k , then for every key x ∈ X ,the variance Var [ˆ v x ] of the bottom- k inverse probability estimator of v x is bounded by Var [ˆ v x ] ≤ v x Vk − . Note that in order to compute these estimates, we need to be able to compute the values v x = E [ S x ] and Pr[ seed ( x ) < τ ] for sampled keys. With stochastic sampling, the precise distribution SeedDist x depends on the distributions of the random variables S x,i . For now, however, we assumethat SeedDist x and v x are available to us with the sample. In Section 5, when we use stochasticsampling, we will also show how to compute SeedDist x .The second result in this section provides a lower bound on the probability that a key x is included inthe stochastic PPSWOR sample of size k = 1 . We show that when V ≥ ε ln (cid:0) ε (cid:1) T , the probabilitykey x is included in the sample is at least − ε times the probability it is included in a PPSWORsample according to the expected weights. Theorem 3.2.

Let ε ≤ . Consider a stochastic PPSWOR sample of size k = 1 . If V ≥ ε ln (cid:0) ε (cid:1) T ,the probability that any key x ∈ X is included in the sample is at least (1 − ε ) vV . The proofs of the two theorems are deferred to Appendix B. SumMax

Sampling Sketch

In this section, we present an auxiliary sampling sketch which will be used in Section 5. The sketchprocesses elements e = ( e. key , e. val ) with keys e. key = ( e. key .p, e. key .s ) that are structured tohave a primary key e. key .p and a secondary key e. key .s . For each primary key x , we deﬁne SumMax D ( x ) := X z | z.p = x Max D ( z ) lgorithm 3: SumMax

Sampling Sketch // Initialize empty structure s Input:

Sample size ks.h ← fully independent random hash with range Exp [1]

Initialize s. sample // A bottom- k structure (Algorithm 1)// Process element e = ( e. key , e. val ) where e. key = ( e. key .p, e. key .s ) Process element ( e. key .p, s.h ( e. key ) /e. val ) to structure s. sample // bottom- k processelement (Algorithm 1)// Merge structures s , s (with s .h = s .h ) to get ss.h ← s .h // s .h = s .hs. sample ← Merge s . sample , s . sample // bottom- k merge (Algorithm 1) where Max is as deﬁned in Section 2. If there are no elements e ∈ D such that e. key .p = x , thenby deﬁnition Max D ( z ) = 0 for all z with z.p = x (as there are no elements in D with key z )and therefore SumMax D ( x ) = 0 . Our goal in this section is to design a sketch that produces aPPSWOR sample of primary keys x according to weights SumMax D ( x ) . Note that while the keyspace of the input elements contains structured keys of the form e. key = ( e. key .p, e. key .s ) , the keyspace for the output sample will be the space of primary keys only. Our sampling sketch is describedin Algorithm 3.The sketch structure consists of a bottom- k structure and a hash function h . We assume we have aperfectly random hash function h such that for every key z = ( z.p, z.s ) , h ( z ) ∼ Exp [1] indepen-dently (in practice, we assume that the hash function is provided by the platform on which we run).We process an input element e by generating a new data element with key e. key .p (the primary keyof the key of the input element) and value ElementScore ( e ) := h ( e. key ) /e. val and then processing that element by our bottom- k structure. The bottom- k structure holds our cur-rent sample of primary keys.By deﬁnition, the bottom- k structure retains the k primary keys x with minimum seed D ( x ) := min e ∈ D | e. key .p = x ElementScore ( e ) . To establish that this is a PPSWOR sample according to

SumMax D ( x ) , we study the distributionof seed D ( x ) . Lemma 4.1.

For all primary keys x that appear in elements of D , seed D ( x ) ∼ Exp [ SumMax D ( x ))] . The random variables seed D ( x ) are independent. The proof is deferred to Appendix C.Note that the distribution of seed D ( x ) , which is Exp [ SumMax D ( x )] , does not depend on the par-ticular structure of D or the order in which elements are processed, but only on the parameter SumMax D ( x ) . The bottom- k sketch structure maintains the k primary keys with smallest seed D ( x ) values. We therefore get the following corollary. Corollary 4.2.

Given a stream or distributed set of elements D , the sampling sketch in Algorithm 3produces a PPSWOR sample according to the weights SumMax D ( x ) . In this section, we are given a set D of elements e = ( e. key , e. val ) and we wish to maintain asample of k keys, that will be close to PPSWOR according to a soft concave sublinear function oftheir frequencies f ( ν x ) . At a high level, our sampling sketch is guided by the sketch for estimatingthe statistics f D due to Cohen [10]. Our sketch uses a parameter ε that will tradeoff the running timeof processing an element with the bound on the variance of the inverse-probability estimator.Recall that a soft concave sublinear function f can be represented as f ( w ) = L c [ a ]( w ) ∞ = R ∞ a ( t )(1 − e − wt ) dt for a ( t ) ≥ . Using this representation, we express f ( ν x ) as a sum of two10ontributions for each key x : f ( ν x ) = L c [ a ]( ν x ) γ + L c [ a ]( ν x ) ∞ γ , where γ is a value we will set adaptively while processing the elements. Our sampling sketch isdescribed in Algorithm 4. It maintains a separate sampling sketch for each set of contributions.The sketch for L c [ a ]( ν x ) γ is discussed in Section 5.1, and the sketch for L c [ a ]( ν x ) ∞ γ is discussedin Section 5.2. In order to produce a sample from the sketch, these separate sketches need to becombined. Algorithm 5 describes how to produce a ﬁnal sample from the sketch. This is discussedfurther in Section 5.3. Finally, we discuss the computation of the inverse-probability estimators \ f ( ν x ) for the sampled keys in Section 5.4. In particular, in order to compute the estimator, we needto know the values f ( ν x ) for the keys in the sample, which will require a second pass over the data.The analysis will result in the following main theorem. Algorithm 4:

Sampling Sketch Structure for f // Initialize empty structure s Input: k : Sample size, ε , a ( t ) ≥ Initialize s. SumMax // SumMax sketch of size k (Algorithm 3) Initialize s.ppswor // PPSWOR sketch of size k (Algorithm 2) Initialize s.sum ← // A sum of all the elements seen so far Initialize s.γ ← ∞ // Threshold

Initialize s. Sideline // A composable max-heap/priority queue//

Process elementInput:

Element e = ( e. key , e. val ) , structure s Process e by s.ppswors.sum ← s.sum + e. val s.γ ← εs.sum // r = k/ε foreach i ∈ [ r ] do y ∼ Exp [ e. val ] // Exponentially distributed with parameter e. val // Process in Sideline if The key ( e. key , i ) appears in s. Sideline then

Update the value of ( e. key , i ) to be the minimum of y and the current value else Add the element (( e. key , i ) , y ) to s. Sideline while s. Sideline contains an element g = ( g. key , g. val ) with g. val ≥ s.γ do Remove g from s. Sideline if R ∞ g. val a ( t ) dt > then Process element ( g. key , R ∞ g. val a ( t ) dt ) by s. SumMax // Merge two structures s and s to s (with same k, ε, a and same h in SumMax sub-structures) s.sum ← s .sum + s .sums.γ ← εs.sum s. Sideline ← merge s . Sideline and s . Sideline // Merge priority queues . s.ppswor ← merge s .ppswor and s .ppswor // Merge PPSWOR structures s. SumMax ← merge s . SumMax and s . SumMax // Merge

SumMax structures while s. Sideline contains an element g = ( g. key , g. val ) with g. val ≥ s.γ do Remove g from s. Sideline if R ∞ g. val a ( t ) dt > then Process element ( g. key , R ∞ g. val a ( t ) dt ) by s. SumMax

Theorem 5.1.

Let k ≥ , < ε ≤ , and f be a soft concave sublinear function. Algorithms 4 and5 produce a stochastic PPSWOR sample of size k − , where each key x has weight V x that satisﬁes f ( ν x ) ≤ E [ V x ] ≤ − ε ) f ( ν x ) . The per-key inverse-probability estimator of f ( ν x ) is unbiased and lgorithm 5: Produce a Final Sample from a Sampling Sketch Structure (Algorithm 4)

Input:

Sampling sketch structure s for f Output:

Sample of size k of key and seed pairs if R ∞ γ a ( t ) dt > thenforeach e ∈ s. Sideline do Process element ( e. key , R ∞ γ a ( t ) dt ) by sketch s. SumMax foreach e ∈ s. SumMax . sample do e. val ← r ∗ e. val // Multiply value by r if R γ ta ( t ) dt > thenforeach e ∈ s. ppswor . sample do e. val ← e. val R γ ta ( t ) dt // Divide value by B ( γ ) sample ← merge s. SumMax . sample and s. ppswor . sample // Bottom- k merge(Algorithm 1) else sample ← s. SumMax . sample return sample has variance Var h \ f ( ν x ) i ≤ f ( ν x ) P z ∈X f ( ν z )(1 − ε ) ( k − . The space required by the sketch at any given time is O ( k ) in expectation. Additionally, with prob-ability at least − δ , the space will not exceed O (cid:16) k + min { log m, log log (cid:16) Sum D Min ( D ) (cid:17) } + log (cid:0) δ (cid:1)(cid:17) at any time while processing D , where m is the number of elements in D , Min ( D ) is the minimumvalue of an element in D , and Sum D is the sum of frequencies of all keys. Remark 5.2.

The parameter ε mainly affects the run time of processing an element. For eachelement processed by the stream, we generate r = kε output elements that are then further processedby the sketch. Hence, the run time of processing an element grows with ε . The space is affected by ε when considering worst case over the randomness. The total number of possible keys for outputelements is r times the number of active keys, and in the worst case (over the randomness), we maystore all of them in Sideline . The sketch and estimator speciﬁcation use the following functions in a black-box fashion A ( γ ) := Z ∞ γ a ( t ) dtB ( γ ) := Z γ ta ( t ) dt where a ( t ) is the inverse complement Laplace transform of f , as speciﬁed in Equation (6) (Sec-tion 2.4). Closed expressions for A ( t ) and B ( t ) for some common concave sublinear functions f are provided in [10]. These functions are well-deﬁned for any soft concave sublinear f . Also notethat it sufﬁces to approximate the values of A and B within a small multiplicative error (which willcarry over to the variance, see Remark 5.11), so one can also use a table of values to compute thefunction.While processing the stream, we will keep track of the sum of values of all elements Sum D = P x ∈X ν x . We will then use Sum D to set γ adaptively to be ε Sum D . Thus, this is a running “candi-date” value that can only decrease over time. The ﬁnal value of γ will be set when we produce asample from the sketch in Algorithm 5. In the discussion below, we will show that setting γ = ε Sum D satisﬁes the conditions needed for each of the sketches for L c [ a ]( ν x ) γ and L c [ a ]( ν x ) ∞ γ .12 .1 The Sketch for L c [ a ]( ν x ) γ For the contributions L c [ a ]( ν x ) γ = R γ a ( t )(1 − e − ν x t ) dt , we will see that as long as we choose asmall enough γ , R γ a ( t )(1 − e − ν x t ) dt will be approximately (cid:0)R γ a ( t ) tdt (cid:1) ν x , up to a multiplicative − ε factor. Note that (cid:0)R γ a ( t ) tdt (cid:1) ν x is simply the frequency ν x scaled by B ( γ ) = R γ a ( t ) tdt .A PPSWOR sample is invariant to the scaling, so we can simply use a PPSWOR sampling sketchaccording to the frequencies ν x (Algorithm 2). The scaling only needs to be considered in a ﬁnalstep when the samples of the two sets of contributions are combined to produce a single sample. Lemma 5.3.

Let ε > and γ ≤ ε max x ν x . Then, for any key x , (1 − ε ) (cid:18)Z γ a ( t ) tdt (cid:19) ν x ≤ Z γ a ( t )(1 − e − ν x t ) dt ≤ (cid:18)Z γ a ( t ) tdt (cid:19) ν x . Proof.

Consider a key x with frequency ν x . Using − e − z ≤ z , we get Z γ a ( t )(1 − e − ν x t ) dt ≤ Z γ a ( t ) ν x tdt = (cid:18)Z γ a ( t ) tdt (cid:19) ν x . Now, using − e − z ≥ z − z for z ≥ , Z γ a ( t )(1 − e − ν x t ) dt ≥ Z γ a ( t ) (cid:18) ν x t − ( ν x t ) (cid:19) dt. Note that γ ≤ ε max y ν y ≤ εν x . Hence, for every ≤ t ≤ γ , ν x t ≤ ε , and ν x t − ( ν x t ) ≥ (1 − ε ) ν x t .As a result, we get that Z γ a ( t )(1 − e − ν x t ) dt ≥ (1 − ε ) (cid:18)Z γ a ( t ) tdt (cid:19) ν x . Note that our choice of γ = ε Sum D satisﬁes the condition of Lemma 5.3. L c [ a ]( ν x ) ∞ γ The sketch for L c [ a ]( ν x ) ∞ γ = R ∞ γ a ( t )(1 − e − ν x t ) dt processes elements in the following way. Wemap each input element e = ( e. key , e. val ) into r = kε output elements with keys ( e. key , through ( e. key , r ) and values Y i ∼ Exp [ e. val ] drawn independently. Each of these elements is then pro-cessed separately.The main component of the sketch is a SumMax sampling sketch of size k . Our goal is thatfor each generated output element (( e. key , i ) , Y i ) , the SumMax sketch will process an element (( e. key , i ) , A (max { Y i , γ } )) . However, since γ decreases over time and we do not know its ﬁnalvalue, we only process the elements with Y i ≥ γ into the SumMax sketch. We keep the rest of theelements in an auxiliary structure (implemented as a maximum priority queue) that we call

Sideline .Every time we update the value of γ , we remove the elements with Y i ≥ γ and process them into the SumMax sketch. Thus, at any time the

Sideline structure only contains elements with value lessthan γ . For any active input key x and i ∈ [ r ] , let M x,i denote the minimum value Y i that was generatedwith key ( x, i ) . We have the following invariants that we will use in our analysis:1. Either the element (( x, i ) , M x,i ) is in Sideline or an element (( x, i ) , A ( M x,i )) was pro-cessed by the SumMax sketch. In our implementation (Section 6) we incorporated an optimization where we only keep in the PPSWORsample elements that may contribute to the ﬁnal sample. In our implementation (see Section 6) we only keep in

Sideline elements that have the potential to modifythe SumMax sketch when inserted.

13. Since γ is decreasing over time, all elements ejected from the Sideline have value that isat least γ .We also will use the following property of the sketch. Lemma 5.4.

For any key x that was active in the input and i ∈ [ r ] , M x,i ∼ Exp [ ν x ] and theserandom variables are independent for different pairs ( x, i ) .Proof. M x,i by deﬁnition is the minimum of independent exponential random variables with sumof parameters ν x .In the following lemma, we bound the size of Sideline (and as a result, the entire sketch) withprobability − δ for < δ < of our choice. Lemma 5.5.

For a set of elements D , denote by m the number of elements in D , and let Min ( D ) denote the minimum value of any element in D . The expected number of elements in Sideline at anygiven time is O ( k ) , and with probability at least − δ , the number of elements in Sideline will notexceed O (cid:16) k + min { log m, log log (cid:16) Sum D Min ( D ) (cid:17) } + log (cid:0) δ (cid:1)(cid:17) at any time while processing D . The proof is deferred to Appendix D.

The ﬁnal sample returned by Algorithm 5 is the merge of two samples:1. The PPSWOR sample for L c [ a ]( ν x ) γ with frequencies scaled by B ( γ ) = R γ ta ( t ) dt .2. The SumMax sample for L c [ a ]( ν x ) ∞ γ with weights scaled by r . Before the scaling, the SumMax sample processes an element ( e. key , A ( γ )) for each remaining e ∈ Sideline .The scaling is performed using a property of exponential random variables and is formalized in thefollowing lemma.

Lemma 5.6.

Given a PPSWOR sample where each key x has frequency ν x , we can obtain aPPSWOR sample for the weights c · ν x by returning the original sample of keys but dividing theseed value of each key by c .Proof. A property of exponential random variables is that if Y ∼ Exp [ w ] , then for any constant c > , y/c ∼ Exp [ cw ] . Consider the set of seed values { seed ( x ) | x ∈ X } computed for theoriginal PPSWOR sample according to the frequencies ν x . If we divided each seed value by c , theseed of key x would come from the distribution Exp ( cν x ) . Hence, a PPSWOR sample according tothe weights cν x would contain the k keys with lowest seed values after dividing by c , and these k keys are the same keys that have lowest seed values before dividing by c .Denote by E the set of all elements that are passed on to the SumMax sketch, either during theprocessing of the set of elements D or in the ﬁnal phase. Lemma 5.7.

The ﬁnal sample computed by Algorithm 5 is a PPSWOR sample with respect toweights V x = 1 r SumMax E ( x ) + ν x Z γ ta ( t ) dt . Proof.

The sample s. ppswor in Algorithm 5 is a PPSWOR sample with respect to frequencies ν x ,which is then scaled by R γ ta ( t ) dt to get a PPSWOR sample according to the weights ν x R γ ta ( t ) dt .The sample s. SumMax is a

SumMax sample, which by Corollary 4.2 is a PPSWOR sample ac-cording to the weights

SumMax E ( x ) . This sample is scaled to be a PPSWOR sample according tothe weights r SumMax E ( x ) .Note that these samples are independent. When we perform a bottom- k merge of the two samples,the seed of key x is then the minimum of two independent exponential random variables with pa-rameters R γ ta ( t ) dt and r SumMax E ( x ) . Therefore, the distribution of seed ( x ) in the mergedsample is Exp ( R γ ta ( t ) dt + r SumMax E ( x )) , which means that the sample is a PPSWOR sampleaccording to those weights, as desired. 14e next interpret r SumMax E ( x ) . From the invariants listed in Section 5.2 and the descrip-tion of Algorithm 5, we have that for any active input key x and i ∈ [ r ] , the element (( x, i ) , A (max { γ, M x,i } )) was processed by the SumMax sketch. Because A is monotonicallynon-increasing, Max E (( x, i )) = max e ∈ E | e. key =( x,i ) e. val = A (max { γ, M x,i } ) . Now, r SumMax E ( x ) = P ri =1 1 r Max E (( x, i )) . By Lemma 5.4, the summands r Max E (( x, i )) are independent random variables for every x and i . By the monotonicity of the function A , eachsummand r Max E (( x, i )) is in the range h , A ( γ ) r i .The ﬁnal sample returned by Algorithm 5 is then a stochastic PPSWOR sample as deﬁned in Sec-tion 3. The weight of key x also includes the deterministic component ν x R γ ta ( t ) dt , which can belarger than A ( γ ) r . However, since this is summand is deterministic, we can break it into smaller deter-ministic parts, each of which will be at most A ( γ ) r . This way, the sample still satisﬁes the conditionthat the weight of every key is the sum of independent random variables in h , A ( γ ) r i . The next stepis to show that it satisﬁes the conditions of Theorem 3.1. Lemma 5.8.

For a key x , deﬁne V x = r SumMax E ( x ) + ν x R γ ta ( t ) dt . Let V = P x ∈X V x . Then,for any < ε ≤ and r ≥ kε , E [ V ] ≥ A ( γ ) r · k. The proof, which is deferred to Appendix D, uses the following lemma which is due to Cohen [10].

Lemma 5.9. [10] For every input key x and i = 1 , . . . , r , E [ Max E (( x, i ))] = L c [ a ]( ν x ) ∞ γ = Z ∞ γ a ( t )(1 − e − ν x t ) dt. The following theorem combines the previous lemmas to show how to estimate f ( ν x ) using thesample and bound the variance. Note that we need to specify how to compute the estimator. Forgetting f ( ν x ) we make another pass, and the computation of the conditioned inclusion probabilityis described in the next subsection. Theorem 5.10.

The sample returned by Algorithm 5 is a stochastic PPSWOR sample, where eachkey x has weight V x that satisﬁes f ( ν x ) ≤ E [ V x ] ≤ − ε ) f ( ν x ) . The per-key inverse-probabilityestimator according to the weights f ( ν x ) , \ f ( ν x ) = ( f ( ν x )Pr[ seed ( x ) <τ ] x ∈ S x / ∈ S . is unbiased and has variance

Var h \ f ( ν x ) i ≤ E [ V x ] E [ V ] k − ≤ f ( ν x ) P z ∈X f ( ν z )(1 − ε ) ( k − where V = P x ∈X V x .Proof. We ﬁrst prove that f ( ν x ) ≤ E [ V x ] ≤ − ε ) f ( ν x ) . The randomized weight V x is the sum oftwo terms: V x = 1 r SumMax E ( x ) + ν x Z γ ta ( t ) dt. As in the proof of Lemma 5.8, using Lemma 5.9, E (cid:20) r SumMax E ( x ) (cid:21) = 1 r r X i =1 E [ Max E (( x, i ))] = Z ∞ γ a ( t )(1 − e − ν x t ) dt. (8)15he quantity ν x R γ ta ( t ) dt is deterministic, and since γ ≤ ε max x ν x , by Lemma 5.3, Z γ a ( t )(1 − e − ν x t ) dt ≤ ν x Z γ ta ( t ) dt ≤ − ε Z γ a ( t )(1 − e − ν x t ) dt. (9)Recall that f ( ν x ) = R ∞ a ( t )(1 − e − ν x t ) dt = R γ a ( t )(1 − e − ν x t ) dt + R ∞ γ a ( t )(1 − e − ν x t ) dt .Combining Equations (8) and (9), we get that f ( ν x ) ≤ E [ V x ] ≤ − ε ) f ( ν x ) . Now consider the estimator \ f ( ν x ) for f ( ν x ) . To show that the estimator is unbiased, that is, E h \ f ( ν x ) i = f ( ν x ) , we can follow the proof of Claim 2.4 exactly as written earlier. It is left tobound the variance. For the sake of the analysis, consider the following estimator for E [ V x ] : [ E [ V x ] = ( E [ V x ]Pr[ seed ( x ) <τ ] x ∈ S x / ∈ S .

This is again the inverse-probability estimator from Deﬁnition 2.3. Our sample is a stochasticPPSWOR sample according to the weights V x , where each one of V x is a sum of independentrandom variables in h , A ( γ ) r i (recall that the deterministic part can also be expressed as a sumwith each summand in h , A ( γ ) r i ). Lemma 5.8 shows that E [ V ] ≥ A ( γ ) r · k . Hence, we satisfy theconditions of Theorem 3.1, which in turn shows that Var h [ E [ V x ] i ≤ E [ V x ] E [ V ] k − . Finally, note that \ f ( ν x ) = f ( ν x ) E [ V x ] · [ E [ V x ] . We established above that f ( ν x ) E [ V x ] ≤ and E [ V x ] ≤ − ε ) f ( ν x ) . We conclude that Var h \ f ( ν x ) i = Var (cid:20) f ( ν x ) E [ V x ] · [ E [ V x ] (cid:21) = (cid:18) f ( ν x ) E [ V x ] (cid:19) Var h [ E [ V x ] i ≤ E [ V x ] E [ V ] k − ≤ f ( ν x ) P z ∈X f ( ν z )(1 − ε ) ( k − . Remark 5.11.

The theorem establishes a bound on the variance of the estimator for E [ V x ] , and thenuses it to bound the variance of the estimator for f ( ν x ) , which is possible since E [ V x ] approximates f ( ν x ) . This results in increasing the variance by an ε -dependent factor. Similarly, if we wish toestimate f ( ν x ) for a concave sublinear function f (and not a soft concave sublinear function), wecan use the same idea and lose another constant factor in the variance. To compute the estimator \ f ( ν x ) , we need to know both f ( ν x ) and the precise conditioned inclusionprobability Pr[ seed ( x ) < τ ] . In order to get f ( ν x ) , we perform a second pass over the data elementsto obtain the exact frequencies ν x for x ∈ S . This can be done via a simple composable sketch thatcollects and sums the values of data elements with keys that occur in the sample S .We next consider computing the conditioned inclusion probabilities. The following lemma considersthe seed distributions of keys in the ﬁnal sample. It shows that the distributions are parameterizedby ν x and describes their CDF. 16 emma 5.12. Algorithms 4 and 5 describe a bottom- k sampling scheme, where in the outputsample the seed of each key x is drawn from a distribution SeedDist ( F ) [ ν x ] . The distribution SeedDist ( F ) [ w ] has the following cumulative distribution function: SeedCDF ( F ) ( w, t ) := Pr s ∼ SeedDist ( F ) [ w ] [ s < t ] = 1 − p p r , where p = exp( − wB ( γ ) t ) p = Z ∞ w exp( − wy ) exp( − A (max { y, γ } ) t/r ) dy The proof is deferred to Appendix D.

We implemented our sampling sketch and report here the results of experiments on real and syntheticdatasets. Our experiments are small-scale and aimed to demonstrate the simplicity and practicalityof our sketch design and to understand the actual space and error bounds (that can be signiﬁcantlybetter than our worst-case bounds).

Our Python 2.7 implementation follows the pseudocode of the sampling sketch (Algorithm 4), thePPSWOR (Algorithm 2) and

SumMax (Algorithm 3) substructures, the sample production fromthe sketch (Algorithm 5), and the estimator (that evaluates the conditioned inclusion probabilities,see Section 5.4). We incorporated two practical optimizations that are not shown in the pseudocode.These optimizations do not affect the outcome of the computation or the worst-case analysis, butreduce the sketch size in practice.

Removing redundant keys from the PPSWOR subsketch

The pseudocode (Algorithm 4) main-tains two samples of size k , the PPSWOR and the SumMax samples. The ﬁnal sample of size k is obtained by merging these two samples. Our implementation instead maintains a truncatedPPSWOR sketch that removes elements that are already redundant (do not have a potential to beincluded in the merged sample). We keep an element in the PPSWOR sketch only when the seedvalue is lower than rB ( γ ) times the current threshold τ of the SumMax sample. This means thatthe “effective” inclusion threshold we use for the PPSWOR sketch is the minimum of the k th largest(the threshold of the PPSWOR sketch) and rB ( γ ) τ . To establish that elements that do not satisfythis condition are indeed redundant, recall that when we later merge the PPSWOR and the SumMax samples, the value of B ( γ ) can only become lower and the SumMax threshold can only be lower,making inclusion more restrictive. This optimization may result in maintaining much fewer than k elements and possibly an empty PPSWOR sketch. The beneﬁt is larger for functions when A ( t ) isbounded (as t approaches ). In particular, when a ( t ) = 0 for t ≤ γ we get B ( γ ) = 0 and thetruncation will result in an empty sample. Removing redundant elements from Sideline

The pseudocode may place elements in

Sideline that have no future potential of modifying the

SumMax sketch. In our implementation, we placeand keep an element (( e. key , i ) , Y ) in Sideline only as long as the following condition holds: If (( e. key , i ) , A ( Y )) is processed by the current SumMax sketch, it would modify the sketch. Toestablish redundancy of discarded elements, note that when an element is eventually processed, thevalue it is processed with is at most A ( Y ) (can be A ( γ ) for γ ≥ Y ) and also at that point the SumMax sketch threshold can only be more restrictive.

We used the following datasets for the experiments: • abcnews [28]: News headlines published by the Australian Broadcasting Corp. For eachword, we created an element with value .17able 1: Experimental Results: f ( ν ) = ν . , rep. k NRMSE Benchmark max abcnews ( . × elements, . × keys)

25 0 .

834 0 .

213 0 .

217 31 . . .

577 0 .

142 0 .

128 0 .

137 58 . . .

468 0 .

120 0 .

111 0 .

110 85 . . .

404 0 .

105 0 .

098 0 .

103 111 . . Dataset: flickr ( . × elements, . × keys)

25 0 .

834 0 .

200 0 .

190 0 .

208 31 . . .

577 0 .

144 0 .

147 0 .

142 57 . . .

468 0 .

123 0 .

114 0 .

110 83 . . .

404 0 .

115 0 .

095 0 .

099 108 . . Dataset: zipf1.1 ( . × elements, . × keys)

25 0 .

834 0 .

215 0 .

198 0 .

217 31 . . .

577 0 .

123 0 .

137 0 .

131 58 . . .

468 0 .

109 0 .

115 0 .

114 84 . . .

404 0 .

106 0 .

103 0 .

097 111 . . Dataset: zipf1.2 ( . × elements, . × keys)

25 0 .

834 0 .

199 0 .

208 0 .

214 31 . . .

577 0 .

144 0 .

138 0 .

145 57 . . .

468 0 .

122 0 .

116 0 .

124 83 . . .

404 0 .

098 0 .

109 0 .

096 109 . . Dataset: zipf1.5 ( . × elements, . × keys)

25 0 .

834 0 .

201 0 .

207 0 .

194 30 . . .

577 0 .

152 0 .

139 0 .

142 56 . . .

468 0 .

115 0 .

112 81 . . .

404 0 .

098 0 .

094 0 .

086 107 . . • flicker [36]: Tags used by Flickr users to annotate images. The key of each element is atag, and the value is the number of times it appeared in a certain folder. • Three synthetic generated datasets that contain × data elements. Each element hasvalue , and the key was chosen according to the Zipf distribution ( numpy.random.zipf ),with Zipf parameter values α ∈ { . , . , . } . The Zipf family in this range is often agood model to real-world frequency distributions.We applied our sampling sketch with sample size parameter values k ∈ { , , , } and set theparameter ε = 0 . in all experiments. We sampled according to two concave sublinear functions:the frequency moment f ( ν ) = ν . and f ( ν ) = ln(1 + ν ) .Tables 1 and 2 report aggregated results of 200 repetitions for each combination of dataset, k , and f values. In each repetition, we were using the ﬁnal sample to estimate the sum P x ∈X f ( ν x ) overall keys. For error bounds, we list the worst-case bound on the CV (which depends only on k and ε and is proportional to / √ k ) and report the actual normalized root of the average squared error(NRMSE). In addition, we report the NRMSE that we got from 200 repetitions of estimating thesame statistics using two common sampling schemes for aggregated data, PPSWOR and prioritysampling, which we use as benchmarks.Also, we consider the size of the sketch after processing each element. Since the representation ofeach key can be explicit and require a lot of space, we separately consider the number of distinctkeys and the number of elements stored in the sketch. We report the maximum number of distinctkeys stored in the sketch at any point (the average and the maximum over the 200 repetitions) and therespective maximum number of elements stored in the sketch at any point during the computations(again, the average and the maximum over the 200 repetitions). The size of the sketch is measured atend of the processing of each input element – during the processing we may store one more distinctkey and temporarily store up to r = k/ε additional elements in the Sideline.We can see that the actual error reported is signiﬁcantly lower than the worst-case bound. Further-more, the error that our sketch gets is close to the error achieved by the two benchmark samplingschemes. We can also see that the maximum number of distinct keys stored in the sketch at any18able 2: Experimental Results: f ( ν ) = ln(1 + ν ) , rep. k NRMSE Benchmark max abcnews ( . × elements, . × keys)

25 0 .

834 0 .

208 0 .

217 0 .

194 29 . . .

577 0 .

138 0 .

136 0 .

142 54 . . .

468 0 .

130 0 .

099 0 .

117 80 . . .

404 0 .

102 0 .

115 0 .

103 104 . . Dataset: flickr ( . × elements, . × keys)

25 0 .

834 0 .

227 0 .

199 0 .

180 28 . . .

577 0 .

144 0 .

151 0 .

129 53 . . .

468 0 .

119 0 .

121 0 .

109 78 . . .

404 0 .

097 0 .

104 0 .

095 102 . . Dataset: zipf1.1 ( . × elements, . × keys)

25 0 .

834 0 .

201 0 .

204 0 .

234 29 . . .

577 0 .

127 0 .

132 0 .

129 54 . . .

468 0 .

116 0 .

122 0 .

110 79 . . .

404 0 .

107 0 .

106 0 .

104 104 . . Dataset: zipf1.2 ( . × elements, . × keys)

25 0 .

834 0 .

209 0 .

195 0 .

218 28 . . .

577 0 .

147 0 .

144 0 .

139 53 . . .

468 0 .

120 0 .

111 0 .

113 78 . . .

404 0 .

098 0 .

106 0 .

102 103 . . Dataset: zipf1.5 ( . × elements, . × keys)

25 0 .

834 0 .

210 0 .

197 0 .

226 27 . . .

577 0 .

141 0 .

146 0 .

149 52 . . .

468 0 .

124 0 .

112 0 .

106 76 . . .

404 0 .

100 0 .

101 0 .

099 101 . . time is relatively close to the speciﬁed sample size of k and that the total sketch size in terms ofelements rarely exceeded k , with the relative excess seeming to decrease with k . In comparison,the benchmark schemes require space that is the number of distinct keys (for the aggregation), whichis signiﬁcantly higher than the space required by our sketch. We presented composable sampling sketches for weighted sampling of unaggregated data tailoredto a concave sublinear function of the frequencies of keys. We experimentally demonstrated thesimplicity and efﬁcacy of our design: Our sketch size is nearly optimal in that it is not much largerthan the ﬁnal sample size, and the estimate quality is close to that provided by a weighted samplecomputed directly over the aggregated data.

Acknowledgments

Oﬁr Geri was supported by NSF grant CCF-1617577, a Simons Investigator Award for MosesCharikar, and the Google Graduate Fellowship in Computer Science in the School of Engineeringat Stanford University. The computing for this project was performed on the Sherlock cluster. Wewould like to thank Stanford University and the Stanford Research Computing Center for providingcomputational resources and support that contributed to these research results.

References [1] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequencymoments.

J. Comput. System Sci. , 58:137–147, 1999.[2] A. Andoni, R. Krauthgamer, and K. Onak. Streaming algorithms via precision sampling. In , pages 363–372,Oct 2011. 193] B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data.In

ACM-SIAM Symposium on Discrete Algorithms , pages 633–634, 2002.[4] V. Braverman, S. R. Chestnut, D. P. Woodruff, and L. F. Yang. Streaming space complexity ofnearly all functions of one variable on frequency vectors. In

PODS . ACM, 2016.[5] V. Braverman, R. Krauthgamer, and L. F. Yang. Universal streaming of subset norms.

CoRR ,abs/1812.00241, 2018.[6] V. Braverman and R. Ostrovsky. Zero-one frequency laws. In

STOC . ACM, 2010.[7] A. Chakrabarti, G. Cormode, and A. McGregor. A near-optimal algorithm for estimating theentropy of a stream.

ACM Trans. Algorithms , 6(3):51:1–51:21, July 2010.[8] M. T. Chao. A general purpose unequal probability sampling plan.

Biometrika , 69(3):653–656,1982.[9] E. Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis.

TKDE ,2015.[10] E. Cohen. Hyperloglog hyperextended: Sketches for concave sublinear frequency statistics. In

KDD . ACM, 2017. full version: https://arxiv.org/abs/1607.06517 .[11] E. Cohen. Stream sampling framework and application for frequency cap statistics.

ACMTrans. Algorithms , 14(4):52:1–52:40, September 2018.[12] E. Cohen, G. Cormode, and N. Dufﬁeld. Don’t let the negatives bring you down: Samplingfrom streams of signed updates. In

Proc. ACM SIGMETRICS/Performance , 2012.[13] E. Cohen, N. Dufﬁeld, H. Kaplan, C. Lund, and M. Thorup. Efﬁcient stream sampling forvariance-optimal estimation of subset sums.

SIAM J. Comput. , 40(5), 2011.[14] E. Cohen, N. Dufﬁeld, H. Kaplan, C. Lund, and M. Thorup. Algorithms and estimators foraccurate summarization of unaggregated data streams.

J. Comput. System Sci. , 80, 2014.[15] E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In

ACM PODC , 2007.[16] E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In

Proceedings of the34th VLDB Conference , 2008.[17] N. Dufﬁeld, M. Thorup, and C. Lund. Priority sampling for estimating arbitrary subset sums.

J. Assoc. Comput. Mach. , 54(6), 2007.[18] C. Estan and G. Varghese. New directions in trafﬁc measurement and accounting. In

SIG-COMM . ACM, 2002.[19] P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimalcardinality estimation algorithm. In

Analysis of Algorithms (AofA) . DMTCS, 2007.[20] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications.

J.Comput. System Sci. , 31:182–209, 1985.[21] G. Frahling, P. Indyk, and C. Sohler. Sampling in dynamic data streams and applications.

International Journal of Computational Geometry & Applications , 18(01n02):3–28, 2008.[22] P. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximatequery answers. In

SIGMOD . ACM, 1998.[23] Google.

Frequency capping: AdWords help , December 2014. https://support.google.com/adwords/answer/117579 .[24] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from aﬁnite universe.

Journal of the American Statistical Association , 47(260):663–685, 1952.[25] R. Jayaram and D. P. Woodruff. Perfect lp sampling in a data stream. In , pages 544–555, Oct 2018.[26] H. Jowhari, M. Sa˘glam, and G. Tardos. Tight bounds for lp samplers, ﬁnding duplicatesin streams, and related problems. In

Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems , PODS ’11, pages 49–58, 2011.[27] D. E. Knuth.

The Art of Computer Programming, Vol 2, Seminumerical Algorithms . Addison-Wesley, 1st edition, 1968.[28] R. Kulkarni. A million news headlines [csv data ﬁle]. , 2017.2029] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Aguera y Arcas. Communication-Efﬁcient Learning of Deep Networks from Decentralized Data. In Aarti Singh and Jerry Zhu,editors,

Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statis-tics , volume 54 of

Proceedings of Machine Learning Research , pages 1273–1282. PMLR,2017.[30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representationsof words and phrases and their compositionality. In

Proceedings of the 26th InternationalConference on Neural Information Processing Systems - Volume 2 , NIPS’13, pages 3111–3119,2013.[31] J. Misra and D. Gries. Finding repeated elements. Technical report, Cornell University, 1982.[32] M. Monemizadeh and D. P. Woodruff. 1-pass relative-error lp-sampling with applications. In

Proc. 21st ACM-SIAM Symposium on Discrete Algorithms . ACM-SIAM, 2010.[33] E. Ohlsson. Sequential poisson sampling.

J. Ofﬁcial Statistics , 14(2):149–162, 1998.[34] M. Osborne.

Facebook Reach and Frequency Buying , October 2014. http://citizennet.com/blog/2014/10/01/facebook-reach-and-frequency-buying/ .[35] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global vectors for word representation.In

EMNLP , 2014.[36] A. Plangprasopchok, K. Lerman, and L. Getoor. Growing a tree in the forest: Constructingfolksonomies by integrating structured metadata. In

Proceedings of the 16th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , KDD ’10, pages 949–958, 2010.[37] B. Ros´en. Asymptotic theory for successive sampling with varying probabilities without re-placement, I.

The Annals of Mathematical Statistics , 43(2):373–397, 1972.[38] B. Ros´en. Asymptotic theory for order sampling.

J. Statistical Planning and Inference ,62(2):135–158, 1997.[39] J.S. Vitter. Random sampling with a reservoir.

ACM Trans. Math. Softw. , 11(1):37–57, 1985.

A Proofs Deferred from Section 2

Proof of Proposition 2.1.

Each data element e = ( e. key , e. val ) is processed by giving it a score ElementScore ( e ) ∼ Exp ( e. val ) and then processing the element ( e. key , ElementScore ( e )) bythe bottom- k structure. For a key x , we deﬁne seed ( x ) := min e ∈ D | e. key = x ElementScore ( e ) to be the smallest score of an element with key x .Since seed ( x ) is the minimum of independent exponential random variables, its distributionis Exp ( w x ) . After processing the elements in D , the bottom- k structure contains the k pairs ( x, seed ( x )) with smallest seed ( x ) values, and hence obtains the respective PPSWOR sample.Consider now the sketch resulting from merging two sampling sketches computed for D and D . For each key x , denote by seed ( x ) and seed ( x ) the values of x in the sketches for D and D , respectively. Then, the merged sketch contains the k pairs ( x, seed ( x )) with small-est seed ( x ) := min { seed ( x ) , seed ( x ) } values. As seed ( x ) is the minimum of two inde-pendent exponential random variables with parameters Sum D ( x ) and Sum D ( x ) , we get that seed ( x ) ∼ Exp ( Sum D ∪ D ( x )) , as desired. B Proofs Deferred from Section 3

B.1 Threshold Distribution and Fixed-Threshold Inclusion Probability

Before proceeding, we establish two technical lemmas that will be useful later. The ﬁrst lemmashows that the distribution of the k -th lowest seed is dominated by the Erlang distribution whichtakes the sum of the expected weights V as a parameter. The lemma will be useful later when weconsider the inclusion threshold τ x . 21 emma B.1. Consider a set of keys X , such that the weight of each x ∈ X is a random variable S x ≥ . Let V = P x ∈X E [ S x ] , and for each x with S x > , draw seed ( x ) ∼ Exp [ S x ] inde-pendently. Then, the distribution of the k -th lowest seed, { seed ( x ) | x ∈ X } ( k ) , is dominated by Erlang [ V, k ] . We ﬁrst establish the dominance relation

Exp [ S ] (cid:22) Exp [ E [ S ]] for any nonnegative random variable S . Lemma B.2.

Let S ≥ be a random variable. Let X be a random variable such that X ∼ Exp [ S ] when S > and X = ∞ otherwise. Then, the distribution of X is dominated by Exp [ E [ S ]] , thatis, ∀ τ, Pr X ∼ Exp [ S ] [ X ≤ τ ] ≤ − e − E [ S ] τ .Proof. Follows from Jensen’s inequality:

Pr[ X ≤ τ ] = E S [1 − e − Sτ ] = 1 − E S [ e − Sτ ] ≤ − e − E [ S ] τ Therefore, for any key x , the seed ( x ) distribution with stochastic weights is dominated by Exp [ E [ S x ]] , which is the distribution used by PPSWOR according to the expected weights. We nowconsider the distribution of { seed ( x ) | x ∈ X } ( k ) , which is the k -th lowest seed value. We showthat the distribution of the k -th lowest seed is dominated by Erlang [ V, k ] (recall that V = P x E [ S x ] ).In the proof, we will use the following property of dominance (the proof of the following propertyis standard and included here for completeness). Claim B.3.

Let X , . . . , X k , Y , . . . , Y k be independent random variables such that the distributionof X i is dominated by that of Y i . Then the distribution of X + . . . + X k is dominated by that of Y + . . . + Y k .Proof. We prove for k = 2 (the proof for k > follows from a simple induction argument). Denoteby f i and F i the PDF and CDF functions of X i , respectively, and by g i and G i the PDF and CDF of Y i . From the dominance assumption, we know that F i ( t ) ≤ G i ( t ) for all i . Now, Pr[ X + X < t ] = Z ∞ f ( x ) Pr[ X < t − x ] dx = Z t f ( x ) F ( t − x ) dx ≤ Z t g ( x ) F ( t − x ) dx [from dominance as F ( t − x ) is non-increasing in x ] ≤ Z t g ( x ) G ( t − x ) dx = Pr[ Y + Y < t ] We are now ready to prove Lemma B.1.

Proof of Lemma B.1.

Conditioned on the values of S x , the distribution of { seed ( x ) | x ∈ X } ( k ) is dominated by Erlang [ P x ∈X S x , k ] (for a proof, see Appendix C in [11]). The distribution X takes values in R and cannot be ∞ . However, the random variable X represents the minimum seedvalue (or the inclusion threshold τ x as in Section 2.3), and the event X ≤ τ represents whether the inclusionthreshold for key x is at most τ . The case S = 0 corresponds to no elements generated with keys in X \ { x } ,so we can say that the inclusion threshold for x is ∞ (the event we care about is whether x enters the sampleor not). Here we are trying to show a distribution that dominates the distribution of the inclusion threshold, andfor that purpose, any threshold τ > is more restrictive than ∞ . From a technical perspective, when S = 0 ,we can still use the CDF of Exp [ S ] since Pr[ X ≤ τ ] = 0 = 1 − e − Sτ . Later, when we consider the k -thlowest seed, we will similarly allow it to be ∞ when less than k keys are active. { seed ( x ) | x ∈ X } ( k ) (unconditioned on the values of S x ) is a linear combination of dis-tributions, which are each dominated by the respective Erlang distribution. Using the deﬁnitionof dominance (Deﬁnition 2.6) and the law of total probability, we get that the distribution of { seed ( x ) | x ∈ X } ( k ) (unconditioned on S x ) is dominated by Erlang [ P x ∈X S x , k ] (uncondi-tioned on S x ). A random variable drawn from Erlang [ P x ∈X S x , k ] has the same distribution as thesum of k independent random variables drawn from Exp ( P x ∈X S x ) . The distribution of each ofthese k exponential random variables is dominated by Exp ( V ) (by Lemma B.2). Using Claim B.3,we get that Erlang [ P x ∈X S x , k ] (cid:22) Erlang [ V, k ] . The assertion of the lemma then follows from thetransitivity of dominance.The second lemma provides lower bounds on the CDF of Exp ( S ) under certain conditions. Lemma B.4.

Let the random variable S = P ri =1 S i be a sum of r independent random variablesin the range [0 , T ] . Let v = E [ S ] . Then, Pr X ∼ Exp [ S ] [ X ≤ τ ] ≥ − e − vτ (1 − τT/ . In the regime τ T < , we get that the probability of being less than τ is close to that of Exp ( E [ S ]) . Lemma B.5.

Let S be a random variables in [0 , T ] with expectation E [ S ] = v . Then for all τ , Pr X ∼ Exp [ S ] [ X ≤ τ ] ≥ vT (1 − e − T τ ) . Proof.

Denote the probability density function of S by p S . Conditioned on the value of S , theprobability of X ∼ Exp [ S ] being below τ is Pr[ X ≤ τ | S = s ] = 1 − e − sτ . It follows that

Pr[ X ≤ τ ] = E [1 − e − Sτ ] = Z T p S ( x )(1 − e − xτ ) dx. Consider the function f ( x ) = 1 − e − xτ for a ﬁxed τ ≥ . Since f is concave, for every x ∈ [0 , T ] , f ( x ) = f (cid:16)(cid:16) − xT (cid:17) · xT · T (cid:17) ≥ (cid:16) − xT (cid:17) · f (0) + xT · f ( T )= xT (1 − e − T τ ) . By monotonicity, Z T p S ( x )(1 − e − xτ ) dx ≥ Z T p S ( x ) · xT (1 − e − T τ ) dx and ﬁnally, Pr[ X ≤ τ ] ≥ − e − T τ T · Z T p S ( x ) xdx = vT · (1 − e − T τ ) . Lemma B.6.

Let the random variable S = P ri =1 S i be a sum of r independent random variablesin the range [0 , T ] . Let v = E [ S ] . Then, Pr X ∼ Exp [ S ] [ X ≤ τ ] ≥ − exp (cid:16) − vT (1 − e − T τ ) (cid:17) . Proof.

Let X ∼ Exp [ S ] . Since S = P ri =1 S i , we could deﬁne r independent exponential randomvariables X i ∼ Exp [ S i ] . X has the same distribution as min ≤ i ≤ r X i . Hence, Pr[

X > τ ] = Pr (cid:20) min ≤ i ≤ r X i > τ (cid:21) r Y i =1 Pr [ X i > τ ] ≤ r Y i =1 (cid:18) − E [ S i ] T · (1 − e − T τ ) (cid:19) ≤ (cid:18) − E [ S ] rT · (1 − e − T τ ) (cid:19) r where the last inequality follows from the arithmetic mean-geometric mean inequality. Now, usingthe inequality − x ≤ e − x (for any x ∈ R ), and the fact that f ( x ) = x r is non-decreasing for x, r ≥ , we get that Pr[

X > τ ] ≤ exp (cid:18) − E [ S ] rT · (1 − e − T τ ) · r (cid:19) = exp (cid:18) − E [ S ] T · (1 − e − T τ ) (cid:19) . Consequently,

Pr[ X ≤ τ ] ≥ − exp (cid:16) − vT (1 − e − T τ ) (cid:17) . Proof of Lemma B.4.

Follows from Lemma B.6 using the inequality − e − x ≥ x − x / for x ≥ . B.2 Variance Bounds for the Inverse-Probability Estimator

Proof of Theorem 3.1.

We start bounding the per-key variance as in Claim 2.5:

Var ( b v x ) = E τ x (cid:20) v x (cid:18) seed ( x ) < τ x ] − (cid:19)(cid:21) . By Lemma B.1, we know that the distribution of τ x (the k − lowest seed of the keys in X \ { x } )is dominated by Erlang [ V, k − , hence Var ( b v x ) ≤ E t ∼ Erlang [ V,k − (cid:20) v x (cid:18) seed ( x ) < t ] − (cid:19)(cid:21) = Z ∞ B V,k − ( t ) v x (cid:18) seed ( x ) < t ] − (cid:19) dt = Z /T B V,k − ( t ) · v x (cid:18) seed ( x ) < t ] − (cid:19) dt + Z ∞ /T B V,k − ( t ) · v x (cid:18) seed ( x ) < t ] − (cid:19) dt To bound the ﬁrst summand, since t ≤ T , we get from Lemma B.4 (applied to seed ( x ) ) that Pr[ seed ( x ) < t ] ≥ − e − v x t ( − tT ) ≥ − e − v x t/ . It follows that Z /T B V,k − ( t ) · v x (cid:18) seed ( x ) < t ] − (cid:19) dt ≤ Z /T B V,k − ( t ) · v x (cid:18) − e − v x t/ − (cid:19) dt ≤ Z /T B V,k − ( t ) · v x v x t/ dt [ e − x − e − x ≤ x ] = 2 Z /T B V,k − ( t ) · v x t dt ≤ Z ∞ B V,k − ( t ) · v x t dt v x Vk − [PPSWOR analysis (Section 2.3)]To bound the second summand, since t > T , Pr[ seed ( x ) < t ] ≥ Pr[ seed ( x ) < /T ] ≥ − e − v x / T . Subsequently, Z ∞ /T B V,k − ( t ) · v x (cid:18) seed ( x ) < t ] − (cid:19) dt ≤ Z ∞ /T B V,k − ( t ) · v x (cid:18) − e − v x / T − (cid:19) dt = v x (cid:18) − e − v x / T − (cid:19) Z ∞ /T B V,k − ( t ) dt ≤ v x (cid:18) − e − v x / T − (cid:19) [integral of density] ≤ v x v x / T [ e − x − e − x ≤ x ] = 2 T v x ≤ v x Vk [ V ≥ T k ]Combining, we get that

Var ( b v x ) ≤ v x Vk − v x Vk ≤ v x Vk − . B.3 Inclusion Probability in a Stochastic Sample

Proof of Theorem 3.2.

We ﬁrst separately deal with the case where there is only one key, which wedenote x . In this case, V = v x , and if S x > , then x is included in the sample. Otherwise, thesample is empty. In the proof of Lemma B.4, when S = 0 , we used Pr[ X ≤ τ ] = 1 − e − sτ = 0 and the event X ≤ τ does not happen. Hence, we can use Lemma B.4 to bound Pr[ S x > ≥ Pr[ seed ( x ) ≤ τ ] for any τ > . We pick τ = εT and get that x is included in the sample withprobability Pr[ S x > ≥ − e − v εT (1 − ε ) ≥ − e − εε (1 − ε ) ln ( ε ) ≥ − ε using V ≥ ε ln (cid:0) ε (cid:1) T and − ε ) ≥ .If there is more than one key, a key x is included in the sample if seed ( x ) is smaller than the seedof all other keys. The distribution of min z = x seed ( z ) is Exp (cid:16)P z = x S z (cid:17) , which is dominated by Exp (cid:16)P z = x v z (cid:17) (Lemma B.2). Then, Pr[ seed ( x ) < min z = x seed ( z )] ≥ E t ∼ Exp [ V − v x ] Pr[ seed ( x ) < t ]= Z ∞ ( V − v x ) e − ( V − v x ) t Pr[ seed ( x ) < t ] dt ≥ Z ε/T ( V − v x ) e − ( V − v x ) t Pr[ seed ( x ) < t ] dt + Z ∞ ε/T ( V − v x ) e − ( V − v x ) t Pr[ seed ( x ) < ε/T ] dt ≥ Z ε/T ( V − v x ) e − ( V − v x ) t (cid:16) − e − v x t (1 − tT/ (cid:17) dt Z ∞ ε/T ( V − v x ) e − ( V − v x ) t (cid:16) − e − v x εT (1 − ε ) (cid:17) dt ≥ Z ∞ ( V − v x ) e − ( V − v x ) t dt − Z ε/T ( V − v x ) e − ( V − v x ) t e − v x t (1 − ε ) dt − Z ∞ ε/T ( V − v x ) e − ( V − v x ) t e − v x εT (1 − ε ) dt = 1 − Z ε/T ( V − v x ) e − ( V − εv x ) t dt − e − ( V − v x ) εT e − v x εT (1 − ε ) = 1 − V − v x V − εv x Z ε/T ( V − εv x ) e − ( V − εv x ) t dt − e − ( V − εv x ) εT = 1 − V − v x V − εv x (cid:16) − e − ( V − εv x ) εT (cid:17) − e − ( V − εv x ) εT = (cid:18) − V − v x V − εv x (cid:19) (cid:16) − e − ( V − εv x ) εT (cid:17) = (1 − ε ) v x V − εv x (cid:16) − e − ( V − εv x ) εT (cid:17) ≥ (1 − ε ) v x V (cid:16) − e − εT (1 − ε ) V (cid:17) [ V ≥ v x ] ≥ (1 − ε ) v x V (cid:16) − e − ε (1 − ε ) ǫ ln ( ε ) (cid:17) ≥ (1 − ε ) v x V (cid:16) − e − ln ( ε ) (cid:17) [2(1 − ε ) ≥ − ε ) · v x V ≥ (1 − ε ) v x V .

C Proofs Deferred from Section 4

Proof of Lemma 4.1.

With a slight abuse of notation, for a full key z = ( z.p, z.s ) we deﬁne seed D ( z ) := min e ∈ D | e. key = z ElementScore ( e ) . Now, since we use the same value h ( z ) for all elements with key z , the minimum ElementScore ( e ) value generated for an element e ∈ D with key e. key = z is h ( z ) / Max D ( z ) : seed D ( z ) = min e ∈ D | e. key = z h ( z ) e. val = h ( z ) Max D ( z ) . Recall that for X ∼ Exp [1] and a > , the distribution of X/a is Exp [ a ] , and that h ( z ) ∼ Exp [1] .Therefore, the algorithm effectively draws seed D ( z ) ∼ Exp [ Max D ( z )] for every key z . Moreover,from our assumption of independence of h , the variables seed D ( z ) of different keys z are alsoindependent.We now notice that for a primary key x , seed D ( x ) = min z | z.p = x seed D ( z ) . That is, seed D ( x ) is the minimum, over all keys z with primary key z.p = x that appeared in atleast one element of D , of seed D ( z ) .The random variables seed D ( z ) for input keys z are independent and exponentially distributed withrespective parameters Max D ( z ) . From properties of the exponential distribution, their minimum is26lso exponentially distributed with a parameter that is equal to the sum of their parameters Max D ( z ) : seed D ( x ) ∼ Exp  X z | z.p = x Max D ( z )  , that is, seed D ( x ) ∼ Exp [ SumMax D ( x )] . Moreover, the independence of seed D ( x ) (for primarykeys x ) follows from the independence of seed D ( z ) (for input keys z ). D Proofs Deferred from Section 5

Proof of Lemma 5.5.

Consider a ﬁxed time during the processing of D by Algorithm 4 (after somebut potentially not all elements have been processed). For each key x , let v x be the sum of values ofelements with key x that have been processed so far.For any t > , key x ∈ X , and i ∈ [ r ] , we deﬁne an indicator random variable I tx,i for the eventthat an element with key ( x, i ) was generated with value less than t . In particular, the number ofelements in Sideline is P x ∈X P ri =1 I γx,i . The event I tx,i = 1 is the event that the minimum value ofthe elements generated with key ( x, i ) is at most t . The distribution of the minimum value of theseelements is Exp ( v x ) , and it follows that E [ I tx,i ] = 1 − e − tv x ≤ tv x . In particular, when t = γ = ε P z ∈X v z and r = kε , we get E " X x ∈X r X i =1 I γx,i ≤ X x ∈X r · εv x P z ∈X v z = 2 rε = 2 k. From Chernoff bounds, Pr " X x ∈X r X i =1 I γx,i > m + 3 ln (cid:0) δ (cid:1) k ! k ≤ e − k − ln m − ln ( δ ) ≤ δm . Applying this each time an element is processed and taking union bound, we get that the size of

Sideline increases beyond k + 3 ln m + 3 ln (cid:0) δ (cid:1) at any time with probability at most δ .We now improve the bound to use log log (cid:16) Sum D Min ( D ) (cid:17) instead of log m . Let t > . Consider all thetimes where the value of γ is in the interval (cid:2) t, t (cid:3) , and for every γ ′ in that interval, let v x ( γ ′ ) denote the frequency of key x at the time where γ = γ ′ . Since γ decreases over time as elementsare processed, any generated element stored in Sideline when γ ∈ (cid:2) t, t (cid:3) must have value at most t .Since only more elements are generated as γ decreases, we can look all the elements that have beengenerated until γ reached t . We bound the number of elements with value at most t that have been generated until the time where γ = t . From the way we set γ in Algorithm 4, we get as long as γ ≥ t , P x ∈X v x ( γ ) t ≤ ε . Now,consider the indicator I tx,i as deﬁned above for the time where γ = t . The number of elementsstored in Sideline at any time when γ ∈ (cid:2) t, t (cid:3) is at most P x ∈X P ri =1 I tx,i . We get that E " X x ∈X r X i =1 I tx,i ≤ r X x ∈X t · v x ( t/ ≤ rε = 4 k and using Chernoff bounds, Pr  X x ∈X r X i =1 I tx,i >  ⌈ log (cid:16) Sum D Min ( D ) (cid:17) ⌉ + 3 ln (cid:0) δ (cid:1) k  k  ≤ e − k − ln ⌈ log (cid:16) Sum D Min ( D ) (cid:17) ⌉− ln ( δ ) It may be the case that γ is never t , but in that case we consider the minimum value of γ that is at least t . δ ⌈ log (cid:16) Sum D Min ( D ) (cid:17) ⌉ . (10)Finally, the minimum value γ can get is ε Min ( D ) , and the maximum value is ε Sum D . Hence, wecan divide the interval of possible values for γ into ⌈ log (cid:16) Sum D Min ( D ) (cid:17) ⌉ intervals of the form (cid:2) t, t (cid:3) ,and apply the bound in Equation (10) to each one of them. By the union bound, we get that theprobability that the size of Sideline exceeds k + 3 ln ⌈ log (cid:16) Sum D Min ( D ) (cid:17) ⌉ + 3 ln (cid:0) δ (cid:1) at any time duringthe processing of D is at most δ . Proof of Lemma 5.8.

Using Lemma 5.9, for every key x , E [ V x ] ≥ E (cid:20) r SumMax E ( x ) (cid:21) = 1 r r X i =1 E [ Max E (( x, i ))]= 1 r r X i =1 Z ∞ γ a ( t )(1 − e − ν x t ) dt [By Lemma 5.9] = Z ∞ γ a ( t )(1 − e − ν x t ) dt ≥ Z ∞ γ a ( t )(1 − e − ν x γ ) dt = A ( γ )(1 − e − ν x γ ) . Recall that γ = ε Sum D . Then, using − e − x ≥ x for ≤ x ≤ , E [ V ] = E " X x ∈X V x ≥ X x ∈X A ( γ )(1 − e − ν x γ )= X x ∈X A ( γ ) (cid:16) − e − ν x · ε Sum D (cid:17) ≥ X x ∈X A ( γ ) ν x · ε Sum D = A ( γ ) ε Sum D X x ∈X ν x = A ( γ ) ε. Since r ≥ kε , we conclude that E [ V ] ≥ A ( γ ) r · k. Proof of Lemma 5.12.

Consider a key x . The seed seed ( F ) ( x ) in the output sample is the minimumof seed (1) ( x ) and seed (2) , which are the seed values obtained by the scaled PPSWOR and the SumMax samples, respectively.The scaled PPSWOR sample is computed with respect to the weights ν x B ( γ ) , and thus seed (1) ( x ) ∼ Exp [ ν x B ( γ )] . Therefore using the density function of Exp [ ν x B ( γ )] , we get thatfor all t > , p = Pr[ seed (1) ( x ) > t ] = exp( − ν x B ( γ ) t ) . SumMax sample is a PPSWOR sample with respect to weights r SumMax E ( x ) .Therefore, seed (2) ( x ) ∼ Exp [ r SumMax E ( x )] . Note however that r SumMax E ( x ) is itself arandom variable and in particular, the value SumMax E ( x ) is not available to us with the sam-ple. We recall that SumMax E ( x ) = P ri =1 Max E (( x, i )) where Max E (( x, i )) are i.i.d. randomvariables. Using properties of the exponential distribution, we know that Exp [ r SumMax E ( x )] is the same distribution as the minimum of r independent random variables drawn from Exp [ r Max E (( x, , . . . , Exp [ r Max E (( x, r ))] . Therefore, for t > , Pr[ seed (2) ( x ) > t ] = Y i Pr (cid:20) Exp (cid:20) r Max E (( x, i )) (cid:21) > t (cid:21) . We now express

Pr[

Exp [ r Max E (( x, i ))] > t ] using the fact that Max E (( x, i )) = A (max { y, γ } ) for y ∼ Exp [ ν x ] : p = Pr (cid:20) Exp (cid:20) r Max E (( x, i )) (cid:21) > t (cid:21) = Z ∞ ν x exp( − ν x y ) Pr (cid:20) Exp (cid:20) r A (max { y, γ } ) (cid:21) > t (cid:21) dy = Z ∞ ν x exp( − ν x y ) exp( − A (max { y, γ } ) t/r ) dy . Since

Pr[ seed (2) ( x ) > t ] = p r and using the fact that seed (1) ( x ) and seed (2) ( x ) are independent,we conclude that Pr[ seed ( F ) ( x ) < t ] = 1 − Pr[min { seed (1) ( x ) , seed (2) ( x ) } > t ]= 1 − Pr[ seed (1) ( x ) > t ] Pr[ seed (2) ( x ) > t ]= 1 − p p r ..