11 Multi-Objective Weighted Sampling
Edith CohenGoogle ResearchMountain View, CA, USA [email protected]
Abstract — Multi-objective samples are powerful and versatile summaries of large data sets. For a set of keys x ∈ X and associatedvalues f x ≥ , a weighted sample taken with respect to f allows us to approximate segment-sum statistics sum ( f ; H ) = (cid:80) x ∈ H f x , forany subset H of the keys, with statistically-guaranteed quality that depends on sample size and the relative weight of H . Whenestimating sum ( g ; H ) for g (cid:54) = f , however, quality guarantees are lost. A multi-objective sample with respect to a set of functions F provides for each f ∈ F the same statistical guarantees as a dedicated weighted sample while minimizing the summary size.We analyze properties of multi-objective samples and present sampling schemes and meta-algortithms for estimation and optimizationwhile showcasing two important application domains. The first are key-value data sets, where different functions f ∈ F applied to thevalues correspond to different statistics such as moments, thresholds, capping, and sum. A multi-objective sample allows us toapproximate all statistics in F . The second is metric spaces, where keys are points, and each f ∈ F is defined by a set of points C with f x being the service cost of x by C , and sum ( f ; X ) models centrality or clustering cost of C . A multi-objective sample allows us toestimate costs for each f ∈ F . In these domains, multi-objective samples are often of small size, are efficiently to construct, and enablescalable estimation and optimization. We aim here to facilitate further applications of this powerful technique. (cid:70) NTRODUCTION
Random sampling is a powerful tool for working with very largedata sets on which exact computation, even of simple statistics,can be time and resource consuming. A small sample of the dataallows us to efficiently obtain approximate answers.Consider data in the form of key value pairs { ( x, f x ) } , wherekeys x are from some universe X , f x ≥ , and we define f x ≡ for keys x ∈ X that are not present in the data. Very commonstatistics over such data are segment sum statistics sum ( f ; H ) = (cid:88) x ∈ H f x , where H ⊂ X is a segment of X . Examples of such datasets are IP flow keys and bytes, users and activity, or customersand distance to the nearest facility. Segments may correspond toa certain demographic or location or other meta data of keys.Segment statistics in these example correspond respectively tototal traffic, activity, or service cost of the segment.When the data set is large, we can compute a weightedsample which includes each key x with probability (roughly)proportional to f x and allows us to estimate segment sum statistics (cid:100) sum ( f ; H ) for query segments H . Popular weighted samplingschemes [30], [32] include Poisson Probability Proportional toSize (pps) [21], VarOpt [5], [12], and the bottom- k schemes [28],[14], [15] Sequential Poisson (priority) [24], [17] and PPS withoutreplacement (ppswor) [27].These weighted samples provide us with nonnegative unbiasedestimates (cid:100) sum ( g ; H ) for any segment and g ≥ (providedthat f x > when g x > ). For all statistics sum ( f ; H ) we obtain statistical guarantees on estimation quality: The error,measured by the coefficient of variation (CV), which is thestandard deviation divided by the mean, is at most the inverse
1. The alternative term selection is used in the DB literature and the term domain is used in the statistics literature. of the square root of the size of the sample multiplied by thefraction sum ( f ; H ) / sum ( f, X ) of “ weight” that is due to thesegment H . This trade-off of quality (across segments) and samplesize are (worst-case) optimal. Moreover, the estimates are well-concentrated in the Chernoff-Bernstein sense: The probability ofan error that is c times the CV decreases exponentially in c .In many applications, such as the following examples, thereare multiple sets of values f ∈ F that are associated with thekeys: (i) Data records can come with explicit multiple weights,as with activity summaries of customers/jobs that specify bothbandwidth and computation consumption. (ii) Metric objectives,such as our service cost example, where each configuration offacility locations induces a different set of distances and henceservice costs. (iii) The raw data can be specified in terms of aset of key value pairs { ( x, w x ) } but we are interested in differentfunctions f x ≡ f ( w x ) of the values that correspond to differentstatistics such asStatistics function f ( w ) count f ( w ) = 1 for w > sum f ( w ) = w threshold with T > thresh T ( w ) = I w ≥ T moment with p > f ( w ) = w p capping with T > cap T = min { T, w } Example 1.1.
Consider a toy data set D : ( u , , ( u , , ( u , , ( u , , ( u , , ( u , , ( u , , ( u , , ( u , , ( u , For a segment H with H ∩ D = { u , u , u , u } , wehave sum ( H ) = 128 , count ( H ) = 4 , thresh ( H ) = 2 , cap ( H ) = 17 , and 2-moment ( H ) = 10414 . For these applications, we are interested in a summary thatcan provide us with estimates with statistically guaranteed qualityfor each f ∈ F . The naive solutions are not satisfactory: Wecan compute a weighted sample taken with respect to a particular f ∈ F , but the quality of the estimates sum ( g ; H ) rapidly a r X i v : . [ c s . D B ] J un degrades with the dissimilarity between g and f . We can computea dedicated sample for each f ∈ F , but the total summary sizecan be much larger than necessary. Multi-objective samples, anotion crystallized in [16] , provide us with the desired statisticalguarantees on quality with minimal summary size.Multi-objective samples build on the classic notion of samplecoordination [23], [4], [29], [7], [28], [25]. In a nutshell, coordi-nated samples are locality sensitive hashes of f , mapping similar f to similar samples. A multi-objective sample is (roughly) theunion S ( F ) = (cid:83) f ∈ F S ( f ) of all the keys that are included incoordinated weighted samples S ( f ) for f ∈ F . Because thesamples are coordinated, the number of distinct keys included andhence the size of S ( F ) is (roughly) as small as possible. Since foreach f ∈ F , the sample S ( F ) “includes” the dedicated sample S ( f ) , the estimate quality from S ( F ) dominates that of S ( f ) .In this paper, we review the definition of multi-objectivesamples, study their properties, and present efficient samplingschemes. We consider both general sets F of objectives andfamilies F with special structure. By exploiting special structure,we can bound the overhead , which is the increase factor in samplesize necessary to meet multi-objective quality guarantees, andobtain efficient sampling schemes that avoid dependence of thecomputation on | F | . In Section 2 we review (single-objective) weighted samplingfocusing on the Poisson pps and bottom- k sampling schemes.We then review the definitions [16] and establish properties ofmulti-objective samples. In Section 3 we study multi-objectivepps samples. We show that the multi-objective sample size isalso necessary for meeting the quality guarantees for segmentstatistics for all f ∈ F . We also show that the guarantees are metwhen we use upper bounds on the multi-objective pps samplingprobabilities instead of working with the exact values. In Section 4we study multi-objective bottom- k samples.In Section 5 we establish a fundamental property of multi-objective samples: We define the sampling closure F of a set ofobjectives F , as all functions f for which a multi-objective sample S ( F ) meets the quality guarantees for segment statistics. Clearly F ⊂ F but we show that the closure F also includes every f thatis a non-negative linear combination of functions from F .In Section 6, we consider data sets in the form of key valuepairs and the family M of all monotone non-decreasing func-tions of the values. This family includes most natural statistics,such as our examples of count, sum, threshold, moments, andcapping. Since M is infinite, it is inefficient to apply a genericmulti-objective sampling algorithm to compute S ( M ) . We presentefficient near-linear sampling schemes for S ( M ) which also applyover streamed or distributed data. Moreover, we establish a boundon the sample size of E [ | S ( M ) | ] ≤ k ln n , where n is the numberof keys in our data set and k is the reference size of the single-objective samples S ( f ) for each f ∈ M . The design is based on asurprising relation to All-Distances Sketches [7], [8]. Furthermore,we establish that (when key weights are unique), a sample of size Ω( k ln n ) is necessary: Intuitively, the “hardness” stems from theneed to support all threshold functions.In Section 7 we study the set C = { cap T | T > } of allcapping functions. The closure C includes all concave f ∈ M
2. The collocated model with at most a linear growth (satisfy f (cid:48) ( x ) ≤ and f (cid:48)(cid:48) ( x ) ≤ ).Since C ⊂ M , the multi-objective sample S ( M ) includes S ( C ) and provides estimates with statistical guarantees for all f ∈ C .The more specialized sample S ( C ) , however, can be much smallerthan S ( M ) . We design an efficient algorithm for computing S ( C ) samples.In Section 8 we discuss metric objectives and multi-objectivesamples as summaries of a set of points that allows us to approxi-mate such objectives.In Section 9 we discuss different types of statistical guaranteesacross functions f in settings where we are only interested instatistics sum ( f ; X ) over the full data. Our basic multi-objectivesamples analyzes the sample size required for ForEach , wherethe statistical guarantees apply to each estimate (cid:100) sum ( f ; H ) inisolation. In particular, also to each estimate over the full dataset. ForAll is much stronger and bounds the (distribution of) themaximum relative error of estimates (cid:100) sum ( f ; X ) for all f ∈ F .Meeting ForAll typically necessitates a larger multi-objectivesample size than meeting
ForEach .In section 10 we present a meta-algorithm for optimizationover samples. The goal is to maximize a (smooth) function of sum ( f ; X ) over f ∈ F . When X is large, we can instead performthe optimization over a small multi-objective sample of X . Thisframework has important applications to metric objectives andestimating loss of a model from examples. The ForOpt guaranteeis for a sample size that facilitates such optimization, that is, theapproximate maximizer over the sample is an approximate maxi-mizer over the data set. This guarantee is stronger than
ForEach but generally weaker than
ForAll . We make a key observationthat with a
ForEach sample we are only prone to testable one-sided errors on the optimization result. Based on that, we presentan adaptive algorithm where the sample size is increased until
ForOpt is met. This framework unifies and generalizes previouswork of optimization over coordinated samples [13], [11].We conclude in Section 11.
EIGHTED SAMPLING ( SINGLE OBJECTIVE ) We review weighted sampling schemes with respect to a setof values f x , focusing on preparation for the multi-objectivegeneralization. The schemes are specified in terms of a sample-size parameter k which allows us to trade-off representation sizeand estimation quality. The pps sample S ( f,k ) includes each key x independently withprobability p ( f,k ) x = min { , k f x (cid:80) y f y } . (1) Example 2.1.
The table below lists pps sampling probabilities p ( f, x ( k = 3 , rounded to the nearest hundredth) for keys in ourexample data for sum ( f x = w x ), thresh ( f x = I w x ≥ ),and cap ( f x = min { , w x } ). The number in parenthesis is sum ( f, X ) = (cid:80) x f x . We can see that sampling probabilitieshighly vary between functions f . key u1 u3 u10 u12 u17 u24 u31 u42 u43 u55 wx sum (385) 0.04 0.78 0.18 0.05 0.01 0.04 1.00 0.15 0.02 0.02 thresh (4) 0.00 0.75 0.75 0.00 0.00 0.00 0.75 0.75 0.00 0.00 cap (41) 0.37 0.37 0.37 0.37 0.07 0.37 0.37 0.37 0.22 0.15 PPS samples can be computed by association a random value u x ∼ U [0 , with each key x and including the key in the sample if u x ≤ p ( f,k ) x . This formulation to us when there are multipleobjectives as it facilitates the coordination of samples taken withrespect to the different objectives. Coordination is achieved usingthe same set u x . k (order) sampling Bottom- k sampling unifies priority (sequential Poisson) [24], [17]and pps without replacement (ppswor) sampling [27]. To obtain abottom- k sample for f we associate a random value u x ∼ U [0 , with each key. To obtain a ppswor sample we use r x ≡ − ln(1 − u x ) and to obtain a priority sample we use r x ≡ u x . The bottom- k sample S ( f,k ) for f contains the k keys with minimum f -seed ,where f - seed( x ) ≡ r x f x . To support estimation, we also retain the threshold , τ ( f,k ) , whichis defined to be the ( k + 1) st smallest f -seed. We estimate a statistics sum ( g ; H ) from a weighted sample S ( f,k ) using the inverse probability estimator [22]: (cid:100) sum ( g ; H ) = (cid:88) x ∈ H ∩ S g x p ( f,k ) x . (2)The estimate is always nonnegative and is unbiased when thefunctions satisfy g x > ⇒ f x > (which ensures thatany key x with g x > is sampled with positive probability). Toapply this estimator, we need to compute p ( f,k ) x for x ∈ S . Todo so with pps samples (1) we include the sum (cid:80) x f x with S asauxiliary information.For bottom- k samples, inclusion probabilities of keys arenot readily available. We therefore use the inverse probabilityestimator (2) with conditional probabilities p ( f,k ) x [17], [15]: Akey x , fixing the randomization u y for all other keys, is sampledif and only if f - seed( x ) < t , where t is the k th smallest f -seed among keys y (cid:54) = x . For x ∈ S ( f,k ) , the k th smallest f -seedamong other keys is t = τ ( f,k ) , and thus p ( f,k ) x = Pr u x ∼ U [0 , (cid:20) r x f x < τ ( f,k ) (cid:21) . (3)Note that the right hand side expression for probability is equal to − e − f x t with ppswor and to min { , f x t } with priority sampling. We consider the variance and concentration of our estimates. Anatural measure of estimation quality of our unbiased estimatesis the coefficient of variation (CV), which is the ratio of thestandard deviation to the mean. We can upper bound the CV ofour estimates (2) of sum ( g ; H ) in terms of the (expected) samplesize k and the relative g -weight of the segment H , defined as q ( g ) ( H ) = sum ( g ; H ) sum ( g ; X ) . To be able to express a bound on the CV when we estimate astatistics sum ( g ; H ) using a weighted sample taken with respectto f , we define the disparity between f and g as ρ ( f, g ) = max x f x g x max x g x f x . The disparity always satisfies ρ ( f, g ) ≥ and we have equality ρ ( f, g ) = 1 only when g is a scaling of f , that is, equal to g = cf for some c > . We obtain the following upper bound: Theorem 2.1.
For pps samples and the estimator (2) , ∀ g ∀ H, C V [ (cid:100) sum ( g ; H )] ≤ (cid:115) ρ ( f, g ) q ( g ) ( H ) k . For bottom- k samples, we replace k by k − . The proof for ρ = 1 is standard for pps, provided in [7], [8] forppswor, and in [31] for priority samples. The proof for ρ ≥ forppswor is provided in Theorem A.1. The proof for pps is simpler,using a subset of the arguments. The proof for priority can beobtained by generalizing [31].Moreover, the estimates obtained from these weighted sampleare concentrated in the Chernoff-Hoeffding-Bernstein sense. Weprovide the proof for the multiplicative form of the bound and forPoisson pps samples: Theorem 2.2.
For δ ≤ , Pr[ | (cid:100) sum ( g ; H ) − sum ( g ; H ) | > δ sum ( g ; H )] ≤ − q ( g ) ( H ) kρ − δ / . For δ > , Pr[ (cid:100) sum ( g ; H ) − sum ( g ; H ) > δ sum ( g ; H )] ≤ exp( − q ( g ) ( H ) kρ − δ/ . Proof.
Consider Poisson pps sampling and the inverse probabilityestimator. The contribution of keys that are sampled with p ( f,k ) x =1 is computed exactly. Let the contribution of these keys be (1 − α ) sum ( g ; H ) , for some α ∈ [0 , ). If α = 0 , the estimate is theexact sum and we are done. Otherwise, it suffices to estimate theremaining α sum ( g ; H ) with relative error δ (cid:48) = δ/α .Consider the remaining keys, which have inclusion probabili-ties p ( f,k ) x ≥ kf x / sum ( f ) . The contribution of such a key x tothe estimate is if x is not sampled and is g x p ( f,k ) x ≤ ρ ( f, g ) sum ( f ) /k when x is sampled. Note that by definition sum ( g ; H ) = q ( g ) ( H ) sum ( g ) ≥ q ( g ) ( H ) sum ( f ) /ρ ( f, g ) . We apply the concentration bounds to the sum of random variablesin the range [0 , ρ ( f, g ) sum ( f ) /k ] . To use the standard form, wecan normalize our random variables and to have range [0 , andaccordingly normalize the expectation α sum ( g ; H ) to obtain µ = α sum ( g ; H ) ρ ( f, g ) sum ( f ) /k ] ≥ αq ( g ) ( H ) ρ ( f, g ) − k . We can now apply multiplicative Chernoff bounds for randomvariables in the range [0 , with δ (cid:48) = δ/α and expectation µ . Theformula bounds the probability of relative error that exceeds δ by − δ µ/ when δ < and by exp( − δµ/ when δ > . Consider data presented as streamed or distributed elements of theform of key-value pairs ( x, f x ) , where x ∈ X and f x > . Wedefine f x ≡ for keys x that are not in the data.An important property of our samples (bottom- k or pps) isthat they are composable (mergeable). Meaning that a sample of the union of two data sets can be computed from the samples ofthe data sets. Composability facilitates efficient streamed or dis-tributed computation. The sampling algorithms can use a randomhash function applied to key x to generate u x – so seed values canbe computed on the fly from ( x, f x ) and do not need to be stored.With bottom- k sampling we permit keys x to occur in multipleelements, in which case we define f x to be the maximum valueof elements with key x . The sample S ( D ) of a set D of elementscontains the pair ( x, f x ) for the k + 1 (unique) keys with smallest f -seeds The sample of the union (cid:83) i D i is obtained from (cid:83) i S ( D i ) by first replacing multiple occurrences of a key withthe one with largest f ( w ) and then returning the pairs for the k + 1 keys with smallest f -seeds.With pps sampling, the information we store with our sample S ( D ) includes the sum sum ( f ; D ) ≡ (cid:80) x ∈ D f x and the sampledpairs ( x, f x ) , which are those with u x ≤ kf x / sum ( f ; D ) .Because we need to accurately track the sum, we require thatelements have unique keys. The sample of a union D = (cid:83) i D i is obtained using the sum sum ( f ; D ) = (cid:80) i sum ( f ; D i ) , and re-taining only keys in (cid:83) i S ( D i ) that satisfy u x ≤ kf x / sum ( f ; D ) . ULTI - OBJECTIVE PPS SAMPLES
Our objectives are specified as pairs ( f, k f ) where f ∈ F is a function and k f specifies a desired estimation quality for sum ( f ; H ) statistics, stated in terms of the quality (Theorem 2.1and Theorem 2.2) provided by a single-objective sample for f with size parameter k f . To simplify notation, we sometimes omit k f when clear from context.A multi-objective sample S ( F ) [16] is defined by consideringdedicated samples S ( f,k f ) for each objective that are coordinated .The dedicated samples are coordinating by using the same ran-domization, which is the association of u x ∼ U [0 , with keys.The multi-objective sample S ( F ) = (cid:83) f ∈ F S ( f,k f ) contains allkeys that are included in at least one of the coordinated dedicatedsamples. In the remaining part of this section we study ppssamples. Multi-objective bottom- k samples are studied in the nextsection. Lemma 3.1.
A multi-objective pps sample for F includes eachkey x independently with probability p ( F ) x = min { , max f ∈ F k f f x (cid:80) y f y } . (4) Proof.
Consider coordinated dedicated pps samples for f ∈ F obtained using the same set { u x } . The key x is included in atleast one of the samples if and only if the value u x is at most themaximum over objectives ( f, k f ) of the pps inclusion probabilityfor that objective: u x ≤ max f ∈ F p ( f,k f ) x = max f ∈ F min { , k f f x (cid:80) y f y } = min { , max f ∈ F k f f x (cid:80) y f y } . Since u x are independent, so are the inclusion probabilities ofdifferent keys. Example 3.2.
Consider the three objectives: sum , thresh , and cap all with k = 3 as in Example 2.1. The expected size of S ( F )
3. When keys are unique to elements it suffices to keep only the ( k + 1) stsmallest f -seed without the pair ( x, f x ) . is | S ( F ) | = (cid:80) x p ( F ) x = 4 . . The naive solution of maintaininga separate dedicated sample for each objective would have totalexpected size . (Note that the dedicated expected sample sizefor sum is . and for thresh , and cap it is . To estimate a statistics sum ( g ; H ) from S ( F ) , we apply theinverse probability estimator (cid:100) sum ( g ; H ) = (cid:88) x ∈ S ( F ) ∩ H g x p ( F ) x . (5)using the probabilities p ( F ) x (4).To compute the estimator (5), we need to know p ( F ) x when x ∈ S ( F ) . These probabilities can be computed if we maintainthe sums sum ( f ) = (cid:80) x f x for f ∈ F as auxiliary informationand we have f x available to us when x ∈ S ( f,k f ) .In some settings it is easier to obtain upper bounds π x ≥ p ( F ) x on the multi-objective pps inclusion probabilities, compute a Pois-son sample using π x , and apply the respective inverse-probabilityestimator (cid:100) sum ( g ; H ) = (cid:88) x ∈ S ∩ H g x π x . (6) Side note:
It is sometime useful to use other sampling schemes, inparticular, VarOpt (dependent) sampling [5], [20], [12] to obtaina fixed sample size. The estimation quality bounds on the CVand concentration also hold with VarOpt (which has negativecovariances).
We show that the estimation quality, in terms of the bounds onthe CV and concentration, of the estimator (5) is at least as goodas that of the estimate we obtain from the dedicated samples. Todo so we prove a more general claim that holds for any Poissonsampling scheme that includes each key x in the sample S withprobability π x ≥ p ( f,k f ) and the respective inverse probabilityestimator (6).The following lemma shows that estimate quality can onlyimprove when inclusion probabilities increase: Lemma 3.3.
The variance v ar [ (cid:100) sum ( g ; H )] of (6) and henceC V [ (cid:100) sum ( g ; H )] are non-increasing in π x .Proof. For each key x consider the inverse probability estimator ˆ g x = g x /π x when x is sampled and ˆ g x = 0 otherwise. Notethat v ar [ˆ g x ] = g x (1 /π x − , which is decreasing with π x . Wehave (cid:100) sum ( g ; H )] = (cid:80) x ∈ H ˆ g x . When covariances between ˆ g x are nonpositive, which is the case in particular with independetinclusions, we have v ar [ (cid:100) sum ( g ; H )] = (cid:80) x ∈ H v ar [ˆ g x ] and theclaim follows as it applies to each summand.We next consider concentration of the estimates. Lemma 3.4.
The concentration claim of Theorem 2.2 carries overwhen we use the inverse probability estimator with any samplingprobabilities that satisfy π x ≥ p ( f,k ) x for all x .Proof. The generalization of the proof is immediate, as the rangeof the random variables ˆ f x can only decrease when we increasethe inclusion probability. An important special case is when k f = k for all f ∈ F , that is,we seek uniform statistical guarantees for all our objectives. Weuse the notation S ( F,k ) for the respective multi-objective sample.We can write the multi-objective pps probabilities (4) as p ( F,k ) x = min { , k max f ∈ F f x (cid:80) y f y } = min { , kp ( F, x } . (7)The last equality follows when recalling the definitions of p ( f, x = f x / (cid:80) y f y and hence p ( F, x = max f ∈ F p ( f, x = max f ∈ F f x (cid:80) y f y . We refer to p ( f, x as the base pps probabilities for f . Note thatthe base pps probabilities are rescaling of f and pps probabilitiesare invariant to this scaling. We refer to p ( F, x as the multi-objective base pps probabilities for F . Finally, for a reason thatwill soon be clear, we refer to the sum h ( pps ) ( F ) ≡ (cid:88) x p ( F, x as the multi-objective pps overhead of F .It is easy to see that h ( F ) ∈ [1 , | F | ] and h ( F ) is closer to when all objectives in F are more similar. We can write kp ( F, x = ( kh ( F )) p ( F, x (cid:80) x p ( F, x . That is, the multi-objective pps probabilities (7) are equivalentto single-objective pps probabilities with size parameter kh ( F ) computed with respect to base probability “weights” g x = p ( F, x . Side note:
We can apply any single-objective weighted samplingscheme such as VarOpt or bottom- k to the weights p ( F, x , or toupper-bounds π x ≥ p ( F, x on these weights while adjusting thesample size parameter to k (cid:80) x π x . The following theorem shows that any multi-objective samplefor F that meets the quality guarantees on all domain queriesfor ( f, k f ) must include each key x with probability at least p ( f, x k f p ( f, x k f +1 ≥ p ( f,k f ) x . Moreover, when p ( f, x k f (cid:28) , thenthe lower bound on inclusion probability is close to p ( f,k f ) x . Thisimplies the multi-objective pps sample size is necessary to meetthe quality guarantees we seek (Theorem 2.1). Theorem 3.1.
Consider a sampling scheme that for weights f supports estimators that satisfy for each segment H C V [ (cid:100) sum ( f ; H )] ≤ / (cid:113) q ( f ) ( H ) k . Then the inclusion probability of a key x must be at least p x ≥ p ( f, kp ( f, k + 1 ≥
12 min { , p ( f, k } . Proof.
Consider a segment of a single key H = { x } . Then p ( f, ≡ q ( f ) ( H ) = f x / (cid:80) y f y ≡ q . The best nonnegativeunbiased sum estimator is the HT estimator: When the key isnot sampled, there is no evidence of the segment and the estimate must be . When it is, uniform estimate minimize the variance. Ifthe key is included with probability p , the CV of the estimate isC V [ (cid:100) sum ( f ; { x } )] = (1 /p − . . From the requirement (1 /p − . ≤ / ( qk ) . , we obtain that p ≥ qkqk +1 . ULTI - OBJECTIVE BOTTOM - k SAMPLES
The sample S ( F ) is defined with respect to random { u x } . Eachdedicated sample S ( f,k f ) includes the k f lowest f -seeds, com-puted using { u x } . S ( F ) accordingly includes all keys that haveone of the k f lowest f -seeds for at least one f ∈ F .To estimate statistics sum ( g ; H ) from bottom- k S ( F ) , weagain apply the inverse probability estimator (5) but here we usethe conditional inclusion probability p ( F ) x for each key x [16].This is the probability (over u x ∼ U [0 , ) that x ∈ S ( F ) , whenfixing u y for all y (cid:54) = x to be as in the current sample. Note that p ( F ) x = max f ∈ F p ( f ) x , where p ( f ) x are as defined in (3).In order to compute the probabilities p ( F ) x for x ∈ S ( F ) , it always suffices to maintain the slightly larger sample (cid:83) f ∈ F S ( f,k f +1) . For completeness, we show that it suffices toinstead maintain with S ( F ) ≡ (cid:83) f ∈ F S ( f,k f ) a smaller (possiblyempty) set Z ⊂ (cid:83) f ∈ F S ( f,k f +1) \ S ( F ) of auxiliary keys. Wenow define the set Z and show how inclusion probabilities can becomputed from S ( F ) ∪ Z . For a key x ∈ S ( F ) , we denote by g ( x ) = arg max f ∈ F | x ∈ S ( f ) p ( f ) x the objective with the most forgiving threshold for x . If p ( g ( x ) ) x < , let y x be the key with ( k + 1) smallest g -seed (otherwise y x isnot defined). The auxiliary keys are then Z = { y x | x ∈ S ( F ) } \ S ( F ) . We use the sample and auxiliary keys S ( F ) ∪ Z as follows tocompute the inclusion probabilities: We first compute for each f ∈ F , τ (cid:48) f , which is the k f + 1 smallest f -seed of keys in S ( F ) ∪ Z .For each x ∈ S ( F ) , we then use p ( F ) x = max f ∈ F f ( w x ) τ (cid:48) f (forpriority) or p ( F ) x = 1 − exp( − max f ∈ F f ( w x ) τ (cid:48) f ) (for ppswor).To see that p ( F ) x are correctly computed, note that while we canhave τ (cid:48) f > τ ( f,k f ) for some f ∈ F ( Z may not include thethreshold keys of all the dedicated samples S ( f,k f ) ), our definitionof Z ensures that τ (cid:48) f = τ ( f,k f ) for f such that there is at least one x where f = g ( x ) and p ( g ( x ) ) x < . Composability:
Note that multi-objective samples S ( F ) are com-posable, since they are a union of (composable) single-objectivesamples S ( f ) . It is not hard to see that composability applies withthe auxiliary keys: The set of auxiliary keys in the composedsample must be a subset of sampled and auxiliary keys in thecomponents. Therefore, the sample itself includes all the necessarystate for streaming or distributed computation. Estimate quality:
We can verify that for any f ∈ F and x , forany random assignment { u y } for y (cid:54) = x , we have p ( F ) x ≥ p ( f ) x .Therefore (applying Lemma 3.3 and noting zero covariances[15])the variance and the CV are at most that of the estimator (2)applied to the bottom- k f sample S ( f ) . To summarize, we obtainthe following statistical guarantees on estimate quality with multi-objective samples: Theorem 4.1.
For each H and g , the inverse-probability estima-tor applied to a multi-objective pps sample S ( F ) hasC V [ (cid:100) sum ( g ; H )] ≤ min f ∈ F (cid:115) ρ ( f, g ) q ( g ) ( H ) k f . The estimator applied to a multi-objective bottom- k samples hasthe same guarantee but with ( k f − replacing k f . Sample size overhead:
We must have E [ | S ( F ) | ] ≤ (cid:80) f ∈ F k f .The worst-case, where the size of S ( F ) is the sum of the sizes ofthe dedicated samples, materializes when functions f ∈ F havedisjoint supports. The sample size, however, can be much smallerwhen functions are more related.With uniform guarantees ( k f ≡ k ), we define the multi-objective bottom- k overhead to be h ( botk ) ( F ) ≡ E [ | S ( F ) | ] /k . This is the sample size overhead of a multi-objective versus adedicated bottom- k sample. pps versus bottom- k multi-objective sample size: For some sets F , with the same parameter k , we can have much larger multi-objective overhead with bottom- k than with pps. A multi-objectivepps samples is the smallest sample that can include a pps samplefor each f . A multi-objective bottom- k sample must include abottom- k f sample for each f . Consider a set of n > k keys. Foreach subset of n/ keys we define a function f that is uniformon the subset and elsewhere. It is easy to see that in this case h ( pps ) ( F ) = 2 whereas h ( botk ) ( F ) ≥ ( n/ k ) /k Computation:
When the data has the form of elements ( x, w x ) with unique keys and f x = f ( w x ) for a set of functions F ,then short of further structural assumptions on F , the samplingalgorithm that computes S ( F ) must apply all functions f ∈ F to all elements. The computation is thus Ω( | F | n ) and can be O ( | F | n + | S ( F ) | log k ) time by identifying for each f ∈ F ,the k keys with smallest f - seed( x ) . In the sequel we will seeexamples of large or infinite sets F but with special structure thatallows us to efficiently compute a multi-objective sample. HE S AMPLING C LOSURE
We define the sampling closure F of a set of functions F tobe the set of all functions f such that for all k and for all H ,the estimate of sum ( f ; H ) from S ( F,k ) has the CV bound ofTheorem 2.1. Note that this definitions is with respect to uniformguarantees (same size parameter k for all objectives). We showthat the closure of F contains all non-negative linear combinationsof functions from F . Theorem 5.1.
Any f = (cid:80) g ∈ F α g g where α g ≥ is in F .Proof. We first consider pps samples, where we establish thestronger claim S ( F ∪{ f } ,k ) = S ( F,k ) , or equivalently,for all keys x , p ( f,k ) x ≤ p ( F,k ) x . (8)For a function g , we use the notation g ( X ) = (cid:80) y g y , and recallthat p ( g,k ) x = min { , k g x g ( X ) } . We first consider f = cg for some g ∈ F . In this case, p ( f,k ) x = p ( g,k ) x ≤ p ( F,k ) x and (8) follows. Tocomplete the proof, it suffices to establish (8) for f = g (1) + g (2) such that g (1) , g (2) ∈ F . Let c be such that g (2) x g (2) ( X ) = c g (1) x g (1) ( X ) we can assume WLOG that c ≤ (otherwise reverse g (1) and g (2) ). For convenience denote α = g (2) ( X ) /g (1) ( X ) . Then wecan write f x f ( X ) = g (1) x + g (2) x g (1) ( X ) + g (2) ( X )= (1 + cα ) g (1) x (1 + α ) g (1) ( X ) = 1 + cα α g (1) x g (1) ( X ) ≤ g (1) x g (1) ( X ) = max { g (1) x g (1) ( X ) , g (2) x g (2) ( X ) } . Therefore p ( f,k ) x ≤ max { p ( g (1) ,k ) x , p ( g (2) ,k ) x } ≤ p ( F,k ) x . The proof for multi-objective bottom- k samples is more in-volved, and deferred to the full version. Note that the multi-objective bottom- k sample S ( F,k ) may not include a bottom- k sample S ( f,k ) , but it is still possible to bound the CV. HE UNIVERSAL SAMPLE FOR MONOTONESTATISTICS
In this section we consider the (infinite) set M of all monotonenon-decreasing functions and the objectives ( f, k ) for all f ∈ M .We show that the multi-objective pps and bottom- k samples S ( M,k ) , which we refer to as the universal monotone sample ,are larger than a single dedicated weighted sample by at mosta logarithmic factor in the number of keys. We will also showthat this is tight. We will also present efficient universal monotonebottom- k sampling scheme for streamed or distributed data.We take the following steps. We consider the multi-objectivesample S ( thresh ,k ) for the set thresh of all threshold functions(recall that thresh T ( x ) = 1 if x ≥ T and thresh T = 0 otherwise). We express the inclusion probabilities p ( thresh ,k ) x andbound the sample size. Since all threshold functions are monotone, thresh ⊂ M . We will establish that S ( thresh ,k ) = S ( M,k ) . Westart with the simpler case of pps and then move on to bottom- k samples. Theorem 6.1.
Consider a data set D = { ( x, w x ) } of n keys andthe sorted order of keys x by non-increasing w x . Then a key x that is in position i in the sorted order has base multi-objectivepps probability p ( thresh , x = p ( M, x ≤ /i . When all keys have unique weights, equality holds.Proof.
Consider the function thresh w x . The function has weight on all the ≥ i keys with weight ≥ w x . Therefore, the basepps probability is p ( thresh wx , x ≤ /i . When the keys have uniqueweights then there are exactly i keys y with weight w y ≥ w x and we have p ( thresh wx , x = 1 /i . If we consider all threshold
4. We observe that M = thresh . This follows from Theorem 5.1, afternoticing that any f ∈ M can be expressed as a non-negative combination ofthreshold functions f ( y ) = (cid:90) ∞ α ( T ) thresh T ( y ) dT , for some function α ( T ) ≥ . We establish here the stronger relation S ( thresh ,k ) = S ( M,k ) . functions, then p ( thresh T , x = 0 when T > w x and p ( thresh T , x ≤ /i when T ≤ w x . Therefore, p ( thresh , x = max T p ( thresh T , x = p ( w x , x = 1 /i . We now consider an arbitrary monotone function f x = f ( w x ) .From monotonicity there are at least i keys with f y ≥ f x therefore, f x / (cid:80) y f y ≤ /i . Thus, p ( f, x = f x / (cid:80) y f y ≤ /i and p ( M, x = max f ∈ M p ( f, x ≤ /i . There is a simple main-memory sampling scheme wherewe sort the keys, compute the probabilities p ( M, and then p ( M,k ) = min { , kp ( M, } and compute a sample accordingly.We next present universal monotone bottom- k samples and sam-pling schemes that are efficient on streamed or distributed data. k Theorem 6.2.
Consider a data set D = { ( x, w x ) } of n keys. The universal monotone bottom- k sample has expectedsize E [ | S ( M,k ) | ] ≤ k ln n and can be computed using O ( n + k log n log k ) operations. For a particular T , the bottom- k sample S ( thresh T ,k ) is theset of k keys with smallest u x among keys with w x ≥ T .The set of keys in the multi-objective sample is S ( thresh ,k ) = (cid:83) T > S ( thresh T ,k ) . We show that a key x is in the multi-objectivesample for thresh if and only if it is in the bottom- k sample for thresh w x : Lemma 6.1.
Fixing { u y } , for any key x , x ∈ S ( thresh ,k ) ⇐⇒ x ∈ S ( thresh wx ,k ) . Proof.
Consider the position t ( x, T ) of x in an ordering of keys y induced by thresh T - seed( y ) . We claim that if for a key y wehave thresh T - seed( x ) < thresh T - seed( y ) for some T > ,this must hold for T = w x . The claim can be established byseparately considering w y ≥ w x and w y < w x . The claim impliesthat t ( x, T ) is minimized for T = w x .We now consider the auxiliary keys Z associated with thissample. Recall that these keys are not technically part of thesample but the information ( u x , w x ) for x ∈ Z is needed inorder to compute the conditional inclusion probabilities p ( thresh ,k ) x for x ∈ S . Note that it follows from Lemma 6.1 that for all keys x , p ( thresh ,k ) x = p ( thresh wx ,k ) x . For a key x , let Y x = { y (cid:54) = x | w y ≥ w x } be the set of keys other than x that have weight that is atleast that of x . Let y x be the key with k th smallest u x in Y x , when | Y x | ≥ k . The auxiliary keys are Z = { y x | x ∈ S } \ S . A key x is included in the sample with probability if y x is not defined(which means it has one of the k largest weights). Otherwise, itis (conditionally) included if and only if u x < u y x . To computethe inclusion probability p ( thresh ,k ) x from S ∪ Z , we do as follows.If there are k or fewer keys in S ∪ Z with weight that is at most w x , then p ( thresh ,k ) x = 1 (For correctness, note that in this case allkeys with weight ≥ w x would be in S .) Otherwise, observe that y x is the key with ( k + 1) th smallest u value in S ∪ Z among allkeys y with w y ≥ w x . We compute y x from the sample and use p ( thresh ,k ) x = u y x . Note that when weights are unique, Z = ∅ . The definition of S ( thresh ,k ) is equivalent to that of an All-Distances Sketch (ADS) computed with respect to weights w x (asinverse distances) [7], [8], and we can apply some algorithms andanalysis. In particular, we obtain that E [ | S ( thresh ,k ) | ] ≤ k ln n and the size is well-concentrated around this expectation. Theargument is simple: Consider keys ordered by decreasing weight.The probability that the i th key has one of the k smallest u x values, and thus is a member of S ( thresh ,k ) is at most min { , k/i } . Summing the expectations over all keys we obtain (cid:80) ni =1 min { , k/i } < k ln n . We shall see that the bound isasymptotically tight when weights are unique. With repeatedweights, however, the sample size can be much smaller. Lemma 6.2.
For any data set { ( x, w x ) } , when using thesame randomization { u x } to generate both samples, S ( M,k ) = S ( thresh ,k ) .Proof. Consider f ∈ M and the samples obtained for somefixed randomization u y for all keys y . Suppose that a key x isin the bottom- k sample S ( f,k ) . By definition, we have that f - seed( x ) = r x /f ( w x ) is among the k smallest f -seeds of allkeys. Therefore, it must be among the k smallest f -seeds in theset Y of keys with w y ≥ w x . From monotonicity of f , this impliesthat r x must be one of the k smallest in { r y | y ∈ Y } , which isthe same as u x being one of the k smallest in { u y | y ∈ Y } . Thisimplies that x ∈ S ( thresh wx ,k ) . The estimator (5) with the conditional inclusion probabilities p ( M,k ) x generalizes the HIP estimator of [8] to sketches computedfor non-unique weights. Theorem 4.1 implies that for any f ∈ M and H , C V [ (cid:100) sum ( f ; H )] ≤ √ q ( f ) ( H )( k − . When weights areunique and we estimate statistics over all keys, we have the tighterbound C V [ (cid:100) sum ( f ; X )] ≤ √ k − [8]. The samples, including the auxiliary information, are composable.Composability holds even when we allow multiple elements tohave the same key x and interpret w x to be the maximum weightover elements with key x . To do so, we use a random hashfunction to generate u x consistently for multiple elements withthe same key. To compose multiple samples, we take a union ofthe elements, replace multiple elements with same key with theone of maximum weight, and apply a sampling algorithm to theset of remaining elements. The updated inclusion probabilities canbe computed from the composed sample.We present two algorithms that compute the sample S ( M,k ) along with the auxiliary keys Z and the inclusion probabilities p ( M,k ) x for x ∈ S ( M,k ) . The algorithms process the elementseither in order of decreasing w x or in order of increasing u x .These two orderings may be suitable for different applicationsand it is worthwhile to present both: In the related context of all-distances sketches, both ordering on distances were used [7], [26],[3], [8]. The algorithms are correct when applied to any set of n elements that includes S ∪ Z . Recall that the inclusion probability p ( M,k ) x is the k th smallest u x among keys with w x ≤ w .Therefore, all keys with the same weight have the same inclusion
5. The sample we would obtain with repeated weights is always a subset ofthe sample we would have obtained with tie breaking. In particular, the samplesize can be at most k times the number of unique weights. probability. For convenience, we thus express the probabilities asa function p ( w ) of the weights. Algorithm 1
Universal monotone sampling: Scan by weight
Initialize empty max heap H of size k ; // k smallest u y values processed so far ptau ← + ∞ ; ; // ** omit with unique weights for ( x, w x ) by decreasing w x then increasing u x order doif | H | < k then S ← S ∪ { x } ; Insert u x to H ; p ( w x ) ← ; Continue if u x < max( H ) then S ← S ∪ { x } ; p ( w x ) ← max( H ) ; // x is sampled ptau ← max( H ) ; prevw ← w x ; // ** Delete max( H ) from H Insert u x to H else // ** if u x < ptau and w x = prevw then Z ← Z ∪ { x } ; p ( w x ) ← u x Algorithm 1 processes keys by order of decreasing weight,breaking ties by increasing u x . We maintain a max-heap H of the k smallest u y values processed so far. When processing a currentkey x , we include x ∈ S if u x < max( H ) . If including x , wedelete max( H ) and insert u x into H . Correctness follows from H being the k smallest u values of keys with weight at least w x . When weights are unique, the probability p ( w x ) is the k thlargest u value in H just before x is inserted. When weights arenot unique, we also need to compute Z . To do so, we track theprevious max( H ) , which we call ptau . If the current key x has u x ∈ (max( H ) , ptau ) , we include x ∈ Z . It is easy to verify thatin this case, p ( w x ) = u x . Note that the algorithm may overwrite p ( w ) multiple times, as keys with weight w are inserted to thesample or to Z .Algorithm 2 processes keys in order of increasing u x . Thealgorithm maintains a min heap H of the k largest weightsprocessed so far. With unique weights, the current key x isincluded in the sample if and only if w x > min( H ) . If x isincluded, we delete from the heap H the key with weight min( H ) and insert x . When weights are not unique, we also track theweight w of the previous removed key from H . When processinga key x then if w x = min( H ) and w x > prevw then x is insertedto Z .The computed inclusion probabilities p ( w ) with uniqueweights is u y , were y is the key whose processing triggered thedeletion of the key x with weight w from H . To establish correct-ness, consider the set H , just after x is deleted. By construction, H contains the k keys with weight w y < w x that have smallest u values. Therefore, p ( w x ) = max y ∈ H u y . Since we process keysby increasing u y , this maximum u value in H is the value of themost recent inserted key, that is, the key y which triggered theremoval of x . Finally, the keys that remain in H in the end arethose with the k largest weights. The algorithms correctly assigns p ( w x ) = 1 for these keys.When multiple keys can have the same weight, then p ( w ) isthe minimum of max y ∈ H u y after the first key of weight w isevicted, and the minimum u z of a key z with w z = w that wasnot sampled. If the minimum is realized at such a key z , that keyis included in Z , and the algorithm set p ( w ) accordingly when z is processed. If p ( w ) is not set already when the first key ofweight w is deleted from H , the algorithm correctly assigns p ( w ) to be max y ∈ H u y . After all keys are processed, p ( w x ) is set for remaining keys x ∈ H where a key with the same weight waspreviously deleted from H . Other keys are assigned p ( w ) = 1 .We now analyze the number of operations. With both al-gorithms, the cost of processing a key is O (1) if the key isnot inserted and O (log k ) if the key is included in the sam-ple. Using the bound on sample size, we obtain a bound of O ( n + k ln n log k ) on processing cost. The sorting requires O ( n log n ) computation, which dominates the computation (sincetypically k (cid:28) n ). When u x are assigned randomly, however,we can generate them with a sorted order by u x in O ( n ) time,enabling a faster O ( n + k log k log n ) computation. Algorithm 2
Universal monotone sampling: Scan by u H ←⊥ ; // min heap of size k , prioritizedby lex order on ( w y , − u y ) , containing keys withlargest priorities processed so far prevw ←⊥ for x by increasing u x order doif | H | < k then S ← S ∪ { x } ; Insert x to H ; Continue y ← arg min z ∈ H ( w z , − u z ) ; // The min weight keyin H with largest u z if w x > w y then S ← S ∪ { x } ; // Add x to sample prevw ← w y if p ( prevw ) = ⊥ then p ( prevw ) ← u x Delete y from H Insert x to H ; else // ** if w x = w y and w x > prevw then Z ← Z ∪ { x } ; p ( w x ) ← u x for x ∈ H do // keys with largest weights if p ( w x ) = ⊥ then p ( w x ) ← We now show that the worst-case factor of ln n on the size ofuniversal monotone sample is in a sense necessary. It suffices toshow this for threshold functions: Theorem 6.3.
Consider data sets where all keys have uniqueweights. Any sampling scheme with a nonnegative unbiased es-timator that for all
T > and H hasC V [ (cid:100) sum ( thresh T ; H )] ≤ / (cid:113) q ( thresh T ) ( H ) k , must have samples of expected size Ω( k ln n ) .Proof. We will use Theorem 3.1 which relates estimation qualityto sampling probabilities. Consider the key x with the i th heaviestweight. Applying Theorem 3.1 to x and thresh w x we obtain that p x ≥ kk + i Summing the sampling probabilities over all keys i ∈ [ n ] to bound the expected sample size we obtain (cid:80) x p x ≥ k (cid:80) ni =1 1 k + i = k ( H n − H k ) ≈ k (ln n − ln k ) . HE UNIVERSAL CAPPING SAMPLE
An important strict subset of monotone functions is the set C = { cap T | T > } of capping functions. We study themulti-objective bottom- k sample S ( C,k ) , which we refer to as the universal capping sample. From Theorem 5.1, the closure of C includes all functions of the form f ( y ) = (cid:82) ∞ α ( T ) cap T ( y ) dT ,for some α ( T ) ≥ . This is the set of all non-decreasing concavefunctions with at most a linear growth, that is f ( w ) that satisfy dfdw ≤ and d fdw ≤ .We show that the sample S ( C,k ) can be computed using O ( n + k log n log k ) operaions from any D (cid:48) ⊂ D that is superset of thekeys in S ( C,k ) . We start with properties of S ( C,k ) which we willuse to design our sampling algorithm. For a key x , let h x be thenumber h of keys with w y ≥ w x and u y < u x . Let (cid:96) x be thenumber of keys y with w y < w x and r y /w y < r x /w x .For a key x and T > and fixing the assignment { u y } for allkeys y , let t ( x, T ) be the position of cap T - seed( x ) in the list ofvalues cap T - seed( y ) for all y . The function t has the followingproperties: Lemma 7.1.
For a key x , t ( x, T ) is minimized for T = w x .Moreover, t ( x, T ) is non-decreasing for T ≥ w x and non-increasing for T ≤ w x .Proof. We can verify that for any key y such that there is a T > such that the cap T - seed( x ) < cap T - seed( y ) , we must have cap w x - seed( x ) < cap w x - seed( y ) . Moreover, the set of T values where cap T - seed( x ) < cap T - seed( y ) is an intervalwhich contains T = w x . We can establish the claim by separatelyconsidering the cases w y ≥ w x and w y < w x .As a corollary, we obtain that a key is in the universal cappingsample only if it is in the bottom- k sample for cap w x : Corollary 7.2.
Fixing { u y } , for any key x , x ∈ S ( C,k ) ⇐⇒ x ∈ S ( cap wx ,k ) . Lemma 7.3.
Fixing the assignment { u x } , x ∈ S ( C,k ) ⇐⇒ (cid:96) x + h x < k . Proof.
From Lemma 7.1, a key x is in a bottom- k cap T samplefor some T if and only if it is in the bottom- k sample for cap w x .The keys with a lower cap w x -seed than x are those with w y ≥ w x and u y < u x , which are counted in h x , and those with w y < w x and r y /w y < r x /w x , which are counted in (cid:96) x . Therefore, a key x is in S ( cap wx ,k ) if and only if there are fewer than k keys with alower cap w x -seed, which is the same as having h x + (cid:96) x < k .For reference, a key x is in the universal monotone sample S ( M,k ) if and only if it satisfies the weaker condition h x < k . Lemma 7.4.
A key x can be auxiliary only if (cid:96) x + h x = k Proof.
A key x is auxiliary (in the set Z ) only if for some y ∈ S ,it has the k th smallest cap w y -seed among all keys other than y .This means it has the ( k + 1) th smallest cap w y -seed.The number of keys with seed smaller than seed( x ) isminimized for T = w x . If the cap w x -seed of x is one of the k smallest ones, it is included in the sample. Therefore, to beauxiliary, it must have the ( k + 1) th smallest seed. We are ready to present our sampling algorithm. We first processthe data so that multiple elements with the same key are replacedwith with the one with maximum weight. The next step is toidentify all keys x with h x ≤ k . It suffices to compute S ( M,k ) of the data with the auxiliary keys. We can apply a variant of Algorithm 1: We process the keys in order of decreasing weight,breaking ties by increasing rank, while maintaining a binary searchtree H of size k which contains the k lowest u values of processedkeys. When processing x , if u x > max( H ) then h x > k , and thekey x is removed. Otherwise, h x is the position of u x in H , and u x is inserted to H and max( H ) removed from H . We now onlyconsider keys with h x ≤ k . Note that in expectation there are atmost k ln n such keys.The algorithm then computes (cid:96) x for all keys with (cid:96) x ≤ k . Thisis done by scanning keys in order of increasing weight, trackingin a binary search tree structure H the (at most) k smallest r y /w y values. When processing x , if r x /w x < max( H ) , then (cid:96) x isthe position of r x /w x in H . We then delete max( H ) and insert r x /w x .Keys that have (cid:96) x + h x < k then constitute the sample S andkeys with (cid:96) x + h x = k are retained as potentially being auxiliary.Finally, we perform another pass on the sampled and poten-tially auxiliary keys. For each key x , we determine the k + 1 thsmallest cap w x - seed , which is τ ( cap wx ,k ) . Using Corollary 7.2,we can use (3) to compute p ( cap wx ,k ) x = p ( C,k ) x . At the sametime we can also determine the precise set of auxiliary keys byremoving those that are not the ( k + 1) th smallest seed for any cap w x for x ∈ S . S ( C,k ) The sample S ( C,k ) is contained in S ( M,k ) , but can be muchsmaller. Intuitively, this is because two keys with similar, but notnecessarily identical, weights are likely to have the same relationbetween their f -seeds across all f ∈ C . This is not true for M :For a threshold T between the two weights, the thresh T -seedwould always be lower for the higher weight key whereas therelation for lower T value can be either one with almost equalprobabilities. In particular, we obtain a bound on | S ( C,k ) | whichdoes not depend on n : Theorem 7.1. E [ | S ( C,k ) | ] ≤ ek ln max x w x min x w x . Proof.
Consider a set of keys Y such that max x ∈ Y w x min x ∈ Y w x = ρ . Weshow that the expected number of keys x ∈ Y that for at leastone T > have one of the bottom- k cap T -seeds is at most ρk .The claim then follows by partitioning keys to ln max x w x min x w x groupswhere weights within each group vary by at most a factor of e andthen noticing that the bottom- k across all groups must be a subsetof the union of bottom- k sets within each group.We now prove the claim for Y . Denote by τ the ( k + 1) thsmallest r x value of x ∈ Y . The set of k keys with r x < τ arethe bottom- k sample for cap T ≤ min x ∈ Y w x . Consider a key y .From Lemma 7.1, we have y ∈ S ( C,k ) only if y ∈ S ( cap wy ,k ) . Anecessary condition for the latter is that r y /w y < τ / min x ∈ Y w x .This probability is at most Pr u y ∼ U [0 , (cid:20) r y w y < τ min x ∈ Y w x (cid:21) ≤ w y min x ∈ Y w x Pr u x ∼ U [0 , [ r y < τ ] ≤ ρ k | Y | Thus, the expected number of keys that satisfy this condition is atmost ρk . ETRIC OBJECTIVES
In this section we discuss the application of multi-objective sam-pling to additive cost objectives. The formulation has a set of keys X , a set of models Q , and a nonnegative cost function c ( Q, x ) ofservicing x ∈ X by Q ∈ Q . In metric settings, the keys X arepoints in a metric space M , Q ∈ Q is a configuration of facilities(that can also be points Q ⊂ M ), and c ( Q, x ) is distance-basedand is the cost of servicing x by Q . For each Q ∈ Q we areinterested in the total cost of servicing X which is c ( Q, X ) = (cid:88) x ∈ X c ( Q, x ) . A concrete example is the k -means clustering cost function,where Q is a set of points (centers) of size k and c ( Q, x ) =min q ∈ Q d ( q, x ) .In this formulation, we are interested in computing a smallsummary of X that would allow us to estimate c ( Q, X ) foreach Q ∈ Q . Such summaries in a metric setting are sometimesreferred to as coresets [1]. Multi-objective samples can be usedas such a summary. Each Q ∈ Q has a corresponding function f x ≡ c ( Q, x ) . A multi-objective sample for the set F of all thefunctions for Q ∈ Q allows us to estimate sum ( f ) = c ( Q, X ) for each Q . In particular, a sample of size h ( F ) (cid:15) − allows us toestimate c ( Q, X ) for each Q ∈ Q with CV at most (cid:15) and goodconcentration.The challenges, for a domain of such problems, are to • Upper-bound the multi-objective overhead h ( F ) as afunction of parameters of the domain ( | X | , c , structureof Q ∈ Q ). The overhead is a fundamental property of theproblem domain. • Efficiently compute upper bounds µ x on the multi-objective sampling probabilities p ( F, so that the sum (cid:80) x µ x is not much larger than h ( F ) . We are interested inobtaining these bounds without enumerating over Q ∈ Q (which can be infinite or very large).Recently, we [6] applied multi-objective sampling to the prob-lem of centrality estimation in metric spaces. Here M is a generalmetric space, X ⊂ M is a set of points, each Q is a single point in M , and the cost functions is c ( Q, x ) = d ( Q, x ) p . The centrality Q is the sum (cid:80) x c ( Q, x ) . We established that the multi-objectiveoverhead is constant and that upper bound probabilities (withconstant overhead) can be computed very efficiently using O ( | X | ) distance computations. More recently [10], we generalized theresult to the k -means objective where Q are subsest of size atmost k and establish that the overhead is O ( k ) . OR E ACH , F OR A LL Our multi-objective sampling probabilities provide statistical guar-antees that hold for each f and H : Theorem 4.1 states that theestimate (cid:100) sum ( f ; H ) has the CV and concentration bounds overthe sample distribution S ∼ p (sample S that includes each x ∈ X independently (or VarOpt) with probability p x ).In this section we focus on uniform per-objective guarantees( k f = k for all f ∈ F ) and statistics sum ( f ; X ) = sum ( f ) overthe full data set. For F and probabilities p , we define the ForEach
Normalized Mean Squared Error (NMSE):NMSE e ( F, p ) = max f ∈ F E S ∼ p ( (cid:100) sum ( f ) sum ( f ) − , (9) and the ForAll
NMSE:NMSE a ( F, p ) = E S ∼ p max f ∈ F ( (cid:100) sum ( f ) sum ( f ) − . (10)The respective normalized root MSE (NRMSE) are the squaredroots of the NMSE. Note that ForAll is stronger than
ForEach asit requires a simultaneous good approximation of sum ( f ) for all f ∈ F : ∀ p , NMSE e ( F, p ) ≤ NMSE a ( F, p ) . We are interested in the tradeoff between the expected size ofa sample, which is sum ( p ) ≡ (cid:80) x p x , and the NRMSE. Themulti-objective pps probabilities are such that for all (cid:96) > ,NMSE e ( F, p ( F,(cid:96) ) ) ≤ /(cid:96) .For a parameter (cid:96) ≥ , we can also consider the ForAll error NMSE a ( F, p ( F,(cid:96) ) ) and ask for a bound on (cid:96) so thatthe NRMSE a ≤ (cid:15) . A union-bound argument established that (cid:96) = (cid:15) − log | F | always suffices. Moreover, when F is thesampling closure of a smaller subset F (cid:48) , then (cid:96) = (cid:15) − log | F (cid:48) | suffices. If we only bound the maximum error on any subset of F of size m , we can use (cid:96) = (cid:15) − log m . When F is the set ofall monotone functions over n keys, then (cid:96) = O ( (cid:15) − log log n ) suffices. To see this intuitively, recall that it suffices to consider allthreshold functions since all monotone functions are nonnegativecombinations of threshold functions. There are n threshold func-tions but these functions have O (log n ) points where the valuesignificantly changes by a factor.We provide an example of a family F where the sample-sizegap between NMSE e and NMSE a is linear in the support size.Consider a set of n keys and define f for each subset of n/ keys so that f is uniform on the subset and outside it. Themulti-objective base pps sampling probabilities are p ( F, x = 2 /n for all x and hence the overhead is h ( F ) = 2 . Therefore, p of size (cid:15) − has NRMSE e ( F, p ) ≤ (cid:15) . In contrast, any p withNRMSE a ( F, p ) ≤ / must contain at least one key from thesupport of each f in (almost) all samples, implying expectedsample size that is at least n/ .When we seek to bound NRMSE e , probabilities of theform p ( F,(cid:96) ) essentially optimize the size-quality tradeoff. ForNRMSE a , however, the minimum size p that meets a certain errorNRMSE a ( F, p ) ≤ (cid:15) can be much smaller than the minimum size p that is restricted to the form p ( F,(cid:96) ) . Consider (cid:15) > and a set F that has k parts F i with disjoint supports of equal sizes n/k .All the parts F i except F have a single f that is uniform onthe support, which means that with uniform p of size (cid:15) − wehave, NRMSE e ( F i , p ) = NRMSE a ( F i , p ) = (cid:15) . The part F hassimilar structure to our previous example which means that any p that has NRMSE a ( F , p ) ≤ / has size at least n/ (2 k ) wherasa uniform p of size (cid:15) − has NRMSE e ( F , p ) = (cid:15) . The multi-objective base sampling probabilities are therefore p ( F, = k/n for keys in the supports of F i where i > and p ( F, = 2 k/n for keys in the support of F and thus the overhead is k + 1 .The minimum size p for NRMSE a ( F, p ) = 1 / must havevalue at least / for keys in the support of F and valueabout ( k/n ) log( k ) for other keys (having ForAll requirement foreach part and a logarithmic factor due to a (tight) union bound).Therefore the sample size is O ( n/k + k log k ) . In contrast, Tohave NRMSE a ( F, p ( F,(cid:96) ) ) = 1 / , we have to use (cid:96) = Ω( n/k ) ,obtaining a sample size of Ω( n ) .
10 O
PTIMIZATION OVER MULTI - OBJECTIVE SAM - PLES
In this section we consider optimization problems where we havekeys X , nonnegative functions f ∈ F where f : X and we seekto maximize M ( sum ( f )) over f ∈ F : f = arg max g ∈ F M ( sum ( g )) . (11)We will assume here that the function M is smooth with boundedrate of change: | M ( v ) − M ( v (cid:48) ) | /M ( v ) ≤ c | v − v (cid:48) | /v , so thatwhen v (cid:48) ≈ v then M ( v (cid:48) ) ≈ M ( v ) .Optimization over the large set X can be costly or infea-sible and we therefore aim to instead compute an approximatemaximizer over a multi-objective sample S of X . We propose aframework that adaptively increases the sample size until the ap-proximate optimization goal is satisfied with the desired statisticalguarantee.We work with a slightly more general formulation where weallow the keys to have importance weights m x ≥ and considerthe weighted sums sum ( f ; X , m ) = (cid:80) x ∈X f x m x . Note thatfor the purpose of defining the problem, we can without loss ofgenerality “fold” the weights m x into the functions f ∈ F toobtain the set F (cid:48) ≡ mF which has uniform importance weightsso that f (cid:48) ∈ F (cid:48) is defined from f ∈ F using ∀ x f (cid:48) x = f x m x .When estimating from a sample, however, keys get re-weightedand therefore it is useful to separate out F , which may havea particular structure we need to preserve, and the importanceweights m .Two example problem domains of such optimizations areclustering (where X are points, each f ∈ F corresponds to aset of centers, f x depend on the distance from x to the centers,and sum ( f ) is the cost of clustering with f ) and empirical riskminimization (where X are examples and sum ( f ) is the loss ofmodel f ). In these settings we seek to minimize ( M ( v ) = − v ) ormaximize ( M ( v ) = v ) sum ( f ) .We present a meta-algorithm for approximate optimization thatuses the following: • Upper bounds π x ≥ p ( mF, x on the base multi-objectivepps probabilities. We would like h = (cid:80) x π x to be notmuch larger than (cid:80) p ( F, x . (Here we denote by mF theimportance weights m folded into F ). • Algorithm A that for input S ⊂ X and positive weights a x for x ∈ S returns f ∈ F that (approximately)optimizes M ( sum ( f ; S, a )) . By approximate optimum,we allow well-concentrated relative error with respect tothe optimum max g ∈ F M ( sum ( g ; S, a )) or with respectto the optimum max g ∈ G M ( sum ( g ; S, a )) on a morerestricted set G ⊂ F .We apply this algorithm to samples S obtained with prob-abilities p x = min { , kπ x } . The keys x ∈ S have impor-tance weights a x = m x /p x . Note that sum ( g ; S, a ) = (cid:80) x ∈ S m x g x /p x , is the estimate of sum ( g ; X , m ) weobtain from S .Optimization over the sample requires that an (approximate)maximizer f that meets our quality guarantees over the sampledistribution is an approximate maximizer of (11). Intuitively,we would need that at least one approximate maximizer f of(11) is approximated well by the sample M ( sum ( f ; S )) ≈ M ( sum ( f ; X )) and that all f that are far from being approximatemaximizers are not approximate maximizers over the sample. An ForAll sample is sufficient but pessimistic. Moreover,meeting
ForAll typically necessitates worst-case non-adaptivebounds on sample size. An
ForEach sample, obtained with k ≥ (cid:15) − , is not sufficient in and off itself, but a key observa-tion is that maximization over a ForEach sample can only err(within our
ForEach statistical guarantees) by over-estimating themaximum, that is, returning f such that M ( sum ( f ; S, a )) (cid:29) M ( sum ( f ; X , m ) . Therefore, if sum ( f ; X , m ) ≥ (1 − (cid:15) ) sum ( f ; S, a ) (12)we can certify, within the statistical guarantees provided by A and ForEach ), that the sample maximizer f is an approximatemaximizer of (11). Otherwise, we obtain a lower and approximateupper bounds [ M ( sum ( f ; X , m )) , (1 + (cid:15) ) M ( sum ( f ; S, a ))] on the optimum. Finally, this certification can be done by exactcomputation of sum ( f ; X , m ) , but it can be performed muchmore efficiently with statistical guarantees using independent“validation” samples from the same distribution.Algorithm 3 exploits this property to perform approximateoptimization with an adaptive sample size. The algorithm startswith an ForEach sample. It iterates approximate optimizationover the sample, testing (12), and doubling the sample size param-eter k , until the condition (12) holds. Note that since the samplesize is doubled, the ForEach guarantees tighten with iterations,thus, from concentration we get confidence for test results overthe iterations. The algorithm uses the smallest sample size whereprobabilities are of the form min { , kπ x } . Note (see example inthe previous section) that the optimization might be supported bya much smaller sample of a different form. An interesting openquestion is whether we can devise an algorithm that increasessampling probabilities in a more targeted way and can performthe approximate optimization using a smaller sample size. Algorithm 3
Optimization over multi-objective samples
Input: points X with weights m , M , functions F : X , upperbounds π x ≥ p ( mF, , algorithm A which for input S ⊂X and weights a performs (cid:15) -approximate maximization of M ( sum ( f ; S, a )) over f ∈ F . foreach x ∈ X do // for sample coordination u x ∼ U [0 , k ← (cid:15) − // Initialize with ForEach guarantee repeat S ←⊥ // Initialize empty sample foreach x ∈ X such that u x ≤ min { , kπ x } do // buildsample S ← S ∪ { x } , a x ← m x / min { , kπ x } ) // Optimization over S Compute f such that M ( sum ( f ; S, a )) ≥ (1 − (cid:15) ) max g ∈ F M ( sum ( g ; S, a )) k ← k // Double the sample size until M ( sum ( f ; X )) ≥ (1 − (cid:15) ) M ( sum ( f ; S )) // Exact orapprox using a validation sample return f
11 C
ONCLUSION
Multi-objectives samples had been studied and applied for nearlyfive decades. We present a unified review and extended analysis of multi-objective sampling schemes, geared towards efficientcomputation over very large data sets. We lay some foundationsfor further exploration and additional applications.A natural extension is the design of efficient multi-objectivesampling schemes for unaggregated data [19], [2] presented ina streamed or distributed form. The data here consists of dataelements that are key value pairs, where multiple elements canshare the same key x , and the weight w x is the sum of the valuesof elements with key x . We are interested again in summaries thatsupport queries of the form sum ( f ; H ) , where f x = f ( w x ) forsome function f ∈ F . To sample unaggregated data, we can firstaggregate it and then apply sampling schemes designed for aggre-gated data. Aggregation, however, of streamed or distributed data,requires state (memory or communication) of size proportional tothe number of unique keys. This number can be large, so instead,we aim for efficient sampling without aggregation, using state ofsize proportional to the sample size. We recently proposed such asampling framework for capping statistics [9], which also can beused to for all statistics in their span. R EFERENCES [1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approx-imation via coresets. In
Combinatorial and computational geometry,MSRI . University Press, 2005.[2] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approx-imating the frequency moments.
J. Comput. System Sci. , 58:137–147,1999.[3] P. Boldi, M. Rosa, and S. Vigna. HyperANF: Approximating theneighbourhood function of very large graphs on a budget. In
WWW ,2011.[4] K. R. W. Brewer, L. J. Early, and S. F. Joyce. Selecting several samplesfrom a single population.
Australian Journal of Statistics , 14(3):231–239,1972.[5] M. T. Chao. A general purpose unequal probability sampling plan.
Biometrika , 69(3):653–656, 1982.[6] S. Chechik, E. Cohen, and H. Kaplan. Average distance queries throughweighted samples in graphs and metric spaces: High scalability with tightstatistical guarantees. In
RANDOM . ACM, 2015.[7] E. Cohen. Size-estimation framework with applications to transitiveclosure and reachability.
J. Comput. System Sci. , 55:441–453, 1997.[8] E. Cohen. All-distances sketches, revisited: HIP estimators for massivegraphs analysis.
TKDE , 2015.[9] E. Cohen. Stream sampling for frequency cap statistics. In
KDD . ACM,2015. full version: http://arxiv.org/abs/1502.05955 .[10] E. Cohen, S. Chechik, and H. Kaplan. Clustering over multi-objectivesamples: The one2all sample.
CoRR , abs/1706.03607, 2017.[11] E. Cohen, D. Delling, T. Pajor, and R. F. Werneck. Sketch-basedinfluence maximization and computation: Scaling up with guarantees.In
CIKM . ACM, 2014.[12] E. Cohen, N. Duffield, C. Lund, M. Thorup, and H. Kaplan. Efficientstream sampling for variance-optimal estimation of subset sums.
SIAMJ. Comput. , 40(5), 2011.[13] E. Cohen, N. Grossuag, and H. Kaplan. Processing Top-k Queries fromSamples. In
Proceedings of the 2006 ACM conference on Emergingnetwork experiment and technology (CoNext) . ACM, 2006.[14] E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In
ACM PODC , 2007.[15] E. Cohen and H. Kaplan. Tighter estimation using bottom-k sketches. In
Proceedings of the 34th VLDB Conference , 2008.[16] E. Cohen, H. Kaplan, and S. Sen. Coordinated weighted sampling forestimating aggregates over multiple weight assignments.
VLDB , 2(1–2),2009. full: http://arxiv.org/abs/0906.4560 .[17] N. Duffield, M. Thorup, and C. Lund. Priority sampling for estimatingarbitrary subset sums.
J. Assoc. Comput. Mach. , 54(6), 2007.[18] W. Feller.
An introduction to probability theory and its applications ,volume 2. John Wiley & Sons, New York, 1971.[19] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for database applications.
J. Comput. System Sci. , 31:182–209, 1985.[20] R. Gandhi, S. Khuller, S. Parthasarathy, and A. Srinivasan. Dependentrounding and its applications to approximation algorithms.
J. Assoc.Comput. Mach. , 53(3):324–360, 2006. [21] M. H. Hansen and W. N. Hurwitz. On the theory of sampling from finitepopulations.
Ann. Math. Statist. , 14(4), 1943.[22] D. G. Horvitz and D. J. Thompson. A generalization of sampling withoutreplacement from a finite universe.
Journal of the American StatisticalAssociation , 47(260):663–685, 1952.[23] L. Kish and A. Scott. Retaining units after changing strata and prob-abilities.
Journal of the American Statistical Association , 66(335):pp.461–470, 1971.[24] E. Ohlsson. Sequential poisson sampling.
J. Official Statistics ,14(2):149–162, 1998.[25] E. Ohlsson. Coordination of pps samples over time. In
The 2ndInternational Conference on Establishment Surveys , pages 255–264.American Statistical Association, 2000.[26] C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: A fast and scalabletool for data mining in massive graphs. In
KDD , 2002.[27] B. Ros´en. Asymptotic theory for successive sampling with varying prob-abilities without replacement, I.
The Annals of Mathematical Statistics ,43(2):373–397, 1972.[28] B. Ros´en. Asymptotic theory for order sampling.
J. Statistical Planningand Inference , 62(2):135–158, 1997.[29] P. J. Saavedra. Fixed sample size pps approximations with a permanentrandom number. In
Proc. of the Section on Survey Research Methods ,pages 697–700, Alexandria, VA, 1995. American Statistical Association.[30] C-E. S¨arndal, B. Swensson, and J. Wretman.
Model Assisted SurveySampling . Springer, 1992.[31] M. Szegedy. The DLT priority sampling is essentially optimal. In
Proc.38th Annual ACM Symposium on Theory of Computing . ACM, 2006.[32] Y. Till´e.
Sampling Algorithms . Springer-Verlag, New York, 2006. A PPENDIX
Theorem A.1.
Consider ppswor sampling with respect to weights f x and the estimator (2) computed using (3) . Then for any g ≥ and segment H , C V [ (cid:100) sum ( g ; H )] ≤ (cid:113) ρ ( f,g ) q ( g ) ( H )( k − .Proof. We adapt a proof technique in [9] (which builds on [7],[8]). To simplify notation, we use W = sum ( f, X ) = (cid:80) x ∈X f x for the total f -weight of the population.We first consider the variance of the inverse probabilityestimate for a key x with weight g x , conditioned on thethreshold τ . We use the notation ˆ g ( τ ) x for the estimate that is g x / Pr[ f - seed( x ) < τ ] when f - seed( x ) < τ and other-wise. Using p = Pr[ f - seed( x ) < τ ] = 1 − e − f x τ , we have v ar [ˆ g ( τ ) x ] = 1 − pp g x = g x e − τf x − e − τf x ≤ g x f x τ ≤ max y g y f y g x τ , (13)using the relation e − z / (1 − e − z ) ≤ /z .We now consider the variance of the estimator ˆ g ( τ ) x when τ is the k th smallest seed value τ (cid:48) in X \ x . We denote by B x thedistribution of τ (cid:48) . We will bound the variance of the estimate usingthe relation v ar [ˆ g x ] = E τ (cid:48) ∼ B x v ar [ˆ g ( τ (cid:48) ) x ] . The distribution of τ (cid:48) is the k th smallest of independentexponential random variables with parameters f y for y ∈ X \ x .From properties of the exponential distribution, the minimum seedis exponentially distributed with parameter W − f x , the differencebetween the minimum and second smallest is exponentially dis-tributed with parameter W − f x − w , where w is the weight f y of the key y with minimum seed, and so on. Therefore, thedistribution on τ (cid:48) conditioned on the ordered set of smallest-seedkeys is a sum of k exponential random variables with parameters at most W . The distribution B x is a convex combination of suchdistributions. We use the notation s W,k for the density functionof the Erlang distribution
Erlang ( W, k ) , which is a sum of k independent exponential distribution with parameter W . What weobtained is that the distribution B x (for any x ) is dominated by Erlang ( W, k ) .Since our bound on the conditioned variance v ar [ˆ g ( τ (cid:48) ) x ] is non-increasing with τ (cid:48) , domination implies that E τ (cid:48) ∼ B x v ar [ˆ g ( τ (cid:48) ) x ≤ E τ (cid:48) ∼ Erlang ( W,k ) v ar [ˆ g ( τ (cid:48) ) x ] , where v ar is our upper bound (13). We now use the Erlang densityfunction [18] s W,k ( z ) = W k z k − ( k − e − W z and the relation (cid:82) ∞ z a e − bz dz = a ! /b a +1 to bound the variance: v ar [ˆ g x ] ≤ (cid:90) ∞ s W,k ( z ) v ar [ˆ g ( z ) x ] dz ≤ (cid:90) ∞ W k z k − ( k − e − W z g x z max y g y f y dz ≤ max y g y f y g x W k ( k − (cid:90) ∞ z k − e − W z dz = max y g y f y g x Wk − . By definition, (cid:100) sum ( g ; H ) = (cid:80) x ∈ H ˆ g x . Since covari-ances between different keys are zero [15], v ar [ (cid:100) sum ( g ; H )] = (cid:80) x ∈ H v ar [ˆ g x ] ≤ max y g y f y sum ( g ; H ) Wk − .C V [ (cid:100) sum ( g ; H )] = v ar [ (cid:100) sum ( g ; H )] sum ( g ; H ) ≤ max y g y f y sum ( g ; H ) W ( k − sum ( g ; H ) ≤ max y g y f y k − W sum ( g ; H ) ≤ max y g y f y k − sum ( g ; X ) sum ( g ; H ) W sum ( g ; X ) ≤ ρq ( k −1)