CountSketches, Feature Hashing and the Median of Three
CCountSketches, Feature Hashing and the Median of Three ∗ Kasper Green Larsen † [email protected] Department of Computer Science,Aarhus University, Denmark Rasmus Pagh [email protected]
BARC, Department of Computer Science,University of Copenhagen, DenmarkJakub Tˇetek ‡ [email protected] BARC, Department of Computer Science,University of Copenhagen, Denmark
Abstract
In this paper, we revisit the classic
CountSketch method, which is a sparse, random projectionthat transforms a (high-dimensional) Euclidean vector v to a vector of dimension (2 t − s , where t, s > t = 1, a CountSketch allows estimatingcoordinates of v with variance bounded by (cid:107) v (cid:107) /s . For t >
1, the estimator takes the median of2 t − (cid:107) v (cid:107) / √ s is exponentially small in t . This suggests choosing t to be logarithmic in a desired inverse failureprobability. However, implementations of CountSketch often use a small, constant t . Previous workonly predicts a constant factor improvement in this setting.Our main contribution is a new analysis of CountSketch, showing an improvement in varianceto O (min {(cid:107) v (cid:107) /s , (cid:107) v (cid:107) /s } ) when t >
1. That is, the variance decreases proportionally to s − ,asymptotically for large enough s . We also study the variance in the setting where an inner productis to be estimated from two CountSketches. This finding suggests that the Feature Hashing method,which is essentially identical to CountSketch but does not make use of the median estimator, can bemade more reliable at a small cost in settings where using a median estimator is possible.We confirm our theoretical findings in experiments and thereby help justify why a small constantnumber of estimates often suffice in practice. Our improved variance bounds are based on new generaltheorems about the variance and higher moments of the median of i.i.d. random variables that maybe of independent interest.
CountSketch [3] is a classic low-memory algorithm for processing a data stream in one pass. It supportsestimating the number of occurrences of different data items in the stream, and can also be used forfast inner product estimation, or as a building block for finding heavy hitters (see e.g. [16]). Sinceits introduction, CountSketch has proved to be a strong primitive for approximate computation onhigh-dimensional vectors. Applications in machine learning include feature selection [1], neural networkcompression [4], random feature mappings [13], compressed gradient optimizers [14], and multitasklearning [15] — see section 1.5 for more details.
CountSketch works in the turnstile streaming model, where one is to maintain a sketch of a vector v ∈ R d under updates to the entries. Concretely, the vector v is given in a streaming fashion as a sequenceof updates ( i , ∆ ) , ( i , ∆ ) , . . . , where an update ( i, ∆) has the effect of setting v i ← v i + ∆ for some∆ ∈ R . ∗ The authors are part of BARC, Basic Algorithms Research Copenhagen, supported by the VILLUM Foundation grant16582. † Kasper Green Larsen is supported by a Villum Young Investigator Grant, a DFF Sapere Aude Research Leader Grantand an AUFF Starting Grant. ‡ Jakub Tˇetek has been supported by the Bakala Foundation Scholarship. a r X i v : . [ c s . D S ] F e b s2 v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.00.51.01.52.02.53.0 s × v a r i a n c e t=1t=2 0 200 400 600 800 1000s0123456 s × v a r i a n c e t=1t=2 Figure 1: Variance plot of frequency estimation (point queries) for CountSketch with t = 1 and t = 2,run on a one-hot vector v with a single nonzero coordinate v i = 1. The first figure shows that thevariances behave linearly on a log-log plot, suggesting that the variances decrease polynomially with thenumber of columns s . The second plot shows variance multiplied by s . CountSketch with t = 1 becomesnear-constant, suggesting that its variance grows as 1 /s . The third plot shows variance multiplied by s and suggests that the variance for t = 2 grows roughly as 1 /s .The sketch can be stored as a matrix A with 2 t − s columns — alternatively viewed asa vector of dimension (2 t − s . Updates to the sketch are defined by hash functions h , . . . , h t − and g , . . . , g t − . To initialize an empty CountSketch, we pick a 2-wise independent hash function h i : [ d ] → [ s ]mapping entries in v to columns of A , and a 2-wise independent hash function g i : [ d ] → {− , } mappingentries in v to a random sign, each for row i ∈ [2 t − . To process the update ( j, ∆) the updatealgorithm sets A i,h i ( j ) ← A i,h i ( j ) + g i ( j )∆ for i = 1 , . . . , t −
1. Thus entry k of the i th row of A containsthe sum of all coordinates v j such that h i ( j ) = k , with each such coordinate v j multiplied by a randomsign g i ( j ). A frequency estimation query (a.k.a. point query) asks to return an estimate of an entry v j . CountSketchsupports such queries by returning the median of { g i ( j ) A i,h i ( j ) } t − i =1 . The classic analysis of CountSketchshows that for each row i of A and entry v j , the estimate ˆ v ij = g i ( j ) A i,h i ( j ) has expectation v j andvariance at most (cid:107) v (cid:107) /s . Using Chebyshev’s inequality, this implies that Pr[ | ˆ v ij − v j | ≥ (cid:107) v (cid:107) / √ s ] ≤ / v j of the 2 t − v j , . . . , ˆ v t − j and using a Chernoff bound to conclude that Pr[ | ˆ v j − v j | ≥ (cid:107) v (cid:107) / √ s ] ≤ exp( − Ω( t )). Asimilar, but less common, analysis based on Markov’s inequality can also be used to give a bound basedon the (cid:96) norm of v . More concretely, it can be shown that E [ | ˆ v ij − v j | ] ≤ (cid:107) v (cid:107) /s . This can again becombined with the Chernoff bound to conclude that Pr[ | ˆ v j − v j | ≥ (cid:107) v (cid:107) /s ] ≤ exp( − Ω( t )). This latterbound has a better dependency on the number of columns (and hence space usage) but potentially aworse dependency on v as (cid:107) v (cid:107) ≥ (cid:107) v (cid:107) for all v ( (cid:107) v (cid:107) and (cid:107) v (cid:107) are close when v consists of a few largenon-zero entries).Both of the above bounds suggest using a value of t that is logarithmic in the desired failure probability.However, practitioners rarely use more than a small constant number of rows, such as 3 or 5 ( t = 2 , s = 512 columns.We explain these observations through new theoretical insights about CountSketch. Concretely, weprove: Theorem 1.
CountSketch with t = 2 (3 rows) satisfies E [(ˆ v j − v j ) ] ≤ min { (cid:107) v (cid:107) /s , (cid:107) v (cid:107) /s } . The new contribution in Theorem 1 is the bound in terms of (cid:107) v (cid:107) . Quite interestingly, the boundin terms of (cid:107) v (cid:107) is not true if using just a single row. To see this, consider any vector v with a singlenon-zero entry v i . The estimate for any other entry v j then equals 0 with probability 1 − /s ( h ( i ) (cid:54) = h ( j ))and it equals v i g ( i ) g ( j ) with probability 1 /s . One therefore has E [(ˆ v j − v j ) ] = v i /s = (cid:107) v (cid:107) /s . Thisshows that using just three rows instead of a single row effectively reduces the variance of CountSketchby a factor s in terms of (cid:107) v (cid:107) . We find this new insight into one of the most fundamental sketching A k -wise independent hash function has independent and uniform random hash values when restricted to any set of upto k keys. (cid:107) v (cid:107) : Theorem 2.
CountSketch with t = 2 (3 rows) satisfies E [(ˆ v j − v j ) ] ≤ (cid:107) v (cid:107) /s . Moreover, we show that this bound is asymptotically optimal. If we consider the same example asabove with a vector v with just a single non-zero entry v i , we again see that when estimating any v j with j (cid:54) = i we have E [(ˆ v j − v j ) ] = v i /s = (cid:107) v (cid:107) /s . Thus using t = 2 (3 rows) rather than t = 1 (1 row) reducesthe fourth moment by a factor s in terms of (cid:107) v (cid:107) . We find it quite remarkable that a constant factorincrease in the number of rows increases the utilization of the number of columns by a linear factor bothin terms of the variance as a function of (cid:107) v (cid:107) and the fourth moment as a function of (cid:107) v (cid:107) . Combinedwith our experiments, this strongly suggest that one should always use at least 3 rows in practice. Weextend our results to any t and show: Theorem 3.
CountSketch with median of t − rows satisfies E [ | ˆ v j − v j | t ] ≤ t − (cid:107) v (cid:107) t /s t and E [(ˆ v j − v j ) t ] ≤ t − (cid:107) v (cid:107) t /s t . Thus we can bound the t th moment optimally (up to the 2 t − factor) in terms of (cid:107) v (cid:107) and similarlyfor the 2 t -th moment in terms of (cid:107) v (cid:107) . Another use case of CountSketch is in fast inner product estimation. Concretely, given two vectors v, w ∈ R d , if one builds a CountSketch on both vectors using the same random hash functions h , . . . , h t − and g , . . . , g t − (i.e. the same seeds), then one can quickly estimate (cid:104) v, w (cid:105) from the two sketches. Moreprecisely, let A v and A w denote the matrices constructed for v and w , respectively. For any row i , theinner product (cid:104) A vi , A wi (cid:105) = (cid:80) sj =1 A vi,j A wi,j is an unbiased estimator of (cid:104) v, w (cid:105) . Moreover, one can show that E [( (cid:104) A vi , A wi (cid:105) − (cid:104) v, w (cid:105) ) ] ≤ (cid:107) v (cid:107) (cid:107) w (cid:107) /s if we replace g by a 4-wise independent hash function (ratherthan just 2-wise). Combining this with Chebyshev’s inequality yieldsPr[ |(cid:104) A vi , A wi (cid:105) − (cid:104) v, w (cid:105)| > (2 √ (cid:107) v (cid:107) (cid:107) w (cid:107) / √ s ] < / . Finally, as with frequency estimation (point queries), one can take the median over the 2 t − X , satisfiesPr[ | X − (cid:104) v, w (cid:105)| > (2 √ (cid:107) v (cid:107) (cid:107) w (cid:107) / √ s ] < exp( − Ω( t )) . CountSketch with just a single row, t = 1, is in fact identical to the popular feature hashing scheme [15].Previous work has not shown any asymptotic benefits of taking the median of a small constant number ofrows, using e.g. t = 2 or t = 3. Our contribution is new bounds on the variance of such inner productestimates: Theorem 4.
For two vectors v, w ∈ R d , let A v and A w denote the two matrices representing a CountSketchof the two vectors when using the same random hash functions, where the g i are 4-wise independent. Let X denote the median of (cid:104) A vi , A wi (cid:105) over rows i = 1 , . . . , t − . Then CountSketch with t = 2 satisfies E [( X − (cid:104) v, w (cid:105) ) ] ≤ min { (cid:107) v (cid:107) (cid:107) w (cid:107) /s , (cid:107) v (cid:107) (cid:107) w (cid:107) /s } , and for t > : E [ | X − (cid:104) v, w (cid:105)| t ] ≤ t − (cid:107) v (cid:107) t (cid:107) w (cid:107) t /s t , and E [( X − (cid:104) v, w (cid:105) ) t ] ≤ t − (cid:107) v (cid:107) t (cid:107) w (cid:107) t /s t . We note that the bounds in terms of (cid:107) v (cid:107) and (cid:107) w (cid:107) can be shown only assuming 2-wise independenceof the g i . As with frequency estimation queries, a simple example demonstrates that the variance boundin terms of (cid:107) v (cid:107) (cid:107) w (cid:107) is false for t = 1. Concretely, let v have a single coordinate v i that is non-zero andlet w have a single coordinate w j with j (cid:54) = i that is non-zero. Then (cid:104) v, w (cid:105) = 0, yet the probability that v i and w j hash to the same entry is 1 /s . In that case, the estimate is either v i w j = (cid:107) v (cid:107) (cid:107) w (cid:107) or − v i w j .This implies that E [( X − (cid:104) v, w (cid:105) ) ] = (cid:107) v (cid:107) (cid:107) w (cid:107) /s , i.e. a factor s worse than the guarantees with threerows.We have also performed experiments estimating the variance on real-world data sets, see Section 4.When s is large enough (so that (cid:107) v (cid:107) (cid:107) w (cid:107) /s becomes the smallest term), these experiments support ourtheoretical findings as with the frequency estimation queries.3 iscussion. Similarly to the frequency estimation queries, our new theoretical bounds and supportingexperiments strongly advocates taking the median of at least 3 rows when using CountSketch for innerproduct estimation. Equivalently, when using feature hashing for inner product estimation, one shouldtake the median of at least 3 independent instantiations. This reduces the variance by a linear factorin the number of columns/coordinates of the sketch. We remark that taking the median might not beallowed in all applications. For instance, when using CountSketch/feature hashing as preprocessing forSupport Vector Machines, using one row corresponds to a kernel function, while this is not the case whentaking the median of multiple row estimates. The median of three can thus not be directly used in thissetting.
We prove our new variance and moment bounds for CountSketch by showing general theorems relatingmoments of the median of i.i.d. random variables to smaller moments of the individual random variables.These new bounds are very natural and should have applications besides in CountSketch. Moreover, weshow that they are asymptotically optimal.
Theorem 5.
Let X , · · · , X t − be t − i.i.d. real valued random variables and let Y denote theirmedian. For all positive integers q it holds that E [ | Y − E [ X ] | tq ] ≤ (cid:0) t − t (cid:1) · E [ | X − E [ X ] | q ] t . In particular, E [ | Y − E [ X ] | tq ] ≤ t − · E [ | X − E [ X ] | q ] t . In many data science applications, the X i would be unbiased estimators of some desirable functionof a data set, such as e.g. the coordinate v i in a vector v . Theorem 5 thus gives a bound on the tq -thmoment of the estimation error of the median Y in terms of just the q -th moment of a single variable.We remark that the median of 2 t − E [( Y − E [ X ]) tq ] is much more desirable than a bound on e.g. E [( Y − E [ Y ]) tq ] as themean of Y might be tricky to prove an exact bound for. However, one can, in fact, derive a bound on thevariance of Y itself (on E [( Y − E [ Y ]) ) directly from Theorem 5: Corollary 1.
Let X , X , X be i.i.d. real valued random variables and let Y denote their median. Then Var( Y ) ≤ E [( Y − E [ X ]) ] ≤ · E [ | X − E [ X ] | ] . Proof.
From Theorem 5 with q = 1 we have E [( Y − E [ X ]) ] ≤ · E [ | X − E [ X ] | ] . Moreover, the minimizing value µ for the function µ (cid:55)→ E [( Y − µ ) ] is the mean µ = E [ Y ]. Therefore wehave Var( Y ) = E [( Y − E [ Y ]) ] ≤ E [( Y − E [ X ]) ] ≤ · E [ | X − E [ X ] | ] .In this paper, we mainly consider the case t = 2 with 3 rows — or equivalently 3 i.i.d. randomvariables. CountSketch was originally proposed in [3] as a method for finding heavy hitters (i.e., frequently occurringelements) in a data stream. Though there are better methods for finding heavy hitters in insertion-onlydata streams, CountSketch has the advantage that it is a linear sketch , meaning that sketches can besubtracted to form a sketch of the difference of two vectors. It is known to be space-optimal for theproblem of finding approximate L p heavy hitters in the turnstile streaming model, where both positiveand negative frequency updates are possible [10]. Analysis by Minton and Price.
An improved analysis of the error distribution of CountSketch wasgiven in [12], building on work of [10]. The analysis gives non-trivial bounds only when t is a sufficientlylarge (unspecified) constant, and the exposition focuses on the case t = Θ(log n ), where n is the dimensionof the vector v . Their stated error bounds are incomparable to ours since they are expressed in terms of(residual) L norm of v . 4he reader may wonder if it is possible to derive our results from the analysis in [12]. Their error boundfor CountSketch based on (cid:107) v [ k ] (cid:107) , where (cid:107) v [ k ] (cid:107) is v with the largest k entries set to 0. More concretely,it is shown that for a single row of CountSketch, it holds that Pr[(ˆ v ij − v j ) > c (cid:107) v [ c s ] (cid:107) /s ] < / c , c . The crucial observation is that all entries of v [ c s ] are bounded by (cid:107) v (cid:107) / ( c s ) andtherefore one has (cid:107) v [ c s ] (cid:107) = O ( (cid:107) v (cid:107) (cid:107) v (cid:107) /s ). Inserting this gives Pr[(ˆ v ij − v j ) > c (cid:107) v (cid:107) /s ] < / (cid:107) v (cid:107) . Already with one row, this looks similar to our bound on the variance of the medianof 3 rows (Theorem 1) which stated that E [(ˆ v j − v j ) ] ≤ (cid:107) v (cid:107) /s . However, as our counterexampleabove suggests, there is no way of extending the ideas of [12] to prove E [(ˆ v ij − v j ) ] = O ( (cid:107) v (cid:107) /s ) as itis simply false for t = 1. Indeed the way [12] proves their bound is by analysing the c s largest entriesseparately from the remaining entries, bounding E [(ˆ v ij − v j ) ] only for the small entries in v [ c s ] . Thusour new variance bounds do not follow from their work.The experiments in [12] focus on the setting where t is relatively large, with 20 or 50 rows, i.e., aboutan order of magnitude larger space usage that we have for t = 2. Dimension reduction.
CountSketch can be used as a dimensionality reduction technique that issimpler and more computationally efficient than the classical Johnson-Lindenstrauss embedding [9].In this setting there is no estimator, the sketch vector is simply considered a vector in ts dimensions.Generalized versions of CountSketch have been shown to yield a time-accuracy trade-off [6, 11].In machine learning, a variant of CountSketch, now known as feature hashing , was independentlyintroduced in [15], focusing on applications in multitask learning. Feature hashing reduces variance in aslightly different way than CountSketch, by initially increasing the dimension of the input vector by afactor t in a way that preserves L distances exactly but reduces the L ∞ norm of vectors by a factor √ t . In [4], CountSketch/feature hashing was wired into the architecture of a neural network in order toreduce the number of model parameters (without the use of medians). CountSketch has also been used inthe construction of random feature mappings [13, 2], which can be seen as dimension-reduced versions ofexplicit feature maps. Further machine learning applications.
CountSketch, with the median estimator, has been usedin several machine learning applications. In [1], CountSketch was used with t = 2 (3 rows) for large-scalefeature selection. In [14], CountSketch was used for compressing gradient optimizers in stochastic gradientdescent. The related count-min sketch [5], which is the special case of CountSketch where we fix s ( x ) = 1,is a popular choice in applications where vectors have non-negative entries. The count-min estimatortakes advantage of non-negativity by taking the minimum of t estimates, and the error distribution canbe analyzed in terms of the L norm of v . We note that a count-min sketch with a fully random hashfunction can be used to simulate a CountSketch with s/ s ). In this section, we prove our new inequalities for moments of the median. We in fact prove a more generaltheorem for the median of 2 t − Lemma 1.
Let f : R + → R + be a non-increasing function and let t be a positive integer. Then (cid:90) ∞ f ( t √ x ) t dx ≤ (cid:16) (cid:90) ∞ f ( x ) dx (cid:17) t . Proof.
Since the function is non-increasing, it is measurable. Moreover, since it is non-negative, theintegrals are defined (possibly equal to + ∞ ). We have: (cid:16) (cid:90) ∞ f ( x ) dx (cid:17) t = (cid:90) ∞ · · · (cid:90) ∞ t (cid:89) i =1 f ( x i ) dx . . . dx t = t ! (cid:90) ∞ (cid:90) x t · · · (cid:90) x t (cid:89) i =1 f ( x i ) dx . . . dx t (1)5 t ! (cid:90) ∞ (cid:90) x t · · · (cid:90) x f ( x t ) t dx . . . dx t (2)= t ! (cid:90) ∞ f ( x t ) t (cid:90) x t · · · (cid:90) x dx . . . dx t = t ! (cid:90) ∞ f ( x t ) t x t − t ( t − dx t (3)= (cid:90) ∞ f ( x ) t tx t − dx = (cid:90) ∞ f ( t √ x ) t dx . The integral in (1) is exactly over the set 0 ≤ x ≤ x ≤ · · · ≤ x t . There are t ! such sets, eachdetermined by an ordering of the variables. Since (cid:81) ti =1 is a symmetric function (by comutativity) itintegrates to the same value over each of these sets. Moreover, these sets partition the set [0 , ∞ ) t (up to aset of measure 0 corresponding to when two variables are equal). Since we have a partition into t ! sets andthe integral over each set from the partition is the same, the integral over each set is a t !-fraction of theintegral over the whole space, and (1) holds. (2) holds because f is non-increasing and x ≤ x , · · · , x t .(3) holds because the inner integrals correspond to the volume of the t − x t and the volume of t − t − (this holds bysymmetry, and can be argued the same way as (1)). The final equality holds by substituting x = x t . Restatement of Theorem 5.
Let X , · · · , X t − be t − i.i.d. real valued random variables and let Y denote their median. For all positive integers q it holds that E [ | Y − E [ X ] | tq ] ≤ (cid:0) t − t (cid:1) · E [ | X − E [ X ] | q ] t . In particular, E [ | Y − E [ X ] | tq ] ≤ t − · E [ | X − E [ X ] | q ] t .Proof. Notice that since Y is the median of X , . . . , X t − and the X i ’s have the same mean, we can onlyhave | Y − E [ X ] | tq ≥ x when at least t variables X i have | X i − E [ X i ] | tq ≥ x . There are (cid:0) t − t (cid:1) choices forsuch t variables, so by the union bound, independence and identical distribution of the X i ’s, we have forany x that: Pr[ | Y − E [ X ] | tq ≥ x ] ≤ (cid:0) t − t (cid:1) Pr[ | X − E [ X ] | tq ≥ x ] t . We can thus bound E [ | Y − E [ X ] | tq ] as: E [ | Y − E [ X ] | tq ]= (cid:90) ∞ Pr[ | Y − E [ X ] | tq ≥ x ] dx ≤ (cid:0) t − t (cid:1) (cid:90) ∞ Pr[ | X − E [ X ] | tq ≥ x ] t dx = (cid:0) t − t (cid:1) (cid:90) ∞ Pr[ | X − E [ X ] | q ≥ t √ x ] t dx ≤ (cid:0) t − t (cid:1)(cid:16) (cid:90) ∞ Pr[ | X − E [ X ] | q ≥ x ] dx (cid:17) t = (cid:0) t − t (cid:1) · E [ | X − E [ X ] | q ] t , where the first and last equalities hold by a standard identity for non-negative random variables, andthe last inequality holds by Lemma 1 since Pr[ | X − E [ X ] | q ≥ x ] is a non-increasing non-negativefunction.The bound shown in this section can easily be seen to be asymptotically optimal. Consider X i ’s whichtake value k with probability 1 /k and are zero otherwise. Then E [ | Y − E [ X ] | qt ]=( k − qt Pr[ Y = k ] ≥ ( k − qt (cid:0) t − t (cid:1) Pr[ X = k ] t (1 − Pr[ X = k ]) t − = ( k − qt k t (cid:0) t − t (cid:1) (1 − k ) t − ( k − ( q − t (cid:0) t − t (cid:1) where the limit in ∼ is taken for k → ∞ . On the other hand, the bound given by our theorem is (cid:0) t − t (cid:1) E [ | X − E [ X ] | q ] t = (cid:0) t − t (cid:1) ( 1 k ( k − q ) t ∼ ( k − ( q − t (cid:0) t − t (cid:1) In this section, we prove our new bounds on the variance (Theorem 1) and 4th moment (Theorem 2)for CountSketch with 3 rows ( t = 2) as well as our general theorem with the median of 2 t − o (1) factor. This can be seen by consideringinput consisting of one item with frequency s and a sketch of size s . Querying an item with frequency0 then reproduces the example from the last section for which our bounds are optimal up to 1 + o (1)factor. Frequency estimation.
Recall that CountSketch with three rows computes an estimate ˆ v ij for each ofthree rows i = 1 , , v j as its estimate of v j . From Theorem 5, we see that toobtain variance and 4th moment bounds for ˆ v j , we only need to bound E [ | ˆ v j − E [ˆ v j ] | q ] for q = 1 ,
2. Suchbounds essentially follow from previous work and are as follows:
Lemma 2.
CountSketch satisfies E [ˆ v j ] = v j , E [ | ˆ v j − v j | ] ≤ (cid:107) v (cid:107) /s and E [(ˆ v j − v j ) ] ≤ (cid:107) v (cid:107) /s . Theorem 1 follows by instantiating Theorem 5 with q = 1 and the facts E [ˆ v j ] = v j , E [ | ˆ v j − v j | ] ≤ (cid:107) v (cid:107) /s from Lemma 2. Theorem 2 follows by instantiating Theorem 5 with q = 2 and the facts E [ˆ v j ] = v j , E [(ˆ v j − v j ) ] ≤ (cid:107) v (cid:107) /s from Lemma 2. Finally, Theorem 3 also follows as an immediate corollary ofTheorem 5 and Lemma 2.We give the proof of Lemma 2 in the following for completeness: Proof of Lemma 2.
For short, let X = ˆ v j , g = g and h = h . We then have: E [ X ] = E [ g ( j ) A ,h ( j ) ]= E [ g ( j )( g ( j ) v j + (cid:88) i (cid:54) = j h ( i )= h ( j ) g ( i ) v i )]= v j + (cid:88) i (cid:54) = j E [1 h ( i )= h ( j ) g ( j ) g ( i )] v i . By independence of h and g , and g being 2-wise independent, we have E [1 h ( i )= h ( j ) g ( j ) g ( i )] = E [1 h ( i )= h ( j ) ] E [ g ( j )] E [ g ( j )] = 0and we conclude E [ X ] = v j . Next consider E [ | X − E [ X ] | ] = E [ | g ( j ) A ,h ( j ) − v j | ]= E [ | (cid:88) i (cid:54) = j h ( i )= h ( j ) g ( j ) g ( i ) v i | ] ≤ E [ (cid:88) i (cid:54) = j | h ( i )= h ( j ) || g ( j ) g ( i ) || v i | ]= (cid:88) i (cid:54) = j E [1 h ( i )= h ( j ) ] | v i |≤ (cid:88) i | v i | /s = (cid:107) v (cid:107) /s. h when we concluded that E [1 h ( i )= h ( j ) ] = Pr[ h ( i ) = h ( j )] ≤ /s for all i (cid:54) = j . Finally consider: E [( X − E [ X ]) ]= E (cid:88) i (cid:54) = j h ( i )= h ( j ) g ( j ) g ( i ) v i = E (cid:88) i (cid:54) = j (cid:88) k (cid:54) = j h ( i )= h ( j ) h ( k )= h ( j ) g ( j ) g ( i ) g ( k ) v i v k = (cid:88) i (cid:54) = j (cid:88) k (cid:54) = j E [1 h ( i )= h ( j ) h ( k )= h ( j ) ] E [ g ( i ) g ( k )] v i v k . Here we notice by 2-wise independence of g that E [ g ( i ) g ( k )] = 0 whenever i (cid:54) = k and 1 otherwise. Theabove is thus bounded by: E [( X − E [ X ]) ] ≤ (cid:88) i (cid:54) = j E [1 h ( i )= h ( j ) ] v i ≤ (cid:88) i (cid:54) = j v i /s ≤ (cid:107) v (cid:107) /s. Inner product estimation.
Similarly to the case of frequency estimation (point queries), we proveour new guarantees in Theorem 4 by invoking our general theorems on moments of the median. All weneed is moment bounds for a single row. The following is more or less standard. We show the following(which are more or less standard):
Lemma 3.
For two vectors v, w ∈ R d , let A v and A w denote the two matrices representing a CountS-ketch of the two vectors when using the same random hash functions. Then E [ (cid:104) A v , A w (cid:105) ] = (cid:104) v, w (cid:105) and E [ |(cid:104) A v , A w (cid:105) − (cid:104) v, w (cid:105)| ] ≤ (cid:107) v (cid:107) (cid:107) w (cid:107) /s . Moreover, if g is 4-wise independent, then we also have E [( (cid:104) A v , A w (cid:105) − (cid:104) v, w (cid:105) ) ] ≤ (cid:107) v (cid:107) (cid:107) w (cid:107) /s . Theorem 4 follows by combining Lemma 3 and Theorem 5.
Proof.
For short, let g = g and h = h . We start by observing the (cid:104) A v , A w (cid:105) = d (cid:88) i =1 d (cid:88) j =1 g ( i ) v i g ( j ) w j h ( i )= h ( j ) . Using 2-wise independence of g , we get: E [ (cid:104) A v , A w (cid:105) ] = d (cid:88) i =1 d (cid:88) j =1 E [ g ( i ) v i g ( j ) w j h ( i )= h ( j ) ]= d (cid:88) i =1 d (cid:88) j =1 E [ g ( i ) g ( j )] v i w j E [1 h ( i )= h ( j ) ]= d (cid:88) i =1 v i w i = (cid:104) v, w (cid:105) . Next, we see that E [ |(cid:104) A v , A w (cid:105) − (cid:104) v, w (cid:105)| ]8 E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) i =1 (cid:88) j (cid:54) = i g ( i ) v i g ( j ) w j h ( i )= h ( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E d (cid:88) i =1 (cid:88) j (cid:54) = i | g ( i ) || v i || g ( j ) || w j || h ( i )= h ( j ) | = d (cid:88) i =1 (cid:88) j (cid:54) = i | v i || w j | E [1 h ( i )= h ( j ) ] ≤ d (cid:88) i =1 | v i | d (cid:88) j =1 | w j | /s = (cid:107) v (cid:107) (cid:107) w (cid:107) /s. And finally for 4-wise independent g , we have: E [( (cid:104) A v , A w (cid:105) − (cid:104) v, w (cid:105) ) ]= E d (cid:88) i =1 (cid:88) j (cid:54) = i g ( i ) v i g ( j ) w j h ( i )= h ( j ) = d (cid:88) i =1 (cid:88) j (cid:54) = i d (cid:88) a =1 (cid:88) b (cid:54) = a E [ g ( i ) v i g ( j ) w j h ( i )= h ( j ) g ( a ) v a g ( b ) w b h ( a )= h ( b ) ]= d (cid:88) i =1 (cid:88) j (cid:54) = i d (cid:88) a =1 (cid:88) b (cid:54) = a v i w j E [1 h ( i )= h ( j ) h ( a )= h ( b ) ] E [ g ( i ) g ( j ) g ( a ) g ( b )] v a w b . Recall that a (cid:54) = b and i (cid:54) = j . Thus if a / ∈ { i, j } or b / ∈ { i, j } , then at least one g ( · ) is independent of theremaining three by 4-wise independence of g . The expectation E [ g ( i ) g ( j ) g ( a ) g ( b )] then splits into theproduct of the expectation of that single term and the remaining three. Since E [ g ( · )] = 0, the wholeterm in the sum becomes 0. Thus for any given ( i, j ) with i (cid:54) = j , there are two choices of ( a, b ) thatdo not result in the term disappearing, namely ( a, b ) = ( i, j ) and ( a, b ) = ( j, i ). In both these cases, g ( i ) g ( j ) g ( a ) g ( b ) = 1. When ( a, b ) = ( i, j ) we have v i w j v a w b = v i w j and when ( a, b ) = ( j, i ) we have v i w j v a w b = v i w i v j w j . Therefore: E [( (cid:104) A v , A w (cid:105) − (cid:104) v, w (cid:105) ) ]= d (cid:88) i =1 (cid:88) j (cid:54) = i ( v i w j + v i w i v j w j ) E [1 h ( i )= h ( j ) ] ≤ d (cid:88) i =1 d (cid:88) j =1 ( v i w j + v i w i v j w j ) /s = (cid:107) v (cid:107) (cid:107) w (cid:107) /s + d (cid:88) i =1 d (cid:88) j =1 v i w i v j w j /s = (cid:107) v (cid:107) (cid:107) w (cid:107) /s + (cid:104) v, w (cid:105) /s. By Cauchy-Schwartz, we have (cid:104) v, w (cid:105) ≤ (cid:107) v (cid:107) (cid:107) w (cid:107) and thus the whole thing is bounded by 2 (cid:107) v (cid:107) (cid:107) w (cid:107) /s . In this section, we empirically support our new theoretical bounds by estimating the variance of CountS-ketch with 1 row and 3 rows on different data sets. We implemented CountSketch in C++ using themultiply-shift hash function [7] as the 2-wise independent hash functions h and g . We seeded the hashfunctions using random numbers generated using the built-in Mersenne twister 64-bit pseudorandom gen-erator. Experiments were run both for frequency estimation (Section 4) and for inner product estimation(Section 4). 9 requency estimation. We ran experiments on two real-world data sets and two synthetic data sets.The real-world data sets come in the form of a stream of items, with the same item occurring multipletimes. Instead of running numerous ( i,
1) updates ( v i ← v i + 1), we have simply computed the number ofoccurrences c i of each item. We then normalize the occurrences c i ← c i / (cid:80) j c j to obtain unit (cid:96) -normand then run a single update v i ← v i + c i for each item i at the end. This produces the exact sameCountSketch as when processing the updates one by one (with normalization). The data sets are describedin the following: • Kosarak:
An anonymized click-stream dataset of a Hungarian online news portal. It consists oftransactions, each of which has several items. We created a vector with one entry for each item,storing the total number of occurrences of that item. The vector has 41270 entries, and whennormalized to have (cid:96) -norm 1, its (cid:96) -norm is 0 .
112 and the largest entry is 0 . • Sentiment140:
A collection of 1.6M tweets from Twitter [8]. We extracted all words that occur atleast twice, and created a vector with one entry per word, containing the total number of occurrencesof that word in the tweets. The vector has 147071 entries, and when normalized to have (cid:96) -norm 1,its (cid:96) -norm is 0 . . • Zipfian:
The Zipfian distribution with skew α and n items is a probability distribution where the k th item has probability k − α / (cid:80) nj =1 j α . Such distributions have been shown to fit a large variety ofreal-world data. We created two data sets with n = 1000 items using skews α = 0 . α = 1 . α = 0 .
8, the (cid:96) -norm is 0 .
097 and the largest entry is0 . α = 1 .
2, the (cid:96) -norm is 0 . . s . We run experiments with s = 2 , , . . . , on each data set.For each choice of s , we estimate the variance by constructing 1000 CountSketches on the input with newrandomness for each. For each CountSketch we pick 100 random items and compute the estimation errorfor each. We sum the squares of all these estimation errors and divide by 100 × CountSketches and make a single estimation on each). s2 v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.0000.0050.0100.0150.0200.0250.0300.035 s × v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.000.050.100.150.200.250.30 s × v a r i a n c e t=1t=2 Figure 2: Variance experiments on the Kosarak data set. s2 v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.00000.00250.00500.00750.01000.01250.01500.0175 s × v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.00.10.20.30.40.50.6 s × v a r i a n c e t=1t=2 Figure 3: Variance experiments on the Sentiment140 data set.On all four data sets, we make three plots of the data. On the first, we show a log-log plot and observethat in all experiments, the variances look linear on the plot, supporting a polynomial dependency on s .Second, we scale the variances by s and plot it on a linear scale. In all experiments, the scaled variancefor t = 1 looks constant, supporting a 1 /s dependency on the number of columns s . Third, we scale thevariance by s and plot it on a linear scale. The scaled variance for t = 2 looks almost constant in allexperiments, supporting a 1 /s dependency on the number of columns. We remark that our theoretical Provided by Ferenc Bodon to the FIMI data set located at http://fimi.uantwerpen.be/data/ . s2 v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.0000.0050.0100.0150.0200.0250.030 s × v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.00.10.20.30.40.50.6 s × v a r i a n c e t=1t=2 Figure 4: Variance experiments on Zipfian distribution with skew α = 0 . s2 v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.000.050.100.150.20 s × v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.00.20.40.60.81.0 s × v a r i a n c e t=1t=2 Figure 5: Variance experiments on Zipfian distribution with skew α = 1 . E [(ˆ v j − v j ) ] ≤ (cid:107) v (cid:107) /s . Since (cid:107) v (cid:107) = 1 in all our data sets, so weexpect a CountSketch with t = 2 on the third plots to stay below 3 on the y-axis, which it does in allexperiments (it even stays below 0 . Data Set Variance t = 1 Variance t = 2 RatioKosarak 1 . × − . × − . . × − . × − . α = 0 . . × − . × − . α = 1 . . × − . × − . Table 1: Variances for different data sets with 2 and 3 rows ( t = 1 ,
2) of CountSketch. In all experiments,we consider a CountSketch with s = 1024 columns. The ratio in the last column of the table gives therelative difference between using 1 and 3 rows.Table 1 shows the variance on the different data sets using CountSketch with s = 1024 rows. In allcases, that increasing CountSketch parameter t from 1 to 2 clearly provides major reductions in variance,ranging from a factor about 28 to 174.We also perform experiments measuring the 4th moment of the estimation errors. The results ofthese experiments can be seen in Figures 6-9. s2 t h m o m e n t t=1t=2 0 200 400 600 800 1000s0.00000.00010.00020.00030.0004 s × t h m o m e n t t=1t=2 0 200 400 600 800 1000s0.00000.00020.00040.00060.00080.00100.0012 s × t h m o m e n t t=1t=2 Figure 6: 4th moment experiments on the Kosarak data set.Again, we have plotted the 4th moment times s and the 4th moment times s . Similar to the varianceexperiments, it appears that CountSketch with t = 1 has a 4th moment error growing as 1 /s and with t = 2 it grows as 1 /s , supporting our new theoretical findings in Theorem 2.To summarize, we believe our empirical findings support our new theoretical bounds on the varianceand 4th moment. Moreover, our results strongly suggest that practitioners use t ≥ Inner product estimation.
In the following, we perform experiments where we use CountSketch forinner product estimation. We perform experiments on two data sets, a synthetic and a real-world data11 s2 t h m o m e n t t=1t=2 0 200 400 600 800 1000s01234567 s × t h m o m e n t
1e 5 t=1t=2 0 200 400 600 800 1000s01234567 s × t h m o m e n t
1e 5 t=1t=2
Figure 7: 4th moment experiments on the Sentiment140 data set. s2 t h m o m e n t t=1t=2 0 200 400 600 800 1000s0.000000.000050.000100.000150.000200.000250.000300.00035 s × t h m o m e n t t=1t=2 0 200 400 600 800 1000s0.000000.000050.000100.000150.00020 s × t h m o m e n t t=1t=2 Figure 8: 4th moment experiments on Zipfian distribution with skew α = 0 . s2 t h m o m e n t t=1t=2 0 200 400 600 800 1000s0.00000.00010.00020.00030.0004 s × t h m o m e n t t=1t=2 0 200 400 600 800 1000s0.00000.00020.00040.00060.00080.00100.0012 s × t h m o m e n t t=1t=2 Figure 9: 4th moment experiments on Zipfian distribution with skew α = 1 . • Disjoint 64 non-zeros:
A synthetic data set with two vectors both having 64 non-zero entrieseach with value 1 /
64. The two vectors have disjoint supports and thus inner product 0. The (cid:96) -normof the vectors is 1 / .
125 and the largest entry is 1 / ≈ . • News20:
A collection of newsgroup documents on different topics . Each document is representedby a tf-idf vector constructed on the words occurring in the documents. We used the training partof the data set for our experiments. The data set has 11314 distinct vectors. For comparison to ourtheoretical bounds, we normalize the vector v representing each document such that it has (cid:107) v (cid:107) = 1.After normalization, the average (cid:96) -norm of a document vector is 0 . . s2 v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.0000.0020.0040.0060.0080.010 s × v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.000.050.100.150.200.25 s × v a r i a n c e t=1t=2 Figure 10: Variance experiments on the Disjoint 64 Non-Zeros data set.For the Disjoint 64 Non-Zeros data set, for 10 iterations, we constructed a new CountSketch on the twovectors using the same random hash functions. We then computed the squared error of the estimatesand averaged over all 10 iterations. For the News20 data set, we run 1000 iterations where we pick new http://qwone.com/~jason/20Newsgroups/ s2 v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.00000.00020.00040.00060.0008 s × v a r i a n c e t=1t=2 0 200 400 600 800 1000s0.000.050.100.150.200.25 s × v a r i a n c e t=1t=2 Figure 11: Variance experiments on the News20 data set. s2 v a r i a n c e t=1t=2 0.0 0.2 0.4 0.6 0.8 1.0s 1e60.00000.00020.00040.00060.00080.00100.00120.0014 s × v a r i a n c e t=1t=2 0.0 0.2 0.4 0.6 0.8 1.0s 1e60.000.250.500.751.001.251.501.75 s × v a r i a n c e t=1t=2 Figure 12: Variance experiments on the News20 data set and number of columns up to s = 2 .random hash functions in each iteration. In an iteration, we pick 100 random pairs of distinct vectors,build a CountSketch on both vectors in a pair, and compute the squared estimation error. We finallyaverage over all 100 × t = 2) has a variance decreasingas 1 /s , not 1 /s . To explain this, recall that the guarantee from Theorem 4 is E [( X − (cid:104) v, w (cid:105) ) ] ≤ min { (cid:107) v (cid:107) (cid:107) w (cid:107) q /s , (cid:107) v (cid:107) (cid:107) w (cid:107) } . In the News20 data set, the average (cid:107) v (cid:107) is 0 . (cid:107) v (cid:107) and (cid:107) w (cid:107) ) it becomes very small compared to (cid:107) v (cid:107) (cid:107) w (cid:107) = 1,thus the 1 /s dependency should only kick in for large values of s . To confirm this, we have run moreexperiments, this time with values of s ranging from 2 to 2 . The results are shown in Figure 12.With these larger values of s , we see the expected 1 /s dependency in the variance for t = 2. Toconclude on this, one may need a larger value of s to see the 1 /s behaviour in variance when performinginner product estimation compared to frequency estimation. This is due to the dependency on the product of two vectors of either (cid:107) v (cid:107) (cid:107) w (cid:107) or (cid:107) v (cid:107) (cid:107) w (cid:107) compared to just the single dependency on (cid:107) v (cid:107) and (cid:107) v (cid:107) for frequency estimation.As with frequency estimation, we also experimentally examine the 4th moments. For the results ofthese experiments, see Figures 13 and 14. s2 t h m o m e n t t=1t=2 0 200 400 600 800 1000s0.00000.00010.00020.00030.00040.00050.0006 s × t h m o m e n t t=1t=2 0 200 400 600 800 1000s0.00.51.01.52.02.53.0 s × t h m o m e n t
1e 5 t=1t=2
Figure 13: 4th moment experiments on the Disjoint 64 Non-Zeros data set.
References [1] Amirali Aghazadeh, Ryan Spring, Daniel LeJeune, Gautam Dasarathy, Anshumali Shrivastava, andRichard G. Baraniuk. Mission: Ultra large-scale feature selection using count-sketches. In
Proceedingsof annual International Conference on Machine Learning (ICML) , pages 80–88. PMLR, 2018.[2] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P13 s2 t h m o m e n t t=1t=2 0 200 400 600 800 1000s01234567 s × t h m o m e n t
1e 7 t=1t=2 0 200 400 600 800 1000s012345 s × t h m o m e n t
1e 7 t=1t=2
Figure 14: 4th moment experiments on the News20 data set.Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In
Proceedingsof annual Symposium on Discrete Algorithms (SODA) , pages 141–160, 2020.[3] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams.
Theoretical Computer Science , 312(1):3–15, 2004.[4] Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neuralnetworks with the hashing trick. In
Proceedings of annual International Conference on MachineLearning (ICML) , pages 2285–2294. PMLR, 2015.[5] Graham Cormode and Shan Muthukrishnan. An improved data stream summary: the count-minsketch and its applications.
Journal of Algorithms , 55(1):58–75, 2005.[6] Anirban Dasgupta, Ravi Kumar, and Tam´as Sarl´os. A sparse Johnson-Lindenstrauss transform. In
Proceedings of Symposium on Theory of computing (STOC) , pages 341–350, 2010.[7] Martin Dietzfelbinger. Universal hashing and k-wise independent random variables via integerarithmetic without primes. In
Proceedings of Annual Symposium on Theoretical Aspects of ComputerScience (STACS) , volume 1046 of
Lecture Notes in Computer Science , pages 569–580. Springer, 1996.[8] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervi-sion.
CS224N project report, Stanford , 1(12):2009, 2009. URL http://help.sentiment140.com/for-students .[9] William B Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a hilbert space.
Contemporary mathematics , 26(189-206):1, 1984.[10] Hossein Jowhari, Mert Sa˘glam, and G´abor Tardos. Tight bounds for lp samplers, finding duplicatesin streams, and related problems. In
Proceedings of symposium on Principles of Database Systems(PODS) , pages 49–58, 2011.[11] Daniel M. Kane and Jelani Nelson. Sparser Johnson-Lindenstrauss transforms.
Journal of the ACM ,61(1), January 2014. ISSN 0004-5411. doi: 10.1145/2559902.[12] Gregory T. Minton and Eric Price. Improved concentration bounds for count-sketch. In
Proceedingsof Annual Symposium on Discrete Algorithms (SODA) , pages 669–686, 2014. doi: 10.1137/1.9781611973402.51.[13] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In
Proceedings of international conference on Knowledge Discovery and Data mining (KDD) , pages239–247, 2013.[14] Ryan Spring, Anastasios Kyrillidis, Vijai Mohan, and Anshumali Shrivastava. Compressing gradientoptimizers via count-sketches. In
Proceedings of annual International Conference on MachineLearning (ICML) , pages 5946–5955. PMLR, 2019.[15] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Featurehashing for large scale multitask learning. In
Proceedings of annual International Conference onMachine Learning (ICML) , pages 1113–1120, 2009.[16] David P Woodruff. New algorithms for heavy hitters in data streams (invited talk). In