Density Sketches for Sampling and Estimation
DD ENSITY S KETCHES FOR S AMPLING AND E STIMATION
A P
REPRINT
Aditya Desai
Department of Computer ScienceRice UniversityHouston, Texas [email protected]
Benjamin Coleman
Department of Electrical and Computer EngineeringRice UniversityHouston, Texas [email protected]
Anshumali Shrivastava
Department of Computer ScienceRice UniversityHouston, Texas [email protected] A BSTRACT
We introduce
Density sketches (DS) : asuccinct online summary of the data dis-tribution. DS can accurately estimatepoint wise probability density. Inter-estingly, DS also provides a capabil-ity to sample unseen novel data fromthe underlying data distribution. Thus,analogous to popular generative models,DS allows us to succinctly replace thereal-data in almost all machine learningpipelines with synthetic examples drawnfrom the same distribution as the orig-inal data. However, unlike generativemodels, which do not have any statisti-cal guarantees, DS leads to theoreticallysound asymptotically converging con-sistent estimators of the underlying den-sity function. Density sketches also havemany appealing properties making themideal for large-scale distributed applica-tions. DS construction is an online al-gorithm. The sketches are additive, i.e.,the sum of two sketches is the sketchof the combined data. These proper-ties allow data to be collected from dis-tributed sources, compressed into a den-sity sketch, efficiently transmitted in thesketch form to a central server, merged,and re-sampled into a synthetic databasefor modeling applications. Thus, den-sity sketches can potentially revolution-ize how we store, communicate, and dis-tribute data.
The capability to sample from a distribution is a prereq-uisite for the classical task of statistical estimation, withinnumerable applications. The popular Monte Carlo Esti-mation [1] algorithm uses sampling as the primary tool toapproximate statistical quantities of interest. Data-drivenmachine learning is yet another example, which is essen-tially a statistical estimation problem to estimate a modelfrom data.Often, we use a set of samples (popularly known as thedata) to specify the distribution. Sampling from a distri-bution described in such a way requires estimating theunderlying distribution. Popular methods to infer the dis-tribution and sample from it belong to the following threecategories: 1. Parametric density estimation [2] 2. Non-parametric estimation - Histograms and Kernel DensityEstimators (KDE) [3] 3. Learning-based approaches suchas Variational Auto Encoders (VAE), Generative Adversar-ial Networks (GANs), and related methods [4, 5]. Gener-ally, parametric estimation is not suitable to model mostreal data as it can lead to large unavoidable bias from thechoice of the model [3]. Learning the distribution, e.g., vianeural networks is one solution to this problem. Althoughlearning-based methods have recently found remarkablesuccess, they do not have any theoretical guarantees for thedistribution of generated samples. Histograms and KDEs,on the other hand, are theoretically well understood. Thesestatistical estimators of density are known to uniformlyconverge to the underlying true distribution almost surely .In this paper, we focus on such estimators, which havetheoretical guarantees.
Our contribution:
Histograms and KDE are the most pop-ular non-parametric density estimation methods. However,they are known to scale poorly with the dimension and sizeof data. In this work, we propose
Density Sketches (DS) - a a r X i v : . [ c s . D S ] F e b ensity Sketches for Sampling and Estimation A P
REPRINT succinct sketch constructed from the data. These sketchesare tiny in size and can be used to 1) query the densityat a point and 2) sample points from the underlying dis-tribution. We show that Density Sketches are backed bytheoretical guarantees and asymptotically approximate truehistogram densities. However, unlike histograms, whichstore counts for exponential (in dimension) number of bins,DS only uses memory logarithmic in size of histogram.Any methodologies for exact computation of the KDE orKDE-based sampling require storing the entire dataset. Incontrast,
Density Sketches do not store the actual data inany form. Furthermore, density sketches have friendlyproperties that open up a variety of exciting and vital ap-plications. We state some applications below.
Density Sketch as compressed surrogate for Data
Wepropose
Density Sketches (DS) constructed from data as acompressed alternative to keeping the data itself. Specif-ically, for statistical estimation tasks, a sample from theunderlying distribution is sufficient for the vast majority ofapplications. In this regard, the original data is not sacro-sanct - any sample from the underlying distribution maybe used. DS can be used to efficiently sample the requiredamount of synthetic data from the underlying distribution.With increasing amounts of data, storage and transfer areimportant practical concerns of growing importance. DSprovides an advantageous trade-off for applications. Ourexperiments show that with more data, the accuracy ofthe density sketch improves, but the size is practicallyunaffected. Our mathematical results show that size isdependent on the variety of the data rather than the vol-ume. These properties make density sketches an appealingalternative to data.
Data Privacy and Sketching
Fine-grained data collectedfrom many individuals is necessary for many downstreamestimation and modeling tasks. However, this could com-promise the privacy of an individual. Maintaining theprivacy of individuals in released data is an increasingly im-portant topic of study. Data Anonymization is an essentialoperation towards making data release private, although,with weak privacy guarantees. As
Density Sketches donot store the exact data, it is a possible way to releaseanonymized data. Differential privacy [6] provides betterprivacy guarantees for the released data. With some modi-fications on the lines of [7], the
Density Sketch , which hassimilar properties to count based sketches, can be madedifferentially private.
Data Collection on Edge and Mobile Devices
A signif-icant portion of the data today is generated on Edge andMobile devices. Frequent transfer of data between thesedevices and a central server is unavoidable because of theirlimited storage capacity.
Density Sketches , because oftheir small size, offers a natural alternative to storing andtransferring data on such devices. Density Sketch, being a sketching data structure, possesses the mergeable property.It implies that the density sketch of complete data can beachieved by composing density sketches of parts of data.This makes it possible to process data on edge devices to make sketches and combine them at a central location toobtain final
Density Sketch . Count sketch [8, 9], along with its variants, is one of themost popular probabilistic data structure used for the heavyhitter problem. Informally, the heavy hitter problem is toidentify top frequent items in a stream of data. This settingcan be generalized where each item in data-stream is a keyand value pair. Then we can state the problem formallyas, given a stream of data of type ( a t , c t ) where a t is a keybelonging to an extensive set, say U , and c t is the valueassociated with a t at time step t. The goal is to output topkeys with the most value c ( a ) , where c ( a ) is the sum of allvalues associated with key a in the stream.Figure 1: Countsketch, sketching and query U can be very large for certain scenarios, for example, tokeep track of top queried web pages, the key is web pageaddress which can go up to 100 characters (800bits). Inthis case |U| = 2 . This prohibits use of array of size |U| . Dictionary to store all keys is prohibitive for its updatetime and cost of storing all keys. Count sketch offers aprobabilistic solution in memory logarithmic in |U| . Thereis a standard memory accuracy trade-off for count sketches.Let m be the number of distinct keys and C be the vectorof counts indexed by each key. For count median sketch[9], the ( (cid:15), δ ) guarantee P ( | ˆ c ( a ) − c ( a ) | > (cid:15) ||C|| ) ≤ δ is achieved using O ( (cid:15) δ (log m +log |U| )) space. [10]. Ascan be seen from the above equation, the approximationaccuracy for a particular key depends on how it comparesto the ||C|| . Specifically, count sketch can give very goodapproximation for keys with highest values in a settingwhere most of other keys have very low values.The procedure of sketching and querying in illustrated infigure 1. Count sketch, parameterised by K and R, uses Kindependent pairs of hash functions ( h i : U → [0 , R ] , g i : U → {− , } ) , i ∈ [1 , K ] uniformly drawn from a familyof universal hash functions and a 2D array of size K × R ,say A. While processing each element (insert operation),say (a, c), for all i ∈ [1 , K ] , we update A [ i, h i ( a )] byadding g i ( a ) c to it. To query value c(a), we get K unbiasedestimates from each array by querying A [ i, h i ( a )] for each2ensity Sketches for Sampling and Estimation A P
REPRINT i. We can then combine them to get the final estimate of c ( a ) . The median, mean, or median of means are someof the preferred ways of combining them to improve thevariance and concentration of estimator around the truevalue. Histograms and KDE [3, 11] are popular methods to es-timate the density of a distribution given some finite i.i.dsample of size, say n, drawn from the true density, say f(x).
Histogram divides the support ( S ⊂ R d ) of the data intomultiple partitions. It then uses the counts in every partitionto predict the density, ˆ f H ( x ) , at a point x. Formally thedensity predicted at the point x ∈ S is given by ˆ f H ( x ) = c ( bin ( x )) n.volume ( bin ( x )) bin(x) identifies the partition in which x lies, c(.) countsthe number of samples in that partition, and volume(.)measures the partition volume. Regular Histogram useshyper cube partitions aligned with data axes with a width,say h . h is called the smoothing parameter. As h in-creases, the estimate’s bias increases, and its variance de-creases. Histograms suffer from bin-edge problems wherea slight change in data across the bin’s edge can changepredictions. One solution to bin-edge problem is AverageShifted Histogram (ASH) [3]. It uses same bin partitionsbut with shifted origins. Consider a one dimensional his-togram with width h. ASH with m histograms has originsat , hm , hm , ... ( m − hm . The density estimate is then theaverage of the density estimates obtained from the differ-ent histograms. Asymptotically as m goes to ∞ , ASHconverges to a KDE with the triangle kernel. Kernel Density is another smoother estimate of f(x) whichresolves the bin-edge problem of histograms. For a givenkernel function k ( x, y ) : R d × R d → R and data, say D,the KDE at point, say x, is defined as ˆ f K ( x ) = KDE ( x ) = 1 n Σ i ∈ [1 ,n ] ,x i ∈ D k ( x, x i ) Kernel functions, generally, are positive, symmetric, andintegrate to 1. Gaussian, Epanechnikov, Uniform, [2] etc.are some of the most widely used kernels. A smoothingparameter h also parameterizes kernel function. It deter-mines its variance for Gaussian kernel, while for uniformand Epanechnikov kernels, it determines the size of thewindow around x where the function is non-zero. Again,as h increases, the bias increases, and variance decreases.Histogram and KDEs cannot provide unbiased estimates ofthe true density [12]. Hence, Mean Square Error (MSE) isused to analyze them. Both of these estimators uniformlyconverge to underlying true distribution asymptotically.However, both suffer from the curse of dimensionality. Toget a decent estimate of density in high dimensions, thenumber of samples needed is exponential in dimensions.Generally, for the Density Estimation task, dimensions of4-50 are considered large enough. [13]
The density estimate using histogram of a randomly drawnpartition is ˆ f ( x ) ∝ n Σ i ∈ [1 ,n ] ,x i ∈ D I ( x i ∈ bin ( x )) This estimate of the density has a expected value (overrandom partitions) which looks very similar to a KDE.The subscript p in Expectation is to make it explicit thatexpectation is over the random partitions. E p ( ˆ f ( x )) ∝ n Σ i ∈ [1 ,n ] ,x i ∈ D P ( x i ∈ bin ( x )) The kernel of this KDE is the probability of collision be-tween the query point x and the data point x i . For exam-ple, randomly shifted regular histograms can approximatetriangle kernel or asymptotic ASH. Of course, to get areasonable estimate, we would need to make multiple ran-dom partitions and combine estimates obtained from them.Depending on the different partitioning schemes, we canobtain estimators for different kernels. In a recent paper,[14], authors observe this connection. They show usingdifferent lsh functions, it is possible to approximate corre-sponding interesting kernels. It is useful to note here thatif we use l1-lsh or l2-lsh functions [15], the partition is agrid of parallelepipeds of random shape. Uniform Sampling from Convex spaces is a well-studiedproblem [16, 17]. For general convex polytopes, this isachieved by finding a point inside the polytope using con-vex feasibility algorithms and then running an MCMCwalk inside the polytope to generate a point with uni-form probability. In the case of regular convex polytopeslike hypercubes and parallelopiped, uniform sampling ismuch simpler. Sampling a data point at random in a d-dimensional hypercube of width 1 is equivalent to sam-pling d real values uniformly in the interval [0 , . Forsampling within a d -dimensional parallelopiped, we firstlocate ( d − -dimensional hyperplane parallel to eachface at a distance drawn uniformly from [0 , h ] where h isthe width of parallelopiped in that direction. The sampledpoint is then the intersection of these ( d − dimensionalhyperplanes. Data D consists of n i.i.d samples of dimension d drawnfrom true distribution f ( x ) : R d → R . bin(x) : ID of the partition in which point x falls. In caseof regular histograms, RACE style partitioning, bin ( x ) : R d → N d and each bin can be identified with a uniquetuple of d integers. For example, in regular histogram withwidth h, bin ( x ) i = (cid:98) x i /h (cid:99) . Shapes of partitions in thecase of regular histograms or RACE style partitioning areregular, and hence simple algorithms can be used for sam-pling a point inside it. bin(x) and sampling algorithm for3ensity Sketches for Sampling and Estimation A P
REPRINT
Partitioning Scheme parameters bin ( x ) ∈ N d ; x ∈ R d Sample s from bin id Regular Histogram width: h bin ( x ) i = (cid:98) x i h (cid:99) r ∈ R d , r i ∼ Uniform [0 , s = h ( bin id + r ) Aligned Histogram widths: h = ( h , h , ...h d ) bin ( x ) i = (cid:98) x i h i (cid:99) r ∈ R d , r i ∼ Uniform [0 , s i = h i ( bin id i + r i ) Random Partitionsusing d l1/l2 lsh functions W ∈ R d × d , W = ( w , w , ..., w d ) b : R d × , b = ( b , b , ..., b d ) width: h bin ( x ) i = (cid:98) (cid:104) x i ,w i (cid:105) + b i h (cid:99) r ∈ R d , r i ∼ Uniform [0 , y = h ∗ ( bin id + r ) Solve
W s = y − B to get s Table 1: bin(x) for different partitioning schemessome partitioning schemes is mentioned in Table 1 mem: count sketch with range R and repetitions K as de-scribed in section 2. heap : Augmented min-heap of size H used with mem .Hence for a given paritioning scheme bin ( . ) , DensitySketch is parametersized by ( K, R, H ) and includes twodatastructures mem ( K, R ) , heap ( H ) . Histogram has an exponential (in d) numberof partitions. Hence, we cannot directly build or store aHistogram in high dimensions. However, most high dimen-sional real data is present in clusters making the Histogramhighly sparse in high dimension. So Histogram is an idealcandidate for the heavy-hitter problem. We use count-sketch to store a compressed version of the Histogram. Tosample from the Histogram, we sample a bin with proba-bility proportional to its count and then choose a randompoint in that bin. For an exponential number of bins, this iscomputationally prohibitive. If we only consider and storeheavy bins, this can become achievable. However, gettingheavy bins directly from count sketch requires enumerat-ing all possible bins. Hence, we store most heavy bins in amin-heap, which can be updated while data is inserted intothe sketch.As shown in figure3 and algorithm 1, we process the datain a streaming fashion. For each data point, say x, wefirst find the partition bin id = bin ( x ) . We increment thecount of this bin i d partition by 1 by inserting ( bin id , 1)into mem . Along with each insertion, we also update theheap. If the heap is not at its capacity, we insert this bin id into the heap along with its updated count. In case the heapis at its capacity, we check bin id ’s updated count againstthe minimum of the heap. If bin id ’s count is found greater,we pop the minimum element from the heap and insert( bin id , count). Heap is ordered by the count of the insertedbin ids. ˆ f C ( x ) : Estimate of density at a point We can use these sketches for querying the density esti-mate at a particular point. The algorithm for querying ispresented in Algorithm 2 and is explained in the figure 2.The density predicted by the histogram, say ˆ f H ( x ) , can bewritten as, ˆ f H ( x ) = c ( bin ( x )) n.volume ( bin ( x )) volume ( bin ( x )) = h d ( regular histogram ) where c(bin(x)) is the count of data points that lie in bin(x).When using the sketch, instead of using actual c(bin(x)),we would use the estimate of c(bin(x)) from the countsketch. Let this estimate be ˆ c ( bin ( x )) . Then we can writethe density predicted using countsketch as ˆ f C ( x )ˆ f C ( x ) = ˆ c ( bin ( x )) n.volume ( bin ( x )) We know from countsketch literature, that ˆ c ( x ) is closelydistributed around c(x) and so we can expect ˆ f C ( x ) to beclose to ˆ f H ( x ) and hence to f ( x ) . Note that though ˆ f C ( x ) is a good estimate of density at a point x, the function ˆ f C ( . ) is not a density function as it does not integrate to 1. Algorithm 1:
Constructing density sketch of f ( x ) Result:
Density Sketch f ( x ) : R d → R : true distribution x , x , . . . x n ∼ f ( x ) : sample drawn from f ( x ) bin(x) : R d → N d : bin to which x belongsmem : CountSketch(R, K) : sketch with R-range,K-repetitionsheap : Heap(H): min-heap to store top H elements for i ← to n do bin id = bin ( x i ) mem.insert ( bin id , c = mem.query ( bin id ) heap.update ( bin id , c ) endAlgorithm 2: query ˆ f C ( y ) , y ∈ R d Result: ˆ f C ( y ) y ∈ R d bin id = bin ( y ) count = mem.query ( bin id ) return countnV olume ( bin id ) ˆ f ∗ C ( x ) : Estimate of density Function In order to obtain a density function from the sketches, wehave to normalize the function ˆ f c ( x ) over the support. Wecan write ˆ f ∗ C ( x ) as ˆ f ∗ C ( x ) ∝ ˆ c ( x ) ˆ f ∗ C ( x ) = ˆ c ( x ) (cid:82) ˆ c ( x ) dx A P
REPRINT
Figure 2: Overview of the sketching algorithmIt is easy to check the integral can be written as the sumover all the bins in the support. ˆ f ∗ C ( x ) = ˆ c ( x ) volume ( bin ( x ))Σ b ∈ bins ˆ c ( b )= ˆ c ( x ) volume ( bin ( x ))ˆ n As is clear from the equations for ˆ f ∗ C ( x ) and ˆ f C ( x ) , n =Σ b ∈ bins c ( b ) , is replaced by ˆ n = Σ b ∈ bins ˆ c ( b ) to get adensity function. We can check that ˆ n is an estimate ofn using estimate of count for each bin from the densitysketch. ˆ f S ( x ) : Sampling from Density Sketches The count sketch is a good enough representation for query-ing the density at a point. However, it is not the best datastructure to efficiently generate samples. One naive way ofsampling from these sketches is to randomly select a pointin support of f(x) and then do a rejection sampling usingestimate ˆ f C ( x ) . However, given the enormous volume ofsupport in high dimensions, this method is bound to beimmensely inefficient. Another way is to choose a partitionwith probability proportional to the count of elements inthat partition and then sample a random point from thischosen partition. It is easy to check that the probability ofsampling a point x in this manner, precisely, is ˆ f H ( x ) if weuse exact counts and ˆ f ∗ C ( x ) if we use approximate countsfrom count sketch. However, given that number of bins isexponential in dimension, sampling a bin proportional toits counts requires prohibitive memory and computation.This is essentially the reason why we needed a count sketchin the first place. Here, we further approximate the distri-bution by storing only top H partitions which contain mostdata points and discarding other partitions. As mentionedin 1, we can maintain top H partitions efficiently with anaugmented heap. We then sample a partition present inthis heap with probability proportional to its count andsample a random data point from this partition (Algorithm3). The probability of sampling a data point whose bin isnot present augmented heap is then zero. The distribution of this sampling algorithm is, ˆ f S ( x ) = I ( bin ( x ) ∈ heap ) ˆ c ˆ n h volume ( bin ( x )) where ˆ n h = Σ b ∈ heap ˆ c ( b ) , is the count-sketch estimate oftotal number of elements captured in all partitions presentin heap. I ( . ) is the indicator function with values 0 or 1evaluating the Boolean statement inside it. Let ratio h =ˆ n h / ˆ n be the capture ratio of heap. It is easy to see that ascapture ratio tends to 1, ˆ f S ( x ) tends to ˆ f ∗ C ( x ) . Note that ˆ f S ( x ) is indeed a density function. Algorithm 3: sample y ∈ R d such y ∼ ˆ f S Result: y: sample from f S ( y ) P : multinomial distribution, such that– P ( bin id ) = ˆ c ( bin id ) n h if bin id ∈ heap. – P ( bin id ) = 0 if bin id / ∈ heap.bin id ∼ Py = U nif ormRandom ( partition ( bin id )) Histogram and Kernel Density Estimators are well studiednon-parametric estimators of density. Both of these esti-mators are shown to be capable of approximating a largeclass of functions [3]. For example , with the conditionof Lipschitz Continuity on f, we can prove that point wise
M SE ( ˆ f H ( x ) converges to 0 at a rate of O ( n − / ) . Betterresults can be obtained for functions that have continuousderivatives. In our analysis we make all such assumptionson the lines of those made in [3]; specifically, existenceand boundedness of all function dependent terms that ap-pear in the theorems below. We refer reader to [3] for indepth discussion on assumptions.We restrict our analysis to convergence in probability forall the estimators discussed in this paper which is stan-dard [3]. In this section we consider the regular histogrampartitioning scheme and show that our density estimates ˆ f ∗ C ( x ) and sampling distribution ˆ f S ( x ) are approximationsof underlying distribution f(x) and converge to it. How-ever, similar analysis holds even for random partitioningschemes / KDE and is skipped here.5ensity Sketches for Sampling and Estimation A P
REPRINT
Mean integrated square error(MISE) : MISE of an es-timator of function is a widely used tool to analyse theperformance of a density estimator.
M ISE = E (cid:90) ( ˆ f ( x ) − f ( x )) dx A density estimator, with MISE asymptotically tending to0, is a consistent estimator of true density and converges toit in probability. We would use this tool to make statementsabout convergence of our estimators. By Fubini’s theorem,MISE is equal to IMSE (Integrated mean square error.Also, for any estimator ˆ e , M SE (ˆ e ) = V ariance (ˆ e ) + Bias (ˆ e ) . Hence we can write IMSE (and hence MISE)as sum of integrated variance (IV) and integrated squarebias (ISB). IM SE ( ˆ f ) = (cid:90) E (( ˆ f ( x ) − f ( x )) ) dx = IV + ISB
The remaining section is organized as follows. We firststate the main theorem in our paper which relates the Sam-pling probability of Density Sketches, ˆ f s ( x ) , to the under-lying true distribution. We then state various theorems thatinterrelate the different density function estimates used inthis paper. The combination of all these theorems lead toour main theorem. The proofs of all the theorems can befound in Appendix A. We then briefly interpret differenttheorems in a small subsection. Theorem 1 ( Main Theorem ˆ f S ( x ) to f ( x ) ) . The proba-bility density function of sampling, ˆ f s ( x ) , using a DensitySketch over regular histogram of width h, with param-eters(K,R,H) created with n i.i.d samples from originaldensity function f(x), has an IMSE : IM SE ( ˆ f S ( x )) ≤ − ratio h ) + 3(1 + 2 (cid:15) )( 1 nh d + R ( f ) n + o ( 1 n ) + bins − KRnh d )+ 3(1 + 3 (cid:15) ) h d R ( (cid:107)∇ f (cid:107) )+ 3 (cid:15) (1 + 2 R ( f ) + h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) ))) with probability (1 − δ ) , where δ = bins(cid:15) nR , bins isthe number of non-empty bins in histogram, ratio h is theestimated capture ratio as described in section 3.5 The dependence of IMSE on properties of f(x), such asroughness, is standard [3] and cannot be avoided.
Interpretation
The estimator f S ( x ) of f ( x ) is obtainedby a series of approximations from f ( x ) → f H ( x ) → f C ( x ) → f ∗ C ( x ) → f S ( x ) . Hence in order to interpretthis result, we break down the result above into multipletheorems enabling the reader to easily notice which step ofapproximations lead to what terms in the theorem above.We urge readers to read the interpretation of each theorembelow to obtain a complete picture of the Main theorem. Theorem 2. [ f H ( x ) to f ( x )] The IMSE for the estimator ˆ f H ( x ) using regular histogram with width h built over n i.i.d samples drawn from true distribution f(x), is IM SE ( ˆ f H ) ≤ nh d + R ( f ) n + o ( 1 n ) + h d R ( (cid:107)∇ f (cid:107) ) Specifically, its IV = ( nh d + R ( f ) n + o ( n )) and ISB ≤ ( h d R ( (cid:107)∇ f (cid:107) )) where R ( φ ) is the roughness of the func-tion φ defined as R ( φ ) = (cid:82) x ∈ S φ ( x ) dx Interpretation : Theorem 2 applies to all the functionsfor which we can apply Taylor series expansion up to 2terms and roughness terms R(f) and R ( (cid:107)∇ f (cid:107) ) exist andare bounded. It is clear that if h → and nh d → ∞ , then IM SE → , which implies that ˆ f H ( x ) converges to f(x)asymptotically. As nh d → ∞ , n should grow at a ratefaster than reduction of h d . This is exactly the curse ofdimensionality. Theorem 3. [ f C ( x ) to f ( x )] The IMSE of estimator ˆ f C ( x ) obtained from the Density Sketch with parameters(R,K,_)using histogram of width h built over n i.i.d samples drawnfrom true distribution f(x) is IM SE ( ˆ f C ) = IM SE ( ˆ f H ) + bins − KRnh d where bins is the number of non-zero bins in his-togram. Specifically, its IV ( ˆ f C ) = IV ( ˆ f H ) + bins − KRnh d and ISB ( ˆ f C ) = ISB ( ˆ f H ) Interpretation:
Note that though ˆ f C ( x ) is not a densityestimator as (cid:82) ˆ f C ( x ) (cid:54) = 1 , it is still useful to look at theIMSE for this function. It is clear from the above theo-rem, if binsKR is bounded, then as h → and nh d → ∞ , IM SE ( ˆ f C ) tends to 0 and the ˆ f C ( x ) converges to f(x)simultaneously at all points in probability. As expected,using count sketches adds to the variance of the estima-tor while keeping the bias unchanged. The term binsKR gives the memory accuracy trade-off here, as it comparesthe number of keys being inserted into sketch versus theDensity Sketch memory (=KR) excluding heap. Theorem 4. [ f ∗ C ( x ) to f C ( x )] The IMSE of estima-tor ˆ f ∗ C ( x ) obtained from the Density Sketch withparameters(R,K,_) using histogram of width h built over ni.i.d samples drawn from true distribution f(x) is | IM SE ( ˆ f ∗ C ) − IM SE ( ˆ f C ) | ≤ (cid:15) ( N + 2 M ) Specifically, | IV ( ˆ f ∗ C ) − IV ( ˆ f C ) | ≤ (cid:15)M | ISB ( ˆ f ∗ C ) − ISB ( ˆ f C ) | ≤ (cid:15)N where, N = (1 + ISB ( ˆ f C )) M = IV ( ˆ f C ) + 2 R ( f ) + ( h d/ R ( (cid:107)∇ f (cid:107) )+ h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) )) with probability (1 − δ ) where δ = bins(cid:15) nR A P
REPRINT (a) sample drawn from true distribution (b) sample drawn from Density Sketch (c) MNIST Sample from Density Sketch us-ing width: 0.01,0.1,1,10 from top to bottom
Figure 3: Visualization of samples drawn from Density Sketch
Dataset Dimension Samples zipped sizeskin/nonskin 3 180K 670KBsusy 18 4.5M 0.8GBhiggs 28 10M 2.5GBwebspam(unigram) 254 300K 120MB
Table 2: Classification Datasets from [18] with dimenstion < , number of samples per class > , Interpretation:
We provide a probabilistic ( (cid:15), δ ) guaran-tee on how ˆ f ∗ C ( x ) approaches ˆ f C ( x ) as the parameters ofDensity Sketch and number of data points change. The (cid:15), δ give the accuracy-memory tradeoff that occurs due torestricted memory of the Density Sketch (excluding heap).Using sufficiently large R, we can essentially control thedeviance of ˆ f ∗ C from ˆ f C . Lemma 1. [ f S ( x ) to f ∗ C ( x )] Estimators ˆ f S ( x ) and ˆ f ∗ C ( x ) ,obtained from the Density Sketch with parameters(R,K,H)using histogram of width h built over n i.i.d samples drawnfrom true distribution have a relation (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = 2(1 − ratio h ) where ratio h is the capture ratio as defined in 3.5 Using above lemma, we get a bound on the IMSE as fol-lows
Theorem 5. [ f S ( x ) to f ∗ C ( x )] The IMSE of estimator ˆ f S ( x ) obtained from the Density Sketch with parame-ters(R,K,H) using histogram of width h built over n i.i.dsamples drawn from true distribution f(x) is IM SE ( ˆ f S ( x )) ≤ − ratio h ) + 3 IM SE ( ˆ f ∗ C ( x )) where ratio h is the capture ratio as defined in 3.5 Interpretation
From Lemma 1, we can infer that for suf-ficiently large H, when ratio h = 1 , ˆ f ∗ C ( x ) and ˆ f S ( x ) areexactly the same simultaneously at all points in the support.Specifically ratio h for a particular H, captures the depen-dence of accuracy of ˆ f S ( x ) upto ˆ f C ( x ) on the data. If thedata is clustered and stored in various pockets of the space,heap captures almost all of the data, ratio h would essen-tially be close to 1 suggesting that ˆ f S will be close to f ∗ C .Similar behaviour would be seen if the distribution of datainto bins would follow a power law; implying that most data is concentrated in few bins which will be capturedin the heap. If the data is scattered then ratio h would beworse and hence the two estimators would diverge. Figure 3 (a) shows the samples drawn from the actualmulti Gaussian distribution, whereas figure 3 (b) shows thesamples drawn from the Density Sketch(DS) built on thesamples from the true distribution. As can be seen, the twosamples are indistinguishable. Figure 3 (c) shows somesamples drawn from DS built over the "MNIST" dataset[18]. It can be seen that DS give sensible samples whichcan be identified as digits by the naked eye for differentwidths of partitions. In these experiments, we used l2-lshrandom partitioning.
For most datasets, it is not possible to inspect samplesvisually. Hence we evaluate the quality of samples fromDS by using them to train classification models.
Datasets:
To perform a fair evaluation, we chose alldatasets from the liblinear website [18], which satisfythe constraints of 1) data dimension less than 500 and2) the number of samples per class greater than 100,000.Large Datasets is the main application domain for DS. Thedatasets such obtained are noted in the table 2. We usel2-lsh random partitions with 0.01 bandwidth
Results:
As can be seen, for "Higgs" dataset , the accu-racy of the model achieved on original data of size 2.5GB,can be achieved by using a DS of size 50MB. So weget around
50x compression ! We see similar results fordatasets of "skin" ( , 670KB) and "susy"( , 0.8GB) as well. The results show thatDS is much more informative than Random Sample. Thedimension of Webspam data is 254. The DS for this datasetis at a disadvantage due to Curse of Dimensionality : Num-ber of datapoints required for good estimates is exponentialin dimension. See Discussion 6 for qualitative discussionon this aspect. Due to these reasons the performance ofDS is comparatively poor in "Webspam" dataset althoughit still beats Random Sampling in this experiment.
Other Baselines:
We consider comparing against othersophisticated baselines. However, none are comparable to7ensity Sketches for Sampling and Estimation
A P
REPRINT (a) skin/nonskin (b) higgs (c) susy (d) webspam(unigram)
Figure 4:
Classification Results(top) : Test accuracy of models trained on samples drawn from density sketches and arandom sample. Models from Original data are horizontal lines. When compared for original accuracy, we get ≈ Compression in "skin", ≈ compression in "susy", ≈ compression in "higgs" with Densiy Sketches. EstimationError(bottom) : Mean Square Error of the predicted covariance matrix from samples drawn from Density Sketches andRandom Sample as compared to the empirical covariance matrix of the original data. Datasets chosen as 2.DS in their purpose/utility. (1)Coresets for sampling wouldfirst require KDE estimation, which has a huge memorycost to construct the point set. Also, despite recent progresstowards coresets in streaming setting [19], coresets remainchallenging to implement for KDE problems [20].(2) Al-gorithms like K-means clustering are inappropriate forstreaming datasets and don’t have mergeability propertyof DS. An alternative approach is to select points basedon importance sampling [20], geometric properties [21],and other sampling techniques [22]. However, recent ex-periments show that for many real-world datasets, randomsamples have a competitive performance with these ap-proaches [14]. Dimensionality reduction via random pro-jections is another way to reduce the size, however, giventhe relatively small dimension. It is unlikely to work well.
We evaluate the DS on another task of evaluating differentproperties of the distribution on the same Datasets. Specifi-cally, in this section, we show that samples drawn from theDS can be used effectively to estimate the Covariance Ma-trix of the underlying distribution. See Figure 4 for plots.The prediction from DS is superior to Random Sampleexcept for "Webspam". Again, due to high dimensionaldata, DS do not perform that well. In this case it is worsethan Random sample.
As can be seen from theory, a lot of factors affect perfor-mance of DS such as nh d , ratio h , ( bins/KR ) amongothers. In our experiments, we see that "Webspam" ashigher d and smaller n as compared to, say, "higgs", hencethe quality of histogram is bad for Webspam which trans-lates to DS as well. Another factor that affects samplingdistribution is cature ratio ratio h , we observe that data is more scattered in "Webspam", hence is ratio h is worsethan other datasets adversely affecting results. In the context of density estimators, the curse of dimension-ality implies (1) that the number of data points requiredto get decent estimates of density increases exponentiallywith dimension. (2) The number of bins in histogramsis exponential. As Density Sketches are built over His-tograms, they inherit this curse from Histograms. Withincreased data collection, the issue of the unavailabilityof large amounts of data is fast vanishing. We want toemphasize that Density Sketches’ advantages are best seenwhen data is humongous. Density Sketches can absorbtons of data and give better density estimates and sam-ples without increasing their memory usage. Also, mostreal data in high dimensions is clustered or stays on alow-dimensional manifold. Density Sketches, throw awayempty bins, and only store the histogram’s populated bins.Density Sketches can deal with the curse of dimensionalitybetter than Histograms.
We introduce
Density sketches (DS) : a sketch summariz-ing density from data samples. With very tiny size, stream-ing nature, sketch properties and capability to sample, DShas to potential to change the way we store, communicateand distribute data. We provide code for Density Sketches in Supplementary
A P
REPRINT
References [1] George Fishman.
Monte Carlo: concepts, algorithms,and applications . Springer Science & Business Me-dia, 2013.[2] Jerome Friedman, Trevor Hastie, Robert Tibshirani,et al.
The elements of statistical learning , volume 1.Springer series in statistics New York, 2001.[3] David W Scott.
Multivariate density estimation: the-ory, practice, and visualization . John Wiley & Sons,2015.[4] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversar-ial networks. arXiv preprint arXiv:1406.2661 , 2014.[5] Ian Goodfellow, Yoshua Bengio, Aaron Courville,and Yoshua Bengio.
Deep learning , volume 1. MITpress Cambridge, 2016.[6] Cynthia Dwork, Aaron Roth, et al. The algorithmicfoundations of differential privacy.
Foundations andTrends in Theoretical Computer Science , 9(3-4):211–407, 2014.[7] Graham Cormode, Magda Procopiuc, Divesh Sri-vastava, and Thanh TL Tran. Differentially pri-vate publication of sparse data. arXiv preprintarXiv:1103.0825 , 2011.[8] Graham Cormode and M Muthukrishnan. Count-minsketch., 2009.[9] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In
International Colloquium on Automata, Languages,and Programming , pages 693–703. Springer, 2002.[10] Amit Chakrabati. ∼ ac/Teach/data-streams-lecnotes.pdf , volume 1. 2020.[11] DW Scott and SR Sain. Multi-dimensional densityestimation handbook of statistics vol 23 data miningand computational statistics ed cr rao and ej wegman,2004.[12] Richard A Davis, Keh-Shin Lii, and Dimitris N Poli-tis. Remarks on some nonparametric estimates ofa density function. In Selected Works of MurrayRosenblatt , pages 95–100. Springer, 2011.[13] Zhipeng Wang and David W Scott. Nonpara-metric density estimation for high-dimensionaldata—algorithms and applications.
Wiley Inter-disciplinary Reviews: Computational Statistics ,11(4):e1461, 2019.[14] Benjamin Coleman and Anshumali Shrivastava. Sub-linear race sketches for approximate kernel densityestimation on streaming data. In
Proceedings of TheWeb Conference 2020 , pages 1739–1749, 2020.[15] Trevor Darrell, Piotr Indyk, and GregoryShakhnarovich.
Nearest-neighbor Methods inLearning and Vision: Theory and Practice . MITPress, 2005. [16] Claude JP Bélisle, H Edwin Romeijn, and Robert LSmith. Hit-and-run algorithms for generating mul-tivariate distributions.
Mathematics of OperationsResearch , 18(2):255–266, 1993.[17] Yuansi Chen, Raaz Dwivedi, Martin J Wainwright,and Bin Yu. Vaidya walk: A sampling algorithmbased on the volumetric barrier. In , pages 1220–1227.IEEE, 2017.[18] Chih-Chung Chang and Chih-Jen Lin. Libsvm : alibrary for support vector machines. In
ACM Transactions on Intelligent Systems and Tech-nology, 2011.[19] Jeff M Phillips and Wai Ming Tai. Near-optimalcoresets of kernel density estimates.
Discrete & Com-putational Geometry , 63(4):867–887, 2020.[20] Moses Charikar and Paris Siminelakis. Hashing-based-estimators for kernel density in high dimen-sions. In , pages1032–1043. IEEE, 2017.[21] Efren Cruz Cortes and Clayton Scott. Sparse approx-imation of a kernel mean.
IEEE Transactions onSignal Processing , 65(5):1310–1323, 2016.[22] Yutian Chen, Max Welling, and Alex Smola. Super-samples from kernel herding. arXiv preprintarXiv:1203.3472 , 2012.[23] Dimitris Achlioptas. Database-friendly random pro-jections: Johnson-lindenstrauss with binary coins.
Journal of computer and System Sciences , 66(4):671–687, 2003.9ensity Sketches for Sampling and Estimation
A P
REPRINT F H to f While estimating true distribution f ( x ) : R d → R , the integrated mean square error (IMSE) for the estimator ˆ f H ( x ) using regular histogram with width h and number of samples n, is IM SE ( ˆ f H ) ≤ nh d + R ( f ) n + o ( 1 n ) + h d R ( (cid:107)∇ f (cid:107) ) Specifically, its IV = 1 nh d + R ( f ) n + o ( 1 n ) and ISB ≤ h d R ( (cid:107)∇ f (cid:107) ) where R ( φ ) is the roughness of the function φ defined as R ( φ ) = (cid:82) φ ( x ) dx Proof.
Let x ∈ S . S is the support of the distribution. The estimator ˆ f H ( x ) is defined as, where V(x) is volume of binin which x lies. Equivalently, we can also use V(b) to denote volume of bin b. For standard histogram, V ( x ) = h d ˆ f H ( x ) = 1 nV ( bin ( x )) Σ ni =1 I ( x i ∈ bin ( x )) (1)First let us consider the integrated variance. IV = (cid:90) x ∈ S V ar ( ˆ f H ( x )) dx = Σ b ∈ bins ( S ) (cid:90) x ∈ b V ar ( ˆ f H ( x )) dx (2)For a particular bin b, the variance is constant at all values of x. Also for a particular x in bin b, we can write thefollowing for V ar ( ˆ f H ( x )) using independence of samples. V ar ( ˆ f H ( x )) = 1 nV ( bin ( x )) V ar ( I ( x i ∈ bin ( x )) (3)Also V ar ( I ( x i ∈ b )) = p b (1 − p b ) where p b is the probability of x i lying in bin b. That is, p b = (cid:82) x ∈ b f ( x ) dx Using this in equation 2 IV = Σ b ∈ bins ( S ) V ( b ) 1 nV ( b ) p b ∗ (1 − p b ) (4)Simplifying, IV = Σ b ∈ bins ( S ) nV ( b ) p b ∗ (1 − p b ) (5)For standard histogram V(b) is same across bins, IV = 1 nV ( b ) (Σ b ∈ bins ( S ) p b − Σ b ∈ bins ( S ) p b ) = 1 nV ( b ) (1 − Σ b ∈ bins ( S ) p b ) (6)Using mean value theorem, we can write, p b = V ( b ) f ( ξ b ) for some point ξ b ∈ b . Σ b ∈ bins p b = Σ b ∈ bins V ( b ) f ( ξ b ) = V ( b )Σ b ∈ bins V ( b ) f ( ξ b ) (7)Using Rieman Integral approximation , we can write the following as the bin size reduces, Σ b ∈ bins V ( b ) f ( ξ b ) = (cid:90) x ∈ S f ( x ) dx + o (1) (8) (cid:82) x ∈ S f ( x ) dx is also known as the roughness of the function. Let us denote it using R(f). Hence IV = 1 nV ( b ) (1 − V ( b )( R ( f ) + o (1))) (9)10ensity Sketches for Sampling and Estimation A P
REPRINT IV = 1 nV ( b ) − R ( f ) n + o ( 1 n ))) (10)Putting V ( b ) = h d IV = 1 nh d − R ( f ) n + o ( 1 n ))) (11)Keeping only the leading term in the above expression, IV = O ( 1 nh d ) (12)Now let us look at the ISB for this estimator, ISB ( ˆ f h ( x )) ISB ( ˆ f h ( x )) = (cid:90) x ∈ S ( E ( ˆ f H ( x ) − f ( x ))) dx (13)Let us look at the estimator, ˆ f H ( x ) = 1 V ( bin ( x )) (cid:90) t ∈ bin ( x ) f ( t ) dt (14)Just to make it clear, x ∈ R d , we will use it as a vector in the following. Using 2nd order multivariate taylor seriesexpansion of this f ( t ) around x , we get : f ( t ) = f ( x ) + (cid:104) t − x, ∇ f ( x ) (cid:105) + 12 ( t − x ) (cid:62) H ( f ( x ))( t − x ) (15)Here H ( f ( t )) is the hessian of f at t. Without loss of generality let us look at the bin ( x ) = [0 , h ] d that is the bin at theorigin. Let us say it is bin (cid:90) t ∈ bin f ( t ) dt = f ( x ) h d + h d (cid:104) ( h − x, ∇ f ( x ) (cid:105) + O ( h d +2 ) (16)where x ( j ) is the j th component of x. Using eq 17 in eq 15, we get ˆ f H ( x ) = f ( x ) + (cid:104) ( h − x ) , ∇ f ( x ) (cid:105) + O ( h ) (17)Hence, just keeping the leading term , we have Bias ( ˆ f H ( x )) = (cid:104) ( h − x ) , ∇ f ( x ) (cid:105) (18)Now, (cid:90) x ∈ b Bias ( ˆ f H ( x )) dx = (cid:90) x ∈ b ( (cid:104) ( h − x ) , ∇ f ( x ) (cid:105) ) dx (19)Using cauchy’s inequality, we get (cid:90) x ∈ b Bias ( ˆ f H ( x )) dx ≤ (cid:90) x ∈ b (cid:107) ( h − x ) (cid:107) (cid:107)∇ f ( x ) (cid:107) dx (20)As [ h/ , h/ , ...h/ is a mid point of the bin. The max norm of x − h/ can be h √ d/ (cid:90) x ∈ b Bias ( ˆ f H ( x )) dx ≤ h d (cid:90) x ∈ b (cid:107)∇ f ( x ) (cid:107) dx (21)Now looking at ISB ISB = Σ b ∈ bins (cid:90) x ∈ b Bias ( ˆ f H ( x )) dx ≤ h d (cid:90) x ∈ S (cid:107)∇ f ( x ) (cid:107) dx (22) ISB ≤ h d R ( (cid:107)∇ f (cid:107) ) (23)11ensity Sketches for Sampling and Estimation A P
REPRINT F C to f H While estimating true distribution f ( x ) : R d → R , the integrated mean square error (IMSE) for the estimator ˆ f C ( x ) using regular histogram with width h and number of samples n and countsketch with parameters (R:range, K:repetitions)and average-recovery, is IM SE ( ˆ f C ) = IM SE ( ˆ f H ) + binsKRnh d where n nzp is the number of non-zero partitions. Specifically, its IV ( ˆ f C ) = IV ( ˆ f H ) + bins − KRnh d and ISB ( ˆ f C ) = ISB ( ˆ f H ) where n n zp is the number of non-zero bins/partitions. Proof.
Consider a Countsketch with range = R and just one repetition. Let it be parameterized by the randomly drawnhash functions g : bin −→ { , , , ..., R − } and s : bin −→ {− , +1 } . The estimate of density at point x can thenbe written as ˆ f C ( x ) = 1 nV ( bin ( x )) ( c ( bin ( x )) + Σ ni =1 I ( x i / ∈ bin ( x ) ∧ g ( bin ( x i )) == g ( bin ( x ))) s ( bin ( x i )) s ( bin ( x )) (24)We can rewrite this as , ˆ f C ( x ) = ˆ f H ( x ) + 1 nV ( bin ( x )) (Σ ni =1 I ( x i / ∈ bin ( x ) ∧ g ( bin ( x i )) == g ( bin ( x ))) s ( bin ( x i )) s ( bin ( x )) (25)where c(.) is count and V(.) is volume of the bins. As E(s(b)) = 0, it can be clearly seen that. E ( ˆ f C ( x )) = E ( ˆ f H ( x )) (26)Hence, it follows that ISB ( ˆ f C ( x )) = ISB ( ˆ f H ( x )) (27)It can be checked that each of the terms in the summation for right hand side of equation 26 including the terms in ˆ f H ( x ) are independent to each other . i.e. covariance between them is 0. Hence we can write the variance of ourestimator as, V ar ( ˆ f C ( x )) = V ar ( ˆ f H ( x )) + 1 nV ( bin ( x )) V ar ( I ( x i / ∈ bin ( x ) ∧ g ( bin ( x i )) == g ( bin ( x ))) s ( bin ( x i )) s ( bin ( x ))) (28) V ar ( ˆ f C ( x )) = V ar ( ˆ f H ( x )) + 1 nV ( bin ( x )) E ( I ( x i / ∈ bin ( x ) ∧ g ( bin ( x i )) == g ( bin ( x )))) (29) V ar ( ˆ f C ( x )) = V ar ( ˆ f H ( x )) + 1 nV ( bin ( x )) (1 − p bin ( x ) ) 1 R ) (30)Hence, IV is IV ( ˆ f C ( x )) = IV ( ˆ f H ( x )) + (cid:90) x ∈ S nV ( bin ( x )) (1 − p bin ( x ) ) 1 R ) (31) IV ( ˆ f C ( x )) = IV ( ˆ f H ( x )) + Σ b ∈ bins (cid:90) x ∈ b nV ( b ) (1 − p b ) 1 R ) (32) IV ( ˆ f C ( x )) = IV ( ˆ f H ( x )) + Σ b ∈ bins nV ( b ) (1 − p b ) 1 R ) (33)Assuming standard partitions. V ( b ) = h d for all b IV ( ˆ f C ( x )) = IV ( ˆ f H ( x )) + 1 nh d ( bins − R (34)With average recovery, with K repetitions, the analysis can be easily extended to get IV as IV ( ˆ f C ( x )) = IV ( ˆ f h ( x )) + 1 nh d ( bins − KR (35)The ISB remains same in this case. 12ensity Sketches for Sampling and Estimation A P
REPRINT f ∗ C tof C While estimating true distribution f ( x ) : R d → R , the integrated mean square error (IMSE) for the estimator ˆ f ∗ C ( x ) using regular histogram with width h and number of samples n and countsketch with parameters (R:range, K:repetitions),is related to the estimator ˆ f C ( x ) as follows IM SE ( ˆ f C ( x )) − (cid:15) ( N + 2 M ) ≤ IM SE ( ˆ f ∗ C ( x )) ≤ IM SE ( ˆ f C ( x )) + (cid:15) ( N + 2 M ) Specifically, IV ( ˆ f C ( x )) − (cid:15)M ≤ IV ( ˆ f ∗ C ( x )) ≤ IV ( ˆ f C ( x )) + 2 (cid:15)M and ISB ( ˆ f C ( x )) − (cid:15)N ≤ ISB ( ˆ f ∗ C ( x )) ≤ ISB ( ˆ f C ( x )) + (cid:15)N where M ≤ IV ( ˆ f C ( x )) + 2( R ( f ) + h d R ( (cid:107)∇ f (cid:107) ) + h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) )) N = (1 + ISB ( ˆ f C ( x ))) with probability (1 − δ ) where δ = bins(cid:15) nR Proof.
Let us look at the estimator ˆ f ∗ C ( x ) = (cid:92) c ( bin ( x )) V ( bin ( x ))Σ b (cid:100) c ( b ) = ˆ f C ( x ) ∗ n ˆ n (36)where ˆ n = Σ b (cid:100) c ( b ) and n = Σ b c ( b )ˆ n and its relation to n Let us first analyse ˆ n and how it is related to n. ˆ n = Σ b (cid:100) c ( b ) = Σ b Σ ni =1 I ( x i ∈ b ) + I ( x i / ∈ b ∧ g ( bin ( x i )) == g ( b )) s ( bin ( x i )) s ( b ) (37) ˆ n = Σ b,i I ( x i ∈ b ) + I ( x i / ∈ b ∧ g ( bin ( x i )) == g ( b )) s ( bin ( x i )) s ( b ) (38)Note that E (ˆ n ) = n . For varaince, observe that most of the terms in the summation have covariance 0, except the terms Cov ( I ( x i ∈ b ) , I ( x i ∈ b )) which are negatively correlated. Hence V ar (ˆ n ) =Σ b,i V ar ( I ( x i ∈ b )) + V ar ( I ( x i / ∈ b ∧ g ( bin ( x i ))! = g ( b )) s ( bin ( x i )) s ( b ))+2Σ i,b ,b ,b (cid:54) = b Cov ( I ( x i ∈ b ) , I ( x i ∈ b )) (39)We know that V ar ( I ( x i ∈ b )) = p b (1 − p b ) V ar ( I ( x i / ∈ b ∧ g ( bin ( x i )) == g ( b )) s ( bin ( x i )) s ( b )) = E ( I ( x i / ∈ b ∧ g ( bin ( x i ))! = g ( b )) ) = 1 − p b RCov ( I ( x i ∈ b ) , I ( x i ∈ b )) = − p b p b Hence, we pluggin in the values in previous equation ,
V ar (ˆ n ) = n Σ b p b (1 − p b ) + n Σ b − p b R − n Σ b ,b ,b (cid:54) = b p b p b (40) V ar (ˆ n ) = n (1 − Σ b p b ) + n Σ b − p b R − n Σ b ,b p b p b (41) V ar (ˆ n ) = n { (1 + Σ b − p b R − (Σ b p b ) − n Σ b ,b p b p b ) } (42) V ar (ˆ n ) = n { (1 + Σ b − p b R − (Σ b p b ) } (43) V ar (ˆ n ) = n { Σ b − p b R } (44)13ensity Sketches for Sampling and Estimation A P
REPRINT
V ar (ˆ n ) = n ( bins − R < n ( bins ) R (45)Using Chebyshev’s inequality , we have P ( | ˆ n − n | > (cid:15)n ) ≤ V ar (ˆ n ) (cid:15) n (46) P ( | ˆ n − n | > (cid:15)n ) ≤ bins(cid:15) nR (47)Hence with probability (1 − δ ) , δ = bins(cid:15) nR , ˆ n is within (cid:15) multiplicative error. relation of pointwise Bias and ISB With probability − δ , ˆ f C ( x )1 + (cid:15) ≤ ˆ f ∗ C ( x ) ≤ ˆ f C ( x )1 − (cid:15) (48)As expectations respect inequalities E ( ˆ f C ( x ))1 + (cid:15) ≤ E ( ˆ f ∗ C ( x )) ≤ E ( ˆ f C ( x ))1 − (cid:15) (49) E ( ˆ f C ( x ))1 + (cid:15) − f ( x ) ≤ Bias ( ˆ f ∗ C ( x )) ≤ E ( ˆ f C ( x ))1 − (cid:15) − f ( x ) (50) Bias ( ˆ f C ( x )) − (cid:15)f ( x )1 + (cid:15) ≤ Bias ( ˆ f ∗ C ( x )) ≤ Bias ( ˆ f C ( x )) + (cid:15)f ( x )1 − (cid:15) (51) Bias ( ˆ f C ( x )) − (cid:15)f ( x )1 + (cid:15) ≤ Bias ( ˆ f ∗ C ( x )) ≤ Bias ( ˆ f C ( x )) + (cid:15)f ( x )1 − (cid:15) (52)Integrating expressions again respects inequalities ISB ( ˆ f C ( x )) − (cid:15) (cid:82) f ( x )1 + (cid:15) ≤ ISB ( ˆ f ∗ C ( x )) ≤ ISB ( ˆ f C ( x )) + (cid:15) (cid:82) f ( x )1 − (cid:15) (53) ISB ( ˆ f C ( x )) − (cid:15) (cid:15) ≤ ISB ( ˆ f ∗ C ( x )) ≤ ISB ( ˆ f C ( x )) + (cid:15) − (cid:15) (54)Using first order taylor expansion of (cid:15) and ignore square terms (1 − (cid:15) ) ISB ( ˆ f C ( x )) − (cid:15) ≤ ISB ( ˆ f ∗ C ( x )) ≤ (1 + (cid:15) ) ISB ( ˆ f C ( x )) + (cid:15) (55) ISB ( ˆ f C ( x )) − (cid:15) (1 + ISB ( ˆ f C ( x ))) ≤ ISB ( ˆ f ∗ C ( x )) ≤ ISB ( ˆ f C ( x )) + (cid:15) (1 + ISB ( ˆ f C ( x ))) (56)Hence, ISB ( ˆ f C ( x )) − (cid:15)N ≤ ISB ( ˆ f ∗ C ( x )) ≤ ISB ( ˆ f C ( x )) + (cid:15)N (57)where N = (1 + ISB ( ˆ f C ( x ))) A P
REPRINT
Point wise variance and IV
Using the similar arguments E ( ˆ f C ( x ))(1 + (cid:15) ) − E ( ˆ f C ( x ))(1 − (cid:15) ) ≤ V ar ( ˆ f ∗ C ( x )) ≤ E ( ˆ f C ( x ))(1 − (cid:15) ) − E ( ˆ f C ( x ))(1 + (cid:15) ) (58)Again making first order taylor expansions of denominator and ignoring square terms V ar ( ˆ f C ( x )) − (cid:15) ( E ( ˆ f C ( x ) + E ( ˆ f C ( x ))) ≤ V ar ( ˆ f ∗ C ( x )) ≤ V ar ( ˆ f C ( x )) + 2( E ( ˆ f C ( x ) + E ( ˆ f C ( x ))) (59)Since, V ar ( ˆ f C ( x )) = E ( ˆ f C ( x )) − E ( ˆ f C ( x )) V ar ( ˆ f C ( x )) − (cid:15) ( V ar ( ˆ f C ( x )) + 2 E ( ˆ f C ( x ))) ≤ V ar ( ˆ f ∗ C ( x )) ≤ V ar ( ˆ f C ( x )) + 2 (cid:15) ( V ar ( ˆ f C ( x )) + 2 E ( ˆ f C ( x ))) (60) IV ( ˆ f C ( x )) − (cid:15) ( IV ( ˆ f C ( x ))+2 (cid:90) x ∈ S E ( ˆ f C ( x ))) ≤ IV ( ˆ f ∗ C ( x )) ≤ IV ( ˆ f C ( x ))+2 (cid:15) ( IV ( ˆ f C ( x ))+2 (cid:90) x ∈ S E ( ˆ f C ( x ))) (61)Let us now figure out the (cid:82) x ∈ S E ( ˆ f C ( x )) (cid:90) x ∈ S E ( ˆ f C ( x )) = (cid:90) x ∈ S E ( ˆ f H ( x )) (62)From equation 18, E( ˆ f H ( x )) = f ( x ) + ( (cid:104) ( h − x ) , ∇ f ( x ) (cid:105) ) + 2 f ( x ) (cid:104) ( h − x ) , ∇ f ( x ) (cid:105) (cid:90) x ∈ S E ( ˆ f H ( x )) ≤ R ( f ) + h d R ( (cid:107)∇ f (cid:107) ) + h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) ) (63)Hence, IV ( ˆ f C ( x )) − (cid:15)M ≤ IV ( ˆ f ∗ C ( x )) ≤ IV ( ˆ f C ( x )) + 2 (cid:15)M (64)Where M ≤ IV ( ˆ f C ( x )) + 2( R ( f ) + h d R ( (cid:107)∇ f (cid:107) ) + h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) )) (65) Estimators ˆ f S ( x ) and ˆ f ∗ C ( x ) , obtained from the Density Sketch with parameters(R,K,H) using histogram of width hbuilt over n i.i.d samples drawn from true distribution have a relation (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = 2(1 − ratio h ) where ratio h is the capture ratio as defined in section 3 (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = Σ b ∈ bins (cid:90) x ∈ b | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx (66) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = Σ b ∈ bins ( H ) (cid:90) x ∈ b | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx + Σ b/ ∈ bins ( H ) (cid:90) x ∈ b | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx (67)we know that for x ∈ b, b / ∈ bins ( H ) , ˆ f S ( x ) = 0. Hence, (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = Σ b ∈ bins ( H ) (cid:90) x ∈ b | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx + Σ b/ ∈ bins ( H ) (cid:90) x ∈ b ˆ f ∗ C ( x ) dx (68)15ensity Sketches for Sampling and Estimation A P
REPRINT (cid:82) x ∈ b ˆ f ∗ C ( x ) dx is the probability of a data point lying in that bucket according to ˆ f ∗ C ( x ) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = Σ b ∈ bins ( H ) (cid:90) x ∈ b | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx + Σ b/ ∈ bins ( H ) ˆ c b ˆ n (69)For points x ∈ b, b ∈ bins ( H ) , ˆ f ∗ C ( x ) ∗ ˆ n = ˆ f S ( x ) ∗ ˆ n h , Hence, ˆ f S ( x ) = ˆ n ˆ n h ˆ f ∗ C ( x ) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = Σ b ∈ bins ( H ) (cid:90) x ∈ b ˆ f ∗ C ( x )( ˆ n ˆ n h − dx + Σ b/ ∈ bins ( H ) ˆ c b ˆ n (70) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = Σ b ∈ bins ( H ) (cid:90) x ∈ b ˆ f ∗ C ( x )( ˆ n ˆ n h − dx + Σ b/ ∈ bins ( H ) ˆ c b ˆ n (71) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = ( ˆ n ˆ n h − b ∈ bins ( H ) ˆ c b ˆ n + Σ b/ ∈ bins ( H ) ˆ c b ˆ n (72) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = ( ˆ n ˆ n h − n h ˆ n ) + ˆ n − ˆ n h ˆ n (73) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = (1 − ˆ n h ˆ n ) + ˆ n − ˆ n h ˆ n (74) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = 2(1 − ˆ n h ˆ n ) (75) (cid:90) | ˆ f ∗ C ( x ) − ˆ f S ( x ) | dx = 2(1 − ratio h ) (76) The IMSE of estimator ˆ f S ( x ) obtained from the Density Sketch with parameters(R,K,H) using histogram of width hbuilt over n i.i.d samples drawn from true distribution f(x) is IM SE ( ˆ f S ( x )) ≤ − ratio h ) + 3 IM SE ( ˆ f ∗ C ( x )) where ratio h is the capture ratio as defined in Proof.
Giving a very loose relation between ˆ f S and f. We can write (cid:90) ( ˆ f S ( x ) − f ( x )) dx = (cid:90) (( ˆ f S ( x ) − ˆ f ∗ C ( x )) − ( ˆ f ∗ C ( x ) − f ( x ))) dx (77) (cid:90) ( ˆ f S ( x ) − f ( x )) dx ≤ (cid:90) ( ˆ f S ( x ) − ˆ f ∗ C ( x )) dx + 3 (cid:90) ( ˆ f ∗ C ( x ) − f ( x )) dx (78) (cid:90) ( ˆ f S ( x ) − f ( x )) dx ≤ (cid:90) | ( ˆ f S ( x ) − ˆ f ∗ C ( x )) | dx ) + 3 (cid:90) ( ˆ f ∗ C ( x ) − f ( x )) dx (79) (cid:90) ( ˆ f S ( x ) − f ( x )) dx ≤ − ratio h ) + 3 (cid:90) ( ˆ f ∗ C ( x ) − f ( x )) dx (80) IM SE = M ISE ( ˆ f S ( x )) ≤ − ratio h ) + 3 IM SE ( ˆ f ∗ C ( x )) (81)16ensity Sketches for Sampling and Estimation A P
REPRINT
This theorem directly relates the distribution ˆ f S ( x ) to the true distribution. f(x) IM SE ( ˆ f S ( x )) ≤ − ratio h ) + 3 IM SE ( ˆ f ∗ C ( x )) (82) IM SE ( ˆ f S ( x )) ≤ − ratio h ) + 3( IM SE ( ˆ f C ( x ) + (cid:15) ( N + 2 M ))) (83) IM SE ( ˆ f S ( x )) ≤ − ratio h ) + 3( IM SE ( ˆ f H ) + bins − KRnh d + (cid:15) ( N + 2 M ))) (84) IM SE ( ˆ f S ( x )) ≤ − ratio h ) + 3( 1 nh d + R ( f ) n + o ( 1 n ) + h d R ( (cid:107)∇ f (cid:107) ) + bins − KRnh d + (cid:15) ( N + 2 M ))) (85) N = (1 + ISB ( ˆ f C )) N ≤ h d R ( (cid:107)∇ f (cid:107) ) M ≤ IV ( ˆ f C ) + 2 R ( f ) + h d R ( (cid:107)∇ f (cid:107) ) + h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) )) M ≤ IV ( ˆ f H ) + bins − KRnh d + 2 R ( f ) + h d R ( (cid:107)∇ f (cid:107) ) + h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) )) M ≤ nh d + R ( f ) n + o ( 1 n ) + bins − KRnh d + 2 R ( f ) + h d R ( (cid:107)∇ f (cid:107) ) + h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) )) IM SE ( ˆ f S ( x )) ≤ − ratio h ) +3( 1 nh d + R ( f ) n + o ( 1 n )+(1+ (cid:15) ) h d R ( (cid:107)∇ f (cid:107) )+ bins − KRnh d +2 (cid:15)M + (cid:15) ) (86) IM SE ( ˆ f S ( x )) ≤ − ratio h ) +3(1 + 2 (cid:15) )( 1 nh d + R ( f ) n + o ( 1 n ) + bins − KRnh d )+3(1 + 3 (cid:15) ) h d R ( (cid:107)∇ f (cid:107) )+3 (cid:15) (1 + 2 R ( f ) + h √ d (cid:90) x ∈ S ( f ( x ) (cid:107)∇ f (cid:107) ))) .
10 Other Base lines
Coresets:
We considered a comparison with sophisticated data summaries such as coresets. Briefly, a coreset is acollection of (possibly weighted) points that can be used to estimate functions over the dataset. To use coresets togenerate a synthetic dataset, we would need to estimate the KDE. Unfortunately, coresets for the KDE suffer frompractical issues such as a large memory cost to construct the point set. Despite recent progress toward coresets in thestreaming environment [19], coresets remain difficult to implement for real-world KDE problems [20].
Clustering and Importance Sampling:
Another reasonable strategy is to represent the dataset as a collection of weightedcluster centers, which may be used to compute the KDE and sample synthetic points. Unfortunately, algorithms such as k -means clustering are inappropriate for large streaming datasets and do not have the same mergeability propertiesas our sketch. Furthermore, such techniques are unlikely to substantially improve over random sampling when the17ensity Sketches for Sampling and Estimation A P
REPRINT samples is spread sufficiently well over the support of the distribution. An alternative approach is to select pointsfrom the dataset based on importance sampling [20], geometric properties [21], and other sampling techniques [22].However, recent experiments show that for many real-world datasets, random samples have competitive performancewhen compared to point sets obtained via importance sampling and cluster-based approaches [14].
Dimensionality Reduction: