[PDF] Learning Structured Distributions From Untrusted Batches: Faster and Simpler

Abstract

We revisit the problem of learning from untrusted batches introduced by Qiao and Valiant [QV17]. Recently, Jain and Orlitsky [JO19] gave a simple semidefinite programming approach based on the cut-norm that achieves essentially information-theoretically optimal error in polynomial time. Concurrently, Chen et al. [CLM19] considered a variant of the problem where μ is assumed to be structured, e.g. log-concave, monotone hazard rate, t -modal, etc. In this case, it is possible to achieve the same error with sample complexity sublinear in n , and they exhibited a quasi-polynomial time algorithm for doing so using Haar wavelets. In this paper, we find an appealing way to synthesize the techniques of [JO19] and [CLM19] to give the best of both worlds: an algorithm which runs in polynomial time and can exploit structure in the underlying distribution to achieve sublinear sample complexity. Along the way, we simplify the approach of [JO19] by avoiding the need for SDP rounding and giving a more direct interpretation of it through the lens of soft filtering, a powerful recent technique in high-dimensional robust estimation. We validate the usefulness of our algorithms in preliminary experimental evaluations.

Full PDF

LLearning Structured Distributions From Untrusted Batches:Faster and Simpler

Sitan Chen ∗ Jerry Li † Ankur Moitra ‡ June 9, 2020

Abstract

We revisit the problem of learning from untrusted batches introduced by Qiao and Valiant[QV17]. Recently, Jain and Orlitsky [JO19] gave a simple semideﬁnite programming approachbased on the cut-norm that achieves essentially information-theoretically optimal error in poly-nomial time. Concurrently, Chen et al. [CLM19] considered a variant of the problem where µ isassumed to be structured, e.g. log-concave, monotone hazard rate, t -modal, etc. In this case, itis possible to achieve the same error with sample complexity sublinear in n , and they exhibiteda quasi-polynomial time algorithm for doing so using Haar wavelets.In this paper, we ﬁnd an appealing way to synthesize [JO19] and [CLM19] to give the bestof both worlds: an algorithm which runs in polynomial time and can exploit structure in theunderlying distribution to achieve sublinear sample complexity. Along the way, we simplify theapproach of [JO19] by avoiding the need for SDP rounding and giving a more direct interpreta-tion of it via soft ﬁltering, a powerful recent technique in high-dimensional robust estimation.We validate the usefulness of our algorithms in preliminary experimental evaluations. In this paper, we consider the problem of learning structured distributions from untrusted batches .This is a variant on the problem of learning from untrusted batches , as introduced in [QV17]. Here,there is an unknown distribution µ over { , . . . , n } , and we are given N batches of samples, eachof size k . A (1 − (cid:15) )-fraction of these batches are “good,” and consist of k i.i.d. samples from somedistribution µ i with distance at most ω from µ in total variation distance, but an (cid:15) -fraction ofthese batches are “bad,” and can be adversarially corrupted. The goal then is to estimate µ intotal variation distance.This problem models a situation where we get batches of data from many diﬀerent users, forinstance, in a crowdsourcing application. Each honest user provides a relatively small batch ofdata, which is by itself insuﬃcient to learn a good model, and moreover, can come from slightlydiﬀerent distributions depending on the user, due to heterogeneity. At the same time, a non-trivial ∗ EECS, Massachusetts Institute of Technology. Email: [email protected] . This work was supported in part by aPaul and Daisy Soros Fellowship, NSF CAREER Award CCF-1453261, and NSF Large CCF-1565235. † Microsoft Research AI. Email: [email protected] . ‡ Department of Mathematics, Massachusetts Institute of Technology. Email: [email protected] . This work wassupported in part by a Microsoft Trustworthy AI Grant, NSF CAREER Award CCF-1453261, NSF Large CCF-1565235, a David and Lucile Packard Fellowship, an Alfred P. Sloan Fellowship and an ONR Young InvestigatorAward. The total variation distance between two distributions µ, ν over a shared probability space Ω is deﬁned to besup U ⊆ Ω µ ( U ) − ν ( U ). a r X i v : . [ c s . L G ] J un raction of data can come from malicious users who wish to game our algorithm to their own ends.The high level question is whether or not we can exploit the batch structure of our data to improvethe robustness of our estimator.For this problem, there are three separate, but equally important, metrics under which we canevaluate any estimator: Robustness

How accurately can we estimate µ in total variation distance? Runtime

Are there algorithms that run in polynomial time in all the relevant parameters?

Sample complexity

How few samples do we need in order to estimate µ ?In the original paper, Qiao and Valiant [QV17] focus primarily on robustness. They give analgorithm for learning general µ from untrusted batches that uses a polynomial number of samples,and estimates µ to within O (cid:16) ω + (cid:15)/ √ k (cid:17) in total variation distance, and they proved that this is the best possible up to constant factors.However, their estimator runs in time 2 n . Qiao and Valiant [QV17] also gave an n k time algorithmbased on low-rank tensor approximation, however their algorithm also needs n k samples.A natural question is whether or not this robustness can be achieved eﬃciently . [CLM19] gavean n log 1 /(cid:15) time algorithm with n log 1 /(cid:15) sample complexity for the general problem based on thesum-of-squares hierarchy. It estimates µ to within O (cid:18) ω + (cid:15) √ k (cid:112) log 1 /(cid:15) (cid:19) in total variation distance. In concurrent and independent work Jain and Orlitsky [JO19] gavea polynomial time algorithm based on a much simpler semideﬁnite program that estimates µ towithin the same total variation distance. Their approach was based on an elegant way to combineapproximation algorithms for the cut-norm [AN04] with the ﬁltering approach for robust estimation[DKK +

19, SCV18, DKK +

17, DKK +

18, DHL19].To some extent, the results of [CLM19, JO19] also address the third consideration, samplecomplexity. In particular, the estimator of [JO19] requires N = (cid:101) Ω( n/(cid:15) ) batches to achieve theerror rate mentioned above. Without any assumptions on the structure of µ , even in the case wherethere are no corruptions, any algorithm must take at least Ω( n/(cid:15) ) batches of size k are requiredin order to learn µ to within total variation distance O ( ω + (cid:15)/ √ k ). Thus, this sample complexityis nearly-optimal for this problem, unless we make additional assumptions.Unfortunately, in many cases, the domain size n can be very large, and a sample complexitywhich strongly grows with n can render the estimator impractical. However in most applications,we have prior knowledge about the shape of µ that could in principle be used to drastically reducethe sample complexity. For example, if µ is log-concave, monotone or multimodal with a boundednumber of modes, it is known that µ can be approximated by a piecewise polynomial function andwhen there are no corruptions, this meta structural property can be used to reduce the samplecomplexity to logarithmic in the domain size [CDSS14b]. An appealing aspect of the relaxationin [CLM19] was that it was possible to incorporate shape-constraints into the relaxation, throughthe Haar wavelet basis, which allowed us to improve the sample complexity to quasipolynomial in d and s , respectively the degree and number of parts in the piecewise polynomial approximation,and quasipolylogarithmic in n . Unfortunately, while [JO19] achieves better runtime and samplecomplexity in the unstructured setting, their techniques do not obviously extend to obtain a similarsample complexity under structural assumptions.2his raises a natural question: can we build on [JO19] and [CLM19], to incorporate shapeconstraints into a simple semideﬁnite programming approach, that can achieve nearly-optimal ro-bustness, in polynomial runtime, and with sample complexity which is sublinear in n ? In thispaper, we answer this question in the aﬃrmative: Theorem 1.1 (Informal, see Theorem 4.1) . Let µ be a distribution over [ n ] that is approximatedby an s -part piecewise polynomial function with degree at most d . Then there is a polynomial-timealgorithm which estimates µ to within O (cid:18) ω + (cid:15) √ k (cid:112) log 1 /(cid:15) (cid:19) in total variation distance after drawing N (cid:15) -corrupted batches, each of size k , where N = (cid:101) O (cid:0) ( s d /(cid:15) ) · log ( n ) (cid:1) is the number of batches needed. Any algorithm for learning structured distributions from untrusted batches must take at leastΩ( sd/(cid:15) ) batches to achieve error O ( ω + (cid:15)/ √ k ), and an interesting open question is whether thereis a polynomial time algorithm that achieves these bounds. For robustly estimating the mean ofa Gaussian in high-dimensions, there is evidence for a Ω( (cid:112) log 1 /(cid:15) ) gap between the best possibleestimation error and what can be achieved by polynomial time algorithms [DKS17]. It seemsplausible that the Ω( (cid:112) log 1 /(cid:15) ) gap between the best possible estimation error and what we achieveis unavoidable in this setting as well. [JO19] demonstrated how to learn general distributions from untrusted batches in polynomial timeusing a ﬁltering algorithm similar to those found in [DKK +

19, SCV18, DKK +

17, DKK +

18, DHL19],and in [CLM19] it was shown how to learn structured distributions from untrusted batches inquasipolynomial time using an SoS relaxation based on Haar wavelets.In this work we show how to combine the ﬁltering framework of [JO19] with the Haar wavelettechnology of [CLM19] to obtain a polynomial-time, sample-eﬃcient algorithm for learning struc-tured distributions from untrusted batches. In the discussion in this section, we will specialize tothe case of ω = 0 for the sake of clarity. Learning via Filtering

A useful ﬁrst observation is that the problem of learning from un-trusted batches can be thought of as robust mean estimation of multinomial distributions in L distance: given a batch of samples Y i = ( Y i , ..., Y ki ) from a distribution µ over [ n ], the frequencyvector { k (cid:80) kj =1 [ Y ji = a ] } a ∈ [ n ] is distributed according to the normalized multinomial distributionMul k ( µ ) given by k draws from µ . Note that µ is precisely the mean of Mul k ( µ ), so the problemof estimating µ from an (cid:15) -corrupted set of N frequency vectors is equivalent to that of robustlyestimating the mean of a multinomial distribution.As such, it is natural to try to adapt the existing algorithms for robust mean estimation ofother distributions; the fastest of these are based on a simple ﬁltering approach which works asfollows. We maintain weights for each point, initialized to uniform. At every step, we measure themaximum “skew” of the weighted dataset in any direction, and if this skew is still too high, updatethe weights by1. Finding the direction v in which the corruptions “skew” the dataset the most.3. Giving a “score” to each point based on how badly it skews the dataset in the direction v

3. Downweighting or removing points with high scores.Otherwise, if the skew is low, output the empirical mean of the weighted dataset.To prove correctness of this procedure, one must show three things for the particular skewnessmeasure and score function chosen: • Regularity : For any suﬃciently large collection of (cid:15) -corrupted samples, a particular deter-ministic regularity condition holds (Deﬁnition 4.3 and Lemma 4.6) • Soundness : Under the regularity condition, if the skew of the weighted dataset is small, thenthe empirical mean of the weighted dataset is suﬃciently close to the true mean (Lemma 4.7). • Progress : Under the regularity condition, if the skew of the weighted dataset is large, thenone iteration of the above update scheme will remove more weight from the bad samples thanfrom the good samples (Lemma 4.10).For isotropic Gaussians, skewness is just given by the maximum variance of the weighted datasetin any direction, i.e. max v ∈ S n − (cid:104) vv (cid:62) , ˜Σ (cid:105) where ˜Σ is the empirical covariance of the weighted dataset.Given maximizing v , the “score” of a point X is then simply its contribution to the skewness.To learn in L distance, the right set of test vectors v to use is the Hamming cube { , } n , so anatural attempt at adapting the above skewness measure to robust mean estimation of multinomialsis to consider the quantity max v ∈{ , } n (cid:104) vv (cid:62) , ˜Σ (cid:105) . But one of the key challenges in passing fromisotropic Gaussians to multinomial distributions is that this quantity above is not very informativebecause we do not have a good handle on the covariance of Mul k ( µ ). In particular, it could be thatfor a direction v , (cid:104) vv (cid:62) , ˜Σ (cid:105) is high simply because the good points have high variance to begin with. The Jain-Orlitsky Correction Term

The clever workaround of [JO19] was to observe that weknow exactly what the projection of a multinomial distribution Mul k ( µ ) in any { , } n direction v is, namely Bin( k, (cid:104) v, µ (cid:105) ). And so to discern whether the corrupted points skew our estimatein a given direction v , one should measure not the variance in the direction v , but rather thefollowing corrected quantity : the variance in the direction v , minus what the variance would be ifthe distribution of the projections in the v direction were actually given by Bin( k, (cid:104) v, ˜ µ (cid:105) ), where ˜ µ is the empirical mean of the weighted dataset. This new skewness measure can be written asmax v ∈{ , } n (cid:26) (cid:104) vv (cid:62) , ˜Σ (cid:105) − k ( (cid:104) v, ˜ µ (cid:105) − (cid:104) v, ˜ µ (cid:105) ) (cid:27) . (1)Finding the direction v ∈ { , } n which maximizes this corrected quantity is some Boolean quadraticprogramming problem which can be solved approximately by solving the natural SDP relaxationand rounding to a Boolean vector v using the machinery of [AN04]. Using this approach, [JO19]obtained a polynomial-time algorithm for learning general discrete distributions from untrustedbatches. Learning Structured Distributions [CLM19] introduced the question of learning from un-trusted batches when the distribution is known to be structured . Learning structured distributionsin the classical sense is well-understood: if a distribution µ is η -close in total variation distance tobeing s -piecewise degree- d , then to estimate µ in total variation distance it is enough to approx-imate µ in a much weaker norm which we will denote by (cid:107) · (cid:107) A K , where K is a parameter thatdepends on s and d . We review the details for this in Section 2.4.4CLM19] gave a sum-of-squares algorithm for robust mean estimation in the A K norm thatachieved (cid:15) √ k (cid:112) log 1 /(cid:15) error in quasipolynomial time, and a natural open question was to achievethis with a polynomial-time algorithm.The key challenge that [CLM19] had to address was that unlike the Hamming cube or S n − , it isunclear how to optimize over the set of test vectors dual to the A K norm. Combinatorially, this setis easy to characterize: (cid:107) µ − ˆ µ (cid:107) A K is small if and only if (cid:104) µ − ˆ µ, v (cid:105) is small for all v ∈ V n K ⊂ {± } n ,where V n K is the set of all v ∈ {± } n with at most 2 K sign changes when read as a vector fromleft to right (for example, (1 , , − , − , , , ∈ V ).The main observation in [CLM19] is that vectors with few sign changes admit sparse representa-tions in the Haar wavelet basis , so instead of working with V n K , one can simply work with a convexrelaxation of this Haar-sparsity constraint. As such, if we let K ⊆ R n × n denote the relaxation ofthe set of { vv (cid:62) | v ∈ V n K } to all matrices Σ whose Haar transforms are “analytically sparse” in someappropriate, convex sense (see Section 3 for a formal deﬁnition), then as this set of test matricescontains the set of test matrices vv (cid:62) for v ∈ V n K , it is enough to learn µ in the norm associated to K , which is strictly stronger than the A K norm. Our goal then is to produce ˆ µ for which (cid:107) ˆ µ − µ (cid:107) K (cid:44) sup Σ ∈K (cid:104) Σ , (ˆ µ − µ ) ⊗ (cid:105) / is small. Andeven though (cid:107) · (cid:107) K is a stronger norm, it turns out that the metric entropy of K is still small enoughthat one can get good sample complexity guarantees. Indeed, showing that this is the case (seeLemma A.1) was where the bulk of the technical machinery of [CLM19] went, and as we elaborateon in Appendix B, the analysis there left some room for tightening. In this work, we give a reﬁnedanalysis of K which allows us to get nearly tight sample complexity bounds. Putting Everything Together

Almost all of the pieces are in place to instantiate the ﬁlteringframework: in lieu of the quantity in (1), which can be phrased as the maximization of somequadratic (cid:104) vv (cid:62) , M ( w ) (cid:105) over {± } n , where M ( w ) ∈ R n × n depends on the dataset and the weights w on its points, we can deﬁne our skewness measure as max Σ ∈K (cid:104) Σ , M ( w ) (cid:105) = (cid:107) M ( w ) (cid:107) K , and wecan deﬁne the score for each point in the dataset to be its contribution to the skewness measure(see Section 4.2).At this point the reader may be wondering why we never round Σ to an actual vector v ∈ V n K before computing skewness and scores. As our subsequent analysis will show, it turns out thatrounding is unnecessary, both in our setting and even in the unstructured distribution settingconsidered in [JO19]. Indeed, if one examines the three proof ingredients of regularity, soundness,and progress that we enumerated above, it becomes evident that the ﬁltering framework for robustmean estimation does not actually require ﬁnding a concrete direction in R n in which to ﬁlter,merely a skewness measure and score functions which are amenable to showing the above threestatements. That said, as we will see, it becomes more technically challenging to prove theseingredients when Σ is not rounded to an actual direction (see e.g. the discussion after Lemmas A.2and A.3 in Appendix A), though nevertheless possible. We hope that this observation will proveuseful in future applications of ﬁltering. Note that in [CLM19], because moment bounds beyond degree 2 were used, they also needed to use higher-ordertensor analogues of K , but in this work it will suﬃce to work with degree 2. Note that we have switched to {± } n in place of { , } n . We do not belabor this point here, as the diﬀerenceturns out to be immaterial, and the former is more convenient for understanding how we handle V n K , which is asubset of {± } n . .2 Related Work The problem of learning from untrusted batches was introduced by [QV17], and is motivated byproblems in reliable distributed learning and federated learning [MMR +

17, KMY + +

15, ADLS17].Our techniques are also related to a recent line of work on robust statistics [DKK +

19, LRV16,CSV17, DKK +

17, HL18, KSS18], a classical problem dating back to the 60s and 70s [Ans60, Tuk60,Hub92, Tuk75]. See [Li18, Ste18, DK19] for a more comprehensive survey of this line of work.Finally, the most relevant papers to our result are [CLM19, JO19], which improve upon theresult of [QV17] in terms of runtime and sample complexity. As mentioned above, our result can bethought of as a way to combine the improved ﬁltering algorithm of [JO19] and the shape-constrainedtechnology introduced in [CLM19].Concurrently and independently of this work, a newer work of Jain and Orlitsky [JO20] obtainsvery similar results, though our quantitative guarantees are incomparable: the number of batches N they need scales linearly in s · d and independently of n , but also scales with √ k and 1 /(cid:15) . Roadmap

In Section 2, we overview notation, formally deﬁne our generative model, give miscel-laneous technical tools, and review the basics on classical learning of structured distributions and onHaar wavelets. In Section 3, we deﬁne the semideﬁnite program that we use to compute skewness.In Section 4, we give our algorithm

LearnWithFilter and prove our main result, Theorem 1.1.In Section 5, we describe our empirical evaluations of

LearnWithFilter on synthetic data. InAppendices A, B, and C, we complete the proofs of some deferred technical statements relating todeterministic regularity conditions and metric entropy bounds. • Given p ∈ [0 , k, p ) denote the normalized binomial distribution, which takes values in { , /k, · · · , } rather than { , , · · · , k } . • Let ∆ n ⊂ R n be the simplex of nonnegative vectors whose coordinates sum to 1. Any p ∈ ∆ n naturally corresponds to a probability distribution over [ n ]. • Let n ∈ R n denote the all-ones vector. We omit the subscript when the context is clear. • Given matrix M ∈ R n × n , let (cid:107) M (cid:107) max denote the maximum absolute value of any entry in M ,let (cid:107) M (cid:107) , denote the absolute sum of its entries, and let (cid:107) M (cid:107) F denote its Frobenius norm.6 Given µ ∈ ∆ n , let Mul k ( µ ) denote the distribution over ∆ n given by sampling a frequency vectorfrom the multinomial distribution arising from k draws from the distribution over [ n ] speciﬁedby µ , and dividing by k . • Given samples X , · · · , X N ∼ Mul k ( µ ) and U ⊆ [ N ], deﬁne w ( U ) : [ N ] → [0 , /N ] to be theset of weights which assigns 1 /N to all points in U and 0 to all other points. Also deﬁne itsnormalization ˆ w ( U ) (cid:44) w ( U ) / (cid:107) w (cid:107) . Let W (cid:15) denote the set of weights w : [ N ] → [0 , /N ] which areconvex combinations of such weights for | U | ≥ (1 − (cid:15) ) N . Given w , deﬁne µ ( w ) (cid:44) (cid:80) Ni =1 w i (cid:107) w (cid:107) X i ,and deﬁne µ ( U ) (cid:44) µ ( w ( U )), that is, the empirical mean of the samples indexed by U . • Given samples X , · · · , X N ∼ Mul k ( µ ), weights w , and ν , ..., ν N ∈ ∆ n , deﬁne the matrices A ( w, { ν i } ) = N (cid:88) i =1 w i ( X i − ν i ) ⊗ and B ( { ν i } ) = 1 N N (cid:88) i =1 E X ∼ Mul k ( ν i ) [( X − ν i ) ⊗ ] . When ν = · · · = ν N = ν , denote these matrices by A ( w, ν ) and B ( ν ) and note that B ( ν ) = 1 k (cid:0) diag( ν ) − ν ⊗ (cid:1) . (2)Also deﬁne M ( w, { ν i } )) (cid:44) A ( w, { ν i } ) − B ( { ν i } ) and M ( w, ν ) (cid:44) A ( w, ν ) − B ( ν ). We will alsodenote M ( w, µ ( w )) by M ( w ) and M ( ˆ w ( U )) by M U .To get intuition for these deﬁnitions, note that any bitstring v ∈ { , } n corresponding to S ⊆ [ n ] induces a normalized binomial distribution Y (cid:44) Bin( n, (cid:104) µ, v (cid:105) ) ∈ [0 , X i ∼ Mul k ( µ ) induces a corresponding sample (cid:104) X i , v (cid:105) from Y . Then (cid:104) vv (cid:62) , M U (cid:105) is thediﬀerence between the empirical variance of Y and the variance of the binomial distributionBin( n, (cid:104) µ ( U ) , v (cid:105) ). Throughout the rest of the paper, let (cid:15), ω > n, k, N ∈ N , and let µ be some probability distri-bution over [ n ]. Deﬁnition 2.1.

We say Y , ..., Y N is an (cid:15) -corrupted ω -diverse set of N batches of size k from µ ifthey are generated via the following process: • For every i ∈ [(1 − (cid:15) ) N ] , ˜ Y i = ( ˜ Y i , ..., ˜ Y ki ) is a set of k iid draws from µ i , where µ i ∈ ∆ n issome probability distribution over [ n ] for which d TV ( µ, µ i ) ≤ ω . • A computationally unbounded adversary inspects ˜ Y , ..., ˜ Y (1 − (cid:15) ) N and adds (cid:15)N arbitrarily chosentuples ˜ Y (1 − (cid:15) ) N +1 , ..., ˜ Y N ∈ [ n ] k , and returns the entire collection of tuples in any arbitraryorder as Y , ..., Y N .Let S G , S B ⊂ [ N ] denote the indices of the uncorrupted (good) and corrupted (bad) batches. It turns out that we might as well treat each Y i as an unordered tuple. That is, for any Y i ,deﬁne X i ∈ ∆ n to be the vector of frequencies whose a -th entry is k (cid:80) kj =1 [ Y ji = a ] for all a ∈ [ n ].Then for each, i ∈ S G , X i is an independent draw from Mul k ( µ i ). Henceforth, we will work solelyin this frequency vector perspective. 7 .3 Elementary Facts In this section we collect miscellaneous elementary facts that will be useful in subsequent sections.

Fact 2.2.

For X , · · · , X m ∈ R n , weights w : [ m ] → R ≥ , v ∈ R n , µ ∈ R n , and Σ ∈ R n × n symmetric, (cid:88) w i (cid:10) ( X i − µ ) ⊗ , Σ (cid:11) = (cid:88) w i (cid:10) ( X i − µ ( w )) ⊗ , Σ (cid:11) + (cid:107) w (cid:107) · (cid:10) ( µ ( w ) − µ ) ⊗ , Σ (cid:11) . (3) In particular, by taking

Σ = vv (cid:62) for any v ∈ R n , (cid:88) w i (cid:104) X i − µ, v (cid:105) = (cid:88) w i (cid:104) X i − µ ( w ) , v (cid:105) + (cid:107) w (cid:107) · (cid:104) µ ( w ) − µ, v (cid:105) . That is, the function ν (cid:55)→ (cid:80) i w i (cid:104) X i − ν, v (cid:105) is minimized over ν ∈ R n by ν = µ ( w ) .Proof. Without loss of generality we may assume (cid:107) w (cid:107) = 1. Using the fact that (cid:104) u ⊗ , Σ (cid:105) −(cid:104) v ⊗ , Σ (cid:105) = ( u − v ) (cid:62) Σ( u + v ) for symmetric Σ, we see that (cid:10) ( X i − µ ⊗ − ( X i − µ ( w )) ⊗ , Σ (cid:11) = ( µ ( w ) − µ ) (cid:62) Σ(2 X i − µ − µ ( w )) . Because (cid:80) w i X i = µ ( w ), we see that (cid:88) w i ( µ ( w ) − µ ) (cid:62) Σ(2 X i − µ − µ ( w )) = (cid:10) ( µ ( w ) − µ ) ⊗ , Σ (cid:11) , from which (3) follows. The remaining parts of the claim follow trivially. Fact 2.3.

For any < (cid:15) < , let weights w : [ N ] → [0 , /N ] satisfy (cid:80) i ∈ [ N ] w i ≥ − O ( (cid:15) ) . If w (cid:48) is the set of weights deﬁned by w (cid:48) i = w i for i ∈ S G and w (cid:48) i = 0 otherwise, and if | S G | ≥ (1 − (cid:15) ) N ,then we have that (cid:107) µ ( w ) − µ ( w (cid:48) ) (cid:107) ≤ O ( (cid:15) ) .Proof. We may write (cid:107) µ ( w ) − µ ( w (cid:48) ) (cid:107) ≤ (cid:107) (cid:107) w (cid:107) (cid:88) i ∈ S B w i X i (cid:107) + (cid:18) (cid:107) w (cid:107) − (cid:107) w (cid:48) (cid:107) (cid:19) (cid:107) (cid:88) i ∈ S G w i X i (cid:107) ≤ O ( (cid:15) ) + (cid:18) (cid:107) w (cid:107) − (cid:107) w (cid:48) (cid:107) (cid:19) (cid:107) (cid:88) i ∈ S G w i X i (cid:107) ≤ O ( (cid:15) ) , where the ﬁrst step follows by deﬁnition of µ ( · ) and by triangle inequality, the second step follows bythe fact that | S B | ≤ (cid:15)N , and the third step follows by the fact that |(cid:107) w (cid:107) −(cid:107) w (cid:48) (cid:107) | = (cid:12)(cid:12)(cid:12)(cid:80) i ∈ S B w i (cid:12)(cid:12)(cid:12) ≤ (cid:15) ,while (cid:107) (cid:80) i ∈ S G w i X i (cid:107) ≤ X i lie in ∆ n .It will be useful to have a basic bound on the Frobenius norm of M ( w, ν ). Lemma 2.4.

For any ν ∈ ∆ n and any weights w for which (cid:80) w i = 1 , we have that (cid:107) M ( w, ν ) (cid:107) F ≤ .Proof. For any sample X ∈ ∆ n , we have that (cid:107) ( X − ν )( X − ν ) (cid:62) (cid:107) F ≤ (cid:107) X − ν (cid:107) ≤ (cid:107) B ( ν ) (cid:107) F ≤ k (cid:107) ν (cid:107) + 1 k (cid:107) ν (cid:107) ≤ /k, from which the lemma follows by triangle inequality and the assumption that (cid:80) w i = 1.8 .4 A K Norms and VC Complexity

In this section we review basics about learning distributions which are close to piecewise polynomial.

Deﬁnition 2.5 ( A K norms, see e.g. [DL01]) . For positive integers K ≤ n , deﬁne A K to be the setof all unions of at most K disjoint intervals over [ n ] , where an interval is any subset of [ n ] of theform { a, a + 1 , · · · , b − , b } . The A K distance between two distributions µ, ν over [ n ] is (cid:107) µ − ν (cid:107) A K = max S ∈A K | µ ( S ) − µ ( S ) | . Equivalently, say that v ∈ {± } n has K sign changes if there are exactly K indices i ∈ [ n − for which v i +1 (cid:54) = v i . Then if V n K denotes the set of all such v , we have (cid:107) µ − ν (cid:107) A K = 12 max v ∈V n K (cid:104) µ − ν, v (cid:105) . Note that (cid:107) · (cid:107) A ≤ (cid:107) · (cid:107) A ≤ · · · ≤ (cid:107) · (cid:107) A n/ = (cid:107) · (cid:107) TV . Deﬁnition 2.6.

We say that a distribution µ over [ n ] is ( η, s )-piecewise degree- d if there is apartition of [ n ] into t disjoint intervals { [ a i , b i ] } ≤ i ≤ t , together with univariate degree- d polynomials r , · · · , r t and a distribution µ (cid:48) on [ n ] , such that d TV ( µ, µ (cid:48) ) ≤ η and such that for all i ∈ [ t ] , µ (cid:48) ( x ) = r i ( x ) for all x ∈ [ n ] in [ a i , b i ] . A proof of the following lemma, a consequence of [ADLS17], can be found in [CLM19].

Lemma 2.7 (Lemma 5.1 in [CLM19], follows by [ADLS17]) . Let K = s ( d + 1) . If µ is ( η, s ) -piecewise degree- d and (cid:107) µ − ˆ µ (cid:107) A K ≤ ζ , then there is an algorithm which, given the vector ˆ µ ,outputs a distribution µ ∗ for which d TV ( µ, µ ∗ ) ≤ ζ + 4 η in time poly( s, d, /η ) . Henceforth, we will focus solely on the problem of learning in A (cid:96) norm, where (cid:96) (cid:44) s ( d + 1) . (4) We brieﬂy recall the deﬁnition of Haar wavelets, further details and examples of which can be foundin [CLM19].

Deﬁnition 2.8.

Let m be a positive integer and let n = 2 m . The Haar wavelet basis is anorthonormal basis over R n consisting of the father wavelet ψ father , = n − / · , the mother wavelet ψ mother , = n − / · (1 , · · · , , − , · · · , − (where (1 , · · · , , − , · · · , − contains n/ n/ -1’s), and for every i, j for which ≤ i < m and ≤ j < i , the wavelet ψ i,j whose m − i · j +1 , · · · , m − i · j + 2 m − i − -th coordinates are − ( m − i ) / and whose m − i · j + (2 m − i − + 1) , · · · , m − i · j + 2 m − i -th coordinates are − − ( m − i ) / , and whose remaining coordinates are 0. Additionally, we will use the following notation when referring to Haar wavelets: • Let H m denote the n × n matrix whose rows consist of the vectors of the Haar wavelet basisfor R n . When the context is clear, we will omit the subscript and refer to this matrix as H . • For ν ∈ [ n ], if the ν -th element of the Haar wavelet basis for R n is some ψ i,j , then deﬁne theweight h ( ν ) (cid:44) − ( m − i ) / . 9 For any index i ∈ { father , mother , , · · · , m − } , let T i ⊂ [ n ] denote the set of indices ν forwhich the ν -th Haar wavelet is of the form ψ i,j for some j . • Given any p ≥

1, deﬁne the

Haar-weighted L p norm (cid:107)·(cid:107) p ; h on R n by (cid:107) w (cid:107) p ; h (cid:44) (cid:107) w (cid:48) (cid:107) p , where forevery a ∈ [ n ], w (cid:48) a (cid:44) h ( a ) w a . Likewise, given any norm (cid:107)·(cid:107) ∗ on R n × n , deﬁne the Haar-weighted ∗ -norm (cid:107) · (cid:107) ∗ ; h on R n × n by (cid:107) M (cid:107) ∗ ; h (cid:44) (cid:107) M (cid:48) (cid:107) ∗ , where for every a, b ∈ [ n ], M (cid:48) a,b (cid:44) h ( a ) h ( b ) M a,b .The key observation is that any v ∈ {± } n with at most (cid:96) sign changes, where (cid:96) is given by(4), has an ( (cid:96) log n + 1)-sparse representation in the Haar wavelet basis. We will use the followingfundamental fact about Haar wavelets, part of which appears as Lemma 6.3 in [CLM19]. Lemma 2.9.

Let v ∈ {± } n have at most (cid:96) sign changes. Then Hv has at most (cid:96) log n + 1 nonzeroentries, and furthermore (cid:107) Hv (cid:107) ∞ ; h ≤ . In particular, (cid:107) Hv (cid:107) h , (cid:107) Hv (cid:107) h ≤ (cid:96) log n + 1 .Proof. We ﬁrst show that Hv has at most (cid:96) log n + 1 nonzero entries. For any ψ i,j with nonzeroentries at indices [ a, b ] ⊂ [ n ] and such that i (cid:54) = 0 father , if v has no sign change in the interval [ a, b ],then (cid:104) ψ i,j , v (cid:105) = 0. For every index ν ∈ [ n ] at which v has a sign change, there are at most m = log n choices of i, j for which ψ i,j has a nonzero entry at index ν , from which the claim follows by a unionbound over all (cid:96) choices of ν , together with the fact that (cid:104) ψ father , , v (cid:105) may be nonzero.Now for each ( i, j ) for which (cid:104) ψ i,j , v (cid:105) (cid:54) = 0, note that2 − ( m − i ) / · |(cid:104) ψ i,j , v (cid:105)| ≤ − ( m − i ) / · (cid:16) − ( m − i ) / · m − i (cid:17) = 1 , as claimed. The bounds on (cid:107) Hv (cid:107) h , (cid:107) Hv (cid:107) h follow immediately. Recall that in [JO19], the authors consider the binary optimization problem max v ∈{ , } n | v (cid:62) M U v | .We would like to approximate the optimization problem max v ∈V n(cid:96) | v (cid:62) M U v | . Motivated by [CLM19]and Lemma 2.9, we consider the following convex relaxation: Deﬁnition 3.1.

Let (cid:96) be given by (4) . Let K denote the (convex) set of all matrices Σ ∈ R n × n forwhich1. (cid:107) Σ (cid:107) max ≤ .2. (cid:107) H Σ H (cid:62) (cid:107) , h ≤ (cid:96) log n + 1 .3. (cid:107) H Σ H (cid:62) (cid:107) F ; h ≤ (cid:96) log n + 1 .4. (cid:107) H Σ H (cid:62) (cid:107) max; h ≤ .5. Σ (cid:23) .Let (cid:107) · (cid:107) K denote the associated norm given by (cid:107) M (cid:107) K (cid:44) sup Σ ∈K |(cid:104) M , Σ (cid:105)| . By abuse of notation, forvectors v ∈ R n we will also use (cid:107) v (cid:107) K to denote (cid:107) vv (cid:62) (cid:107) / K .Because K has an eﬃcient separation oracle, one can compute (cid:107) · (cid:107) K in polynomial time. Remark 3.2.

Note that, besides not being a sum-of-squares program like the one considered in[CLM19], this relaxation is also slightly diﬀerent because of Constraints 3 and 4. As we will see inSection B, these additional constraints will be crucial for getting reﬁned sample complexity bounds. K is a relaxation of V n(cid:96) : Corollary 3.3 (Corollary of Lemma 2.9) . vv (cid:62) ∈ K for any v ∈ V n(cid:96) . Note also that Constraint 1 in Deﬁnition 3.1 ensures that (cid:107) · (cid:107) K is weaker than (cid:107) · (cid:107) and moregenerally that: Fact 3.4.

For any a, b ∈ R n and Σ ∈ K , a (cid:62) · Σ · b ≤ (cid:107) a (cid:107) · (cid:107) b (cid:107) . In particular, for any v ∈ R n , (cid:107) v (cid:107) K ≤ (cid:107) v (cid:107) . As a consequence, we conclude the following useful fact about stability of the B ( · ) matrix. Corollary 3.5.

For any µ, µ (cid:48) ∈ ∆ n , (cid:107) B ( µ ) − B ( µ (cid:48) ) (cid:107) K ≤ k (cid:107) µ − µ (cid:48) (cid:107) .Proof. Take any Σ ∈ K . By symmetry, it is enough to show that (cid:104) B ( µ ) − B ( µ (cid:48) ) , Σ (cid:105) ≤ k (cid:107) µ − µ (cid:48) (cid:107) .By Constraint 1, we have that (cid:104) µ − µ (cid:48) , diag(Σ) (cid:105) ≤ (cid:107) µ − µ (cid:48) (cid:107) . On the other hand, note that µ (cid:48)(cid:62) Σ µ (cid:48) − µ (cid:62) Σ µ = ( µ (cid:48) − µ ) (cid:62) Σ( µ (cid:48) + µ ) ≤ (cid:107) µ (cid:48) − µ (cid:107) · (cid:107) µ (cid:48) + µ (cid:107) ≤ (cid:107) µ (cid:48) − µ (cid:107) , where the second step follows from Fact 3.4. The corollary now follows.Note that if the solution to the convex program argmax Σ ∈K (cid:104) M U , Σ (cid:105) were actually integral, thatis, some rank-1 matrix vv (cid:62) for v ∈ V n(cid:96) , it would correspond to the direction v in which the samplesin U have the largest discrepancy between the empirical variance and the variance predicted by theempirical mean. Then v would correspond to a subset of the domain [ s ] on which one could ﬁlterout bad points as in [JO19]. In the sequel, we will show that this kind of analysis applies even ifthe solution to argmax Σ ∈K (cid:104) M U , Σ (cid:105) is not integral . In this section we prove our main theorem, stated formally below:

Theorem 4.1.

Let µ be an ( η, s ) -piecewise degree- d distribution over [ n ] . Then for any < (cid:15) < / smaller than some absolute constant, and any < δ < , there is a poly( n, k, /(cid:15), /δ ) -timealgorithm LearnWithFilter which, given N = (cid:101) O (cid:0) log(1 /δ )( s d /(cid:15) ) log ( n ) (cid:1) ,(cid:15) -corrupted, ω -diverse batches of size k from µ , outputs an estimate ˆ µ such that (cid:107) ˆ µ − µ (cid:107) ≤ O (cid:18) η + ω + (cid:15) √ log 1 /(cid:15) √ k (cid:19) with probability at least − δ over the samples. In Section 4.1, we ﬁrst describe and prove guarantees for a basic but important subroutine, , of our algorithm. In Section 4.2, we describe our learning algorithm,

LearnWithFil-ter , in full. In Section 4.3 we deﬁne the deterministic conditions that the dataset must satisfyfor

LearnWithFilter to succeed, deferring the proof that these deterministic conditions holdwith high probability (Lemma 4.6) to Appendix A. In Section 4.4 we prove a key geometric lemma(Lemma 4.7). Finally, in Section 4.5, we complete the proof of correctness of

LearnWithFilter .11 lgorithm 1: ( τ, w ) Input:

Scores τ : [ N ] → R ≥ , weights w : [ N ] → R ≥ Output:

New weights w (cid:48) with even less mass on bad points than good points (seeLemma 4.2) τ max ← max i : w i > τ i w (cid:48) i ← (cid:16) − τ i τ max (cid:17) w i for all i ∈ [ N ] Output w (cid:48) In this section, we deﬁne and analyze a simple deterministic subroutine which takes asinput a set of weights w and a set of scores on the batches X , · · · , X N , and outputs a new set ofweights w (cid:48) such that, if the weighted average of the scores among the bad batches exceeds that ofthe scores among the good batches, then w (cid:48) places even less weight relatively on the bad batchesthan does w . This subroutine is given in Algorithm 1 below. Lemma 4.2.

Let τ : [ N ] → R ≥ be a set of scores, and let w : [ N ] → R ≥ be a weight. Given apartition [ N ] = S G (cid:116) S B for which (cid:88) i ∈ S G w i τ i < (cid:88) i ∈ S B w i τ i , then the output w (cid:48) of ( τ, w ) satisﬁes ( a ) w (cid:48) i ≤ w i for all i ∈ [ N ] , ( b ) the support of w (cid:48) isa strict subset of the support of w , and ( c ) (cid:80) i ∈ S G w i − w (cid:48) i < (cid:80) i ∈ S B w i − w (cid:48) i .Proof. ( a ) and ( b ) are immediate. For ( c ), note that (cid:88) i ∈ S G w i − w (cid:48) i = 1 τ max (cid:88) i ∈ S G τ i w i < τ max (cid:88) i ∈ S B τ i w i = (cid:88) i ∈ S B w i − w (cid:48) i , from which the lemma follows.We note that this kind of downweighting scheme and its analysis are not new, see e.g. Lemma4.5 from [CSV17] or Lemma 17 from [SCV18]. We can now describe our algorithm

LearnWithFilter . At a high level, we maintain weights w : [ N ] → R ≥ for each of the batches. In every iteration, we compute Σ ∈ K maximizing |(cid:104) M ( w ) , Σ (cid:105)| . If |(cid:104) M ( w ) , Σ (cid:105)| ≤ O (cid:0) (cid:15)k log 1 /(cid:15) (cid:1) , then output µ ( w ). Otherwise, update the weights asfollows: for every batch X i , compute the score τ i given by τ i (cid:44) (cid:10) ( X i − µ ( w )) ⊗ , Σ (cid:11) , (5)and set the weights to be the output of ( τ, w ). The pseudocode for LearnWithFilter is given in Algorithm 2 below. 12 lgorithm 2:

LearnWithFilter ( { X i } i ∈ [ N ] , (cid:15) ) Input:

Frequency vectors X , · · · , X N coming from an (cid:15) -corrupted, ω -diverse set of batchesfrom µ , where µ is ( η, s )-piecewise, degree d Output: ˆ µ such that (cid:107) ˆ µ − µ (cid:107) ≤ O (cid:18) η + ω + (cid:15) √ log 1 /(cid:15) √ k (cid:19) , provided uncorrupted samples (cid:15) -good w ← w ([ N ]) while (cid:107) M ( w ) (cid:107) K ≥ Ω( ω + (cid:15)k log 1 /(cid:15) ) do Σ ← argmax Σ (cid:48) ∈K |(cid:104) M ( w ) , Σ (cid:105)| Compute scores τ : [ N ] → R ≥ according to (5). w ← ( τ, w ) Using the algorithm of [ADLS17] (see Lemma 2.7), output the s -piecewise, degree- d distribution ˆ w minimizing (cid:107) µ ( w ) − ˆ µ (cid:107) s ( d +1) (up to additive error η ). Deﬁnition 4.3 ( (cid:15) -goodness) . Take a set of points U ⊂ [ N ] , and let { µ i } i ∈ U be a collection ofdistributions over [ n ] . For any W ⊆ U , deﬁne µ W (cid:44) | W | (cid:80) i ∈ W µ i . Denote µ (cid:44) µ U .We say U is (cid:15) -good if it satisﬁes that for all W ⊂ U for which | W | = (cid:15) | U | ,(I) (Concentration of mean) (cid:107) µ ( U ) − µ (cid:107) K ≤ O (cid:32) (cid:15) (cid:112) log 1 /(cid:15) √ k (cid:33) and (cid:107) µ ( W ) − µ W (cid:107) K ≤ O (cid:32) (cid:112) log 1 /(cid:15) √ k (cid:33) (II) (Concentration of covariance) (cid:107) M ( ˆ w ( U ) , { µ i } i ∈ U ) (cid:107) K ≤ O (cid:18) (cid:15) log 1 /(cid:15)k (cid:19) and (cid:107) A ( ˆ w ( W ) , { µ i } i ∈ W (cid:107) K ≤ O (cid:18) log 1 /(cid:15)k (cid:19) (III) (Concentration of variance proxy) (cid:107) B (ˆ µ ( U )) − B ( { µ i } i ∈ U ) (cid:107) K ≤ O ( ω /k + (cid:15)/k ) (IV) (Heterogeneity has negligible eﬀect, see Lemma 4.4) sup Σ ∈K (cid:40) | U | (cid:88) i ∈ U ( µ i − µ ) (cid:62) · Σ · ( X i − µ i ) (cid:41) ≤ O (cid:32) ω · (cid:15) (cid:112) log 1 /(cid:15) √ k (cid:33) . sup Σ ∈K (cid:40) | W | (cid:88) i ∈ W ( µ i − µ ) (cid:62) · Σ · ( X i − µ i ) (cid:41) ≤ O (cid:32) ω · (cid:112) log 1 /(cid:15) √ k (cid:33) .

13e ﬁrst remark that we only need extremely mild concentration in Condition (III), but it turnsout this suﬃces in the one place where we use it (see Lemma 4.9).Additionally, note that we can completely ignore Condition (IV) when ω = 0. The followingmakes clear why it is useful when ω > Lemma 4.4.

For (cid:15) -good U , all W ⊂ U of size (cid:15) | U | , and all Σ ∈ K , (cid:107) A (ˆ µ ( U ) , µ ) − A (ˆ µ ( U ) , { µ i } ) (cid:107) K ≤ O (cid:32) ω + (cid:15) (cid:112) log 1 /(cid:15) √ k (cid:33) (cid:107) A (ˆ µ ( W ) , µ ) − A (ˆ µ ( W ) , { µ i } ) (cid:107) K ≤ O (cid:32) ω + (cid:112) log 1 /(cid:15) √ k (cid:33) . Proof.

For S = U or S = W and any Σ ∈ K , (cid:104) Σ , A (ˆ µ ( S ) , µ ) − A (ˆ µ ( S ) , { µ i } ) (cid:105) = 1 | S | (cid:88) i ∈ S (cid:104) ( X i − µ ) ⊗ − ( X i − µ i ) ⊗ , Σ (cid:105) = 1 | S | (cid:88) i ∈ S ( µ i − µ ) (cid:62) · Σ · (2 X i − µ i − µ )= 2 | S | (cid:88) i ∈ S ( µ i − µ ) (cid:62) · Σ · ( X i − µ i ) + 1 | S | (cid:88) i ∈ S (cid:104) ( µ i − µ ) ⊗ , Σ (cid:105) . (6)The ﬁrst (resp. second) part of the lemma follows by taking S = U (resp. S = W ) and invokingthe ﬁrst (resp. second) part of Condition (IV) of (cid:15) -goodness to upper bound the ﬁrst term in (6),and Fact 3.4 and the fact that (cid:107) µ i − µ (cid:107) ≤ ω for all i to upper bound the second term in (6). Corollary 4.5. If U is (cid:15) -good and µ (cid:44) | U | (cid:80) i ∈ U µ i , then (cid:107) A ( ˆ w ( U ) , µ ) − B ( { µ i } ) (cid:107) K ≤ O (cid:32) ω + (cid:15) (cid:112) log 1 /(cid:15) √ k (cid:33) . Proof.

This follows immediately from Lemma 4.4 and the ﬁrst part of Condition (II) of (cid:15) -goodness.In Appendix A, we will show that for N suﬃciently large, the set S G of uncorrupted batcheswill satisfy the above deterministic condition. Lemma 4.6 (Regularity of good samples) . If U is a set of (cid:101) Ω (cid:0) log(1 /δ )( (cid:96) /(cid:15) ) · log ( n ) (cid:1) independentsamples from Mul k ( µ ) , ..., Mul k ( µ | U | ) , then U is (cid:15) -good with probability at least − δ . .4 Key Geometric Lemma The key property of (cid:15) -good sets is the following geometric lemma bounding the accuracy of anestimate µ ( w ) given by weights w in terms of (cid:107) M ( w ) (cid:107) K . Lemma 4.7 (Spectral signatures) . If S G is (cid:15) -good and | S G | ≥ (1 − (cid:15) ) N , then for any w ∈ W (cid:15) , (cid:107) µ ( w ) − µ (cid:107) K ≤ O (cid:18) (cid:15) √ k (cid:112) log 1 /(cid:15) + (cid:15) · ω + (cid:114) (cid:15) (cid:16) (cid:107) M ( w ) (cid:107) K + ω + (cid:15)k log 1 /(cid:15) (cid:17)(cid:19) . It turns out the proof ingredients for Lemma 4.7 will also be useful in our analysis of

Learn-WithFilter later, so we will now prove this lemma in full.

Proof.

Take any Σ ∈ K . Recalling that Σ is psd by Constraint 5 in Deﬁnition 3.1, we will sometimeswrite it as Σ = E v [ vv (cid:62) ], where the distribution over v is deﬁned according to the eigendecompositionof Σ. We wish to bound E v (cid:2) (cid:104) µ ( w ) − µ, v (cid:105) (cid:3) . By splitting w i (cid:44) /N − δ i for i ∈ S G , we have that (cid:104) µ ( w ) − µ, v (cid:105) = N (cid:88) i =1 w i (cid:104) X i − µ, v (cid:105) = (cid:28) | S G | N ( µ ( S G ) − µ ) , v (cid:29) − (cid:88) i ∈ S G δ i (cid:104) X i − µ, v (cid:105) + (cid:88) i ∈ S B w i (cid:104) X i − µ, v (cid:105) , = (cid:28) | S G | N ( µ ( S G ) − µ ) , v (cid:29) − (cid:88) i ∈ S G δ i (cid:104) X i − µ, v (cid:105) + (cid:88) i ∈ S B w i (cid:104) X i − µ ( w ) , v (cid:105) + (cid:104) µ ( w ) − µ, v (cid:105) (cid:88) i ∈ S B w i . We may rewrite this as  − (cid:88) i ∈ S B w i  (cid:104) µ ( w ) − µ, v (cid:105) = (cid:28) | S G | N ( µ ( S G ) − µ ) , v (cid:29) − (cid:88) i ∈ S G δ i (cid:104) X i − µ, v (cid:105) + (cid:88) i ∈ S B w i (cid:104) X i − µ ( w ) , v (cid:105) . Note further that (cid:88) i ∈ S G δ i (cid:104) X i − µ, v (cid:105) = (cid:88) i ∈ S G δ i (cid:104) X i − µ i , v (cid:105) + (cid:88) i ∈ S G δ i (cid:104) µ i − µ, v (cid:105) , so in particular, 14  − (cid:88) i ∈ S B w i  · E v (cid:2) (cid:104) µ ( w ) − µ, v (cid:105) (cid:3) ≤ + + + (7)where (cid:44) | S G | N E v (cid:2) (cid:104) µ ( S G ) − µ, v (cid:105) (cid:3) (cid:44) E v  (cid:88) i ∈ S G δ i (cid:104) X i − µ i , v (cid:105)   (cid:44) E v  (cid:88) i ∈ S G δ i (cid:104) µ i − µ, v (cid:105)   (cid:44) E v  (cid:88) i ∈ S B w i (cid:104) X i − µ ( w ) , v (cid:105)   , note that ≤ | S G | N (cid:107) µ ( S G ) − µ (cid:107) K ≤ O (cid:18) (cid:15) log 1 /(cid:15)k (cid:19) by the ﬁrst part of Condition (I) of (cid:15) -goodness of S G and the fact that | S G | /N ≥ − (cid:15) .For , by Cauchy-Schwarz we have that ≤  (cid:88) i ∈ S G δ i  · E v  (cid:88) i ∈ S G δ i (cid:104) X i − µ i , v (cid:105)  ≤ (cid:15) · (cid:42) (cid:88) i ∈ S G δ i ( X i − µ i ) ⊗ , E v [ vv (cid:62) ] (cid:43) = (cid:15) (cid:104) A ( δ, { µ i } ) , Σ (cid:105)≤ O (cid:18) (cid:15) k log 1 /(cid:15) (cid:19) , (8)where the last step follows by Lemma 4.8 below.For , again by Cauchy-Schwarz, ≤  (cid:88) i ∈ S G δ i  · E v  (cid:88) i ∈ S G δ i (cid:104) µ i − µ, v (cid:105)  ≤ (cid:15) · (cid:88) i ∈ S G δ i (cid:107) µ i − µ (cid:107) K ≤ (cid:15) · max i ∈ S G (cid:107) µ i − µ (cid:107) ≤ (cid:15) · ω , where the penultimate step follows by Fact 3.4.Finally, we will relate to (cid:107) M ( w ) (cid:107) K . Let w (cid:48) be the set of weights given by w (cid:48) i = w i for i ∈ S G and w (cid:48) i = 0 for i (cid:54)∈ S G . By another application of Cauchy-Schwarz, ≤  (cid:88) i ∈ S B w i  · E v  (cid:88) i ∈ S B w i (cid:104) X i − µ ( w ) , v (cid:105)  ≤ (cid:15)  E v (cid:34) N (cid:88) i =1 w i (cid:104) X i − µ ( w ) , v (cid:105) (cid:35) − E v  (cid:88) i ∈ S G w i (cid:104) X i − µ ( w ) , v (cid:105)  = (cid:15) (cid:10) A ( w, µ ( w )) − A ( w (cid:48) , µ ( w )) , Σ (cid:11) (9) ≤ (cid:15) (cid:10) A ( w, µ ( w )) − A ( w (cid:48) , µ ( w (cid:48) )) , Σ (cid:11) (10) ≤ (cid:15) (cid:28) A ( w, µ ( w )) − (cid:80) w (cid:48) i B ( µ ( w (cid:48) )) , Σ (cid:29) + O (cid:18) (cid:15) · ω + (cid:15) k log 1 /(cid:15) (cid:19) (11)= (cid:15) (cid:104) M ( w ) , Σ (cid:105) + (cid:15) (cid:28) B ( µ ( w )) − (cid:80) w (cid:48) i B ( µ ( w (cid:48) )) , Σ (cid:29) + O (cid:18) (cid:15) · ω + (cid:15) k log 1 /(cid:15) (cid:19) ≤ (cid:15) (cid:107) M ( w ) (cid:107) K + (cid:15) (cid:107) B ( µ ( w )) − (cid:80) w (cid:48) i B ( µ ( w (cid:48) )) (cid:107) K + O (cid:18) (cid:15) · ω + (cid:15) k log 1 /(cid:15) (cid:19) (12)16here (9) follows by the deﬁnition of A ( w, ν ), (10) follows by Fact 2.2, (11) follows by Lemma 4.9below. Lastly, by triangle inequality, we may upper bound (cid:107) B ( µ ( w )) − (cid:80) w (cid:48) i B ( µ ( w (cid:48) )) (cid:107) K by (cid:107) B ( µ ( w )) − B ( µ ( w (cid:48) )) (cid:107) K + O ( (cid:15) ) · (cid:107) B ( µ ( w (cid:48) )) (cid:107) K ≤ k (cid:107) µ ( w ) − µ ( w (cid:48) ) (cid:107) + O ( (cid:15)/k ) ≤ O ( (cid:15)/k ) , (13)where the ﬁrst inequality follows by Corollary 3.5, and the bound on (cid:107) µ ( w ) − µ ( w (cid:48) ) (cid:107) in the laststep follows from Fact 2.3. The lemma then follows from (7), (8), (12), and (13).Next, we show in Lemma 4.8 that small subsets of the good samples cannot contribute toomuch to the total energy. Lemma 4.9, which bounds the norm of M ( w ) for any set of weights w which is close to the uniform set of weights over S G , will follow as a consequence. Lemma 4.8.

For any < (cid:15) < / , if U is (cid:15) -good, and δ : U → [0 , / | U | ] is a set of weightssatisfying (cid:80) i ∈ U δ i ≤ (cid:15) , then we have the following bounds:1. (cid:107) A ( δ, { µ i } ) (cid:107) K ≤ O ( (cid:15)k log 1 /(cid:15) ) (cid:107) (cid:80) i ∈ U δ i ( X i − µ i ) (cid:107) K ≤ O ( (cid:15) √ k (cid:112) log 1 /(cid:15) ) (cid:107) A ( δ, µ ) (cid:107) K ≤ O (cid:16) (cid:15) · ω + (cid:15) log 1 /(cid:15)k (cid:17) (cid:107) (cid:80) i ∈ U δ i ( X i − µ ) (cid:107) K ≤ O ( (cid:15) √ k (cid:112) log 1 /(cid:15) + (cid:15) · ω ) .Proof. For the ﬁrst part, we may assume without loss of generality that (cid:80) i ∈ U δ i = (cid:15) . But then wemay write δ as (cid:15) E W [ ˆ w ( W )] for some distribution over subsets W ⊂ U of size (cid:15) | U | . By Jensen’sinequality and the second part of Condition (II) of (cid:15) -goodness of U , we conclude that A ( δ, { µ i } ) ≤ (cid:15) · E W [ (cid:107) A ( ˆ w ( W ) , { µ i } ) (cid:107) K ] ≤ O (cid:16) (cid:15)k log 1 /(cid:15) (cid:17) , giving the ﬁrst part of the lemma.For the second part, for any Σ ∈ K of the form Σ = E [ vv (cid:62) ], (cid:42) Σ , (cid:32)(cid:88) i ∈ U δ i ( X i − µ i ) (cid:33) ⊗ (cid:43) = E (cid:32)(cid:88) i ∈ U δ i (cid:104) X i − µ i , v (cid:105) (cid:33)  ≤ E (cid:34)(cid:32)(cid:88) i ∈ U δ i (cid:33) · (cid:32)(cid:88) i ∈ U δ i (cid:104) X i − µ i , v (cid:105) (cid:33)(cid:35) ≤ (cid:15) (cid:107) A ( δ, { µ i } ) (cid:107) ≤ O (cid:18) (cid:15) k log 1 /(cid:15) (cid:19) , where the second step follows by Cauchy-Schwarz, the fourth step follows by the ﬁrst part of thelemma. As this holds for all Σ ∈ K , we get the second part of the lemma.This also implies the fourth part of the lemma because (cid:107) (cid:88) i ∈ U δ i ( X i − µ ) (cid:107) K ≤ (cid:107) (cid:88) i ∈ U δ i ( X i − µ i ) (cid:107) K + (cid:107) (cid:88) i ∈ U δ i ( µ i − µ ) (cid:107) K ≤ O (cid:18) (cid:15) √ k (cid:112) log 1 /(cid:15) (cid:19) + (cid:88) i ∈ U δ i (cid:107) µ i − µ (cid:107) ≤ O (cid:18) (cid:15) √ k (cid:112) log 1 /(cid:15) + (cid:15) · ω (cid:19) , δ as (cid:15) E W [ ˆ w ( W )] as beforeand applying Jensen’s to the second part of Lemma 4.4, we get that (cid:107) A ( δ, µ ) − A ( δ, { µ i } ) (cid:107) K ≤ (cid:15) · O (cid:32) ω + (cid:112) log 1 /(cid:15) √ k (cid:33) ≤ O (cid:18) (cid:15) · ω + (cid:15) log 1 /(cid:15)k (cid:19) . The third part of the lemma then follows by the ﬁrst part, together with triangle inequality.

Lemma 4.9. If S G is (cid:15) -good, and w : S G → [0 , satisﬁes (cid:107) w − ˆ w ( S G ) (cid:107) ≤ (cid:15) and (cid:80) i ∈ S G w i = 1 ,then (cid:107) M ( w ) (cid:107) K ≤ O ( ω + (cid:15)k log 1 /(cid:15) ) .Proof. Deﬁne δ i = 1 / | S G | − w i for all i ∈ S G and take any Σ ∈ K .By Fact 2.2 and the assumption that (cid:107) w (cid:107) = 1, (cid:104) A ( w, µ ( w )) , Σ (cid:105) = (cid:104) A ( w, µ ) , Σ (cid:105) − (cid:107) µ ( w ) − µ (cid:107) K . (14)For the second term on the right-hand side of (14), note that we can write µ ( w ) − µ = (cid:88) i ∈ S G w i ( X i − µ )= (cid:88) i ∈ S G (1 / | S G | − δ i )( X i − µ )= ( µ ( S G ) − µ ) − (cid:88) i ∈ S G δ i ( X i − µ )= ( µ ( S G ) − µ ) − (cid:88) i ∈ S G δ i ( X i − µ i ) − (cid:88) i ∈ S G δ i ( µ i − µ ) , where the ﬁrst step follows by the fact that (cid:80) i ∈ S G w i = 1. So by triangle inequality, (cid:107) µ ( w ) − µ (cid:107) K ≤ (cid:107) µ ( S G ) − µ (cid:107) K + (cid:107) (cid:88) i ∈ S G δ i ( X i − µ ) (cid:107) K ≤ O (cid:18) (cid:15) √ k (cid:112) log 1 /(cid:15) + (cid:15) · ω (cid:19) (15)where the second step follows by the ﬁrst part of Condition (I) in the deﬁnition of (cid:15) -goodness for S G , together with the second part of Lemma 4.8.Next, we bound the ﬁrst term on the right-hand side of (14). We have |(cid:104) A ( w, µ ) , Σ (cid:105)| ≤ |(cid:104) A ( ˆ w ( S G ) , µ ) , Σ (cid:105)| + |(cid:104) A ( δ, µ ) , Σ (cid:105)|≤ |(cid:104) A ( ˆ w ( S G ) , µ ) , Σ (cid:105)| + O (cid:16) (cid:15)k log 1 /(cid:15) + (cid:15) · ω (cid:17) ≤ |(cid:104) B ( { µ i } ) , Σ (cid:105)| + O (cid:18) ω + (cid:15) log 1 /(cid:15)k (cid:19) ≤ |(cid:104) B (ˆ µ ( S G )) , Σ (cid:105)| + O (cid:18) ω + (cid:15) log 1 /(cid:15)k (cid:19) , (16)where the second step follows by the third part of Lemma 4.8, the third step follows by Corollary 4.5,and the fourth step follows by Condition (III) of (cid:15) -goodness.Additionally, by Corollary 3.5, we can bound |(cid:104) B ( µ ( w )) , Σ (cid:105) − (cid:104) B (ˆ µ ( S G )) , Σ (cid:105)| ≤ k (cid:107) µ ( w ) − ˆ µ ( S G ) (cid:107) ≤ k (cid:107) w − ˆ w ( S G ) (cid:107) ≤ O ( (cid:15)/k ) . (17)By (16) and (17) we conclude that (cid:104) A ( w, µ ) , Σ (cid:105) ≤ (cid:104) B ( µ ( w )) , Σ (cid:105) + O ( (cid:15)k log 1 /(cid:15) ), so this togetherwith (14) and (15) yields the desired bound. 18 .5 Analyzing the Filter With Spectral Signatures We now use Lemma 4.7 to show that under the deterministic condition that the uncorrupted pointsare (cid:15) -good,

LearnWithFilter satisﬁes the guarantees of Theorem 4.1.The main step is to show that as long as we remain in the main loop of

LearnWithFilter ,and we have so far thrown out more bad weight than good weight, we are guaranteed to throw outmore bad weight than good weight in the next iteration of the main loop:

Lemma 4.10.

Let w and w (cid:48) be the weights at the start and end of a single iteration of the main loopof LearnWithFilter . There is an absolute constant

C > such that if (cid:107) M ( w ) (cid:107) K > C · (cid:15)k log 1 /(cid:15) and (cid:80) i ∈ S G N − w i < (cid:80) i ∈ S B N − w i , then (cid:80) i ∈ S G w i − w (cid:48) i < (cid:80) i ∈ S B w i − w (cid:48) i .Proof. Suppose the scores τ , · · · , τ N in this iteration are sorted in decreasing order, and let T denote the smallest index for which (cid:80) i ∈ [ T ] w i ≥ (cid:15) . As Filter does not modify w i for i > T , wejust need to show that (cid:80) i ∈ S G ∩ [ T ] w i − w (cid:48) i < (cid:80) i ∈ S B ∩ [ T ] w i − w (cid:48) i , and by Lemma 4.2 it is enough toshow that (cid:88) i ∈ S G ∩ [ T ] w i τ i < (cid:88) i ∈ S B ∩ [ T ] w i τ i . (18)First note that because each weight is at most (cid:15) , we may assume that (cid:80) i ∈ [ T ] w i ≤ (cid:15) . We beginby upper bounding the left-hand side of (18). Lemma 4.11. (cid:80) i ∈ S G ∩ [ T ] w i τ i ≤ O (cid:0) (cid:15)k log 1 /(cid:15) + (cid:15) · ω + (cid:15) (cid:107) M ( w ) (cid:107) K (cid:1) .Proof. Let w (cid:48)(cid:48) be the weights given by w (cid:48)(cid:48) i for i ∈ S G ∩ [ T ] and w (cid:48)(cid:48) i = 0 otherwise. Then (cid:80) S G ∩ [ T ] w i τ i is equal to (cid:88) i ∈ [ N ] w (cid:48)(cid:48) i τ i = (cid:88) i ∈ [ N ] w (cid:48)(cid:48) i (cid:10) ( X i − µ ( w )) ⊗ , Σ (cid:11) = (cid:88) i ∈ [ N ] w (cid:48)(cid:48) i (cid:10) ( X i − µ ( w (cid:48)(cid:48) )) ⊗ , Σ (cid:11) + (cid:107) w (cid:48)(cid:48) (cid:107) · (cid:10) ( µ ( w (cid:48)(cid:48) ) − µ ( w )) ⊗ , Σ (cid:11) (19) ≤ (cid:88) i ∈ [ N ] w (cid:48)(cid:48) i (cid:10) ( X i − µ ( w (cid:48)(cid:48) )) ⊗ , Σ (cid:11) + O ( (cid:15) ) · (cid:107) µ ( w (cid:48)(cid:48) ) − µ ( w ) (cid:107) K (20) ≤ (cid:88) i ∈ [ N ] w (cid:48)(cid:48) i (cid:10) ( X i − µ ) ⊗ , Σ (cid:11) + O ( (cid:15) ) · (cid:107) µ ( w (cid:48)(cid:48) ) − µ ( w ) (cid:107) K (21) ≤ O (cid:16) (cid:15) · ω + (cid:15)k log 1 /(cid:15) (cid:17) + O ( (cid:15) ) · (cid:107) µ ( w (cid:48)(cid:48) ) − µ ( w ) (cid:107) K where (19) and (21) both follow from Fact 2.2, (20) follows from the earlier assumption that (cid:80) i ∈ [ T ] w i ≤ (cid:15) and the deﬁnition of (cid:107) · (cid:107) K , and the last step follows by the third part of Lemma 4.8.Now note that (cid:107) µ ( w (cid:48)(cid:48) ) − µ ( w ) (cid:107) K ≤ (cid:107) µ ( w (cid:48)(cid:48) ) − µ (cid:107) K + (cid:107) µ ( w ) − µ (cid:107) K ≤ O (cid:32) (cid:112) log 1 /(cid:15) √ k + ω (cid:33) + (cid:107) µ ( w ) − µ (cid:107) K ≤ O (cid:32) (cid:112) log 1 /(cid:15) √ k + ω + (cid:114) (cid:15) (cid:16) (cid:107) M ( w ) (cid:107) K + ω + (cid:15)k log 1 /(cid:15) (cid:17)(cid:33) , where the second step follows by the fourth part of Lemma 4.8 and the third step holds byLemma 4.7. The desired bound follows. 19ne consequence of this is that outside of the tails, the scores among good samples are small. Corollary 4.12.

For all i > T , τ i ≤ O ( k log 1 /(cid:15) + (cid:15) (cid:107) M ( w ) (cid:107) K + ω ) .Proof. Note that (cid:88) i ∈ S G ∩ [ T ] w i = (cid:88) i ∈ [ T ] w i − (cid:88) i ∈ S B ∩ [ T ] w i ≥ (cid:15) − (cid:88) i ∈ S B w i ≥ (cid:15), so the claim follows from Lemma 4.11 and averaging.Next, we show that the deviation of the total scores of the good points from their expectationis negligible. Lemma 4.13. (cid:80) i ∈ S G w i τ i − (cid:104) B ( µ ( w )) , Σ (cid:105) ≤ O (cid:0) (cid:15)k log 1 /(cid:15) + (cid:15) · ω + (cid:15) · (cid:107) M ( w ) (cid:107) K (cid:1) .Proof. Let w (cid:48) be the weights given by w (cid:48) i = w i for i ∈ S G and w (cid:48) i = 0 otherwise. Then by Fact 2.2, (cid:88) i ∈ S G w i τ i = (cid:88) i ∈ S G w i (cid:104) ( X i − µ ( w (cid:48) )) ⊗ , Σ (cid:105) + (cid:107) w (cid:107) · (cid:104) ( µ ( w ) − µ ( w (cid:48) )) ⊗ , Σ (cid:105)≤ (cid:80) i ∈ S G w i (cid:16) (cid:104) B ( µ ( w (cid:48) )) , Σ (cid:105) + O (cid:16) (cid:15)k log 1 /(cid:15) (cid:17)(cid:17) + (cid:107) µ ( w ) − µ ( w (cid:48) ) (cid:107) K where in the second step we used Fact 2.2, and in the third step we used Lemma 4.9 and thedeﬁnition of (cid:107) · (cid:107) K . To bound the (cid:107) µ ( w ) − µ ( w (cid:48) ) (cid:107) K term, note that (cid:107) µ ( w ) − µ ( w (cid:48) ) (cid:107) K ≤ (cid:107) µ ( w ) − µ (cid:107) K + (cid:107) µ ( w (cid:48) ) − µ (cid:107) K ≤ (cid:107) µ ( w ) − µ (cid:107) K + O (cid:32) (cid:15) (cid:112) log 1 /(cid:15) √ k + (cid:15) · ω (cid:33) ≤ O (cid:32) (cid:15) (cid:112) log 1 /(cid:15) √ k + (cid:15) · ω + (cid:114) (cid:15) (cid:16) (cid:107) M ( w ) (cid:107) K + ω + (cid:15)k log 1 /(cid:15) (cid:17)(cid:33) , where the second step follows by the fourth part of Lemma 4.8, and the third step follows byLemma 4.7. Finally, by Corollary 3.5 we have that (cid:104) B ( µ ( w (cid:48) )) , Σ (cid:105) ≤ (cid:104) B ( µ ( w )) , Σ (cid:105) + 3 k (cid:107) µ ( w (cid:48) ) − µ ( w ) (cid:107) ≤ (cid:104) B ( µ ( w )) , Σ (cid:105) + O ( (cid:15)/k ) , where the last step follows by Fact 2.3. This completes the proof of the claim.We are now ready to complete the proof of Lemma 4.10. In light of Lemma 4.11, we wish tolower bound the right-hand side of (18). Claim 4.14. If C > in the lower bound (cid:107) M ( w ) (cid:107) K > C ( (cid:15)k log 1 /(cid:15) + ω ) is suﬃciently large, then (cid:104) M ( w ) , Σ ∗ (cid:105) must be positive.Proof. Let w (cid:48) denote the weights given by w (cid:48) i = w i for i ∈ S G and w (cid:48) i = 0 otherwise. We have M ( w ) = (cid:88) i ∈ [ N ] w i ( X i − µ ( w )) ⊗ − B ( µ ( w )) (cid:23) (cid:88) i ∈ S G w (cid:48) i ( X i − µ ( w )) ⊗ − B ( µ ( w )) (cid:23) (cid:88) i ∈ S G w (cid:48) i ( X i − µ ( w (cid:48) )) ⊗ − B ( µ ( w ))= M ( w (cid:48) ) + B ( µ ( w (cid:48) )) − B ( µ ( w )) (22)20here the third step follows by Fact 2.2. Furthermore, (cid:107) B ( µ ( w (cid:48) )) − B ( µ ( w )) (cid:107) K ≤ k · (cid:107) µ ( w (cid:48) ) − µ ( w ) (cid:107) ≤ O ( (cid:15)/k ) (23)by Corollary 3.5 and Fact 2.3. Lastly, we must bound (cid:107) M ( w (cid:48) ) (cid:107) K . Letting ˆ w (cid:48) denote the normalizedversion of w (cid:48) , we have that (cid:107) M ( w (cid:48) ) (cid:107) K ≤ (cid:107) M ( ˆ w (cid:48) ) (cid:107) K + (cid:107) M ( w (cid:48) ) − M ( ˆ w (cid:48) ) (cid:107) K ≤ (cid:107) M ( ˆ w (cid:48) ) (cid:107) K + (cid:107) A ( ˆ w (cid:48) − w (cid:48) , µ ) (cid:107) K ≤ O (cid:16) (cid:15)k log 1 /(cid:15) + ω (cid:17) , (24)where the penultimate step follows by Fact 2.2 and the deﬁnition of the matrix M ( · ), and the laststep follows by Lemma 4.9 and the third part of Lemma 4.8.We conclude by (22), (23), and (24) thatmin Σ ∈K (cid:104) M ( w ) , Σ (cid:105) ≥ − O (cid:16) (cid:15)k log 1 /(cid:15) + ω (cid:17) , (25)so we simply need to take C larger than the constant implicit in the right-hand side of (25) toensure that (cid:104) M ( w ) , Σ ∗ (cid:105) > (cid:88) i ∈ [ N ] w i τ i − (cid:104) B ( µ ( w )) , Σ ∗ (cid:105) = (cid:104) M ( w ) , Σ ∗ (cid:105) ≥ (cid:107) M ( w ) (cid:107) K . This, together with Lemma 4.13, yields (cid:80) i ∈ S B w i τ i ≥ C (cid:48) (cid:107) M ( w ) (cid:107) K for some C (cid:48) < C which wecan take to be arbitrarily large. We want to show that this same sum, over only S B ∩ [ T ], enjoysessentially the same bound. Indeed, (cid:88) i ∈ S B ∩ [ T ] w i τ i ≥ C (cid:48) (cid:107) M ( w ) (cid:107) K − (cid:88) i ∈ S B \ [ T ] w i τ i ≥ C (cid:48) (cid:107) M ( w ) (cid:107) K −  (cid:88) i ∈ S B w i  · O (cid:18) k log 1 /(cid:15) + ω + (cid:15) (cid:107) M ( w ) (cid:107) K (cid:19) ≥ C · (cid:107) M ( w ) (cid:107) K , for some arbitrarily large absolute constant C , where the second step follows by Corollary 4.12,and the last by the assumption that (cid:107) M ( w ) (cid:107) K > C · ( (cid:15)k log 1 /(cid:15) + ω ). On the other hand, by thissame assumption and by Lemma 4.11, (cid:88) i ∈ S G ∩ [ T ] w i τ i ≤ O (cid:16) (cid:15)k log 1 /(cid:15) + (cid:15) · ω + (cid:15) (cid:107) M ( w ) (cid:107) K (cid:17) ≤ C · (cid:107) M ( w ) (cid:107) K , where C can be taken to be smaller than C . This proves (18) and thus Lemma 4.10.We can now combine Lemma 4.7 and Lemma 4.10 to get a proof of Theorem 4.1.21 roof of Theorem 4.1. Let ˆ µ be the output of LearnWithFilter . By Lemma 2.7, it suﬃces toshow that ˆ µ satisﬁes (cid:107) ˆ µ − µ (cid:107) A s ( d +1) ≤ O ( ω + (cid:15) √ k (cid:112) log 1 /(cid:15) ), or equivalently that for all v ∈ V n(cid:96) ,where (cid:96) (cid:44) s ( d + 1), we have that (cid:104) (ˆ µ − µ ) ⊗ , vv (cid:62) (cid:105) / ≤ O ( ω + (cid:15) √ k (cid:112) log 1 /(cid:15) ). By Corollary 3.3,it is enough to show that (cid:107) ˆ µ − µ (cid:107) K ≤ O ( ω + (cid:15) √ k (cid:112) log 1 /(cid:15) ). By Lemma 4.7 together with thetermination condition of the main loop of LearnWithFilter , we just need to show that thealgorithm terminates (in polynomial time) and that w ∈ W O ( (cid:15) ) .But by induction and Lemma 4.10, every iteration of the loop removes more mass from the badpoints than from the good points. Furthermore, by Lemma 4.2, the support of w goes down by atleast one every time is run, so the loops terminates after at most N iterations, each ofwhich can be implemented in polynomial time. At the end, at most an (cid:15) fraction of the total masson S G has been removed, so the ﬁnal weights w satisfy w ∈ W (cid:15) as desired. n A / d i s t an c e / k filteroraclenaive k A / d i s t an c e / k filteroraclenaive A / d i s t an c e / k filteroraclenaive

30 40 50(iv) number of batches0.0200.0400.0600.080 A / d i s t an c e / k filteroraclenaive Figure 1:

Arbitrary Distributions :In this section we report on empirical evaluations of our algorithm on synthetic data. Wecompared our algorithm

LearnWithFilter , the naive estimator which simply takes the empiricalmean of all samples, the “oracle” algorithm which computes the empirical mean of the uncorruptedsamples , and the threshold of (cid:15)/ √ k which our theorems show that LearnWithFilter achieves,up to constant factors (in Figures 1 and 2, these are labeled “ﬁlter”, “naive”, “oracle”, and (cid:15)/ √ k

50 100(i) domain size n L d i s t an c e / k filteroraclenaive k L d i s t an c e / k filteroraclenaive L d i s t an c e / k filteroraclenaive

30 40 50(iv) number of batches0.0000.0250.0500.0750.1000.125 L d i s t an c e / k filteroraclenaive Figure 2:

Structured Distributions :respectively). Note that by deﬁnition, the oracle dominates the algorithms considered in [CLM19]and [JO19] for the unstructured case, as those algorithms search for a subset of the data andoutput the empirical mean of that subset. But as Theorem 4.1 predicts,

LearnWithFilter should actually outperform the oracle in settings where the underlying distribution µ is structured and there are too few samples for the empirical mean of the uncorrupted points to concentratesuﬃciently. In these experiments, we conﬁrm this empirically. Our experiments fall under two types: (A) those on learning an arbitrary distribution in A (cid:96)/ norm and B) those on learning a structured distribution in total variation distance . The purposeof experiments of type (A) will be to convey that LearnWithFilter can be used to learn fromuntrusted batches in A (cid:96)/ norm even for distributions which are not necessarily structured. Thepurpose of experiments of type (B) will be to demonstrate that LearnWithFilter can outperformthe oracle for structured distributions.Throughout, ω = 0 and (cid:96) = 10. While our algorithm can also be implemented for larger (cid:96) (asthe size of the SDP we solve does not depend on (cid:96) ), we choose (cid:96) = 5 because it is small enough thatthe sample complexity savings of our algorithm are very pronounced, yet large enough that for thedomain sizes n we work with, enumerating over V n(cid:96) would be prohibitively expensive, justifying theneed to use an SDP. 23or experiments of type (A), we chose the true underlying distribution µ by sampling uniformlyfrom [0 , n and normalizing, and for experiments of type B), we chose µ by sampling a uniformlyrandom piecewise constant function with (cid:96) = 5 pieces.Given µ and a prescribed parameter δ , the distribution from which the corrupted batches weredrawn was taken to be Mul k ( ν ), where ν was constructed to satisfy d TV ( µ, ν ) = δ by adding δn to the smallest entries of µ and subtracting δn from the largest. Sometimes this does not give aprobability distribution, in which case we resample µ . When k, (cid:15), N are clear from context and wesay that N (cid:15) -corrupted batches are drawn from the distribution speciﬁed by ( µ, ν ), we mean that (cid:98) (1 − (cid:15) ) N (cid:99) samples are drawn from Mul k ( µ ) and N − (cid:98) (1 − (cid:15) ) N (cid:99) from Mul k ( ν ).As noted in [JO19], choosing δ too high makes it too easy to detect the corruptions in the data,while choosing δ too low means the naive estimator will already perform quite well. In light ofthis and the fact that the above process for generating ν only ensures that d TV ( µ, ν ) = δ , whereas (cid:107) µ − ν (cid:107) A (cid:96) might be much smaller, we chose δ for our experiments as follows. For experiments oftype (A), we took δ = 0 . A (cid:96)/ distance between the empirical meanand the truth was still suﬃciently large that the the naive estimator was not competitive. Forexperiments of type B) where we measure error in terms of total variation distance, we could aﬀordto choose δ slightly smaller, namely δ = 0 . n , batch size k , corruption fraction (cid:15) , and totalnumber of batches N . Each of the following four experiments was repeated for a total of ten trials.(a) Varying domain size n : We ﬁxed (cid:15) = 0 . k = 1000, and N = (cid:98) (cid:96)/(cid:15) − (cid:15) (cid:99) to ensure (cid:98) (cid:96)/(cid:15) (cid:99) samples from Mul k ( µ ). We chose such large k to ensure the gap between empirical meanand our algorithm was very noticable. In each trial and for each n ∈ [4 , , , , , µ, ν ) via the above procedure, drew N (cid:15) -corrupted samples fromdistribution speciﬁed by ( µ, ν ). Note that while N is independent of n , the performance ofour algorithm is comparable to that of the oracle. (b) Varying batch size k : We ﬁxed (cid:15) = 0 . n = 64, and N = (cid:96)/(cid:15) − (cid:15) (cid:99) . In each trial, we randomly gen-erated ( µ, ν ) via the above procedure, and then for each value of k ∈ [1 , , , , , , N samples from the distribution speciﬁed by ( µ, ν ). Note that while our algorithm’serror and the oracle’s error decay with k , the empirical mean’s error remains ﬁxed.(c) Varying corruption fraction (cid:15) : We ﬁxed (cid:15) ∗ = 0 . n = 64, k = 1000, and N = (cid:98) (cid:96)/(cid:15) ∗ (cid:99) . Ineach trial, we randomly generated ( µ, ν ) via the above procedure and drew N samples fromMul( k, µ ). Then for each (cid:15) ∈ [0 . , . , . , . , . (cid:98) (cid:15)N − (cid:15) samples from Mul( k, ν ). Note that while our algorithm’s error remains close to (cid:15) ∗ / √ k , theempirical mean’s error increases linearly in (cid:15) .(d) Varying number of batches N : We ﬁxed (cid:15) = 0 . n = 128, and k = 500. In each trial, we ran-domly generated ( µ, ν ) via the above procedure, and then for each ρ ∈ [0 . , . , , . , . N = (cid:98) ρ · (cid:96)/(cid:15) (cid:99) samples from the distribution speciﬁed by ( µ, ν ). Note that even withsuch a small number of samples, our algorithm can compete with the oracle. Also note thatour error bottoms out at (cid:15)/ √ k while the oracle’s error goes beneath this threshold.For type (B), we ran the exact same set of four experiments but over structured µ , with thekey diﬀerence that after generating an estimate with LearnWithFilter , we post-processed it by The naive estimator’s error is decreasing in n for an unrelated reason: as n increases, the above procedure forsampling ( µ, ν ) appears to skew towards µ for which the resulting perturbation ν is close in A (cid:96)/ . total variation distance to that of the empirical meanof the whole dataset, and the empirical mean of the uncorrupted points.As is evident from Figure 2, our algorithm outperforms even the oracle, as predicted by Theo-rem 4.1. The experiments were conducted on a MacBook Pro with 2.6 GHz Dual-Core Intel Core i5 processorand 8 GB of RAM. The experiments of type (A) respectively took 110m36.499s, 73m19.477s,50m54.655s, and 536m39.212s to run. The experiments of type (B) respectively took 64m28.346s,52m7.859s, 39m36.754s, and 362m50.742s to run. The discrepancy in runtimes between (A) and(B) can be explained by the fact that a number of unrelated processes were also running at the timeof the former. The experiment of varying the number of batches N was the most expensive becausewe chose domain size n = 128 to accentuate the gap between our algorithm and the oracle. Theabovementioned runtimes imply that over a domain of size 128, LearnWithFilter takes roughly7-10 minutes.For the implementation, we used the SCS solver in CVXPY for our semideﬁnite programs. Inorder to achieve reasonable runtimes, we needed to set the feasibility tolerance to − , and as aresult the SDP solver would occasionally output matrices Σ which are moderately far from K ; inparticular, one mode of failure that arose was that Σ might be non-PSD and give rise to negativescores in LearnWithFilter . We chose to address this mode of failure heuristically by terminatingthe algorithm whenever this happened and simply outputting the estimate for µ at that point intime. Of the 480 total trials that were run across all experiments, this happened 53 times. Anotherheuristic that we used was to terminate the algorithm as soon as (cid:107) Σ (cid:107) K stopped increasing duringa run of LearnWithFilter ; this was primarily to have a stopping criterion that avoids the needto tune constant factors. As demonstrated by Figures 1 and 2, these heuristic decisions ultimatelyhad negligible eﬀect on the performance of our algorithm.All code, data, and documentation can be found at https://github.com/secanth/federated . Acknowledgments

We would like to thank the authors of the concurrent work [JO20] for coor-dinating submissions with us.

References [ADH +

15] J. Acharya, I. Diakonikolas, C. Hegde, J. Li, and L. Schmidt. Fast and Near-OptimalAlgorithms for Approximating Distributions by Histograms. In

PODS , 2015.[ADLS17] Jayadev Acharya, Ilias Diakonikolas, Jerry Li, and Ludwig Schmidt. Sample-optimaldensity estimation in nearly-linear time. In

Proceedings of the Twenty-Eighth AnnualACM-SIAM Symposium on Discrete Algorithms , pages 1278–1289. SIAM, 2017.[AN04] Noga Alon and Assaf Naor. Approximating the cut-norm via grothendieck’s inequality.In

Proceedings of the thirty-sixth annual ACM symposium on Theory of computing ,pages 72–80. ACM, 2004.[Ans60] Frank J Anscombe. Rejection of outliers.

Technometrics , 2(2):123–146, 1960.25BBBB72] Richard E Barlow, David J Bartholomew, James M Bremner, and H Daniel Brunk.Statistical inference under order restrictions: The theory and application of isotonicregression. Technical report, Wiley New York, 1972.[CDSS13] Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Learning mix-tures of structured distributions over discrete domains. In

Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms , pages 1380–1394. Societyfor Industrial and Applied Mathematics, 2013.[CDSS14a] S. Chan, I. Diakonikolas, R. Servedio, and X. Sun. Near-optimal density estimation innear-linear time using variable-width histograms. In

NIPS , pages 1844–1852, 2014.[CDSS14b] Siu-On Chan, Ilias Diakonikolas, Rocco A Servedio, and Xiaorui Sun. Eﬃcient densityestimation via piecewise polynomial approximation. In

Proceedings of the forty-sixthannual ACM symposium on Theory of computing , pages 604–613. ACM, 2014.[CLM19] Sitan Chen, Jerry Li, and Ankur Moitra. Eﬃciently learning structured distributionsfrom untrusted batches. arXiv preprint arXiv:1911.02035 , 2019.[CSV17] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data.In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing ,pages 47–60. ACM, 2017.[DHL19] Yihe Dong, Samuel Hopkins, and Jerry Li. Quantum entropy scoring for fast robustmean estimation and improved outlier detection. In

Advances in Neural InformationProcessing Systems , pages 6065–6075, 2019.[Dia16] Ilias Diakonikolas. Learning structured distributions.

Handbook of Big Data , 267, 2016.[DK19] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensionalrobust statistics. arXiv preprint arXiv:1911.05911 , 2019.[DKK +

17] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, andAlistair Stewart. Being robust (in high dimensions) can be practical. In

Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pages 999–1008.JMLR. org, 2017.[DKK +

18] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Jacob Steinhardt, andAlistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. arXivpreprint arXiv:1803.02815 , 2018.[DKK +

19] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and AlistairStewart. Robust estimators in high-dimensions without the computational intractabil-ity.

SIAM Journal on Computing , 48(2):742–864, 2019.[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart. Statistical query lower boundsfor robust estimation of high-dimensional gaussians and gaussian mixtures. In , pages73–84. IEEE, 2017.[DL01] Luc Devroye and Gabor Lugosi.

Combinatorial Methods in Density Estimation .Springer Science & Business Media, 2001.26HL18] Samuel B Hopkins and Jerry Li. Mixture models, robustness, and sum of squaresproofs. In

Proceedings of the 50th Annual ACM SIGACT Symposium on Theory ofComputing , pages 1021–1034. ACM, 2018.[Hub92] Peter J Huber. Robust estimation of a location parameter. In

Breakthroughs in statis-tics , pages 492–518. Springer, 1992.[JO19] Ayush Jain and Alon Orlitsky. Robust learning of discrete distributions from batches. arXiv preprint arXiv:1911.08532 , 2019.[JO20] Ayush Jain and Alon Orlitsky. A general method for robust learning from batches. arXiv preprint arXiv:2002.11099 , 2020.[KMY +

16] Jakub Koneˇcn`y, H Brendan McMahan, Felix X Yu, Peter Richt´arik, Ananda TheerthaSuresh, and Dave Bacon. Federated learning: Strategies for improving communicationeﬃciency. arXiv preprint arXiv:1610.05492 , 2016.[KSS18] Pravesh K Kothari, Jacob Steinhardt, and David Steurer. Robust moment estimationand improved clustering via sum of squares. In

Proceedings of the 50th Annual ACMSIGACT Symposium on Theory of Computing , pages 1035–1046. ACM, 2018.[Li18] Jerry Zheng Li.

Principled approaches to robust machine learning and beyond . PhDthesis, Massachusetts Institute of Technology, 2018.[LRR13] Reut Levi, Dana Ron, and Ronitt Rubinfeld. Testing properties of collections of dis-tributions.

Theory of Computing , 9(1):295–347, 2013.[LRV16] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean andcovariance. In , pages 665–674. IEEE, 2016.[MMR +

17] H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agueray Arcas. Communication-eﬃcient learning of deep networks from decentralized data. In

Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics(AISTATS) , 2017.[O’B16] Carl M O’Brien. Nonparametric estimation under shape constraints: Estimators, al-gorithms and asymptotics.

International Statistical Review , 84(2):318–319, 2016.[QV17] Mingda Qiao and Gregory Valiant. Learning discrete distributions from untrustedbatches. arXiv preprint arXiv:1711.08113 , 2017.[SCV18] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion forlearning in the presence of arbitrary outliers. In . Schloss Dagstuhl-Leibniz-Zentrum fuer Infor-matik, 2018.[SHKT97] C. J. Stone, M. H. Hansen, C. Kooperberg, and Y. K. Truong. Polynomial splines andtheir tensor products in extended linear modeling: 1994 wald memorial lecture.

Ann.Statist. , 25(4):1371–1470, 1997.[Ste18] Jacob Steinhardt.

Robust Learning: Information Theory and Algorithms . PhD thesis,Stanford University, 2018. 27Sto94] C. J. Stone. The use of polynomial splines and their tensor products in multivariatefunction estimation.

The Annals of Statistics , 22(1):pp. 118–171, 1994.[TKV17] Kevin Tian, Weihao Kong, and Gregory Valiant. Learning populations of parameters.In

Advances in Neural Information Processing Systems , pages 5778–5787, 2017.[Tuk60] John W Tukey. A survey of sampling from contaminated distributions.

Contributionsto probability and statistics , pages 448–485, 1960.[Tuk75] John W Tukey. Mathematics and the picturing of data. In

Proceedings of the In-ternational Congress of Mathematicians, Vancouver, 1975 , volume 2, pages 523–531,1975.[WN07] R. Willett and R. D. Nowak. Multiscale poisson intensity and density estimation.

IEEETransactions on Information Theory , 53(9):3171–3187, 2007.[WW83] E. J. Wegman and I. W. Wright. Splines in statistics.

Journal of the American Statis-tical Association , 78(382):pp. 351–365, 1983.

A Concentration

In this section we prove Lemma 4.6, restated here for convenience:

Lemma 4.6 (Regularity of good samples) . If U is a set of (cid:101) Ω (cid:0) log(1 /δ )( (cid:96) /(cid:15) ) · log ( n ) (cid:1) independentsamples from Mul k ( µ ) , ..., Mul k ( µ | U | ) , then U is (cid:15) -good with probability at least − δ . A.1 Technical Ingredients

The key technical fact we use to get sample complexity that depend quadratically on (cid:96) is:

Lemma A.1.

Let ξ > and let N ⊂ R n × n be any ﬁnite set for which (cid:107) Σ (cid:107) max ≤ O (1) for all Σ ∈ N . Let µ , ..., µ N , µ ∈ ∆ n satisfy µ (cid:44) N (cid:80) Ni =1 µ i . Then for X i ∼ Mul k ( µ i ) for i ∈ [ N ] , Pr (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:42) N N (cid:88) i =1 ( X i − µ i ) ⊗ − E X ∼ Mul k ( µ i ) (cid:2) ( X − µ i ) ⊗ (cid:3) , Σ (cid:43)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t ∀ Σ ∈ N (cid:35) < |N | exp (cid:18) − Ω (cid:18) N k t kt (cid:19)(cid:19) , where the probability is over the samples X , · · · , X N . emma A.3. Let ξ > and let N ⊂ R n × n be any ﬁnite set for which (cid:107) Σ (cid:107) max ≤ O (1) for all Σ ∈ N . For X i ∼ Mul k ( µ i ) for i ∈ [ N ] , ˆ µ (cid:44) N (cid:80) Ni =1 X i , and µ (cid:44) N (cid:80) Ni =1 µ i , Pr (cid:2)(cid:12)(cid:12)(cid:10) (ˆ µ − µ ) ⊗ , Σ (cid:11) − E (cid:2)(cid:10) (ˆ µ − µ ) ⊗ , Σ (cid:11)(cid:3)(cid:12)(cid:12) > t ∀ Σ ∈ N (cid:3) < |N | exp (cid:18) − Ω (cid:18) N k t N kt (cid:19)(cid:19) , where the probability is over the samples X , · · · , X N . Lemma A.4.

Let ξ > and let N ⊂ R n × n be any ﬁnite set for which (cid:107) Σ (cid:107) max ≤ O (1) for all Σ ∈ N . Let µ , ..., µ N , µ ∈ ∆ n satisfy (cid:107) µ i − µ (cid:107) ≤ ω for all i ∈ [ N ] . For X i ∼ Mul k ( µ i ) for i ∈ [ N ] , Pr (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ( µ i − µ ) (cid:62) Σ( X i − µ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ω · t ∀ Σ ∈ N (cid:35) < |N | exp (cid:0) − Ω (cid:0) kN t (cid:1)(cid:1) , where the probability is over the samples X , · · · , X N . Note that if N consisted solely of matrices of the form vv (cid:62) for v ∈ {± } n , these lemmas wouldfollow straightforwardly from standard binomial tail bounds. Instead, we only have entrywisebounds for the matrices in N and will therefore need to compute moment estimates from scratchin order to prove Lemmas A.2 and A.3. We defer the details of this to Appendix C.Lastly, we will need the following elementary consequence of Stirling’s formula: Fact A.5.

For any m ≥ , log (cid:0) m(cid:15)m (cid:1) ≤ m · (cid:15) log 1 /(cid:15) . A.2 Proof of Lemma 4.6

We are now ready to prove that the four conditions for (cid:15) -goodness hold for a set U of independentdraws from Mul k ( µ ) , ..., Mul k ( µ | U | ) respectively, of size | U | = (cid:101) Ω (cid:0) log(1 /δ )( (cid:96) /(cid:15) ) · log ( n ) (cid:1) . (26) Proof of Lemma 4.6. As (cid:107) · (cid:107) K is deﬁned as a supremum over K , we will reduce controlling theinﬁnitely many directions in K to controlling a ﬁnite net of such directions by invoking Lemma A.1.Speciﬁcally, recall that for any Σ ∈ K , by Lemma A.1, there is some ˜Σ = (cid:80) ν α ν Σ ∗ ν such thatΣ ∗ ν ∈ N and (cid:107) Σ − ˜Σ (cid:107) F ≤ η .( Condition (I) ) By Lemma A.3, with probability at least 1 − |N | exp (cid:16) − Ω( N k t Nkt ) (cid:17) , we havethat for all Σ ∈ K , (cid:10) ( µ ( U ) − µ ) ⊗ , Σ (cid:11) ≤ (cid:68) ( µ ( U ) − µ ) ⊗ , ˜Σ (cid:69) + (cid:107) µ ( U ) − µ (cid:107) · (cid:107) Σ − ˜Σ (cid:107) F ≤ (cid:68) ( µ ( U ) − µ ) ⊗ , ˜Σ (cid:69) + 2 η = (cid:88) ν α ν (cid:10) ( µ ( U ) − µ ) ⊗ , Σ ∗ ν (cid:11) + 2 η ≤ N N (cid:88) i =1 E (cid:2)(cid:10) ( X − µ i ) ⊗ , Σ ∗ ν (cid:11)(cid:3) + (cid:88) ν α ν · t + 2 η ≤ O (1 /k | U | ) + t + 2 η, (27)where the ﬁrst step follows by Cauchy-Schwarz and triangle inequality, the second step follows bythe trivial bound (cid:107) µ ( U ) − µ i (cid:107) ≤ (cid:107) Σ − ˜Σ (cid:107) F guaranteed by Lemma A.1, the29ourth step holds with the claimed probability by Lemma A.3 and the fact that (cid:107) Σ ∗ ν (cid:107) max ≤ O (1)for all ν by the guarantees of Lemma A.1, and the last step follows by the bound on (cid:80) α ν by theguarantees of Lemma A.1, as well as the moment bound in Lemma C.2 applied to r = 1.If | U | satisﬁes (26) and η, t = O ( (cid:15) k log 1 /(cid:15) ), the ﬁrst part of Condition (I) holds.For the second part, by the steps leading to (27), a union bound over the (cid:0) | U | (cid:15) | U | (cid:1) subsets W andFact A.5, with probability at least1 − | U | · (cid:15) log 1 /(cid:15) ) · |N | exp (cid:18) − Ω (cid:18) (cid:15) | U | k t (cid:15) | U | kt (cid:19)(cid:19) we have that (cid:107) µ ( W ) − µ W (cid:107) K ≤ O (cid:16) (cid:15)k | U | (cid:17) + t + 2 η for all W . Note that 2 log 1 /(cid:15) ≤ O (cid:16) (cid:15) | U | k t (cid:15) | U | kt (cid:17) provided t = Ω (cid:16) log 1 /(cid:15)k (cid:17) , so if | U | satisﬁes (26) and η = O ( log 1 /(cid:15)k ), the second part of Condition (I)holds.( Condition (II) ) For the ﬁrst part, let ˆ M (cid:44) M ( ˆ w ( U ) , { µ i } i ∈ U ). By Lemma A.2, with proba-bility at least 1 − |N | exp (cid:16) − Ω (cid:16) | U | k t kt (cid:17)(cid:17) , we have that for all Σ ∈ K , (cid:104) ˆ M , Σ (cid:105) ≤ (cid:104) ˆ M , ˜Σ (cid:105) + (cid:107) ˆ M (cid:107) F · (cid:107) Σ − ˜Σ (cid:107) F ≤ (cid:104) ˆ M , ˜Σ (cid:105) + 3 η ≤ (cid:88) ν α ν (cid:104) ˆ M , Σ ∗ ν (cid:105) + 3 η ≤ (cid:88) ν α ν · t + 3 η ≤ t + 3 η (28)where the ﬁrst step follows by Cauchy-Schwarz and triangle inequality, and the second step followsby Lemma 2.4 and the bound on (cid:107) Σ − ˜Σ (cid:107) F guaranteed by Lemma A.1, the fourth step holds withthe claimed probability by Lemma A.2 and the fact that (cid:107) Σ ∗ ν (cid:107) max ≤ O (1) for all ν by the guaranteesof Lemma A.1, and the last step follows by the bound on (cid:80) α ν by the guarantees of Lemma A.1.If | U | satisﬁes (26), η = O (cid:0) (cid:15)k log 1 /(cid:15) (cid:1) , t = O (cid:0) (cid:15)k log 1 /(cid:15) (cid:1) , the ﬁrst part of Condition (II) holds.For the second part, ﬁrst note that it is slightly diﬀerent from the ﬁrst part because we do notsubtract out B ( µ ), the reason being that (cid:107) B ( µ ) (cid:107) K ≤ O (1 /k ) = o ( log 1 /(cid:15)k ), so this term is negligible.By the steps leading to (28), a union bound over the (cid:0) | U | (cid:15) | U | (cid:1) subsets W , and Fact A.5, with probabilityat least 1 − |N | exp(2 (cid:15) | U | log 1 /(cid:15) ) · exp (cid:18) − Ω (cid:18) (cid:15) | U | k t kt (cid:19)(cid:19) , we have that (cid:107) M ( ˆ w ( W ) , { µ i } i ∈ W ) (cid:107) K ≤ t + 3 η for all W . Note that 2 log 1 /(cid:15) ≤ O (cid:16) k t kt (cid:17) provided t = Ω (cid:16) log 1 /(cid:15)k (cid:17) , so if | U | satisﬁes (26) and η = O (cid:16) log 1 /(cid:15)k (cid:17) , the second part of Condition (II) holds.( Condition (III) ) First note that B ( { µ i } ) − B ( µ ) = 1 | U | (cid:88) i ∈ U k (cid:0) diag( µ i − µ ) − ( µ ⊗ i − µ ⊗ ) (cid:1) = − | U | (cid:88) i ∈ U k ( µ ⊗ i − µ ⊗ ) . Also note that (cid:42) Σ , | U | (cid:88) i ∈ U ( µ ⊗ i − µ ⊗ ) (cid:43) = 1 | U | (cid:88) i ∈ U (cid:10) ( µ i − µ ) ⊗ , Σ (cid:11) ≤ max i (cid:107) µ i − µ (cid:107) ≤ ω , (cid:107) B ( { µ i } ) − B ( µ ) (cid:107) K ≤ ω /k .It remains to bound (cid:107) B (ˆ µ ( U )) − B ( µ ) (cid:107) K . As we only need to show extremely mild concentrationhere, we will not make an eﬀort to obtain tight bounds. Note that by (2), |(cid:104) Σ , B (ˆ µ ( U )) − B ( µ ) (cid:105)| ≤ k |(cid:104) diag(ˆ µ ( U ) − µ ) , Σ (cid:105)| + 1 k (cid:12)(cid:12) (cid:104) ˆ µ ( U ) ⊗ − µ ⊗ , Σ (cid:105) (cid:12)(cid:12) . (29)We have (cid:104) diag(ˆ µ ( U ) − µ ) , Σ (cid:105) ≤ (cid:88) ν α ν (cid:104) diag(ˆ µ ( U ) − µ ) , Σ ∗ ν (cid:105) + (cid:107) Σ − ˜Σ (cid:107) F · (cid:107) ˆ µ ( U ) − µ (cid:107) ≤ (cid:88) ν α ν (cid:104) ˆ µ ( U ) − µ, diag(Σ ∗ ν ) (cid:105) + O ( η ) . (30)Note that for any ν , (cid:104) ˆ µ ( U ) − µ, diag(Σ ∗ ν ) (cid:105) = | U | (cid:80) i ∈ U Z νi for Z νi (cid:44) (cid:104) X i − µ i , diag(Σ ∗ ν ). These areindependent, mean-zero, O (1)-bounded random variables, so by Hoeﬀding’s, for any ﬁxed ν wehave that |(cid:104) ˆ µ ( U ) − µ, diag(Σ ∗ ν ) (cid:105)| ≤ t with probability at least 1 − − Ω( | U | t )). If we unionbound over N , then by taking η, t = O ( (cid:15) ), and | U | satisfying (26), (30) will be at most O ( (cid:15) ).We also have that (cid:12)(cid:12) (cid:104) ˆ µ ( U ) ⊗ − µ ⊗ , Σ (cid:105) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) (cid:104) (ˆ µ ( U ) − µ ) ⊗ , Σ (cid:105) − µ (cid:62) Σ(ˆ µ ( U ) − µ ) (cid:12)(cid:12)(cid:12) ≤ O (cid:18) (cid:15) log 1 /(cid:15)k (cid:19) + 2 (cid:12)(cid:12)(cid:12) µ (cid:62) Σ(ˆ µ ( U ) − µ ) (cid:12)(cid:12)(cid:12) , (31)where the second step follows by the ﬁrst part of this lemma. For the other term, we have µ (cid:62) Σ(ˆ µ ( U ) − µ ) ≤ (cid:88) ν α ν µ (cid:62) Σ ∗ ν (ˆ µ ( U ) − µ ) + (cid:107) Σ − ˜Σ (cid:107) F · (cid:107) µ (cid:107) · (cid:107) ˆ µ ( U ) − µ (cid:107) ≤ (cid:88) ν α ν µ (cid:62) Σ ∗ ν (ˆ µ ( U ) − µ ) + O ( η ) . (32)For any ν , µ (cid:62) Σ ∗ ν (ˆ µ ( U ) − µ ) = | U | (cid:80) i ∈ U W νi for W νi (cid:44) µ (cid:62) Σ ∗ ν ( X i − µ i ). These are independent, mean-zero, O (1)-bounded random variables, so by Hoeﬀding’s, for any ﬁxed ν , we have that | µ (cid:62) Σ(ˆ µ ( U ) − µ ) | ≤ t with probability at least 1 − − Ω( | U | t )). If we union bound over N , then by taking η, t = O ( (cid:15) ) and | U | satisfying (26) again, (32) and thus (31) will be at most O ( (cid:15) ).By (29), we thus conclude that (cid:107) B (ˆ µ ( U )) − B ( µ ) (cid:107) K ≤ O ( (cid:15)/k ) as claimed.( Condition (IV) ) By Lemma A.4, with probability at least 1 − |N | exp (cid:0) − Ω (cid:0) k | U | t (cid:1)(cid:1) , wehave that for all Σ ∈ K ,1 | U | (cid:88) i ∈ U ( µ i − µ ) (cid:62) Σ( X i − µ i ) ≤ | U | (cid:88) i ∈ U ( µ i − µ ) (cid:62) ˜Σ( X i − µ i ) + 1 | U | (cid:88) i ∈ U (cid:107) Σ − ˜Σ (cid:107) F · (cid:107) µ i − µ (cid:107) · (cid:107) X i − µ i (cid:107) ≤ (cid:88) ν α ν · | U | (cid:88) i ∈ U ( µ i − µ ) (cid:62) Σ ∗ ν ( X i − µ i ) + 2 ω · η ≤ (cid:88) ν α ν · t + 2 ω · η ≤ ω · t + 2 ω · η (33)31here the ﬁrst step follows by triangle inequality and Cauchy-Schwarz, the second step follows bythe bound on (cid:107) Σ − ˜Σ (cid:107) F guaranteed by Lemma A.1 and the assumption that (cid:107) µ i − µ (cid:107) ≤ ω , andthe third step holds with the claimed probability by Lemma A.4 and the fact that (cid:107) Σ ∗ ν (cid:107) max ≤ O (1)for all ν by Lemma A.1, and the last step follows by the bound on (cid:80) α ν by the guarantees ofLemma A.1. If | U | satisﬁes (26) and η, t = O (cid:18) (cid:15) √ log 1 /(cid:15) √ k (cid:19) , the ﬁrst part of Condition (IV) holds.For the second part, by the steps leading to (33), a union bound over W , and Fact A.5, withprobability at least 1 − |N | exp(2 (cid:15) | U | log 1 /(cid:15) ) · exp (cid:0) − Ω (cid:0) (cid:15)k | U | t (cid:1)(cid:1) , we have that | W | (cid:80) i ∈ W ( µ i − µ ) (cid:62) Σ( X i − µ i ) ≤ ω · t + 2 ω · η for all W .Note that 2 log 1 /(cid:15) ≤ O ( kt ) provided t = Ω (cid:18) √ log 1 /(cid:15) √ k (cid:19) , so if | U | satisﬁes (26) and η = O (cid:18) √ log 1 /(cid:15) √ k (cid:19) , the second part of Condition (IV) holds. B Netting Over K In this section we prove Lemma A.1, restated here for convenience:

Lemma A.1.

For every < η ≤ , there exists a net N ⊂ R n × n of size O ( n (cid:96) log n/η ) ( (cid:96) log n +1) of matrices such that for every Σ ∈ K , there exists some ˜Σ = (cid:80) ν Σ ∗ ν for Σ ∗ ν ∈ N such that thefollowing holds: 1) (cid:107) Σ − ˜Σ (cid:107) F ≤ η , 2) (cid:80) ν α ν ≤ , and 3) (cid:107) Σ ∗ ν (cid:107) max ≤ O (1) . As alluded to in Remark 3.2 and Appendix A, we will use the extra Constraints 3 and 4 in thedeﬁnition of K to tighten the proof of Lemma 6.9 from [CLM19] to obtain Lemma A.1 above.The following well-known trick will be useful. Lemma B.1 (“Shelling”) . If v ∈ R m satisﬁes (cid:107) v (cid:107) ≤ C and (cid:107) v (cid:107) = C ·√ k , then there exist k -sparsevectors v [1] , ..., v [ m/k ] with disjoint supports for which 1) v = (cid:80) m/ki =1 v [ i ] , 2) (cid:80) m/ki =1 (cid:107) v [ i ] (cid:107) ≤ C ,and 3) (cid:80) m/ki =1 (cid:107) v [ i ] (cid:107) ∞ ≤ k (cid:107) v (cid:107) + (cid:107) v (cid:107) ∞ .Proof. Assume without loss of generality that C = 1. Letting B ⊂ [ m ] be the indices of the k largest entries of v in absolute value, B those of the next k largest, etc., we can write [ m ] = B (cid:116) · · · (cid:116) B m/k . For i ∈ [ m/k ], deﬁne v [ i ] ∈ R m to be the restriction of v to the coordinates indexedby B i . For any i and j ∈ B i , | v j | ≤ k (cid:107) v [ i − (cid:107) . This immediately implies that m/k (cid:88) i =1 (cid:107) v [ i ] (cid:107) ∞ ≤ (cid:107) v (cid:107) ∞ + 1 k m/k (cid:88) i =1 (cid:107) v [ i ] (cid:107) , yielding 3) above. Likewise, it implies that (cid:107) v [ i ] (cid:107) = (cid:88) j ∈ B i v j ≤ k · k · (cid:107) v [ i − (cid:107) = 1 k (cid:107) v [ i − (cid:107) . So (cid:107) v [ i ] (cid:107) ≤ (cid:107) v [ i − (cid:107) / √ k and thus m/k (cid:88) i =1 (cid:107) v [ i ] (cid:107) ≤ (cid:107) v [1] (cid:107) + 1 √ k (cid:107) v (cid:107) ≤ , giving 2) above. 32y rescaling the entries of v in Lemma B.1, we immediately get the following extension toHaar-weighted norms: Corollary B.2. If v ∈ R m satisﬁes (cid:107) v (cid:107) h ≤ C and (cid:107) v (cid:107) h = C · √ k , then there exist k -sparsevectors v , ..., v m/k with disjoint supports for which 1) v = (cid:80) m/ki =1 v i , 2) (cid:80) m/ki =1 (cid:107) v i (cid:107) h ≤ C , and 3) (cid:80) m/ki =1 (cid:107) v [ i ] (cid:107) ∞ ; h ≤ k (cid:107) v (cid:107) h + (cid:107) v (cid:107) ∞ ; h . We remark that whereas in [CLM19], shelling was applied to the unweighted L , L norms,and the only L information used about v ∈ V n(cid:96) was that (cid:107) v (cid:107) = n , in the sequel we will shellunder the Haar-weighted norms and use the reﬁned bounds on the Haar-weighted norms given byConstraints 3 and 4 from Deﬁnition 3.1. This will be crucial to getting a net of size exponential in (cid:96) rather than just poly( (cid:96) ).We now complete the proof of Lemma A.1. Proof of Lemma A.1.

Let s = (cid:96) log n +1, and let m = log n . Let N (cid:48) be an O (cid:0) ηn · s (cid:1) -net in Frobeniusnorm for all s -sparse n × n matrices of unit Frobenius norm. Because S s − has an O (cid:0) ηn · s (cid:1) -netin L norm of size O ( n · s /η ) s , by a union bound we have that |N (cid:48) | ≤ (cid:18) n s (cid:19) · O ( n · s /η ) s = O ( n (cid:96) log n/η ) s Take any Σ ∈ K and consider L (cid:44) H Σ H (cid:62) . By Constraints 2, 3, 4 in Deﬁnition 3.1, (cid:107) L (cid:107) , h ≤ s , (cid:107) L (cid:107) F ; h ≤ s , and (cid:107) L (cid:107) max; h ≤ . (34)We can use the ﬁrst two of these and apply Corollary B.2 to the n -dimensional vector L to concludethat L = (cid:80) j L j for some matrices { L j } j of sparsity at most s and for which (cid:80) j (cid:107) L j (cid:107) F ; h ≤ s and (cid:80) j (cid:107) L j (cid:107) max; h ≤ s (cid:107) L j (cid:107) , h + (cid:107) L j (cid:107) max; h .By deﬁnition of the Haar-weighted Frobenius norm, (cid:107) L j (cid:107) F ≤ n · (cid:107) L j (cid:107) F,µ , so (cid:88) j (cid:107) L j (cid:107) F ≤ O ( n · s ) . For each L j , there is some ( L (cid:48) ) j ∈ N (cid:48) such that for ˜ L j (cid:44) (cid:107) L j (cid:107) F · ( L (cid:48) ) j , (cid:107) L j − ˜ L j (cid:107) F ≤ O (cid:16) ηn · s (cid:17) (cid:107) L j (cid:107) F . (35)We conclude that if we deﬁne ˜ L (cid:44) (cid:80) j ˜ L j , then (cid:107) L − ˜ L (cid:107) F ≤ η .Now let N (cid:44) H − N − ( H − ) (cid:62) . As Σ = H − L ( H − ) (cid:62) and H − is an isometry, if we deﬁne˜Σ j (cid:44) H − ˜ L j ( H − ) (cid:62) and ˜Σ (cid:44) (cid:80) j ˜Σ j , then we likewise get that (cid:107) Σ − ˜Σ (cid:107) F ≤ η , and clearly ˜Σ j ∈ P N for every j , concluding the proof of part 1) of the lemma.For each ˜Σ j , deﬁne α j (cid:44) (cid:107) L j (cid:107) max; h / j ∗ (cid:44) ˜Σ j /α j so that ˜Σ = (cid:80) j,σ,τ α j · Σ j ∗ . Note that by part 3) of Corollary B.2 and (34), (cid:88) j α j = 12 (cid:88) j (cid:107) L j (cid:107) max; h ≤

12 1 s (cid:107) L (cid:107) , h + (cid:107) L (cid:107) max; h ≤ (cid:107) L (cid:107) , h ≤ s and (cid:107) L [ σ, τ ] (cid:107) max; h ≤

1. This concludesthe proof of part 2) of the lemma.Finally, we need to bound (cid:107) Σ j ∗ (cid:107) max . Note ﬁrst that for any matrix J supported only on asubmatrix consisting of entries of L from the rows i (resp. columns j ) for which i ∈ T σ (resp. j ∈ T τ ), we have that (cid:107) H − J ( H − ) (cid:62) (cid:107) max = 2 − ( m − σ ) / · − ( m − τ ) / · (cid:107) J (cid:107) max = 2 ( σ + τ ) / n (cid:107) J (cid:107) max because the Haar wavelets { ψ σ,j } j (resp. { ψ τ,j } j ) have disjoint supports and L ∞ norm 2 − ( m − σ ) / (resp. 2 − ( m − τ ) / ). For general J , by decomposing J into such submatrices, call them J [ σ, τ ], weget by triangle inequality that (cid:107) H − J ( H − ) (cid:62) (cid:107) max ≤ (cid:88) σ,τ ( σ + τ ) / n (cid:107) J [ σ, τ ] (cid:107) max ≤ (cid:107) J (cid:107) max . (37)By applying this to J = ˜Σ j , we get (cid:107) ˜Σ j (cid:107) max ≤ (cid:16) (cid:107) H − L j ( H − ) (cid:62) (cid:107) max + (cid:107) H − (cid:16) L j − ˜ L j (cid:17) ( H − ) (cid:62) (cid:107) max (cid:17) ≤ (cid:107) L j (cid:107) max + (cid:107) L j − ˜ L j (cid:107) max ≤ (cid:107) L j (cid:107) max + (cid:107) L j − ˜ L j (cid:107) F ≤ (cid:107) L j (cid:107) max + O (cid:16) ηn · s (cid:17) (cid:107) L j (cid:107) F ≤ (cid:107) L j (cid:107) max · (1 + O ( η/n )) ≤ · (cid:107) L j (cid:107) max , where the ﬁrst inequality is triangle inequality, the second inequality follows by (37), the thirdinequality follows from monotonicity of L p norms, the fourth inequality follows from (35), and theﬁfth inequality follows from the fact that L j is s sparse.Recalling (36) and the deﬁnition of Σ σ,τ ; j ∗ , we conclude that (cid:107) Σ σ,τ ; j ∗ (cid:107) max ≤ O (1) as claimed. C Sub-Exponential Tail Bounds From Section A

In this section, we provide proofs for Lemmas A.2, A.3, and A.4, restated here for convenience.

Lemma A.2.

Let ξ > and let N ⊂ R n × n be any ﬁnite set for which (cid:107) Σ (cid:107) max ≤ O (1) for all Σ ∈ N . For X i ∼ Mul k ( µ i ) for i ∈ [ N ] , ˆ µ (cid:44) N (cid:80) Ni =1 X i , and µ (cid:44) N (cid:80) Ni =1 µ i , Pr (cid:2)(cid:12)(cid:12)(cid:10) (ˆ µ − µ ) ⊗ , Σ (cid:11) − E (cid:2)(cid:10) (ˆ µ − µ ) ⊗ , Σ (cid:11)(cid:3)(cid:12)(cid:12) > t ∀ Σ ∈ N (cid:3) < |N | exp (cid:18) − Ω (cid:18) N k t N kt (cid:19)(cid:19) , where the probability is over the samples X , · · · , X N . emma A.4. Let ξ > and let N ⊂ R n × n be any ﬁnite set for which (cid:107) Σ (cid:107) max ≤ O (1) for all Σ ∈ N . Let µ , ..., µ N , µ ∈ ∆ n satisfy (cid:107) µ i − µ (cid:107) ≤ ω for all i ∈ [ N ] . For X i ∼ Mul k ( µ i ) for i ∈ [ N ] , Pr (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ( µ i − µ ) (cid:62) Σ( X i − µ i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ω · t ∀ Σ ∈ N (cid:35) < |N | exp (cid:0) − Ω (cid:0) kN t (cid:1)(cid:1) , where the probability is over the samples X , · · · , X N . We remark that if we restricted our attention to test matrices of the form Σ = vv (cid:62) for v ∈ {± } n ,these lemmas would follow straightforwardly from Bernstein’s and the sub-Gaussianity of binomialdistributions.We will need the following well-known combinatorial fact, a proof of which we include forcompleteness in Section C.1 Fact C.1.

For any m, r ∈ Z , there are at most O ( m ) r · r ! tuples ( i , ..., i r ) ∈ [ m ] t for which everyelement of [ m ] occurs an even (possibly zero) number of times. Central to the proofs of Lemmas A.2 and A.3 is the following sub-exponential moment bound.We remark that this moment bound would be an immediate consequence of McDiarmid’s if Σ notonly satisﬁed (cid:107) Σ (cid:107) max but was also psd, but because the matrices arising from shelling need not bepsd, it turns out to be unavoidable that we must prove this moment bound from scratch.In this section, given µ ∈ ∆ n , let D µ denote the distribution over standard basis vectors { e i } of R n where for any i ∈ [ n ], e i has probability mass equal to the i -th entry of µ . Lemma C.2.

Let Σ ∈ R n × n have entries bounded in absolute value by O (1) , and for µ , ..., µ m , µ ∈ ∆ n , let µ (cid:44) m (cid:80) mi =1 µ i . If Y , ..., Y m are independent draws from D µ i respectively, and ˆ µ (cid:44) m (cid:80) mi =1 Y i , then for every r ≥ , E (cid:104)(cid:0) (ˆ µ − µ ) (cid:62) Σ(ˆ µ − µ ) (cid:1) r (cid:105) ≤ Ω( m ) − r · r ! .Proof. Without loss of generality, suppose Σ has entries bounded in absolute value by 1. For i, i (cid:48) ∈ [ m ], deﬁne Z i,i (cid:48) (cid:44) ( Y i − µ i ) (cid:62) Σ( Y i (cid:48) − µ i (cid:48) ). Note that because (cid:107) Y i − µ i (cid:107) ≤ i ∈ [ m ], and the entries of Σ are bounded in absolute value by 1, | Z i,i (cid:48) | ≤ i, i (cid:48) ∈ [ m ]. We can write E (cid:2)(cid:0) (ˆ µ − µ ) (cid:62) Σ(ˆ µ − µ ) (cid:1) r (cid:3) as1 m r E  (cid:88) i,i (cid:48) ∈ [ m ] Z i,i (cid:48)  r  = 1 m r (cid:88) ( i ,i (cid:48) ) ,..., ( i r ,i (cid:48) r ) E  r (cid:89) j =1 Z i j ,i (cid:48) j  . (38)Now that if there exists some index i ∈ [ m ] which occurs an odd number of times among i , i (cid:48) , ..., i r , i (cid:48) r , then by the fact that the tensor E (cid:2) ( Y i − µ i ) ⊗ a (cid:3) is identically zero for odd a , we havethat E (cid:104)(cid:81) rj =1 Z i j ,i (cid:48) j (cid:105) . So the nonzero summands on the right-hand side of (38) correspond to indices { ( i j , i (cid:48) j ) } j ∈ [ r ] which must satisfy that every index appearing among i , i (cid:48) , ..., i r , i (cid:48) r appears an evennumber of times. By Fact C.1, there are O ( m ) r · r ! such tuples.Finally, by the fact that | Z i,i (cid:48) | ≤ i, i (cid:48) ∈ [ M ], each monomial E (cid:104)(cid:81) rj =1 Z i j ,i (cid:48) j (cid:105) is upper bounded by 4 r . We conclude that E (cid:2)(cid:0) (ˆ µ − µ ) (cid:62) Σ(ˆ µ − µ ) (cid:1) r (cid:3) ≤ m r · O ( m ) r · r ! · r , from which the claim follows.Similarly, a crucial ingredient to the proof of Lemma A.4 is the following moment bound. Lemma C.3.

Let Σ ∈ R n × n have entries bounded in absolute value by O (1) , and suppose µ , ..., µ m , µ ∈ ∆ n satisfy (cid:107) µ i − µ m (cid:107) ≤ ω for all i ∈ [ m ] . Then for every r ∈ Z , E (cid:2)(cid:0) m (cid:80) mi =1 ( µ i − µ ) (cid:62) Σ( Y i − µ i ) (cid:1) r (cid:3) is if r is odd and at most O ( rω /m ) r/ otherwise. roof. It is clear that the r -th moment is zero when r is odd. Henceforth, write r as 2 r . Withoutloss of generality, suppose Σ has entries bounded in absolute value by 1. For i ∈ [ m ], deﬁne Z i (cid:44) ( µ i − µ ) (cid:62) Σ( Y i − µ i ). Note that because (cid:107) Y i − µ i (cid:107) ≤ i ∈ [ m ], andthe entries of Σ are bounded in absolute value by 1, | Z i | ≤ ω with probability 1 for all i ∈ [ m ].We can write E (cid:104)(cid:0) m (cid:80) mi =1 ( µ i − µ ) (cid:62) Σ( Y i − µ i ) (cid:1) r (cid:105) as1 m r E  (cid:88) i ∈ [ m ] Z i  r  = 1 m r (cid:88) i ,....,i r E  r (cid:89) j =1 Z i j  . As in the proof of Lemma C.2, the only nonzero summands correspond to tuples ( i , ..., i r ) suchthat every element of [ m ] appears an even (possibly zero) number of times. By Fact C.1, there areat most O ( m ) r · r ! such tuples, from which we can complete the proof.Lemmas A.2 and A.3 will now follow as consequences of Lemma C.2 and the following standardtail bound for random variables with sub-exponential moments: Fact C.4.

Let Z , ..., Z m be random variables for which there exists a constant ν > such that E [ Z ri ] ≤ ν r · r ! for all integers r ≥ and i ∈ [ m ] . Then Pr (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m m (cid:88) i =1 Z i − E [ Z ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:35) ≤ e − Ω (cid:16) mt ν νt (cid:17) . Similarly, Lemma A.4 will follow as a consequence of Lemma C.3 and the following standardtail bound for random variables with sub-Gaussian moments:

Fact C.5.

Let Z , ..., Z m be random variables for which there exists a constant ν > such that E [ Z ri ] ≤ ( r · ν ) r/ for all integers r ≥ and i ∈ [ m ] . Then Pr (cid:34)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) m m (cid:88) i =1 Z i − E [ Z ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:35) ≤ e − Ω ( mt /ν ) . Proof of Lemma A.2.

This follows by taking m = k in Lemma C.2 and m = N in Fact C.4 andnoting that for any Σ ∈ N , (cid:107) Σ (cid:107) max ≤ O (1) by Lemma A.1. Proof of Lemma A.3.

This follows by taking m = kN in Lemma C.2 and m = 1 in Fact C.4 andnoting that for any Σ ∈ N , (cid:107) Σ (cid:107) max ≤ O (1) by Lemma A.1. Proof of Lemma A.4.

This follows by taking m = k in Lemma C.3 and m = N in Fact C.5 andnoting that for any Σ ∈ N , (cid:107) Σ (cid:107) max ≤ O (1) by Lemma A.1. C.1 Proof of Fact C.1

Proof.

To count the number N ∗ of such tuples ( i , ..., i r ), for every 1 ≤ s ≤ r let N s denote thenumber of tuples β ∈ { , ..., r } s for which (cid:80) si =1 β i = 2 r . By balls-and-bins, N s = (cid:0) r + s − r (cid:1) ≤ ( es r ) r . Now note that to enumerate N ∗ , we can 1) choose the number 1 ≤ s ≤ min( m, r ) of uniqueindices among { i j } , 2) choose a subset S of [ m ] of size s , 3) choose one of the N s tuples β , and 4)36hoose one of the (cid:0) rβ ,...,β s (cid:1) ways of assigning index S to β indices in { i j } , S to β indices, etc.For convenience, let r (cid:48) (cid:44) min( m, r ). We get an upper bound of N ∗ ≤ min( m,r ) (cid:88) s =1 (cid:18) ms (cid:19) · N s · (cid:18) rβ , ..., β s (cid:19) ≤ min( m,r ) (cid:88) s =1 m s s ! (cid:18) es r (cid:19) r · (2 s )! ≤ m r (cid:48) ( r (cid:48) )! · r (cid:48) · (cid:18) er (cid:48) r (cid:19) r · (2 r (cid:48) )! ≤ m r ( r )! · r · (3 e/ r · (2 r )!= m r · r · (3 e/ r · (cid:18) rr (cid:19) · r ! ≤ O ( m ) r · r ! , where in the second step we used basic bounds on binomial and multinomial coeﬃcients togetherwith the above bound on N s , in the third step we used the fact that the summands are increasingin s , and in the fourth step we used this fact along with the fact that rr