Statistical Query Algorithms and Low-Degree Tests Are Almost Equivalent
Matthew Brennan, Guy Bresler, Samuel B. Hopkins, Jerry Li, Tselil Schramm
aa r X i v : . [ c s . CC ] N ov Statistical Query Algorithms and Low-Degree TestsAre Almost Equivalent
Matthew Brennan ∗ Guy Bresler † Samuel B. Hopkins ‡ Jerry Li § Tselil Schramm ¶ November 12, 2020
Abstract
Researchers currently use a number of approaches to predict and substantiate information-computation gaps in high-dimensional statistical estimation problems. A prominent approachis to characterize the limits of restricted models of computation, which on the one hand yieldsstrong computational lower bounds for powerful classes of algorithms and on the other hand helpsguide the development of efficient algorithms. In this paper, we study two of the most popularrestricted computational models, the statistical query framework and low-degree polynomials,in the context of high-dimensional hypothesis testing. Our main result is that under mildconditions on the testing problem, the two classes of algorithms are essentially equivalent inpower. As corollaries, we obtain new statistical query lower bounds for sparse PCA, tensorPCA and several variants of the planted clique problem. ∗ MIT, [email protected] . Supported by MIT-IBM Watson AI Lab, NSF Career Award CCF-1940205, and ONRN00014-17-1-2147. † MIT, [email protected] . Supported by MIT-IBM Watson AI Lab, NSF Career Award CCF-1940205, and ONRN00014-17-1-2147. ‡ UC Berkeley, [email protected] . Supported by a Miller Postdoctoral Fellowship. § Microsoft Research, [email protected] . ¶ Stanford University, [email protected] . Part of this work was done while virtually visiting the Microsoft ResearchMachine Learning and Optimization group. ontents
A SDA, Product-SDA, and Simple-vs-Simple Hypothesis Testing 45
A.1 Counterexample to Equivalence of Two Notions of Statistical Dimension . . . . . . . 46A.2 Statistical Dimension as a Lower Bound for Hypothesis Testing . . . . . . . . . . . . 47
B VSTAT Algorithms Imply Low-Degree Distinguishers 48C Proofs of Cloning Facts 52D Omitted Calculations from Applications 53
D.1 Tensor PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53D.2 Planted Clique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55D.3 Spiked Wishart PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58D.4 Gaussian Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Introduction
Information-computation tradeoffs are ubiquitous in high dimensional statistics. As the amountand quality of the data increase, inference and estimation tasks often require fewer computationalresources , creating an information-computation gap between the signal-to-noise ratios at which theproblem is information-theoretically solvable and at which computationally efficient algorithms areknown. This phenomenon is widespread, appearing in estimation of a sparse vector from linearobservations, low-rank matrix estimation, sparse principal component analysis, subgraph recov-ery, random constraint satisfaction, dictionary learning, tensor completion, covariance estimation,phase retrieval, graph matching, and well beyond (c.f., [Don06, CRT06, FB96, CT07, LDP07,RFP10, JNS13, CMP10, RCLV13, JOH, CSV13, ACV14, ACBL12, Mon15, Fei02, JL09, BR13b,RBE10, SWW12, FHT08]). Tradeoffs between computational resources and statistical accuracyare also widely observed empirically in machine learning: both increasing model size and usingmore iterations of gradient descent to fit models to training data often improve generalization[JT18, SHN +
18, NKB +
19, KMH + planted clique problem, where the goal is to find a clique of size k placed at random in random graph on n ver-tices. The problem is solvable by exhaustive search for k ≫ log n , but all known polynomial-timealgorithms require k = Ω( √ n ); the planted clique conjecture postulates that the problem is compu-tationally hard if k = o ( √ n ). The foundational work [Jer92] showed lower bounds for Markov-ChainMonte-Carlo methods. [FK03] prove lower bounds against Lov´asz–Schrijver semidefinite programs,and lower bounds against stronger Sum-of-Squares semidefinite programs were developed later in[BHK +
19, DM15, MPW15, HKP + +
17] rule out algorithms for a similar problem in the statistical query model, while [ABDR +
18, Ros08, Ros14] study proof and circuit complexity. Mostof these lower bounds rule out algorithms for any k = o ( √ n ).Taken together, these works constitute some evidence for the planted clique conjecture. How-ever, the proliferation of lower bounds suggests a need for unifying principles, especially becausethis story is repeated for numerous statistical estimation problems: lower bounds against a vari-ety of restricted computational models are proven independently, all usually pointing to the samesignal-to-noise ratios tolerated by efficient algorithms. This appears to be a miracle: why, forso many distinct problems, should so many restricted computational models point to the samesignal-to-noise thresholds for efficient algorithms? (E.g., k > Ω( √ n ) for planted clique.) We ask: Are some or all of these restricted models equivalent in power? Do lower bounds insome models imply lower bounds in others?
If a single class of algorithms were to turn out to be at least as powerful as any of the otherpopular computational models for an interesting class of statistics problems, then numerous lower1ounds could be replaced with a single bound. One might hope to achieve this objective by givingreductions between computational models , establishing a hierarchy among them and quelling theproliferation of lower bounds.In this paper, we make a small step towards this goal. Under mild conditions, we establish theequivalence of two popular frameworks for lower bounds on restricted models of computation forhigh-dimensional hypothesis testing: statistical dimension and low-degree polynomials . Statisticaldimension is closely related to statistical query (SQ) algorithms , and our results also show thatalgorithms based on low-degree polynomials are at least as powerful as SQ algorithms.
Hypothesis Testing.
We consider simple-versus-simple hypothesis testing problems in which wehave one null distribution D ∅ over R n , and a family of alternative distributions S = { D u } u ∈ S overthe same space, with a prior distribution µ on S .Under the null hypothesis H we are given samples x , . . . , x m ∈ R n generated independentlyaccording to D ∅ , whereas under the alternative hypothesis H the samples are instead generatedaccording to D u for u ∼ µ (we often write u ∼ S ). The objective is to determine which hypothesisis correct. One example is the sparse principal component analysis problem (sparse PCA), where D ∅ = N (0 , I n ), S = { D u } where for each u ∈ R n with k u k = 1 and ρn nonzero entries, D u = N (0 , I n + 0 . uu ⊤ ), and µ taken uniform over S —here, the testing problem amounts to detectingthe presence of the sparse rank-one spike. Testing problems are of great interest in their own right; moreover, to give a lower bound for anestimation problem, it is often sufficient to show that a related hypothesis testing problem is hard(see, e.g., [BB20] – estimation and testing are related similarly to search and decision in worst-casecomplexity).Since we study a model of computation (low degree polynomials) which most naturally outputsreal rather than Boolean values, we will use the following notion of a successful test between H , H . Definition 1.1 ( ε -distinguisher) . We call a function p : R n × m → R of m vectors x = x , . . . , x m ∈ R n an m -sample ε -distinguisher for a testing problem D ∅ vs. S if (cid:12)(cid:12) E x ∼ D ∅ p ( x ) − E u ∼ S E x ∼ D u p ( x ) (cid:12)(cid:12) > ε · p Var x ∼ D ∅ p ( x ). If ε >
1, we call p a good distinguisher. A hypothesis test with small probability of error automatically furnishes a good distinguisher.The converse is not necessarily true; though one might naturally try to apply thresholding to adistinguisher to obtain a hypothesis test, a good distinguisher may have large variance under thealternative hypothesis H , so there is only a one-sided error guarantee. Thus, from the perspectiveof lower bounds , ruling out the existence of a ε -distinguisher in a restricted computational modelis at least as strong as ruling out the existence of a small-error hypothesis test (in that model). Low Degree Polynomials.
Given m samples x = x , . . . , x m ∈ R n , our first model of com-putation is allowed to output the value of any fixed polynomial p ( x ) of bounded degree, usuallyconstant or logarithmic in m, n . Note that this model allows polynomials in all m samples jointly, not just empirical averages over m samples of the form m P mi =1 p ( x i ).An extraordinary variety of high-dimensional hypothesis testing algorithms boil down to evalu-ating low-degree polynomials: for example, most spectral algorithms, the method of moments, algo-rithms based on small-subgraph statistics, and message passing algorithms (see [KWB19, Hop18]). As we discuss below, this problem is unlike planted clique in that the number of samples rather than the signalper sample governs information-theoretic and computational complexity. Here, ε > k polynomial can be evaluatedin time ( nm ) O ( k ) by evaluating all monomials.A recent line of work characterizes the limitations of such algorithms by ruling out the exis-tence of low-degree distinguishers: such lower bounds are now known in the computationally-hardregimes of planted clique [BHK + any known poly-time hypothesis test . Statistical Queries and Statistical Dimension.
Our second model of computation is the statistical query (SQ) model VSTAT( m ). VSTAT( m ) algorithms access a distribution D over R n via queries φ : R n → [0 ,
1] to an oracle. For each query φ , the oracle returns E x ∼ D φ ( x ) + ζ , foran adversarially chosen ζ ∈ R with | ζ | max( m , q E [ φ ](1 − E [ φ ]) m ). This approximates E D φ with thesame accuracy as an m -sample empirical estimate under the guarantees of Bernstein’s inequality.The SQ model was first proposed as a framework for designing noise-tolerant algorithms [Kea98],and is a popular restricted model of computation for studying information-computation tradeoffs(see e.g. [FGR +
17, FPV18, DKS17], as well as numerous supervised learning problems). Analgorithm which makes q queries to VSTAT( m ) is a proxy for an algorithm running in time q on m samples, albeit an imperfect one, since (1) the queries φ need not be polynomial-time computable,and (2) each query φ is permitted to be a function of only a single sample.We will treat the SQ model via statistical dimension , a complexity measure on hypothesis testingproblems which implies lower bounds against SQ algorithms. Most existing SQ lower boundsare proved by analyzing one of a few possible notions of statistical dimension. We use a mildstrengthening of the statistical dimension introduced by [FGR + Definition 1.2 (Statistical Dimension) . Let D ∅ vs. S be a testing problem with prior µ . For D u ∈ S , define the relative density D u ( x ) = D u ( x ) D ∅ ( x ) , and the inner product h f, g i = E x ∼ D ∅ f ( x ) g ( x ).The statistical dimension SDA( S , µ, m ) measures tails of (cid:10) D u , D v (cid:11) − u, v drawn from µ .SDA( S , µ, m ) = max (cid:26) q ∈ N : E u,v ∼ µ (cid:2)(cid:12)(cid:12)(cid:10) D u , D v (cid:11) − (cid:12)(cid:12) | A (cid:3) m for all events A s.t. Pr u,v ∼ µ ( A ) > q (cid:27) . Often we will write SDA( m ) or SDA( S , m ) when S and/or µ are clear from context.We offer some intuition about the definition, which may be opaque at first. The quantity h D u , D v i− E x ∼ D u Pr Dv [ x ] Pr D ∅ [ x ] −
1; that is, the centered average of the likelihood ratio of D v to D ∅ over samples from D u . When this quantity is at least δ , D u and D v may have common eventsthat allow one to distinguish them both from D ∅ with probability δ ′ . The statistical dimensionquantifies the measure of pairs of distributions (according to µ ) with no such common events.In [FGR + Theorem 1.3 (Theorem 2.7 of [FGR + . Let D ∅ be a null distribution and S be a set of alternatedistributions over R n . Then any (randomized) statistical query algorithm which solves the hypoth-esis testing problem of D ∅ vs. S with probability at least (1 − δ ) requires at least (1 − δ )SDA( S , m ) queries to VSTAT( m/ (corresponding to m/ samples). We remark on technical differences between our setup and that of [FGR +
17] in Appendices A.1 and A.2. We extend their result to our notion of SDA via a near-identical argument in Appendix A.2. .2 Our Results Our main result is a surprisingly tight equivalence, under mild conditions, between statisticaldimension and the minimum degree of any good distinguisher.Summarizing the discussion of running times and sample complexities above, we might hopeto equate m -sample distinguishers of degree k (which can be evaluated in time ( nm ) O ( k ) ) with2 O ( k ) -query VSTAT( m ) algorithms. To understand the conditions under which this is possible, wefirst observe that planted clique already furnishes a counterexample – a case where a single-querySQ algorithm exists but there is no corresponding low-degree distinguisher. Concretely, to detect a k -clique planted in a graph G from G ( n, / k ≫ log n it suffices to make the single query φ ( G ) = ( G contains a k -clique) to VSTAT(4). By contrast, it is known that no degree o (log n )polynomial successfully distinguishes for any k < n / − ε [BHK + niceness condition, which asks for just slightly more: no high degree function of a very small number ofsamples is a nontrivial distinguisher.While niceness rules out problems like planted clique (which is what we want), we will see thatit allows “many-sample” problems such as sparse PCA – precisely the type of problems for whichthe SQ model can capture interesting information-computation gaps. After our main theoremstatement (Remark 1.9) we describe a principled approach to transform one-shot problems likeplanted clique into many-sample problems, so that they can also be studied with our techniques. Definition 1.4 (( δ, k )-nice) . Fix a null distribution D ∅ on R N . Call a function p : R N × k → R of k vectors x , . . . , x k ∈ R N k -purely high degree if it is orthogonal to all functions f ( x , . . . , x k ) whichhave degree at most k in one of x , . . . , x k – that is, E x ,...,x k ∼ D ∅ p ( x , . . . , x k ) f ( x , . . . , x k ) = 0 forall such f . The testing problem D ∅ , { D u } u ∈S is ( δ, k )-nice if no k -purely high-degree function of k samples is a δ -distinguisher.We emphasize that ( δ, k )-niceness concerns hardness of a testing problem when given very few samples – we typically think of k = O (1) or k = polylog N . We will show that almost any reasonablemulti-sample testing problem which is not too easy to solve with k samples becomes nice after theaddition of a small amount of noise. The following is stated for a coordinate-wise resampling noiseprocess – it follows from standard arguments about noise operators and high-degree functions. InSection 5 we give versions allowing a broad class of noise processes (additive Gaussian noise, randomrestriction, etc.). Fact 1.5 (See Theorem 5.2) . Let S = { D u } , D ∅ be a testing problem on R N and suppose that D ∅ = D ⊗ N is a product distribution. Let k ∈ N and suppose that S , D ∅ does not have a k -sample C -distinguisher. Let S ′ = { D ′ u } , where to sample x ′ ∼ D ′ u we first sample x ∼ D u and then eachcoordinate x i is independently replaced with a fresh sample from D with probability ρ ∈ [0 , . Then D ∅ versus S ′ is ( C (1 − ρ ) k , k ) -nice. Many natural high-dimensional hypothesis testing problems are robust to noise (including themain examples we have mentioned so far), and remain qualitatively unchanged by the additionof some form of noise captured by our theorems. The typical effect is a small decrease in thesignal-to-noise ratio in each sample. In typical applications, C = O (1), and when working with m samples we will want roughly ( m − k/ , k )-niceness, which we can achieve by taking k ≈ log m and ρ a small constant, so that S and S ′ are very similar. In this case, our main theorem will4ead to (log m ) -degree distinguishers, whereas brute-force algorithms would correspond to degree N m ≫ (log m ) – with more refined definitions later on, in many cases (e.g. Planted Clique) wecan avoid the logarithmic loss and replace (log m ) with log m . Main Theorem.
We turn to our main theorem. On first reading we suggest the interpretationthat m ′ = m and k is constant or logarithmic in m . Theorem 1.6 (Main Theorem, see Theorem 3.1 and Theorem 4.1) . Let D ∅ vs. S be an ( m − k/ / , k ) -nice testing problem on R N for some even k > .1. If there is some m ′ m such that SDA( S , m ′ ) (cid:0) mm ′ (cid:1) k/ (in particular, if there isan SQ algorithm making o (2 k/ ) queries to VSTAT( m/ ), then there is a good mk -sampledistinguisher p which has degree d k , and2. if there is a degree k function p which is a good m -sample distinguisher, then there exists m ′ m such that SDA( S , m ′ ) (cid:0) mm ′ (cid:1) O ( k ) (e.g. SDA( S , m ) O ( k ) ). Using Fact 1.5, we already see that Theorem 1.6 applies to any noisy testing problem. Evenwithout adding noise, our next theorem shows that the guarantees of Theorem 1.6 apply to someproblems with additional structure – for instance, if D ∅ and the D u ’s are all product distributions.(This is the case even though such problems may not be nice; we are still able to apply a variantof the proof of Theorem 1.6.) This leads to slightly tighter results, especially for problems wherethe difference between degree log m and poly(log m ) distinguishers is important. Theorem 1.7 (Gaussian or Independent Coordinates, see Theorems 6.1 & 6.3) . Let S = { D u } , D ∅ be a testing problem on R N with one of the following structures: • D ∅ = N (0 , I N ) is the standard Gaussian distribution and each D u = N ( u, I N ) for somevector u ∈ R N • D ∅ and all D u are product measures on {± } N Let m, k ∈ N with k ≪ m and suppose that S , D ∅ has no k -sample k -distinguisher. Then theconclusion of Theorem 1.6 holds for S (with the upper bound on d in part 1 replaced by d O ( k ) ). Even with the additional requirements, Theorem 1.7 captures numerous interesting problems –spiked matrix and tensor models, variants of random constraint satisfaction and linear equations,community detection, and beyond.
Remark 1.8 (Simulation Arguments Are Lossy) . A natural approach to prove a theorem likeTheorem 1.6 would be to na¨ıvely simulate SQ algorithms by low-degree distinguishers and vice versa.However, direct simulation arguments that we are aware of (for instance, taking each monomial in alow-degree distinguisher to be an SQ query) at best relate SDA( S , m ) to low-degree distinguisherson poly( m ) samples (or vice versa). By contrast, Theorem 1.6 translates between SDA( S , m ) andlow-degree distinguishers on approximately m samples – this is crucial for most applications, whereinformation-computation gaps occur on the scale of m versus poly( m ) samples.Our proof of Theorem 1.6 directly relates statistical dimension to the minimum degree of adistinguisher, without a simulation argument. We also give (Appendix B) a different proof of aslightly weaker version of part 1 of Theorem 1.6, which is based on a simulation-style argument(though it has a non-constructive component) of an algorithm making calls to VSTAT via a low-degree distinguisher without poly( m ) losses. As mentioned above, Theorems 3.1 and 4.1 are stated in terms of a more refined notion of degree (defined inSection 2) which allows us in many cases to improve the bound to d O ( k ), which is the best we can hope for. The quantitative bounds we obtain are identical to Theorem 1.6; the theorem is weaker because the existence ofa VSTAT algorithm is a stronger assumption than an upper bound on the statistical dimension. emark 1.9 (One-Shot Versus Multi-Sample Problems) . Theorem 1.6 only applies to nice testingproblems. In particular, niceness rules out many “one-shot” problems which are information-theoretically easy to solve with a single sample, such as the usual formulation of planted clique,where the SQ model does not make sense – the model originates in PAC learning, where havingmany independent samples is fundamental. By contrast, low-degree tests can still be formulatedfor one-shot problems.To give evidence of hardness for a one-shot problem in the SQ framework, one must firstformulate a multi-sample version. For instance, the SQ lower bounds of [FGR +
17] for plantedclique treat a “bipartite” version where each sample is the adjacency list of a node in a bipartitegraph. These multi-sample formulations are often ad hoc , which is problematic, as the choice ofmulti-sample version can significantly affect the resulting statistical query complexity !Based on Theorem 1.6, we propose a canonical approach to translate one-shot problems into nicemany-sample problems: decrease the per-sample signal-to-noise ratio (e.g., clique size versus graphdensity in planted clique) until the resulting problem is information-theoretically unsolvable given O (1) independent samples, while simultaneously increasing the number of samples appropriately.For example, in a Gaussian model, one sample from N ( u, I ) is equivalent to m samples from N ( √ m u, I ). In numerous cases – additive Gaussian models and planted clique, for example –this yields problems which are polynomial-time equivalent to the underlying one-shot problem (seeSection 7). For an illustration, see the Tensor PCA problem discussed in and above Corollary 1.11. We outline the proof of case (1) of our main theorem; case (2)follows a similar argument in reverse. We argue contrapositively, starting with the hypothesis thatthere is no good degree k m -sample distinguisher. For this sketch, we ignore the case m ′ < m and consider the goal of proving a lower bound on the statistical dimension SDA( S , m ). Unpackingthe definition of SDA, this amounts to the tail bound E u,v ∼S [ | (cid:10) D u , D v (cid:11) − | | A ] . /m for anyevent A of probability roughly 2 − k . This tail bound will be implied by an upper bound on the k -thmoment – our goal will be to show E u,v ∼S ( (cid:10) D u , D v (cid:11) − k . m − k .Simple manipulations (which rely on the independence of the samples) show that the maximumvalue of α such that there is a k -sample α -distinguisher is given by the related quantity α = q E u,v ∼S h D u , D v i k −
1. To see why, recall that a k -sample ε -distinguisher is a function of k samples, p ( x , . . . , x k ) that satisfies ε · ( Var D ⊗ k ∅ p ) / | E u ∼S E D ⊗ ku p − E D ⊗ k ∅ p | = (cid:12)(cid:12)(cid:12) h p, E u D ⊗ ku − i D ⊗ k ∅ (cid:12)(cid:12)(cid:12) . By rescaling we may without loss of generality consider p with Var D ⊗ k ∅ p = h p, p i D ⊗ k ∅ = 1. So nowby Cauchy-Schwarz and by the independence of the samples, ε = (cid:12)(cid:12)(cid:12)(cid:12)D p, E u ∼S D ⊗ ku − E D ⊗ k ∅ (cid:12)(cid:12)(cid:12)(cid:12) r E u,v ∼S D D ⊗ ku − , D ⊗ kv − E D ⊗ k ∅ = r E u,v ∼S h D u , D v i kD ∅ − , where in the final step we have used that D ⊗ ku is a density and so h D ⊗ ku , i = 1, as well as theindependence of the samples. By choosing the p for which the Cauchy-Schwarz is tight, we haveour conclusion.Thus, pretending for the sake of this overview that the k -th moment E u,v ( h D u , D v i − k ≈ E u,v h D u , D v i k −
1, to show that E u,v ∼S ( h D u , D v i− k . m − k , it suffices for us to rule out k -sample m − k/ -distinguishers. Since by assumption D ∅ versus S is ( m − k/ , k ) nice, such a distinguisher Here we have used the notation that for a distribution D , h f, g i D = E x ∼ D f ( x ) g ( x ) and D ⊗ k is the jointdistribution of k random samples from D , and for a function f ( x ), f ⊗ k ( x , . . . , x k ) = Q ki =1 f ( x i ). k -purely high degree. Via a careful application of H¨older’s inequality (Lemma 3.4),we are able to show that it suffices to consider only functions of purely high degree or purely lowdegree . The main challenge is now to rule out a low-degree k -sample m − k/ distinguisher – that is,we need to show that every function p ( x , . . . , x k ) with degree at most k in each sample x i has (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E u ∼S E D ⊗ ku p − E D ⊗ k ∅ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . m − k/ r Var D ⊗ k ∅ p . (1)Since we are analyzing k -sample distinguishers, it is not a priori clear how such a 1 / poly( m )bound on the distinguishing power can appear, especially given that m ≫ k . Our key insight isthat this strong quantitative bound follows from the assumption that there is no good degree- k m -sample distinguisher: Lemma 1.10 (Key Lemma, Informal – see Claim 3.3, Lemma 3.5) . If there is no good m -sampledegree- k distinguisher for the testing problem D ∅ versus S , then no function p ( x , . . . , x k ) withdegree at most k in each sample is an m − k/ -distinguisher. Once the (very careful) setup is in place, this lemma follows from elementary Fourier anal-ysis, exploiting independence of samples. Nonetheless, we find it striking that a relatively mildassumption on the distinguishing power of low degree polynomials of m samples can be boostedinto a strong quantitative bound on the distinguishing power of low degree polynomials of k ≪ m samples. This lemma leads to (1), finishing the proof. Niceness of Noise-Robust Problems.
To show that noise-robust testing problems satisfy theniceness criterion (Fact 1.5 and its generalizations in Section 5), we again use Fourier Analysis; forsome types of noise our arguments are entirely standard, exploiting the attenuation of high-degreefunctions under i.i.d. noise. We also allow for noise processes which make sense for problemswith combinatorial structure which would be adversely affected by i.i.d. coordinate-wise noise (e.g.hypergraph planted clique) – showing that these also lead to nice testing problems uses similarideas but requires more care.
Avoiding Niceness for Product and Gaussian Distributions.
Finally, we overview theproof of Theorem 1.7. We need to avoid the use of the niceness assumption that we described inthe overview above of the proof of Theorem 1.6. That is, we need a different way to rule out high-degree k -sample m − k/ -distinguishers. Roughly speaking, we show that under either the product orGaussian assumptions, a high-degree k -sample α -distinguisher cannot exist unless a low-degree onedoes – then we follow the argument above to rule out low-degree k -sample m − k/ distinguishers.This argument turns on the fact that, for Gaussian and product distributions, high-degree momentsare simple functions of low-degree moments. (See Lemmas 6.2 and 6.4 for the details.) We use our equivalence theorems to obtain new information-computation lower bounds for a numberof testing problems. We obtain new lower bounds against SQ algorithms for tensor PCA (Corol-lary 8.4), (Hypergraph) Planted Clique and Planted Dense Subgraph (8.14), and sparse PCA (8.22),and we obtain new lower bounds against low-degree distinguishers for Gaussian mixture models(8.29) and Gaussian Graphical Models (8.32). Our bounds are obtained essentially “for free” bystarting with known SDA or degree lower bounds, then applying Theorem 1.6 and its derivatives.(One exception is the Gaussian Graphical Models bound, for which we prove an SQ lower bound7rom scratch. Interestingly, for this problem, it seems easier to prove SDA lower bounds than degreelower bounds.)In the case of planted clique, in addition to capturing the “bipartite” model of [FGR + m independent copiesof the adjacency matrix of G ( n, p /m ) or G ( n, p /m ) with the same planted k -clique. We showin Lemma 7.3 that our version is information-theoretically and computationally equivalent to thestandard version of planted clique (albeit with slightly higher-than-usual edge density p > / + + G with i.i.d. entries from N (0 ,
1) from a planted tensor of the form G + λu ⊗ , where G is as before, λ >
0, and u is a unit vector. In Lemma 7.2 we show that this problem is in fact equivalent (bothstatistically and computationally) to the following m -sample problem: distinguish between i.i.d. G , . . . , G m and G + λ √ m u ⊗ , . . . , G m + λ √ m u ⊗ .By combining known bounds against low-degree distinguishers [HKP +
17, KWB19] with Theo-rem 1.6, we obtain a new SQ lower bound against the multi-sample version of Tensor PCA:
Corollary 1.11 (SQ lower bound for Tensor PCA (special case of Corollary 8.4)) . Let D ∅ = N (0 , I n ) and for unit u ∈ R n let D u = N ( u ⊗ , I n ) . Let S be the uniform distribution on { D u } u ∈{± / √ n } n . Any SQ algorithm solving the testing problem S versus D ∅ requires at least n ω (1) queries to VSTAT( n / / (log n ) O (1) ) . Up to logarithmic factors, this SQ lower bound matches the best known polynomial-time algo-rithms, which require at least m > Ω( n / ) samples (or, for the one-shot problem, λ > Ω( n / ))[HSS15]. We discuss the information-computation tradeoff in greater detail in Section 8.1. We notethat similar bounds for tensor PCA were obtained concurrently and independently in [DH20]. Researchers have long been aware of the information-computation gap phenomenon, with early workshowing such gaps in artificially constructed learning problems [DGR00, Ser99, SSST12] and morerecent work focusing on algorithms that trade off between statistical and computational efficiency[SSS08, BKR +
11, SSST12, CJ13, CX16]. Our goal here is to establish an equivalence betweenlarge classes of algorithms for a wide range of problems in high-dimensional statistics – low-degreedistinguishers and SQ algorithms. Several prior works have a similar theme: in related contexts,[HKP +
17] shows that Sum-of-Squares semidefinite programs are no more powerful than a restrictedclass of spectral algorithms for hypothesis testing, and [FGV17] shows that a restricted class ofconvex programs is captured by SQ algorithms.Several related lines of work establish algorithm-independent or structural properties of highdimensional statistics problems which imply hardness results against restricted models of computa-tion – statistical dimension being one example. Other examples come from statistical physics, where overlap gaps and, more generally, solution-space geometry are related to performance of algorithmssuch as Markov-Chain Monte Carlo and message passing, with early work focusing primarily on This class of spectral algorithms, to our knowledge, is not captured by low-degree distinguishers. +
20, AWZ20, GJS19].More broadly, information-computation tradeoffs have been studied in many restricted com-putational models: e.g. message-passing algorithms (see [MM09, ZK16] for overviews; we high-light recent work [WEAM19] focusing on running time versus information tradeoffs ), Markov-Chain Monte Carlo (e.g. [Jer92, AGJ + Statistical Query Model.
The SQ model was proposed by Kearns as a framework for designingnoise-tolerant algorithms for PAC learning [Kea98]. Blum et al. shortly thereafter introducedstatistical query dimension [BFJ +
94] as a framework for proving lower bounds on SQ algorithmsfor supervised learning. The SQ framework has since been generalized to hypothesis testing andestimation [FGR +
17, FPV18].An advantage of SQ lower bounds is their implications for other algorithms: since many al-gorithms can be implemented with SQ oracle access, SQ lower bounds immediately imply lowerbounds against a number of other algorithms, including some convex programs, gradient descent,and more (see e.g. [FGV17]).SQ lower bounds abound in the study of high-dimensional learning – recent examples are inrobust statistics [DKS17, DKS19], polytopes [KS07], neural nets [GGJ + q but we do notknow any q -query VSTAT algorithms. A complete characterization is given in [Fel12]. In lightof this, our results equate the power of low-degree distinguishers with a computational modelthat is at least as powerful as VSTAT. There are a number of other statistical query models forhypothesis testing problems defined in the literature, for example the MVSTAT oracle of [FPV18].An interesting open problem is whether a more direct equivalence (via simulation argument) canbe achieved in an alternative SQ model. Low-Degree Tests.
Using low-degree polynomials to prove computational lower bounds is aclassical idea in theoretical computer science; see e.g. [Bei93] on the polynomial method in circuitcomplexity. Their recent study as a restricted model of computation for high-dimensional estima-tion and hypothesis testing problems emerged implicitly in the literature on Sum-of-Squares lowerbounds [BHK + + + + + Organization.
Section 2 contains preliminaries; the proofs of parts 1 and 2 of Theorem 1.6 followin Sections 3 and 4. In Section 5 we obtain corollaries for noise robust problems (generalizations ofFact 1.5) and in Section 6 we derive even stronger corollaries for product measures (Theorem 1.7).9ection 7 contains a discussion of the cloning methodology for transforming a one-shot problem toan appropriate multi-sample problem for the SQ framework. Section 8 applies our main results toobtain new lower bounds for a number of testing problems.Appendices A.1 and A.2 give some further details on statistical dimension. Appendix B gives anargument showing how VSTAT algorithms can be simulated directly by low-degree distinguishers.Some calculations are postponed to Appendices C and D.
We study hypothesis testing problems D ∅ vs. S = { D u } u ∈S with a prior µ over S . We frequentlywrite u ∼ S or u ∼ S to indicate that D u is sampled from S according to the marginal µ .We use D u to refer to the likelihood ratio or relative density D u D ∅ , where the background measure D ∅ will be clear from context. We always assume that the likelihood ratio is finite and that E x ∼ D ∅ ( D u ( x ) /D ∅ ( x )) < ∞ , for every D u . This holds if D ∅ , D u have finite support and thesupport of D u is contained in that of D ∅ ; it can also be enforced for continuous distributions bymild truncation of tails.For R -valued functions f, g , let the inner product h f, g i D ∅ = E x ∼ D ∅ f ( x ) g ( x ) and the corre-sponding norm k f k D ∅ = h f, f i / D ∅ . We drop the subscript D ∅ when D ∅ is clear from context.Note that always, h D u , i = 1. For a distribution D and an integer k , let D ⊗ k denote the jointdistribution of k independent samples from D . We will often use (cid:10) f ⊗ k , g ⊗ k (cid:11) D ⊗ k ∅ = h f, g i kD ∅ , whichis a consequence of independence.For D ∅ over R n , d a non-negative integer, and any function f : R n → R , we let f ( x ) d denotethe orthogonal (w.r.t. D ∅ ) projection of f to the span of functions of degree at most d in x . Wesimilarly define f
We will repeatedly use thefolklore fact that the optimal m -sample low-degree test for a problem S , D ∅ has a canonical form:it is the projection of the m -sample likelihood ratio E u ∼S D ⊗ mu to the span of functions of low degree.In fact, a more general statement is true (which we have essentially proved in Section 1.2.1): Fact 2.1.
Let D ∅ vs. S be a testing problem on R n . Let C be a linear subspace of functions p : ( R n ) ⊗ m → R , and let Π C be the orthogonal projection to the subspace C . Then arg max p ∈C E D ⊗ m ∅ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E u ∼ S E D ⊗ mu p − E D ⊗ m ∅ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = Π C (cid:16) E u ∼ S D ⊗ mu − (cid:17)(cid:13)(cid:13)(cid:13) Π C (cid:16) E u ∼ S D ⊗ mu − (cid:17)(cid:13)(cid:13)(cid:13) D ⊗ m ∅ . Letting p = Π C (cid:16) E u ∼ S D ⊗ mu − (cid:17)(cid:13)(cid:13)(cid:13) Π C (cid:16) E u ∼ S D ⊗ mu − (cid:17)(cid:13)(cid:13)(cid:13) be the optimizer of the above program, observe also that E u ∼ S E D ⊗ mu p − E D ⊗ m ∅ p = (cid:13)(cid:13)(cid:13)(cid:13) Π C (cid:18) E u ∼ S D ⊗ mu − (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) D ⊗ m ∅ . Consequently,
Fact 2.2. If (cid:13)(cid:13)(cid:13) Π C ( E u ∼ S D ⊗ mu − (cid:13)(cid:13)(cid:13) ε , then D ∅ vs. S has no m -sample ε -distinguisher in C . amplewise Degree. Rather than directly ruling out distinguishers of low degree, it will beconvenient for us to introduce a notion of degree which agrees with the product structure (acrosssamples) of D ⊗ m ∅ . Definition 2.3 (Samplewise degree) . For integers m, n >
1, we say that a function f : ( R n ) ⊗ m → R has samplewise degree (d,k) if f ( x , . . . , x m ) can be written as a linear combination of functionswhich have degree at most d in each x i , and nonzero degree in at most k of the x i ’s.Note that a function of samplewise degree ( d, k ) has degree at most d · k , and a function ofdegree d has samplewise degree at most ( d, d ).In order to rule out low-degree distinguishers, we will rule out low-samplewise degree distin-guishers using Fact 2.2. We denote the orthogonal projection of f : ( R n ) ⊗ m → R to the span ofsamplewise degree ( d, k ) functions by f d,k . We define the following quantity: Definition 2.4 (Low degree likelihood ratio) . For a hypothesis testing problem D ∅ vs. S = { D u } ,the m -sample ( d, k ) -low degree likelihood ratio function is the projection of the m -sample likelihoodratio E u ∼ S (cid:16) D ⊗ mu (cid:17) to the span of non-constant functions of sample-wise degree at most ( d, k ): (cid:18) E u ∼ S D ⊗ mu − (cid:19) d,k = E u ∼ S (cid:16) D ⊗ mu (cid:17) d,k − . We refer to this function as the ( d, k )-LDLR m . Abusing terminology, we also use ( d, k )-LDLR m torefer to the norm of the low degree likelihood ratio, k E u ∼ S ( D ⊗ mu ) d,k − k . In this section, we prove part 1 of Theorem 1.6, showing that an upper bound on the low-degreelikelihood ratio’s norm (LDLR) implies lower bounds on the statistical dimension.
Theorem 3.1 (LDLR to SDA Lower Bounds) . Let d, k ∈ N with k even and S = { D v } v ∈ S be acollection of probability distributions with prior µ over S . Suppose that S satisfies:1. The k -sample high-degree part of the likelihood ratio is bounded by k E u ∼S ( D >du ) ⊗ k k δ .2. For some m ∈ N , the ( d, k ) - LDLR m is bounded by k E u ∼S ( D ⊗ mu ) d,k − k ε .Then for any q > , it follows that SDA (cid:18) S , mq /k ( kε /k + δ /k m ) (cid:19) > q. Notice that for a ( m − k/ / , k )-nice testing problem, Condition 1 of Theorem 3.1 holds with d = k and δ = m − k/ / m − k/ / , k )-nice problems with no good 4 mk -sampledegree k distinguisher (and therefore no good samplewise degree ( k, k ) distinguisher), setting q = (2 m/m ′ ) k/ in Theorem 3.1 implies that SDA( S , Θ( m ′ /k )) > (2 m/m ′ ) k/ , which establishesthe contrapositive of part 1 of Theorem 1.6. In subsequent sections, we will demonstrate that theniceness condition holds for many natural hypothesis testing problems (or in some cases, holds ifthe ( d, k )-LDLR m is small). Combining these conditions with Theorem 3.1 will yield Theorems 5.2,6.1 and 6.3. Proof of Theorem 3.1, for overview see Section 1.2.
Let X be the random variable X = (cid:12)(cid:12) h D u , D v i − (cid:12)(cid:12) for u, v ∼ S sampled independently according to the prior µ . By definition, SDA( S , t ) > q if E [ X | A ] t for all events A over the choice of u, v of probability at least q . So our goal is to showthat E [ X | A ] q /k ( km ε /k + δ /k ). We relate E [ X | A ] to moments of X via H¨older’s inequality:11 act 3.2. If x is a real-valued random variable and A is any event then E [ | x | | A ] (cid:16) E [ | x | k ] Pr [ A ] (cid:17) /k . We give prove the fact below for completeness. Since we have assumed that k is even, E X k = E u,v ∼ S (cid:0) h D u , D v i D ∅ − (cid:1) k = E u,v ∼ S (cid:0) h D u − , D v − i D ∅ (cid:1) k = (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D u − ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) D ⊗ k ∅ , where we have first used that h D u , i = 1 for all u ∈ S , and then the independence of the samples.Applying Fact 3.2,max A s.t. Pr u,v ∼S [ A ] > q E u,v ∼S (cid:2)(cid:12)(cid:12)(cid:10) D u , D v (cid:11) − (cid:12)(cid:12) | A (cid:3) (cid:18) q · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D u − ⊗ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:19) /k . (2)Now, applying H¨older’s inequality (see Lemma 3.4 below), we can split the degree d and degree > d parts of D u − (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D u − ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) /k (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D du − ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) /k + (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D >du ) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) /k . (3)The second right-hand-side term is bounded by δ /k from Condition 1. So, it remains to boundthe first term. This is our crucial “boosting” step. We employ the following structural claim,which uses the independence of the samples to relate the correlation of the ( d, k ) projections of m -sample likelihood ratios to the correlation of the ( d, k ) projections of k -sample likelihood ratios,with k ≪ m : Claim 3.3.
Let D u , D v be distributions with relative densities D u , D v . Then their ( d, k )-projectionsare related as follows: h ( D ⊗ mu ) d,k , ( D ⊗ mv ) d,k i − k X t =1 (cid:18) mt (cid:19) · (cid:16) h D du , D dv i − (cid:17) t . We give the (simple) proof of this claim below. Now, by linearity of expectation, the squared( d, k )-LDLR m is equal to (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) = E u,v ∼ S h ( D ⊗ mu ) d,k , ( D ⊗ mv ) d,k i − E u,v ∼ S k X t =1 (cid:18) mt (cid:19) (cid:16) h D du , D dv i − (cid:17) t , where in the final equality we applied Claim 3.3. So Condition 2 ( k E u ( D ⊗ mu ) d,k k ε ) combinedwith the above implies that ε > (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − − (cid:13)(cid:13)(cid:13)(cid:13) = (cid:18) mk (cid:19) · E u,v ∼ S (cid:16) h D du , D dv i − (cid:17) k > . Dividing through by (cid:0) mk (cid:1) we have E u,v ( h D du , D dv i − k = k E u ( D du − ⊗ k k ε ( mk ) ε (cid:0) km (cid:1) k .Combining this with Equations (2) and (3) finishes the proof.We now prove the outstanding claims, in order of mathematical interest.12 roof of Claim 3.3. We write D u = 1 + ( D du −
1) + D >du . Expanding the tensor power,( D ⊗ mu ) d,k = X A ⊆ [ m ] ,B ⊆ [ m ] \ A (cid:16) ⊗ A ⊗ ( D du − ⊗ B ⊗ ( D >du ) ⊗ [ m ] \ ( A ∪ B ) (cid:17) d,k . Now, D >du is orthogonal to all functions of degree at most d . So the projection (cid:16) ⊗ A ⊗ ( D du − ⊗ B ⊗ ( D >du ) ⊗ [ m ] \ ( A ∪ B ) (cid:17) d,k = 0unless A ∪ B = [ m ], and hence( D ⊗ mu ) d,k = X A ⊆ [ m ] (cid:16) ⊗ A ⊗ ( D du − ⊗ [ m ] \ A (cid:17) d,k . Furthermore, if | [ m ] \ A | > k , then 1 ⊗ A ⊗ ( D du − ⊗ [ m ] \ A is orthogonal to every function dependingon at most k samples. So again applying the projection to degree-( d, k ),( D ⊗ mu ) d,k = X B ⊆ [ m ] , | B | k ⊗ [ m ] \ B ⊗ ( D du − ⊗ B . Observe also that if
B, B ′ ⊆ [ m ] and B = B ′ , then D ⊗ [ m ] \ B ⊗ ( D du − ⊗ B , ⊗ [ m ] \ B ′ ⊗ ( D dv − ⊗ B ′ E = 0 . So we have h ( D ⊗ mu ) d,k , ( D ⊗ mv ) d,k i − X B ⊆ [ m ] ,B = ∅ D D du − , D dv − E | B | , which, by the independence of samples, proves the claim. Lemma 3.4.
Let D ∅ be a null distribution and S = { D u } u ∈ S be a set of alternate distributions with D u ’s density relative to D ∅ density given by D u for each u ∈ S . Let k, d > be integers with k even.Then the centered k -sample likelihood ratio may be bounded in terms of the k -sample-homogeneouslow-degree part and the k -sample-homogeneous high degree part: (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D u − ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) /k (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D du − ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) /k + (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D >du ) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) /k . Proof.
By the triangle inequality, H¨older’s inequality and the fact that k is even, we have that E u,v h(cid:0) h D u , D v i − (cid:1) k i = E u,v (cid:20)(cid:16) h D du , D dv i − h D >du , D >dv i (cid:17) k (cid:21) E u,v (cid:20)(cid:16)(cid:12)(cid:12)(cid:12) h D du , D dv i − (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) h D >du , D >dv i (cid:12)(cid:12)(cid:12)(cid:17) k (cid:21) k X ℓ =0 (cid:18) kℓ (cid:19) E u,v (cid:20)(cid:16) h D du , D dv i − (cid:17) k (cid:21) ℓ/k E u,v (cid:20)(cid:16) h D >du , D >dv i (cid:17) k (cid:21) ( k − ℓ ) /k = E u,v (cid:20)(cid:16) h D du , D dv i − (cid:17) k (cid:21) /k + E u,v (cid:20)(cid:16) h D >du , D >dv i (cid:17) k (cid:21) /k ! k , and the conclusion now follows because h D u , i = 1 for all u ∈ S , which implies E u,v ( h D u , D v i − k = k E u ( D u − ⊗ k k and E u,v ( h D du , D dv i − k = k E u ( D du − ⊗ k k .13 roof of Fact 3.2. Observe that E [ | x | | A ] = E [ | x | · [ A ]] Pr [ A ] E [ | x | k ] /k E [ [ A ]] − /k Pr [ A ] = (cid:18) E [ | x | k ] Pr [ A ] (cid:19) /k . where we have applied H¨older’s inequality.We encapsulate the conclusion of the boosting argument above in the following standalonelemma, which will be useful later: Lemma 3.5 (Samplewise-LDLR boosting) . If the ( d, k ) - LDLR m for the hypothesis testing prob-lem of D ∅ vs { D v } v ∈ S is bounded, then the moments of the low-degree single-sample LR are alsobounded, by k E u ∼ S ( D du − ⊗ k k = E u,v ∼ S (cid:16) h D du , D dv i − (cid:17) k (cid:0) mk (cid:1) (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) . The proof is identical to the end of the proof of Theorem 3.1.
In this section, we show that lower bounds on the statistical dimension imply that the low-degreelikelihood ratio norm is small (hence ruling out good low-degree distinguishers). We will prove thefollowing theorem:
Theorem 4.1.
Let S be a hypothesis testing problem on R N with respect to null hypothesis D ∅ .Let m, k ∈ N with k even. Suppose that for all m ′ m , SDA( S , m ′ ) > k · ( m/m ′ ) k . (Inparticular, SDA( S , m ) > k .) Then for all d , k E u ∼S ( D ⊗ mu ) d, Ω( k ) − k . The key lemma to prove Theorem 4.1 is the following, which translates the bound SDA( S , m ′ ) > k · ( m/m ′ ) k to a bound on the moments of h D u , D v i − Lemma 4.2.
In the setting of Theorem 4.1, for any t k/ , E u,v ∼ S ( h D u , D v i− t · (1 / m ) t . Now we prove Theorem 4.1.
Proof of Theorem 4.1.
We use Claim 3.3 and Lemma 4.2 to obtain E u,v ∼S D ( D ⊗ mu ) ∞ ,k/ , ( D ⊗ mv ) ∞ ,k/ E k/ X t =1 (cid:18) mt (cid:19) E u,v ∼S (cid:0)(cid:10) D u , D v (cid:11) − (cid:1) t k/ X t =1 (cid:18) mt (cid:19) · · (cid:18) m (cid:19) t . Using (cid:0) mt (cid:1) ( me/t ) t , we find that this is at most 4 P k/ t =1 (cid:0) e t (cid:1) t e e/ −
1. But for all d ∈ N we have k E u ∼S ( D u ⊗ m ) d,k/ − k E u,v ∼S D ( D ⊗ mu ) ∞ ,k/ , ( D ⊗ mv ) ∞ ,k/ E which completes the proof.We turn to the proof of Lemma 4.2. We need the following basic fact to relate the momentsand tails of h D u , D v i −
1. (The proof is straightforward calculus; see e.g. Appendix A.2 of [HL19].)14 act 4.3.
Let X be an R -valued random variable. For every p > q > , E | X | q (2 sup A Pr [ A ] · ( E [ X | A ]) p ) q/p · pp − q . (The supremum is taken over all events A .)Proof of Lemma 4.2. Let X = | h D u , D v i− | be the R -valued random variable given by two randomdraws u, v ∼ S . Our assumption SDA( S , m ′ ) > k · ( m/m ′ ) k for all m ′ m implies that forevery event A of probability α > − k · ( m ′ /m ) k , we have E [ X | A ] /m ′ . Rearranging, for allevents A of probability α , we have E [ X | A ] mα /k . So for any t k/ A Pr ( A ) · ( E [ X | A ]) t sup α > α − t/k · (cid:18) m (cid:19) t (cid:18) m (cid:19) t . So applying Fact 4.3 for any t k/ E X t · (1 / m ) t . In this section, we observe that Theorem 3.1 immediately applies to noise-robust problems, asnoise-robustness implies a bound on the high-degree part of the LR.
We define a class of Markov operators which generalize the Gaussian and discrete noise operators.Recall that a Markov operator T is a linear operator such that if f is a probability density, thenso is T f . Definition 5.1 (( d, ǫ )-Markov operator) . Let D ∅ be a probability measure on R N (or a discretedistribution on Ω N for some finite set Ω), inducing an inner product on functions f, g : R N → R (or f, g : Ω N → R ) by h f, g i = E x ∼ D ∅ f ( x ) g ( x ). Let ℓ = { f : R N → R s.t. E x ∼ D ∅ f ( x ) ∞} .Let d ∈ N , and let ℓ > d be the orthogonal complement of span { f ∈ ℓ : f has degree ( d − } withrespect to h· , ·i .Any hypothesis testing problem ( D ∅ , S ) and Markov operator T : ℓ → ℓ induce anotherhypothesis testing problem ( D ∅ , T S ) by applying T to each of the distributions D u ∈ S . We calla Markov operator T a ( d, ǫ ) -operator if ℓ > d ⊆ span { f ∈ ℓ : f is an eigenfunction of T with eigenvalue λ such that | λ | ǫ } . Our main examples are the Ornstein-Uhlenbeck operator U ρ (a.k.a. the Gaussian noise operator)and the discrete noise operator T ρ , both of which are ( d, ρ d ) operators. In both cases, the testingproblems ( D ∅ , T S ) will be noisy versions of original problems ( D ∅ , S ). However, we will usea different family of noise operators to treat certain statistical problems where there is plantedstructure which is not robust to independent entrywise noise, such as planted clique. Theorem 5.2.
Let d, k ∈ N with k even and S = { D v } v ∈ S be a collection of probability distribu-tions, let D u be the relative density of D u with respect to D ∅ . Let T be a ( d + 1 , ρ d +1 ) Markovoperator. Suppose that the k -sample likelihood ratio is bounded by k E u D ⊗ ku k C k , and the noised ( d, k ) - LDLR m is bounded by k E u ( T D ⊗ mu ) d,k − k ε . Then it follows that for any q > , SDA (cid:18) S , mq /k ( kε /k + ρ d +1) Cm ) (cid:19) > q . roof. Since T is a ( d + 1 , ρ d +1 ) Markov Operator by assumption, the k -sample high-degree partof the LR is bounded by (cid:13)(cid:13)(cid:13) E u ( T D >du ) ⊗ k (cid:13)(cid:13)(cid:13) ρ d +1) k · (cid:13)(cid:13)(cid:13) E u ( D >du ) ⊗ k (cid:13)(cid:13)(cid:13) ρ d +1) k · (cid:13)(cid:13)(cid:13) E u ( D u ) ⊗ k (cid:13)(cid:13)(cid:13) ρ d +1) k · C k . Applying Theorem 3.1 now completes the proof of this theorem.
Some problems of interest are not noise-robust under nontrivial ( ρ, d )-operators. For example, con-sider the (bipartite) planted clique problem—the clique structure is not preserved if the coordinatesare resampled independently. To accommodate such problems, we generalize Theorem 5.2 to adifferent class of noise operators: random restrictions. A random restriction fixes a random subsetof coordinates, then applies noise to the remaining coordinates across all of the samples.
Definition 5.3 (Random Restriction) . Let T be a Markov operator on R N . Given a subset R ⊂ [ N ], let T R be the Markov operator on R N that applies T to all entries except those in R .Given a set of probability distributions S and a prior µ over S , the ( T, s ) -random restriction of S is the set of distributions S ′ = n T R D | D ∈ S , R ⊆ [ N ] o equipped with the prior µ ′ where a sample T R D ∼ µ ′ is generated sampling D ∼ µ and sampling R by including every coordinate in [ N ] independently with probability sN . Denote the distributionon subsets as R N ( s ).We will often abuse notation and let T R stand in for ( T ⊗ n ) R when T is a noise operator on R .For simplicity we restrict our attention to distributions D v over the boolean hypercube {± } n ,and to null distributions D ∅ which are product measures for which all biases are the same, D ∅ = D ⊗ N . We now have the following lemma:
Lemma 5.4.
Let D ∅ be a product measure over {± } N . Let d, k ∈ N , let T be a (1 , ρ ) -operator over {± } (with respect to the measure induced by D ∅ on a single coordinate). Then for S = { D v } v ∈ S a family of distributions over {± } N with prior µ , we have that the ( T, s ) -random restriction S ′ , µ ′ of S has degree ( > d, = k ) bounded by (cid:13)(cid:13)(cid:13)(cid:13) E R ∼R N ( s ) E u ∼ µ (cid:16) T R D >du (cid:17) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) max ( d +1 ρ d +1) k , (cid:18) sn (cid:19) d +1) ) · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ ( D u ) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) . Proof.
We will abuse notation and let T R simultaneously denote the noise operator on ( R N ) ⊗ k thatapplies T R independently to each copy of R N . Let D = E u ∼ µ (cid:0) D u (cid:1) ⊗ k and let b D ( α , α , . . . , α k )denote the Fourier character of D at the subsets α , α , . . . , α k ⊆ [ N ]. By the definition of T R , wehave that [ T Rρ D ( α , α , . . . , α k ) = ρ P ki =1 | α i ∩ R c | · b D ( α , α , . . . , α k )for any α , α , . . . , α k ⊆ [ N ]. Let T ′ denote the operator E R ∼R N ( s ) T R and observe that d T ′ D ( α , α , . . . , α k ) = E R ∼R N ( s ) h ρ P ki =1 | α i ∩ R c | i · b D ( α , α , . . . , α k ) . In the bipartite version, we further require that the resampling procedure be dependent across samples . We expect that a near-identical proof will extend to the case when D ∅ is a product measure with arbitrarycoordinate biases. E R ∼R n ( s/n ) h ρ P ki =1 | α i ∩ R c | i k Y i =1 E R ∼R N ( s ) h ρ k | α i ∩ R c | i /k = k Y i =1 E R ∼R N ( s ) Y j ∈ α i ρ k · ( j R ) /k = (cid:16) sN + (cid:16) − sN (cid:17) ρ k (cid:17) P ki =1 | α i | /k , where the final equality follows from the fact that the events ( j R ) are independent and occurwith probability 1 − sN under R ∼ R N ( s ). Now by Parseval’s inequality, we have that (cid:13)(cid:13)(cid:13)(cid:13) E R ∼R N ( s ) E u ∼ µ (cid:16) T Rρ D >du (cid:17) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) = X | α | , | α | ,..., | α k | >d d T ′ D ( α , α , . . . , α k ) X | α | , | α | ,..., | α k | >d (cid:16) sN + (cid:16) − sN (cid:17) ρ k (cid:17) P ki =1 | α i | /k · b D ( α , α , . . . , α k ) (cid:16) sN + (cid:16) − sN (cid:17) ρ k (cid:17) d +1) X | α | , | α | ,..., | α k | >d b D ( α , α , . . . , α k ) (cid:16) sN + ρ k (cid:17) d +1) · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ ( D u ) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) . (4)The lemma then follows from the fact that s/N + ρ k max { ρ k , sN } .Applying Theorem 3.1 yields the following Corollary: Corollary 5.5.
Let D ∅ be a product measure over {± } N . Let d, k ∈ N with k even, let T bea (1 , ρ ) -operator over {± } (with respect to the measure induced by D ∅ on a single coordinate).Let S = { D v } v ∈ S a family of distributions over {± } N with prior µ over S , and let D u be therelative density of D u with respect to D ∅ . Suppose that the k -sample likelihood ratio is bounded by k E u D ⊗ ku k C k , and suppose that the ( T, s )-randomly restricted alternate hypothesis class S , µ ′ has ( d, k ) - LDLR m bounded, (cid:13)(cid:13)(cid:13)(cid:13) E R ∼R N ( s ) E u ∼ µ ( T R D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) ε, Then it follows that for any q > , SDA S ′ , µ ′ , mq /k kε /k + max ( ( d +1) /k ρ d +1) , (cid:18) sn (cid:19) d +1) /k ) Cm ! − > q. Remark 5.6 (Comparison to Theorem 5.2) . As long as k = Ω( d ), 4 ( d +1) /k = O (1) and thus thistheorem can be viewed as a natural extension of Theorem 5.2, recovering (essentially) the sameresult when s = 0. In Section 8.2, we show that Corollary 5.5 implies an equivalence between distinguishers andstatistical queries for a number of models such as planted clique, in which the planted structure isnot robust to independent noise. We also remark that the (2 s/N ) d +1) factor in Lemma 5.4 cannot in general be improved. In particular, when ρ = 0, the diagonal Fourier coefficients of the form d T ′ D ( α, α, . . . , α ) are exactly equal to ( s/N ) | α | · b D ( α, α, . . . , α ).However, other Fourier coefficients are scaled down more heavily under T ′ and it is possible to improve the bound inLemma 5.4 under further assumptions about the Fourier coefficients of D . .3.1 Random Subtensor Restrictions In the above, we treated random restrictions in which coordinates in [ N ] are fixed independently. Intensor- and matrix-problems, where {± } N is identified with ( ± n ) ⊗ p for an integer p , the naturalnotion of random restriction restricts to a random principal minor ( {± } R ) ⊗ p . Below, we willgeneralize Corollary 5.5 to this type of random restriction.Let R n ( s ) be as in the section above, and for R ∈ R n ( s ) let R ⊗ p denote the set of all coordinatesin ( {± } n ) ⊗ p where all p modes lie in R . Lemma 5.7.
Let p, s, n, k, d ∈ N and ρ ∈ (0 , with s n , p/k ρ . Let D ∅ be a productmeasure over {± } N where N = n p , and let T be a (1 , ρ ) -operator over {± } (with respect to themeasure induced by D ∅ on a single coordinate). Then for S = { D v } v ∈ S a family of distributionsover ( {± } n ) ⊗ p with prior µ , we have that the ( T, s ) -random restriction S ′ , µ ′ of S has degree ( > d, = k ) bounded by (cid:13)(cid:13)(cid:13)(cid:13) E R ∼R n ( s ) E u ∼ µ (cid:16) T R ⊗ p D >du (cid:17) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) max d +1 ρ ( d +1) k/p , (cid:18) sn (cid:19) ( ( d +1) ) /p · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ′ ( D u ) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) . Proof.
As in Lemma 5.4, let D = E u ∼ µ ( D u ) ⊗ k with Fourier coefficients b D ( α , α , . . . , α k ) for anysequence of subsets α , α , . . . , α k ⊆ [ n ] p . Similarly, let T ′ = E R ∼R n ( s ) T R ⊗ p . Applying H¨older’sinequality just as in the proof of Lemma 5.4, we have that d T ′ D ( α , α , . . . , α k ) = E R ∼R n ( s ) h ρ P kℓ =1 | α ℓ ∩ ( R ⊗ p ) c | i · b D ( α , α , . . . , α k ) k Y ℓ =1 E R ∼R n ( s ) Y ( i ,i ,...,i p ) ∈ α ℓ ρ k · ( ∃ a ∈ [ p ] , i a R ) /k · b D ( α , α , . . . , α k ) (5)We now will prove the following claim which will complete the proof of the lemma. Claim 5.8.
For any α ⊆ [ n ] p , so long as 2 p/k ρ s n , E R ∼R n ( s ) Y ( i ,i ,...,i p ) ∈ α ρ k · ( ∃ a ∈ [ p ] , i a R ) max ( | α | ρ k p | α | , (cid:18) sn (cid:19) ( | α | ) /p ) . (6) Proof.
Let V ( α ) = { i ∈ [ n ] | ∃ ( i , . . . , i p ) ∈ α, a ∈ [ p ] s.t. i = i a } be the set of indices of [ n ] thatappear in α . For each i ∈ V ( α ), let d i > i appears as an index in α . Since | ρ | ( ∃ a ∈ [ p ] , i a R ) p P a ∈ [ p ] ( i a R ), we have that E Y ( i ,...,i p ) ∈ α ρ k ( ∃ a ∈ [ p ] , i a R ) E Y ( i ,...,i p ) ∈ α ρ kp P a ∈ [ p ] ( i a R ) = E Y i ∈ V ( α ) ρ kp d i ( i R ) = Y i ∈ V ( α ) E h ρ kp d i ( i R ) i Y i ∈ V ( α ) (cid:16) sn + (cid:16) − sn (cid:17) ρ kp d i (cid:17) | V ( α ) | · max U ⊆ V ( α ) (cid:16) sn (cid:17) | V ( α ) \ U | · ρ kp P i ∈ U d i , max U ⊆ V ( α ) (cid:18) sn (cid:19) | V ( α ) \ U | · (2 p/k ρ ) kp P i ∈ U d i , where to obtain the third line we have used the independence of the events ( i R ), in thepenultimate line we have bounded the product expansion by its maximum term, and in the finalline we have used that d i > i ∈ U . If P i ∈ U d i > | α | , then since 2 s n and 2 p/k ρ (cid:0) sn (cid:1) | V ( α ) \ U | (2 p/k ρ ) kp P i ∈ U d i (2 p/k ρ ) k p | α | , and we have our conclusion. Otherwise suppose P i ∈ U d i < | α | and consider the set tuples α ′ which do not contain elements from U . We havethat | α ′ | > | α | , because the elements of U participate in at most P i ∈ U d i tuples. Further, | α ′ | ( | V ( α ) \ U | ) p , since this is the number of distinct tuples of at most p elements that can be formedfrom the elements of V ( α ) \ U . Thus | V ( α ) \ U | > ( | α | ) /p , and the bound now follows because (cid:0) sn (cid:1) | V ( α ) \ U | (2 p/k ρ ) kp P i ∈ U d i (cid:0) sn (cid:1) ( | α | ) /p .Combining Equations (5) and (6) with a similar application of Parseval’s inequality as in Equa-tion (4) from Lemma 5.4 now completes the proof of the lemma.Combining this lemma with Theorem 3.1 now yields that LDLR bounds for problems that canbe realized as random submatrix or subtensor restrictions imply SQ lower bounds, as in Corollary5.5 in the previous section. We remark that the bounds in Lemma 5.7 are nearly tight. Remark 5.9.
A final setting of interest (e.g. for multi-sample planted clique) is when N = (cid:0) np (cid:1) and the indices of samples are identified with subsets in (cid:0) [ n ] p (cid:1) . The natural notion of a randomrestriction is then to subsets of the form (cid:0) Rp (cid:1) ∈ (cid:0) [ n ] p (cid:1) where R ∼ R n ( s ). Lemma 5.7 can be seen tohandle this case as well: repeating the argument identically, but considering only tuples ( i , . . . , i p )with i < · · · < i p , yields the following theorem. Theorem 5.10.
Let p, s, n, k, d ∈ N and ρ ∈ (0 , with s n , p/k ρ . Let D ∅ be a productmeasure over {± } N where N = (cid:0) np (cid:1) , and let T be a (1 , ρ ) -operator over {± } (with respect to themeasure induced by D ∅ on a single coordinate). Then for S = { D v } v ∈ S a family of distributions over {± } ( [ n ] p ) with prior µ , we have that the ( T, s ) -random restriction S ′ , µ ′ of S has degree ( > d, = k ) bounded by (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E R ∼R n ( s ) E u ∼ µ (cid:18) T ( Rp ) D >du (cid:19) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) max d +1 ρ ( d +1) k/p , (cid:18) sn (cid:19) ( ( d +1) ) /p · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ′ ( D u ) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) . In this section, we prove Theorems 6.1 and 6.3. In each case, we bound the high-degree part of theLR in terms of the LDLR and then apply Theorem 3.1 to deduce the result. When ρ = 0, the diagonal Fourier coefficients corresponding to submatrices are given by d T ′ D ( R ⊗ p , . . . , R ⊗ p ) =( s/n ) | R | · b D ( R ⊗ p , . . . , R ⊗ p ). This implies that the (cid:0) ( d + 1) (cid:1) /p factor in the exponent of (2 s/n ) ( ( d +1) ) /p in Lemma5.7 is necessary. .1 Identity-Covariance Gaussians Theorem 6.1.
Let k be an even integer. For the null distribution D ∅ = N (0 , I n ) and alternatedistributions S = { D v } v ∈ S with D v = N ( v, I n ) , let D u be the relative density of D u with respectto D ∅ . Suppose that the k -sample likelihood ratio is bounded by k E u D ⊗ ku k C k , and the (1 , k ) - LDLR m is bounded by k E u ( D ⊗ mu ) , k − k ε . Then for any q > , SDA S , mq /k ε /k k ε /k + (cid:16) e k (1+ C ) m (cid:17) > q . We first will prove a lemma bounding the high-degree part of the LR in terms of its low-degreepart.
Lemma 6.2.
Let S = { D u } u ∈ S be a set of identity-covariance Gaussian distributions, where D u = N ( u, I n ) and D ∅ = N (0 , I n ) . For each u ∈ S , let D u be the relative density of D u with respect to D ∅ . For any integers d, k > with k even, (cid:13)(cid:13)(cid:13) E u ( D >du ) ⊗ k (cid:13)(cid:13)(cid:13) /k d + 1)! E u,v (cid:20)(cid:16) h D u , D v i − (cid:17) k ( d +1) (cid:21) / k (cid:16) k E u D ⊗ ku k (cid:17) / k . Proof.
We will exploit some properties of identity-covariance Gaussians. Let exp >d ( x ) = P ∞ t = d +1 x d d ! be truncation error of the degree- d Taylor approximation of exp( x ) about 0. In this setting, foreach u, v ∈ S , it is shown in [KWB19] (Theorem 2.6) that h D >du , D >dv i D ∅ = exp >d ( h u, v i ) . (7)By Taylor’s theorem, we have that exp >d ( x ) is bounded by (cid:12)(cid:12)(cid:12) exp >d ( x ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) x d ( d + 1)! · exp( ξ ( x )) (cid:12)(cid:12)(cid:12)(cid:12) , For some function ξ ( x ) with sign( ξ ( x )) = sign( x ) and | ξ ( x ) | | x | . Thus, using that k is even, (cid:13)(cid:13)(cid:13) E u ( D >du ) ⊗ k (cid:13)(cid:13)(cid:13) = E u,v (cid:20)(cid:16) h D >du , D >dv i (cid:17) k (cid:21) = E u,v (cid:20)(cid:12)(cid:12)(cid:12) exp >d ( h u, v i ) (cid:12)(cid:12)(cid:12) k (cid:21) E u,v "(cid:12)(cid:12)(cid:12)(cid:12) h u, v i d +1 ( d + 1)! exp( ξ ( x )) (cid:12)(cid:12)(cid:12)(cid:12) k (cid:18) d + 1)! (cid:19) k r E u,v [ h u, v i dk +2 k ] E u,v [exp( ξ ( x )) k ] (cid:18) d + 1)! (cid:19) k r E u,v [ h u, v i dk +2 k ] E u,v [1 + exp( x ) k ]= (cid:18) d + 1)! (cid:19) k s E u,v (cid:20)(cid:16) h D u , D v i − (cid:17) dk +2 k (cid:21) (1 + E [ h D u , D v i k ]) . The fourth line follows from Cauchy-Schwarz, and the fifth line uses that sign( ξ ( x )) = sign( x )and therefore 1 + exp( x ) > | max(1 , exp( x )) | > | exp( ξ ( x )) | . The final line then follows from (7).Substituting this back in for the above, we have our desired conclusion.20 roof of Theorem 6.1. We will show that a more general result holds given k E u ( D ⊗ mu ) d, k ( d +1) − k ε , and then set d = 1. By Lemma 3.5, we have that (cid:13)(cid:13)(cid:13) E u ( D u − ⊗ k ( d +1) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) E u ( D du − ⊗ k ( d +1) (cid:13)(cid:13)(cid:13) ε (cid:0) m k ( d +1) (cid:1) . Therefore Lemma 6.2 implies that (cid:13)(cid:13)(cid:13) E u ( D >du ) ⊗ k (cid:13)(cid:13)(cid:13) /k d + 1)! · ε /k (cid:0) m k ( d +1) (cid:1) / k (cid:16) C k (cid:17) / k C ( d + 1)! · ε /k (2 k ( d + 1)) d +1 m d +1 (1 + C ) · ε /k (2 ke ) d +1 m d +1 using Stirling’s approximation to the factorials and the fact that (cid:0) ab (cid:1) > ( a/b ) b . Since ( d, k )-LDLR m ( d, k ( d + 1))-LDLR m , we also have that k E u ( D ⊗ mu ) d,k − k ε . Now applyingTheorem 3.1 to the ( d, k )-LDLR m and then setting d = 1 completes the proof of the theorem. Theorem 6.3.
Let k be an even integer. Let S = { D u } u ∈ S be a set of product distributions overthe n -dimensional hypercube. Let D ∅ be any product measure over {± } n with no fixed coordinates,and let D u be the relative density of D u . Suppose that the k -sample likelihood ratio is boundedby k E u D ⊗ ku k C k , and the (1 , k ) - LDLR m is bounded by k E u ( D ⊗ mu ) , k k ε . Then for any q > , SDA S , mq /k ε /k k ε /k + kC / m !! > q . We again will prove a lemma bounding the high-degree part of the LR in terms of its low-degreepart.
Lemma 6.4.
Let S = { D u } u ∈ S be a set of product distributions over the n -dimensional hypercube.Let D ∅ be any product measure over {± } n with no fixed coordinates, and let D u be the relativedensity of D u . For any integers d, k > with k even, (cid:13)(cid:13)(cid:13) E u ( D >du ) ⊗ k (cid:13)(cid:13)(cid:13) E u,v ∼ S (cid:20)(cid:16) h D u , D v i − (cid:17) k ( d +1) (cid:21) / (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S D ⊗ ku (cid:13)(cid:13)(cid:13)(cid:13) . Proof.
As in Lemma 6.2, k E u ( D >du ) ⊗ k k = E u,v h D >du , D >dv i k . We let χ i ( x ) be the unique functionsuch that E x ∼ D ∅ χ i ( x ) = 0, E x ∼ D ∅ χ i ( x ) = 1, and χ i ( x ) > x i = 1. For convenience, weassociate each u ∈ S with a vector u ∈ R n as follows: if D u is the (unique) product measure P u over {± } n with E x ∼ D u [ χ i ( x )] = u i . Let e k : R n → R be the k th elementary symmetric polynomial: e k ( x ) = X S ⊂ [ n ] | S | = k k Y i =1 x i . t ∈ [ n ], using standard Fourier analysis over the Boolean hypercube one can see that h D = tu , D = tv i = X S ⊆ [ n ] | S | = t E D u "Y i ∈ S χ i ( x ) E D v "Y i ∈ S χ i ( x ) = X S ⊆ [ n ] | S | = t Y i ∈ S u i v i = e t ( u ◦ v ) , where u ◦ v ∈ R n is the Hadamard (or “entrywise”) product of u and v . So we may re-express h D >du , D >dv i = n X t = d +1 e t ( u ◦ v ) . (8)We will exploit the following claims regarding polynomials in u ◦ v and the elementary symmetricpolynomials: Claim 6.5.
Let A be any multiset of elements from [ n ], and for a vector x ∈ R n denote by x A = Q i ∈ A x i . Then, for any set S ⊂ R n , E u,v ∼ S ( u ◦ v ) A = E u,v ∼ S Y i ∈ A u i v i = (cid:18) E u ∼ S u A (cid:19) > . The proof of Claim 6.5 is evident from the expression above. One consequence is the following:
Claim 6.6.
Let p : R n +1 → R be any polynomial which is a sum of monomials with non-negativecoefficients, let S ⊂ R n and for each u ∈ S let there be a λ u ∈ R . Then for any integers a, b > E u,v [ e a + b ( u ◦ v ) · p ( u ◦ v )] E u,v [ e a ( u ◦ v ) · e b ( u ◦ v ) · p ( u ◦ v )] . Proof.
For any x ∈ R n , we can expand the product e a ( x ) e b ( x ) = X A ⊂ [ n ] | A | = a x A X B ⊂ [ n ] | B | = b x B = min( a,b ) X i =0 X I ⊂ [ n ] | I | = i x I X S,T ⊂ [ n ] \ I | S | = a − i, | T | = b − i | S ∩ T | =0 x S ∪ T , where we have arranged the second sum according to the intersection size i that a monomial from e a and a monomial from e b may have. Extracting the i = 0 summand, we have that X S,T ⊂ [ n ] | S | = a, | T | = b, | S ∩ T | =0 x S ∪ T = (cid:18) a + ba (cid:19) e a + b ( x ) , since each set S ∪ T is counted in this sum (cid:0) a + ba (cid:1) times. Write p ( x ′ ) = P C ˆ p C · ( x ′ ) C where the sumis over monomials. Therefore we have that e a ( x ) · e b ( x ) · p ( x ′ ) = (cid:18) a + bb (cid:19) e a + b ( x ) · p ( x ′ ) + q ( x ) p ( x ′ ) , where q ( x ) (the summation over over i >
0) is a sum of monomials with non-negative coefficients.The claim now follows from taking expectations on both sides and applying Claim 6.5.22iven these facts and (8), we can deduce the following upper bound: E u,v h h D >du , D >dv i k i = E u,v n X t = d +1 e t ( u ◦ v ) ! k = E u,v n X t = d +1 e t ( u ◦ v ) · n X t = d +1 e t ( u ◦ v ) ! k − E u,v n X t = d +1 e d +1 ( u ◦ v ) · e t − ( d +1) ( u ◦ v ) · n X t = d +1 e t ( u ◦ v ) ! k − = E u,v e d +1 ( u ◦ v ) · n − d − X s =0 e s ( u ◦ v ) ! n X t = d +1 e t ( u ◦ v ) ! k − , Where to obtain the inequality we have applied Claim 6.6 with p = (cid:0)P nt = d +1 e t ( u ◦ v ) (cid:1) k − , a = d +1,and b = t − d −
1. Repeating this for the k − E u,v e d +1 ( u ◦ v ) · n − d − X s =0 e s ( u ◦ v ) ! k E u,v e d +1 ( u ◦ v ) · n X s =0 e s ( u ◦ v ) ! k = E u,v h(cid:0) e d +1 ( u ◦ v ) · h D u , D v i (cid:1) k i , where in the second-to-last line we have used Claim 6.5 to add the terms for s = n − d, . . . , n asthey contribute positively to the expectation. Applying Cauchy-Schwarz to the conclusion of theabove display, E u,v h h D >du , D >dv i k i r E u,v [ e d +1 ( u ◦ v ) k ] E u,v [ h D u , D v i k ] s E u,v (cid:20)(cid:16) h D u , D v i − (cid:17) k ( d +1) (cid:21) k E u D ⊗ ku k , where we have used that E u,v (cid:16) h D u , D v i − (cid:17) k ( d +1) > E u,v ( e d +1 ( u ◦ v )) k , again by applyingClaim 6.5 in a similar manner to the proof of Claim 6.6. This completes the proof. Proof of Theorem 6.3.
As in the proof of Theorem 6.3, we will show that a more general resultholds given k E u ( D ⊗ mu ) d, k ( d +1) − k ε , and then set d = 1. By Lemma 3.5, we have that (cid:13)(cid:13)(cid:13) E u ( D u − ⊗ k ( d +1) (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) E u ( D du − ⊗ k ( d +1) (cid:13)(cid:13)(cid:13) ε (cid:0) m k ( d +1) (cid:1) . The same application of Lemma 3.5 as in the proof of Theorem 6.3 and Lemma 6.2 imply that (cid:13)(cid:13)(cid:13) E u ( D >du ) ⊗ k (cid:13)(cid:13)(cid:13) /k C / ε /k (cid:0) m k ( d +1) (cid:1) / k C / ε /k (2 k ( d + 1)) d +1 m d +1 using the fact that (cid:0) ab (cid:1) > ( a/b ) b . As in the proof of Theorem 6.3, we have that k E u ( D ⊗ mu ) d,k − k ε . Applying Theorem 3.1 to the ( d, k )-LDLR m and then setting d = 1 completes the proof of thetheorem. 23 Diluting the Power of Statistical Queries via Cloning: Levelingthe Playing Field
As discussed in Remark 1.9, many average-case problems of interest such as planted clique andtensor PCA do not have a natural notion of samples. In contrast, the SQ framework requiresproblem formulations involving multiple samples. In this section we describe how to convert certainsingle sample problems into multiple-sample problems, and then address the question of how tochoose the number of samples so that the SQ complexity of the resulting problem captures thecomputational complexity of the original problem (as predicted by e.g. low-degree tests).
Multi-sample formulations of single-sample problems.
The idea is to apply an SQ bound toa “diluted” or “cloned” version of the single-sample problem, wherein each “dilute” sample carrieslittle information compared to a single sample. When multiple cloned samples can be combinedinto one original sample in polynomial time, a lower bound against the cloned problem implies alower bound against the original problem (within the framework of polynomial time algorithms).We first state a general and somewhat obvious sufficient condition for the existence of anaverage-case reduction from a multi-sample problem to a single-sample problem. A computationallower bound for the multi-sample problem is then transferred to the single-sample problem via thereduction.
Fact 7.1.
Let D ∅ and S = { D u } u ∈ S be distributions on R N and let µ be a prior over S . Let { P θ } θ ∈ Ω be an exponential family of distributions on R N with sufficient statistic T that can be computed intime polynomial in the size of its input. Suppose that for each distribution D ∈ { D ∅ } ∪ S , there isa θ = θ ( D ) such that if Y , . . . , Y m i.i.d. ∼ P θ then T ( Y , . . . , Y m ) ∼ D . Then if there is no polynomialtime algorithm testing between H : ( Y , . . . , Y m ) ∼ P ⊗ mθ ( D ∅ ) versus H : ( Y , . . . , Y m ) ∼ P ⊗ mθ ( D u ) where u ∼ µ , with Type I + II error − ε , then the same is true for the original testing problem. If one can efficiently generate m samples Y , . . . , Y m as described in the fact just above giventhe single sample X , then the mapping is invertible , which implies that no signal is lost and thesingle and multi-sample versions of the problem are computationally and statistically equivalent .Note that by the definition of sufficient statistic it is possible to generate samples with givensufficient statistic, but it is not always possible to do so efficiently (assuming the widely believedcomputational complexity conjecture RP = NP) [BGS14, Mon14].We now describe two examples where simple randomized algorithms show that it is possible togenerate samples efficiently given a sufficient statistic. In the first, the data consists of unit varianceGaussians, for which the mean is the sufficient statistic. Lemma 7.2 (Gaussian Cloning) . There is a randomized algorithm taking as input a real number x and outputting m independent random variables Y , . . . , Y m such that for any µ ∈ R if x ∼ N ( µ, ,then Y i ∼ N ( µ/ √ m, . We will give the proof in Appendix C. In the second example, we show that the planted cliqueproblem has an equivalent multi-sample version. Given a subset U ⊆ [ n ], let G ( n, U, γ ) denote thedistribution of G ( n, γ ) conditioned on the vertices in U forming a clique (again see Appendix C fora proof). This reduction is a mild variant of Bernoulli Cloning in [BBH18], which corresponds tothe regime where m = O (1). Lemma 7.3 (Planted Clique Cloning) . There is an algorithm that when given m independentsamples from G ( n, U, γ ) for any U ⊆ [ n ] , efficiently produces a single instance distributed accordingto G ( n, U, γ m ) . Conversely, there is an efficient algorithm taking a graph as input and producing random graphs, such that given an instance of planted clique G ( n, U, γ ) with unknown cliqueposition U , produces m independent samples from G ( n, U, γ /m ) . The same equivalence holds in the hypergraph formulation of planted clique. The Gaussiancloning algorithm runs in poly( m ) time given access to an oracle for sampling standard normalrandom variables. When applied entry-wise, this cloning procedure can be used to show average-case equivalences between single and multi-sample variants of problems with Gaussian noise suchas tensor PCA and the spiked Wigner model. Furthermore, increasing the number of samples from1 to m dilutes the level of signal in the problem exactly by a factor of 1 / √ m . The planted cliquecloning algorithm runs in poly( m, n ) randomized time. This again shows a precise tradeoff betweenthe level of signal and number of samples m – as the ambient edge density varies as γ to γ /m withthe number of samples m . Choosing the number of samples.
The number of queries used by statistical query algorithmsis a proxy for runtime. However, the statistical query framework allows queries that cannot becomputed in polynomial time, and for this reason can lead to predictions that do not correspondto polynomial time algorithms. For example, a naive application of the statistical query frameworkin [FGR +
17] to the planted clique problem treats an instance as a single sample from the plantedclique distribution has a single-query VSTAT( ) algorithm, using the { , } query: does the graph G have a clique of size at least k ? For this reason, prior SQ lower bounds for planted clique [FGR +
17] consider instead the planted biclique problem in a bipartite graph, and furthermore, assumed that i.i.d. data is generated byobserving a random column from the adjacency matrix. While this is an interesting problem tostudy, it is not known to be equivalent to planted clique, the original problem of interest. Moretroubling is that this approach of generating samples fails badly for hypergraph planted clique. Ifone views a sample as a random slice of the adjacency tensor, then statistical query algorithms canperform an exhaustive search over what amounts to an instance of planted clique and this succeedsif at least one sample contains a planted clique, which occurs with positive probability once onehas n/k samples.The methodology described earlier in this section of converting a single-sample problem tomany-sample problem is applicable to a broad class of problems and thus gives a unified wayof addressing a variety of problems within the SQ framework. If we are free to study multi-sample versions of problems, it remains to specify the correct number of samples in order to obtainmeaningful predictions within the SQ framework. As noted in the introduction, a prescription issuggested by Theorem 1.6: we should dilute the signal so that each the problem is information-theoretically unsolvable from O (1) samples. Concretely, we convert to a hypothesis testing problemwith m samples, D ⊗ m ∅ vs. D ⊗ mu where k E u D u k = O (1). Problem 8.1 (Tensor Principal Components Analysis (PCA)) . For n, r positive integers, λ ∈ R ,and S = {± √ n } n , the n -dimensional r -tensor PCA with signal strength λ problem is the followingmany-vs-one hypothesis testing problem: • Null: a tensor in ( R n ) ⊗ r with independent standard Gaussian entries, D ∅ = N (0 , I n r ). • Alternate: uniform mixture of D u = N ( λ · u ⊗ r , I n r ) over u ∈ S .25ariations on the tensor PCA problem are possible; for example one may insist that the tensorsbe symmetric, or that S be a different subset of S n − . Claim 8.2.
For any integers k, n , and r > kλ < n , the k -sample likelihood ratio forthe n -dimensional r -tensor PCA problem with signal strength λ is bounded by (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S D ⊗ ku (cid:13)(cid:13)(cid:13)(cid:13) s π − kλ n . We prove this claim in Appendix D.1.
Claim 8.3.
For any integers n, r, k, m and real number λ which satisfy 2 emλ k ( r − / n r/ ,the (1 , k )-LDLR m for the m -sample, dimension- n tensor PCA problem with signal strength λ isbounded by (cid:13)(cid:13)(cid:13) E u ( D ⊗ mu ) ,k (cid:13)(cid:13)(cid:13) e r +1 mλ k ( r − / n r/ The proof is a straightforward calculation which appears in [HKP +
17, KWB19]—these worksconsider the single-sample version, but it is not difficult to see that their bounds imply ours. Forcompleteness we give a full proof in Appendix D.1. Together these claims are sufficient to deducethe following Corollary of Theorem 6.1.
Corollary 8.4.
For integers k, n, m, r and real numbers λ, δ with δ ∈ (0 , satisfying | λ | min s(cid:18) n (4 k ) ( r − /r (cid:19) r/ em , r (1 − δ ) n k , and e k (cid:18) πδ (cid:19) /k ! m , then for the n -dimensional r -tensor PCA problem with signal strength λ , for all q > , SDA( m q /k k ) > q .Proof. By Claims 8.2 and 8.3 and our assumptions, we have that (cid:13)(cid:13)(cid:13) E u ( D u ) ⊗ k (cid:13)(cid:13)(cid:13) s π − kλ n r πδ , (cid:13)(cid:13)(cid:13) E u ( D ⊗ mu ) , k − (cid:13)(cid:13)(cid:13) emλ (4 k ) ( r − / n r/ . We instantiate Theorem 6.1 with C = (cid:0) πδ (cid:1) /k and ε = 1, and using our assumption on δ we haveour conclusion. Comparison with prior work and predictions.
In the literature, it is most common toconsider the single-sample version of tensor PCA; for translations’ sake, notice that m samples from N ( λu ⊗ r , I n r ) are equivalent to a single sample from N ( √ mλu ⊗ r , I n r ), since the sum of the samplesis a sufficient statistic. So we compare the m -sample problem to the single-sample hypothesis testingproblem with signal strength √ mλ . Similarly, we compare the VSTAT( M ) to the single-samplehypothesis testing problem with signal strength √ M λ .Applying this transformation, the best n k -time algorithms for the n -dimensional r -tensor PCAproblem requires signal strength √ mλ > ˜Ω (cid:16) √ k (cid:0) nk (cid:1) r/ (cid:17) [BGL17, RRS17, WEAM19]. To see thatthis is consistent with the obtained VSTAT( M ) bound with M = m ekq /k , note that by Theorem A.526ur bound implies that any q = 2 k -query algorithm requires the “adjusted signal strength” to satisfyeither λ k = Ω( √ n ) (which we will discuss below) or √ M | λ | > (cid:18) n (4 k ) ( r − /r (cid:19) r/ s ekq /k = Ω (cid:18) (cid:16) nk (cid:17) r/ (cid:19) . In the k ≫ log n regime, this is equivalent to the performance of the best-known algorithms up toa factor of ˜ O ( √ k ).We remark as well that the condition λ k < O ( √ n ) is necessary to rule out statistical queryalgorithms which use brute force on individual samples. If λ > n , then there is a single-querySQ algorithm for the many-vs-one hypothesis testing problem: for a given sample T ∈ ( R n ) ⊗ r ,simply query whether there exists some vector x ∈ {± √ n } n which achieves |h x ⊗ r , T i| > λ . When | λ | > √ n , it is easy to see that for T ∼ D ∅ this query will return false with high probability;this follows from the fact that h x ⊗ r , T i ∼ N (0 , I n r ). On the other hand, for any T ∼ D u , this querywill return true with high probability for similar reasons. In this section, we consider several formulations of planted clique (PC) and planted dense subgraph(PDS). We begin by using our results to reproduce SQ lower bounds for “bipartite” formulationspreviously considered in the SQ literature [FGR + The classical planted clique problem is a single-sample problem, which makes it incompatible withthe SQ framework. In an effort to address the complexity of the PC problem, the authors of[FGR +
17] give an SQ lower bounds for the following related problem: “bipartite planted clique”where each column of the resulting adjacency matrix is treated as an i.i.d. sample from a mixturedistribution.
Problem 8.5 (Bipartite Planted Dense Subgraph/Planted Clique) . Given
K, N ∈ N and 0 < q
K, N, k, d, m ∈ N , define γ = ( p − q ) q (1 − q ) . Then the ( d, k )-LDLR m for bipartitePDS is bounded k E u ∼ µ ( D ⊗ mu ) d,k − k = O N (1) if K N · max n mN , (1 + γ ) k o − Ω N (1) . Claim 8.7.
For any
K, N, k ∈ N , the k -sample LR is bounded by k E u ∼ µ D ⊗ ku k = O N (1) if K N · max (cid:26) kN , (1 + γ ) k (cid:27) − Ω N (1)where γ = ( p − q ) q (1 − q ) . Implications of our results.
Given these computations, we now can deduce the following im-plication of Corollary 5.5.
Corollary 8.8.
Suppose that K = Θ( N / − δ ) for some small constant δ > and < q < p are constants. Then for bipartite PC and PDS with N vertices, edge densities < q < p andplanted dense subgraph size K , it holds that SDA( N ) = N ω (1) .Proof. Let T be the noise operator that resamples independently from Ber( q ), so T is a (1 , K = Θ( N / − δ ) can be realized as a random restrictionwith noise operator T of bipartite PDS with K = Θ( N / − δ/ ), restriction probability s/N = N − δ/ and noise parameter ρ = 0. Suppose that d, k = Θ((log N ) c ) where c ∈ (0 ,
1) and d/k ∼ c where c is a sufficiently large constant. If again m = Θ( N δ ), then the parameters for both the restrictedand unrestricted bipartite PDS instances satisfy condition (1) in Claims 8.6 and 8.7. Now considerapplying Corollary 5.5 with dimension lower bound q ′ ∼ k (log N ) c for some constant c ∈ (1 − c , c is sufficiently large, then (2 s/N ) d +1) /k m = o (1) and we have that SDA( N ) > q ′ = N ω (1) . Remark 8.9.
Our generic noise-robustness result (Theorem 5.2) also recovers this lower bound inthe case of bipartite PDS when p <
1. We choose T to be the (1 , ρ )-noise operator that resamplesentries independently from Ber( q ) with probability 1 − ρ = p − q − q . Then the distributions D u canbe realized by applying T entrywise to an instance of bipartite PC with edge density q . Note thatthe parameters d ∼ c log N for a sufficiently large constant c , k ∼ c log N for a sufficiently smallconstant c , K = Θ( N / − δ ) and m = Θ( N δ ) satisfy condition (1) in Claims 8.6 and 8.7 forboth the bipartite PDS instance in question and the bipartite PC instance before applying T . Nowapply Theorem 5.2 with dimension lower bound q ′ ∼ k (log N ) c for some constant c ∈ (0 , c is sufficiently large, then ρ d +1) m = o (1) and it again follows that SDA( N ) > q ′ = N ω (1) . Wealso remark that, unlike in our previous applications of our main results where we set q ′ = 2 k , wemust take q ′ = 2 ω ( k ) in this application of our noise-robustness theorem to show superpolynomialSQ lower bounds. Comparison to prior work and predictions.
Corollary 8.8 recovers the K = Θ( N / − δ ) bar-rier from [FGR +
17] at which the SDA for bipartite PC/PDS with constant edge densities ceases tobe poly( N ). Despite being the consequence of a much more general theorem on random restrictions,our results for bipartite PC/PDS also nearly recover the precise SDA lower bounds from [FGR + + /
2, it is shown that SDA( N ℓ +1 K ) > N ℓδ / ℓ K . Fine-tuning our parameter choices in Corollary 8.8 yields that SDA( N − ǫ ℓ +1 K ) > N Ω( ℓ ) forany constant ǫ >
0, which matches the bound from [FGR +
17] up to arbitrarily small polynomialfactors in the sample complexity. 28 .2.2 Multi-Sample Hypergraph Planted Clique
We now consider a variant of planted clique where the observations consist of multiple samples fromthe planted clique distribution. As discussed in Section 7, there is a natural tradeoff between thenumber of samples m and edge density q for which this variant has an average-case equivalence withordinary PC. In this section, we will treat a generalization of this variant to s -uniform hypergraphs(including the case s = 2 corresponding to simple graphs).Let G s ( N, q ) denote the Erd˝os-R´enyi distribution over s -uniform hypergraphs, where each s -subset of [ N ] is included as a hyperedge independently with probability q . Given a subset u ⊆ [ N ],let G s ( N, u, q ) denote the hypergraph where hyperedges among the vertices within u are alwaysincluded and all other hyperedges are included independently with probability q . Throughout thissection, we will treat s as a fixed positive integer constant. Problem 8.10 (Multi-Sample Hypergraph PC) . Given s, K, N ∈ N with N ≫ K ≫ s > q ∈ (0 , multi-sample s -uniform hypergraph planted clique problem with edge density q is thefollowing hypothesis testing problem: • Null: the Erd˝os-R´enyi hypergraph D ∅ = G s ( N, q ). • Alternate: uniform mixture of D u = G s ( N, u, q ) over K -subsets u ⊆ [ N ]. The complexity of multi-sample hypergraph PC as m and q vary. To the best of ourknowledge, multi-sample hypergraph PC has not been considered in this generality before. However,because of the average-case equivalence from Section 7, its complexity can be extrapolated exactlyfrom that of ordinary hypergraph planted clique, i.e. when m = 1. For m = 1, its complexityconjecturally behaves as follows (as a function of q ):1. If q is near constant with N − o (1) q − N − o (1) , then the threshold at which polynomial-time algorithms begin to solve the distinguishing problem is K = N ± o (1) , which is consistentwith the threshold in the classical setting of q = .2. If q is polynomially small with q = Θ( N − α ) for some α >
0, then the clique number of G s ( N, q ) is constant and the problem begins to be easy when K = Θ(1).3. If q is very close to 1 with q = 1 − Θ( N − α ) for some α ∈ (0 , K = ˜Θ( N α/s ).The best known algorithm in the last regime simply counts the total number of edges. In thegraph case when s = 2, it was shown in [BBH18] that the PC conjecture with q = 1 / K = ˜Θ( N α/ ) when q = 1 − Θ( N − α ). We remark that, in thisregime, recovering the vertices in the planted clique is conjectured to be a harder problem thatonly becomes easy at larger values of K . Our focus in this section will be on the transition in thefirst parameter regime, when N − o (1) q − N − o (1) .As discussed in Section 7, there is a natural average-case equivalence between the single andmulti-sample problems. Specifically, hypergraph PC with m samples and edge density q is equiv-alent to hypergraph PC with m = 1 sample and edge density q m . Thus the parameter regime ofinterest corresponds to the q with mN o (1) − q ≪ log Nm . We remark that at 1 − q = Θ( log Nm ), thedistinguishing problem undergoes a (conjecturally sharp) transition to algorithmically easy. Specif-ically, taking the bit-wise AND of the edge indicators across the different samples corresponds to asingle-sample instance of hypergraph PC with edge density q m = N − Θ(1) , which can be solved inpolynomial time whenever K is a sufficiently large constant.29s also discussed in Section 7, another concern when choosing m is the existence of inefficientalgorithms that can be implement with a small number of VSTAT( m ). Let h ( G ) ∈ { , } be theindicator that G has a clique of size K . While h is NP-hard to compute, the single query of h toa VSTAT(Θ(1)) oracle will solve the distinguishing problem unless 1 − q is sufficiently small. Theexpected number of cliques of size K in G s ( N, q ) is (cid:18) NK (cid:19) q ( Ks ) exp (cid:18) K log N − − qq · (cid:18) Ks (cid:19)(cid:19) = o (1)as long as 1 − q > CK − s log N for a sufficiently large constant C . Thus unless 1 − q = O ( K − s log N ),Markov’s inequality implies that G s ( N, q ) has no clique of size K with probability 1 − o (1) and theSQ query of h solves the distinguishing problem where no polynomial time algorithms are knownto succeed. Thus to make the performance of SQ and polynomial-time algorithms comparable, itseems necessary to restrict to q with 1 − q = O ( K − s log N ). As will be shown in Claim 8.13,this threshold is also roughly when the k -sample LR begins to have a constant-sized norm. Tosummarize this discussion, the natural choices of m and q are: • sufficiently large q with q = 1 − O ( K − s log N ); and • m such that q lies in the range mN o (1) − q ≪ log Nm .Note that this requires we take m = ˜Ω( K s − ) samples. Remark 8.11.
A different natural alternative formulation of hypergraph PC views the adjacencylists of individual vertices as independent samples, as in bipartite PC. However, since each adjacencylist is itself an ( s − s >
2: ask if the adjacency list contains a clique of size at least K . For this reason, thebipartite model is not appropriate for the SQ framework. Choice of prior µ . We now discuss why the choice of prior µ over the the clique vertex set u differs in the definitions of multi-sample hypergraph PC and bipartite PDS. The prior µ in whicheach vertex is included in the clique independently with probability K/N was used in definingbipartite PDS because it is more convenient to work with when computing the LDLR, k -sampleLR and applying our main results.However, a subtle technical issues arises in multi-sample PC that precludes using this prior.The underlying problem is that D ∅ and the mixture of D u induced by this prior do not neces-sarily converge in χ divergence even when they converge in total variation. This is because χ divergence is large if certain tail events have very mismatched probabilities while total variationis not. Specifically, the probability the mixture of D u contains a clique of size t ≫ K is at least Pr [Bin( N, K/N ) > t ], which is much larger than the probability that D ∅ contains a clique of size t . This issue causes the average correlations defining SDA and the key quantity k E u ∼ µ D ⊗ ku k to bevery different between the two priors. Specifically, carrying out a similar computation as in Claim8.13 for the prior where each vertex is included with probability K/N yields that k E u ∼ µ D ⊗ ku k isonly O N (1) for much smaller values of γ .The important properties of the prior µ used in this section, where u is a random K -subset of[ N ], are that: (1) u is symmetric; (2) the size of u concentrates around K ; and (3) the distributionof | u | has very small upper tails. In particular, replacing µ with any prior that chooses a clique sizefrom the interval [ CK, K ] for some constant
C >
DLR and k -sample LR bounds. The following claims bound the LDLR and k -sample LRin multi-sample hypergraph PC in order to verify the conditions needed to apply our main results.Their proofs are standard computations and deferred to Appendix D.2. Let µ denote the uniformdistribution over K -subsets u ⊆ [ N ]. Claim 8.12.
For any s, K, N, k, d, m ∈ N , the ( d, k )-LDLR m for multi-sample hypergraph PCsatisfies that k E u ∼ µ ( D ⊗ mu ) d,k − k = O N (1) if the following conditions are satisfied: γ · max { m, ( ksd ) s } = O N (1) and 2 sk e K N = 1 − Ω N (1)where γ = − qq . Claim 8.13.
For any
K, N, k ∈ N , the k -sample LR is bounded by k E u ∼ µ D ⊗ ku k = O N (1) if thefollowing condition are satisfied: K N and γ k · K − s log (cid:18) NK (cid:19) where γ = − qq . Implications of our results and comparison to conjectured complexity barriers.
Wenow can deduce the implications of our main theorems.
Corollary 8.14.
Suppose that s is a fixed constant, K = Θ( N / − δ ) for some small constant δ > and q ∈ (0 , satisfies q > − c K − s for a sufficiently small constant c > . Thenfor multi-sample hypergraph PC with N vertices, clique size K and edge density q , it holds that SDA (cid:16) Θ (cid:16) t (1 − q ) (cid:17)(cid:17) > N Ω(log t ) for any t > (log N ) .Proof. In multi-sample hypergraph PC, each D u is a product measure on the hypercube and Theo-rem 6.3 applies. Consider setting the parameters d = 1, k = c log N for a sufficiently small constant c > K = Θ( N / − δ ) for a constant δ > m to be m = c / (1 − q ) forsome constant c >
0. Note that m is polynomially large in N . It now can be verified that, if c issufficiently small, then these parameters satisfy the conditions in Claim 8.12 and, if c is sufficientlysmall, they also satisfy the condition in Claim 8.13. Now consider applying Theorem 6.3 with SDAlower bound q ′ = N c (log t − log log N ) . It can be verified that this implies SDA(Θ( m/t )) > q ′ , provingthe corollary.Setting t = (log N ) δ ′ for some small δ ′ > K = Θ( N / − δ ) computa-tional barrier in the SQ model for multi-sample hypergraph PC in the regime mN o (1) − q O (cid:0) m (cid:1) of interest. It is worth noting that the loss of the t = (log N ) factor in m on applyingTheorem 6.3 means that we cannot arrive at m and q satisfying that 1 − q = Θ(1 /m ) exactly. Underthe average-case equivalence from Section 7, this corresponds to single-sample hypergraph PC withexactly constant edge densities. However, this constraint does not affect the tightness of Corollary8.14, as the resulting lower bound still corresponds to a single-sample instance of hypergraph PCwith a nearly constant edge density in the range N − o (1) q − N − o (1) and thus K = N ± o (1) is still the conjectured computational barrier. Remark 8.15.
Our partial noise robustness results imply SQ lower bounds in multi-sample hyper-graph PC, with a slightly different choice of the prior µ . Let µ ′ be the prior formed by choosing a31lique size according to Bin( K, N − δ ) and then choosing a vertex set of this size uniformly at randomfrom [ N ] to be the planted clique, where δ > K, N − δ ) has zero probability mass above K , Claims 8.12 and 8.13 can be adapted to accom-modate this different prior. Furthermore, this prior concentrates will around KN − δ = Θ( N / − δ )if K = Θ( N / − δ ).If T is the (1 ,
0) noise operator that resamples independently from Ber( q ), then m -samplehypergraph PC with the prior µ ′ can be realized as a subtensor random restriction of the type inTheorem 5.10 of m -sample hypergraph PC with the prior µ . In particular, it can be realized withthe noise operator T , restriction probability N − δ and correlation parameter ρ = 0. Now considersetting the parameters d = c − (log N ) s , k = c log N for a sufficiently small constant c > K = Θ( N / − δ ) for a constant δ > q and number of samples m to again be m = c / (1 − q ). If c and c are sufficiently small, then the conditions in Claims 8.12 and 8.13 aremet. Adapting the arguments in these claims to accommodate µ ′ yields that the relevant LDLRand k -sample LR are both O N (1). Now consider applying Theorem 5.10 together with Theorem3.1, similarly to as in Corollary 5.5, again with the SDA lower bound q ′ = N c (log t − log log N ) . If c is sufficiently small, then ( N − δ ) k − p √ ( d +1) / m = o (1) and we recover the same lower bound as inCorollary 8.14 for the prior µ ′ . The spiked Wishart model is a well-studied model for understanding sparse PCA. We considerthe following, standard version the problem. As with the other problems considered here, manyvariations of this problem exist in the literature, see e.g. [PWB +
18] for a more detailed discussion.
Problem 8.16 (Sparse PCA with Wishart Noise) . For a positive integer n , ρ ∈ [0 , λ ∈ [0 , ∞ ), the sparse PCA with Wishart noise problem is the following many-vs-one hypothesis testingproblem: • Null: m i.i.d. samples from the standard normal Gaussian, i.e. D ∅ = N (0 , I n ). • Alternate: m i.i.d. samples from a Gaussian with randomly spiked covariance. Specifically,sample a vector s via the following process. First draw s ′ ∈ {− , , } n so that each entry of s ′ is independent and distributed as s ′ i = − ρ ; − ρ/ ρ/ . Then, if k s ′ k > ρn , let s = 0, otherwise let s = √ ρn s ′ . Finally, draw m samples from D s = N (0 , I n + λss ⊤ ). Denote the distribution over s by S ρ .The choice of constant 2 in this model is arbitrary and can be replaced by any constant larger than1. By a Chernoff bound, for ρ = ω (1 /n ), s = 0 with high probability. Note that this problem isnaturally stated as a multi-sample problem.Unfortunately, while the null hypothesis for this problem is the standard normal Gaussian, itdoes not cleanly fit into the framework of Theorem 6.1, as the alternate hypotheses are not additiveshifts of N (0 , I n ). However, the ( d, k ) − LDLR m for this problem still has a nice form, which allowsus apply our main theorem.Recall the Hermite basis for D ⊗ t ∅ is the set of polynomials over ( R n ) t given by { H α } , where H α is parametrized by multi-indices α = ( α , . . . , α t ) ∈ ( N n ) t . For any multi-index α ∈ N n , and any x ∈ R n , let x α = Q ni =1 x α i i . Then, we have the following bound from [BKW19]:32 emma 8.17 (Lemma 5.8 in [BKW19]) . Let ( α , . . . , α t ) ∈ ( N n ) t . Then, we have: (cid:18) E u ∼ S ρ h D u , H α i (cid:19) = ( λ P ti =1 | α i | · Q ti =1 ( | α i |− α i ! · (cid:16) E u ∼ S ρ u P ti =1 α i (cid:17) if | α i | are even ;0 otherwise . As a result, we have the following:
Lemma 8.18.
Let t, d ∈ N . Suppose that nρ , and that dtλ ρn . Then, we have: (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ ( D du − ⊗ t (cid:13)(cid:13)(cid:13)(cid:13) (cid:18) d kλρn (cid:19) t . We prove Lemma 8.18 in Appendix D.3. Together with Claim 3.3, this immediately implies:
Corollary 8.19.
Let t, d be as in Lemma 8.18. Let m be so that m ρ n λ d k . Then (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ ( D ⊗ m ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) O (1) . We now seek to bound the norm of the high degree part of the correlation. To do so, we rely onthe following lemma:
Lemma 8.20 ([BKW19]) . Let φ ( x ) = (1 − x ) − / , and let φ d ( x ) = P dℓ =0 (cid:0) ℓℓ (cid:1) x ℓ and φ >d ( x ) = P ∞ ℓ = d +1 (cid:0) ℓℓ (cid:1) x ℓ denote the low degree approximation and the approximation error of the degree d Taylor approximation to φ ( x ) at zero, respectively. Then (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ D >du (cid:13)(cid:13)(cid:13)(cid:13) = E u,v ∼ S ρ (cid:20) φ > ⌊ d/ ⌋ (cid:18) λ h u, v i (cid:19)(cid:21) . As a result, we obtain the following bound:
Lemma 8.21.
Assume that nk ( d + 1) ρ . For λ < / and d even, we have: (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ (cid:16) D >du (cid:17) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) (cid:18) λ ρn (cid:19) k ( d +1) . The proof closely resembles the proof of Lemma 6.2, and we defer it to Appendix D.3. CombiningCorollary 8.19 and Lemma 8.21 with Theorem 3.1, we obtain:
Corollary 8.22.
Let d, k ∈ N . Let λ / , let ρ be so that nk ( d + 1) ρ , let m be so that m ( ρn ) d k λ . Then SDA( S , e Θ( m/k )) > k . Comparison to prior work and predictions.
The Wishart model for spiked PCA has two,well-studied regimes, the sparse PCA model, where the sparsity, governed by ρ , is sublinear in n ,typically nρ
1, and the dense regime, when ρ = Θ(1). In the dense regime, the celebrated BBPtransition [BAP +
05] gives an exact prediction of when detection is computationally possible, andthe computational limits in terms of the low degree likelihood ratio are known to exactly matchthese predictions [PWB +
18, DKWB19, BKW19]. In particular, it is predicted that when ρ is afixed universal constant, recovery is possible if and only if m > n/λ . While it is possible to plugin the machinery here with the LDLR bounds attained in [BKW19], it appears to be an inherentlimitation of the SDA framework for proving SQ lower bounds that it cannot predict exact (i.e.33ncluding constants) thresholds. Thus, while we can attain SQ lower bounds matching the BBPtransition up to constants, we cannot prove SQ lower bounds up to the transition.For this reason, the calculations in the previous section primarily focus on the sparse regime.The problem is well-studied in this setting, and the best known sample complexity for this problemis m = Ω (cid:16) ( ρn ) log nλ (cid:17) [dBG08, BR13b]. In contrast, information theoretically m = Ω (cid:16) ( ρn ) log nλ (cid:17) samples suffice. There is a slew of evidence [BR13a, HKP +
17, BB19] that suggests that this is thebest possible. Note that the SQ lower bounds and LDLR lower bounds we obtain witness this gap,up to logarithmic factors. To the best of our knowledge, prior to our work there were no LDLRlower bounds for sparse PCA in the ρ / √ n regime, and existing SQ lower bounds required λ = o (1) and ρ = n − / [WGL15]. In this section, we prove LDLR bounds for robustly testing Gaussian Mixtures. We use the SDAbounds of [DKS17] in an almost black-box fashion (we must modify their proofs a little bit toaccount for the different notions of statistical dimension considered).
Problem 8.23 (Testing Gaussian Mixture Models) . For n, s positive integers and ε ∈ (0 , − ε ) -separated Gaussian s -mixture model testing problem is the following hypothesis testingproblem: • Null: N (0 , I n ) • Alternate: uniform over S = { D U } U ∈ S for some S ⊂ × s R n − , where each D U for U = u , . . . , u s is a mixture of N ( u , I ) , ..., N ( u s , I ) satisfying the conditions d TV ( D u,v , D ∅ ) > . d TV ( N ( u i , I ) , N ( u j , I )) > − ε for all i = j ∈ [ s ].In [DKS17], the authors show lower bounds on the SDA × for this problem—however, becausethe lower bounds are for product-SDA, we must make some mild modifications to their proofs. Weuse the following building blocks: Lemma 8.24 (Lemma 3.4 of [DKS17]) . Suppose A is a distribution over R which matches m moments of N (0 , . For each u ∈ S n − , define the distribution with probability density function D u ( x ) = A ( h x, u i ) · γ ⊥ u ( x ) , where γ ⊥ u is the projection of D ∅ = N (0 , I n ) orthogonal to u . Letting D u be the relative density of D u with respect to D ∅ , we have that for any u, v ∈ S n − , |h D u , D v i − | |h u, v i| m +1 · k A k , for A the relative density of A with respect to N (0 , . Lemma 8.25 (Lemma 3.7 of [DKS17]) . For any c ∈ (0 , ) , there is a set S of Ω( n c ) unit vectorsin R n so that for each u, v ∈ S with u = v , |h u, v i| O ( n c − / ) . Now, we use the following propositions of [DKS17], which selects a distribution A for the GMMtesting problem: Proposition 8.26 (Proposition 4.2 of [DKS17]) . For any ε ∈ (0 , , c ∈ (0 , ) , and integer s > there exists a distribution A on R that is a mixture of s Gaussians A , . . . , A s with d TV ( A i , A j ) > − ε for all i = j ∈ [ s ] . Further, k A k exp( O ( s )) log ε and A agrees with N (0 , on s − moments, and if we construct { D u } u ∈ S as described in Lemmas 8.24 and 8.25, then each D u is amixture of s Gaussians and further for all u, v ∈ S , d TV ( D u , D v ) > . Putting these together, we have the following instance of the GGM testing problem:34 roblem 8.27 ( (1 − ε )-separated GGM testing instance from [DKS17]) . For n, ℓ positive integersand any ε ∈ (0 , A be the mixture of ℓ Gaussians described in Proposition 8.26 and let S bethe subset of S n − described in Lemma 8.25 with c = 0 .
26. Consider the following instance of the(1 − ε )-separated Gaussian ℓ -mixture model testing problem: • Null: D ∅ = N (0 , I n ) • Alternate: Uniform over the set of distributions S = { D u } u ∈ S ′ , where D u ( x ) = A ( h x, u i ) · γ ⊥ u ( x ) and S ′ is the subset of u ∈ S with d TV ( D u , D ∅ ) > (note | S ′ | > | S | ).We note that Problem 8.27 is a valid instance of the (1 − ε )-separated Gaussian ℓ -mixture testingproblem: since from Proposition 8.26 A is a one-dimensional mixture of ℓ Gaussians with pairwisetotal variation distance > − ε , each D u is also a mixture of ℓ Gaussians with pairwise totalvariation distance > − ε . Proposition 8.26 also guarantees that for each u = v , d T V ( D u , D v ) > .By the triangle inequality, we have that d TV ( D u , D ) + d TV ( D v , D ) > d TV ( D u , D v ) > , whichimplies that for at least half of u ∈ S , d TV ( D u , D v ) > , and this half is exactly S ′ .Putting these lemmas together, we have the following easy corollary: Corollary 8.28.
Let ℓ, n be integers with n sufficiently large and n ℓ +1 n / . Let S = { D u } u ∈ S ′ be as described in Problem 8.27. Then there exists a constant c so that for all integers n sufficientlylarge, for any q > , SDA S , ( n/c ) ( ℓ +1) / log ε (cid:16) q n / (cid:17) > q. Proof.
We have that Pr u,v ∼ S [ u = v ] = | S ′ | . Since Problem 8.27 uses the construction fromLemma 8.25 with c = .
26, for n sufficiently large | S ′ | > n . and |h u, v i| n − / for all u = v ∈ S ′ .Since Lemma 8.24 furnishes a bound on the correlation for u = v , for any event E , E u,v ∼ µ (cid:2)(cid:12)(cid:12) h D u , D v i − (cid:12)(cid:12) | E (cid:3) min (cid:18) , | S ′ | Pr [ E ] (cid:19) · k A k + max (cid:18) , − | S ′ | Pr [ E ] (cid:19) · n ( ℓ +1) / k A k , and substituting our bound on | S ′ | , using that k A k log ε C ℓ for some constant C , and using theassumption that n ( ℓ +1) / / n . n / , we have our conclusion.Applying Theorem 4.1, we deduce the following bound: Corollary 8.29.
There exists a real number c > so that for any ε ∈ (0 , and integer ℓ , thereexists n sufficiently large that for any even integer k ≪ n / and any m ( n/c ) ( ℓ +1) / ε , the (1 − ε ) -separated Gaussian ℓ -mixture model testing problem S = { D u } u ∈ S vs. D ∅ described in Problem 8.27has ( ∞ , k ) - LDLR m bounded by (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) ∞ ,k − (cid:13)(cid:13)(cid:13)(cid:13) . Proof.
Let m = ( n/c ) ( ℓ +1) / ε . We notice that |h D u , D v i − | exp( O ( ℓ )) log ε m / always, since ε, ℓ are fixed constants. Hence we meet the condition of Theorem 4.1 that k E u ( D u − ⊗ k k m k/ .Applying Corollary 8.28 with q = q n / mm ′ , we have that for all 1 m ′ m ,SDA( S , m ′ ) > r n / mm ′ > (cid:18) mm ′ (cid:19) k for any k n . . This concludes the argument.35 omparison with prior work and predictions The lower bound Corollary 8.29 is consis-tent with the SQ lower bounds of [DKS17], suggesting efficient algorithms for learning a mixtureof ℓ Gaussians in n dimensions, each separated in total-variation distance, requires d Ω( ℓ ) sam-ples. Information-theoretically, only poly( n, ℓ ) samples are required in this setting, although theinformation-theoretic sample complexity becomes exponential in ℓ if the Gaussians are not requiredto have total variation distance close to 1 [MV10]. An algorithm using time and samples d poly( k ) isknown [MV10]. In this section, we prove an SDA lower bound for a hypothesis testing problem over GaussianGraphical Models, and then show that this implies a LDLR lower bound for the same problem.We will not succeed in establishing evidence for information computation gaps—the point of thisexample is to illustrate the utility of Theorem 4.1, for a setting where LDLR lower bounds arehighly intractable while SDA lower bounds are approachable.In Gaussian Graphical models, we observe samples x , . . . , x m ∼ N ( µ, Θ − ), where Θ is asparse positive semidefinite matrix—since it is sparse, it is thought of as a graph. The goal is toget algorithms for estimating Θ which do not depend on its condition number, and which takeadvantage of the graph sparsity. The relevant parameters are the maximum degree d and thenon-degeneracy parameter κ := min i,j ∈ [ n ] | Θ ij | √ Θ ii Θ jj . Problem 8.30 (Gaussian Graphical Models: planted d -regular subgraph) . For n > s > d positiveintegers and κ ∈ R with κ √ d < , the κ -nondegenerate d -sparse s -planted n -dimensional plantedregular subgraph Gaussian Graphical Model ( ( κ, d, s, n ) -prsGGM) problem is the following many-vs-one hypothesis testing problem: • Null: D ∅ = N (0 , I n ). • Alternate: uniform mixture of D u = N (0 , ( I n + κ ∆ u ) − ), over u ∼ S , where each u is sampledby choosing s of n indices uniformly at random, and then planting a randomly signed random d -regular graph on those indices (conditioned on the graph having all eigenvalues bounded inmagnitude by 2 √ d ), then taking ∆ u to be the adjacency matrix of that graph.We will prove the following Lemma, from which we obtain an LDLR lower bound as a corollaryof Theorem 4.1: Lemma 8.31.
For any integer d sufficiently large, any s ≫ d sufficiently large, any n ≫ s sufficiently large, and κ ∈ (0 , √ d ) such that the following holds: If S vs. D ∅ is an instance of the ( κ, d, s, n ) -prsGGM problem, then for any even integer k and q > , SDA S , (cid:18) nq s (cid:19) /k sdκ ) − ! > q, and further, E u,v h D u , D v i k (cid:18) s n (cid:19) /k (cid:0) exp( sdκ ) − (cid:1)! k . We give the proof of this Lemma in Appendix D.4. Combining Lemma 8.31 with Theorem 4.1gives us the following corollary: 36 orollary 8.32.
For any integer d sufficiently large, any s ≫ d sufficiently large, any n ≫ s sufficiently large, and κ ∈ (0 , √ d ) such that the following holds: If S vs. D ∅ is an instance of the ( κ, d, s, n ) -prsGGM problem, then for any even integers k, t and m (cid:0) ns (cid:1) /k sdκ ) − with sdκ k log m , the m -sample ( t, Ω( k )) - LDLR m is bounded: (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) t,k/ − (cid:13)(cid:13)(cid:13)(cid:13) . Comparison with prior work and predictions.
For an arbitrary Gaussian Graphical Modelwith maximum degree d , κ -nondegeneracy, and dimension n , information-theoretically, m > log nκ samples are required [WWR10], and the fastest known algorithms for m = Θ( κ log n ) run in time n O ( d ) [KKMM19], though faster algorithms are known for more structured cases [KKMM19, RWR + n o ( d ) time algorithms.Our bounds are not strong enough to give evidence for an information-computation gap: forsignal-to-noise ratios corresponding to m = Θ( log nκ ) samples, by choosing s = log n and κ smallenough we can rule out SQ algorithms with fewer than q n/ ( d log n ) queries, or degree- O ( log n log d )polynomial distinguishers (these bounds degrade as d increases, instead of the other way around).We do not expect that this bound is tight, and our bound from Lemma 8.31 might easily beimproved with a more careful analysis. But, because the matrices that we use are well-conditioned,and because there are algorithms for well-conditioned matrices that require fewer samples, it isunlikely that the hypothesis testing problem we consider will give evidence for this information-computation tradeoff, even if analyzed optimally.However, this example does illustrate that it is possible to obtain a bound depending on thesparsity and non-degeneracy; in this, it highlights the usefulness of Theorem 4.1. In the GGMproblem, any set of alternate hypotheses S by definition involves Gaussian distributions whose inverse covariance matrices are easy to describe, but the covariance matrices themselves are not;this would make calculating the LDLR directly extremely arduous, even for our toy example ofalternate distributions. However, calculating some bound on the SDA is relatively tractable, andTheorem 4.1 lets us draw conclusions for the LDLR. Theorem 5.2 shows that if for the hypothesis testing problem T ρ S vs D ∅ , the ( s − , k )-LDLR m is bounded by ε , and k E u ( D u ) ⊗ k k O (1), and ρ s = O ( m ), then at least 2 k queries toVSTAT( O ( m/k )) are necessary. The following example illustrates that this dependence on ρ istight. Problem 8.33.
The following is the 2 k -subset of s -sparse parities problem: • Null: D ∅ is uniform over {± } n . • Alternate: For S an arbitrary subset of (cid:0) [ n ] s (cid:1) with | S | = 2 k , define S = { D u } u ∈ D , where foreach u ∈ S we take D u uniform over x ∼ {± } n conditioned on x u = 1. Claim 8.34.
For any ρ ∈ [ − ,
1] and T ρ the standard Boolean noise operator, and any integer m ,the many-vs-one 2 k -subset of s -sparse parities problem D ∅ vs S = { D u } has k E u ∼ S ( T ρ D ⊗ mu ) s − , ∞ − k = 0 . roof. This is because each D u has no Fourier mass on degrees 1 through s − Claim 8.35.
For the many-vs-one 2 k -subset of s -sparse parities problem, k E u ∼ S ( D ⊗ ku ) k . Proof.
For each u = v , h D u , D v i = 1, and h D u , D u i = 2. We then use the fact that | S | k tocalculate, k E u ( D u ) ⊗ k k = E u,v ∼ S h D u , D v i k = 1 | S | · k + (1 − | S | ) · . Together, the above claims demonstrate that we meet the conditions of Theorem 5.2. However,there is also a 2 k -query VSTAT( ρ − s ) algorithm: Claim 8.36.
There is a 2 k query VSTAT( ρ − s ) algorithm for the ρ -noisy 2 k -subset of s -sparseparities problem, T ρ S vs. D ∅ . Proof.
The algorithm is as follows: for each u ∈ S , take the query φ u ( x ) = (1 + x u ). Under null, E D ∅ φ u = . Under T ρ D u , E T ρ D u φ u = (1 + ρ s ). Thus, a VSTAT( ρ − s ) algorithm can distinguishthese cases.Hence, the requirement in Theorem 5.2 that ρ s = O ( m ) is tight. Acknowledgments
T.S. thanks Ankur Moitra, Alex Wein, Fred Koehler, and Adam Klivans for helpful conversationsregarding the nature of statistical query algorithms and the implications of this work.
References [ABDR +
18] Albert Atserias, Ilario Bonacina, Susanna De Rezende, Massimo Lauria, Jakob Nord-str˝om, and Alexander Razborov,
Clique is hard on average for regular resolution ,Symposium on the Theory of Computing (STOC), 2018. 1[ACBL12] Ery Arias-Castro, S´ebastien Bubeck, and G´abor Lugosi,
Detection of correlations , TheAnnals of Statistics (2012), no. 1, 412–435. 1[ACO08] Dimitris Achlioptas and Amin Coja-Oghlan, Algorithmic barriers from phase tran-sitions , 2008 49th Annual IEEE Symposium on Foundations of Computer Science,IEEE, 2008, pp. 793–802. 9[ACV14] Ery Arias-Castro and Nicolas Verzelen,
Community detection in dense random net-works , The Annals of Statistics (2014), no. 3, 940–969. 1[AGJ +
20] Gerard Ben Arous, Reza Gheissari, Aukosh Jagannath, et al.,
Algorithmic thresholdsfor tensor pca , Annals of Probability (2020), no. 4, 2052–2087. 8, 9[AWZ20] G´erard Ben Arous, Alexander S Wein, and Ilias Zadik, Free energy wells and overlapgap property in sparse pca , Conference on Learning Theory, 2020, pp. 479–482. 938BAP +
05] Jinho Baik, G´erard Ben Arous, Sandrine P´ech´e, et al.,
Phase transition of the largesteigenvalue for nonnull complex sample covariance matrices , The Annals of Probability (2005), no. 5, 1643–1697. 33[BB19] Matthew Brennan and Guy Bresler, Optimal average-case reductions to sparse pca:From weak assumptions to strong hardness , Conference on Learning Theory, 2019,pp. 469–470. 1, 34[BB20] ,
Reducibility and statistical-computational gaps from secret leakage , Conferenceon Learning Theory (COLT), 2020. 1, 2[BBH18] Matthew Brennan, Guy Bresler, and Wasim Huleihel,
Reducibility and computationallower bounds for problems with planted sparse structure , Conference on Learning The-ory (COLT), 2018. 1, 24, 29[BBH19] ,
Universality of computational lower bounds for submatrix detection , Confer-ence on Learning Theory (COLT), 2019. 1[BBKW19] Afonso S Bandeira, Jess Banks, Dmitriy Kunisky, and Alexander S Wein,
Spectralplanting and the hardness of refuting cuts, colorability,and communities in randomgraphs , arXiv preprint arXiv:2008.12237 (2019). 3[Bei93] Richard Beigel,
The polynomial method in circuit complexity , [1993] Proceedings of theEigth Annual Structure in Complexity Theory Conference, IEEE, 1993, pp. 82–95. 9[BFJ +
94] Avrim Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, andSteven Rudich,
Weakly learning dnf and characterizing statistical query learning usingfourier analysis , Proceedings of the twenty-sixth annual ACM symposium on Theoryof computing, 1994, pp. 253–262. 9[BGL17] Vijay Bhattiprolu, Venkatesan Guruswami, and Euiwoong Lee,
Sum-of-squares certifi-cates for maxima of random tensors on the sphere , APPROX/RANDOM 2017 (KlausJansen, Jos´e D. P. Rolim, David Williamson, and Santosh S. Vempala, eds.), LIPIcs,vol. 81, Schloss Dagstuhl - Leibniz-Zentrum f¨ur Informatik, 2017, pp. 31:1–31:20. 26[BGS14] G. Bresler, D. Gamarnik, and D. Shah,
Hardness of parameter estimation in graphicalmodels , Neural Information Processing Systems, 2014. 24[BHK +
19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra,and Aaron Potechin,
A nearly tight sum-of-squares lower bound for the planted cliqueproblem , SIAM Journal on Computing (2019), no. 2, 687–735. 1, 3, 4, 9[BKR +
11] Sivaraman Balakrishnan, Mladen Kolar, Alessandro Rinaldo, Aarti Singh, and LarryWasserman,
Statistical and computational tradeoffs in biclustering , NeurIPS 2011workshop on computational trade-offs in statistical learning, vol. 4, 2011. 8[BKW19] Afonso S Bandeira, Dmitriy Kunisky, and Alexander S Wein,
Computational hardnessof certifying bounds on constrained pca problems , arXiv preprint arXiv:1902.07324(2019). 9, 32, 33[BR13a] Quentin Berthet and Philippe Rigollet,
Complexity theoretic lower bounds for sparseprincipal component detection , Conference on Learning Theory, 2013, pp. 1046–1066.1, 34 39BR13b] ,
Optimal detection of sparse principal components in high dimension , TheAnnals of Statistics (2013), no. 4, 1780–1815. 1, 34[CJ13] Venkat Chandrasekaran and Michael I Jordan, Computational and statistical tradeoffsvia convex relaxation , Proceedings of the National Academy of Sciences (2013),no. 13, E1181–E1190. 8[CMP10] Anwei Chai, Miguel Moscoso, and George Papanicolaou,
Array imaging usingintensity-only measurements , Inverse Problems (2010), no. 1, 015005. 1[CRT06] Emmanuel J Candes, Justin K Romberg, and Terence Tao, Stable signal recoveryfrom incomplete and inaccurate measurements , Communications on Pure and AppliedMathematics: A Journal Issued by the Courant Institute of Mathematical Sciences (2006), no. 8, 1207–1223. 1[CSV13] Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski, Phaselift: Exactand stable signal recovery from magnitude measurements via convex programming ,Communications on Pure and Applied Mathematics (2013), no. 8, 1241–1274. 1[CT07] Emmanuel Candes and Terence Tao, The Dantzig selector: Statistical estimation whenp is much larger than n , The Annals of Statistics (2007), no. 6, 2313–2351. 1[CX16] Yudong Chen and Jiaming Xu, Statistical-computational tradeoffs in planted problemsand submatrix localization with a growing number of clusters and submatrices , Journalof Machine Learning Research (2016), no. 27, 1–57. 8[dBG08] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui, Optimal solutions forsparse principal component analysis , Journal of Machine Learning Research (2008),no. Jul, 1269–1294. 34[DGR00] Scott E Decatur, Oded Goldreich, and Dana Ron, Computational sample complexity ,SIAM Journal on Computing (2000), no. 3, 854–879. 8[DH20] Rishabh Dudeja and Daniel Hsu, Statistical query lower bounds for tensor PCA , arXivpreprint arXiv:2008.04101 (2020). 8, 9[DKS17] Ilias Diakonikolas, Daniel M Kane, and Alistair Stewart,
Statistical query lower boundsfor robust estimation of high-dimensional gaussians and gaussian mixtures , 2017 IEEE58th Annual Symposium on Foundations of Computer Science (FOCS), IEEE, 2017,pp. 73–84. 3, 9, 34, 35, 36[DKS19] Ilias Diakonikolas, Weihao Kong, and Alistair Stewart,
Efficient algorithms and lowerbounds for robust linear regression , Proceedings of the Thirtieth Annual ACM-SIAMSymposium on Discrete Algorithms, SIAM, 2019, pp. 2745–2754. 9[DKWB19] Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira,
Subexponential-time algorithms for sparse PCA , arXiv preprint arXiv:1907.11635(2019). 3, 9, 33[DM15] Yash Deshpande and Andrea Montanari,
Improved sum-of-squares lower boundsfor hidden clique and hidden submatrix problems. , Conference on Learning Theory(COLT), 2015, pp. 523–562. 1 40Don06] David L Donoho,
Compressed sensing , IEEE Transactions on information theory (2006), no. 4, 1289–1306. 1[FB96] Ping Feng and Yoram Bresler, Spectrum-blind minimum-rate sampling and reconstruc-tion of multiband signals , Acoustics, Speech, and Signal Processing, 1996. ICASSP-96.Conference Proceedings., 1996 IEEE International Conference on, vol. 3, IEEE, 1996,pp. 1688–1691. 1[Fei02] Uriel Feige,
Relations between average case complexity and approximation complexity ,Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,ACM, 2002, pp. 534–543. 1[Fel12] Vitaly Feldman,
A complete characterization of statistical query learning with appli-cations to evolvability , Journal of Computer and System Sciences (2012), no. 5,1444–1459. 9[FGR +
17] Vitaly Feldman, Elena Grigorescu, Lev Reyzin, Santosh S Vempala, and Ying Xiao,
Statistical algorithms and a lower bound for detecting planted cliques , Journal of theACM (JACM) (2017), no. 2, 1–37. 1, 3, 6, 8, 9, 25, 27, 28, 45, 46, 47, 48[FGV17] Vitaly Feldman, Cristobal Guzman, and Santosh Vempala, Statistical query algorithmsfor mean vector estimation and stochastic convex optimization , Proceedings of theTwenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SIAM, 2017,pp. 1265–1277. 8, 9[FHT08] J. Friedman, T. Hastie, and R. Tibshirani,
Sparse inverse covariance estimation withthe graphical lasso , Biostatistics (2008), no. 3, 432–441. 1[FK03] Uriel Feige and Robert Krauthgamer, The probable value of the lov´asz–schrijver relax-ations for maximum independent set , SIAM Journal on Computing (2003), no. 2,345–370. 1[FPV18] Vitaly Feldman, Will Perkins, and Santosh Vempala, On the complexity of randomsatisfiability problems with planted solutions , SIAM Journal on Computing (2018),no. 4, 1294–1338. 3, 9[GGJ +
20] Surbhi Goel, Aravind Gollakota, Zhihan Jin, Sushrut Karmalkar, and Adam Klivans,
Superpolynomial lower bounds for learning one-layer neural networks using gradientdescent , arXiv preprint arXiv:2006.12011 (2020). 9[GJS19] David Gamarnik, Aukosh Jagannath, and Subhabrata Sen,
The overlap gap propertyin principal submatrix recovery , arXiv preprint arXiv:1908.09959 (2019). 9[GJW20] David Gamarnik, Aukosh Jagannath, and Alexander S Wein,
Low-degree hardness ofrandom optimization problems , arXiv preprint arXiv:2004.12063 (2020). 9[Gri01] Dima Grigoriev,
Linear lower bound on degrees of positivstellensatz calculus proofs forthe parity , Theoretical Computer Science (2001), no. 1-2, 613–622. 9[GS14] David Gamarnik and Madhu Sudan,
Limits of local algorithms over sparse randomgraphs , Proceedings of the 5th conference on Innovations in theoretical computer sci-ence, 2014, pp. 369–376. 9 41GZ19] David Gamarnik and Ilias Zadik,
The landscape of the planted clique problem: Densesubgraphs and the overlap gap property , arXiv preprint arXiv:1904.07174 (2019). 9[HKP +
17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, TselilSchramm, and David Steurer,
The power of sum-of-squares for detecting hidden struc-tures , 2017 IEEE 58th Annual Symposium on Foundations of Computer Science(FOCS), IEEE, 2017, pp. 720–731. 8, 9, 26, 34[HKP +
18] Samuel B Hopkins, Pravesh Kothari, Aaron Henry Potechin, Prasad Raghavendra,and Tselil Schramm,
On the integrality gap of degree-4 sum of squares for plantedclique , ACM Transactions on Algorithms (TALG) (2018), no. 3, 28. 1[HL19] Samuel B Hopkins and Jerry Li, How hard is robust mean estimation? , arXiv preprintarXiv:1903.07870 (2019). 14[Hop18] Samuel B Hopkins,
Statistical inference and the sum of squares method , Ph.D. thesis,Cornell University, 2018. 2[HS17] Samuel B Hopkins and David Steurer,
Efficient bayesian estimation from few samples:community detection and related problems , 2017 IEEE 58th Annual Symposium onFoundations of Computer Science (FOCS), IEEE, 2017, pp. 379–390. 3, 9[HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer,
Tensor principal componentanalysis via sum-of-square proofs , Conference on Learning Theory, 2015, pp. 956–1006. 8[HW20] Justin Holmgren and Alexander S Wein,
Counterexamples to the low-degree conjecture ,arXiv preprint arXiv:2004.08454 (2020). 3[HWX15] Bruce E Hajek, Yihong Wu, and Jiaming Xu,
Computational lower bounds for com-munity detection on random graphs. , Conference on Learning Theory (COLT), 2015,pp. 899–928. 1[IKKM12] Morteza Ibrahimi, Yashodhan Kanoria, Matt Kraning, and Andrea Montanari,
Theset of solutions of random xorsat formulae , Proceedings of the twenty-third annualACM-SIAM symposium on Discrete Algorithms, SIAM, 2012, pp. 760–779. 9[Jer92] Mark Jerrum,
Large cliques elude the metropolis process , Random Structures & Algo-rithms (1992), no. 4, 347–359. 1, 9[JL09] Iain M Johnstone and Arthur Yu Lu, On consistency and sparsity for principal com-ponents analysis in high dimensions , Journal of the American Statistical Association (2009), no. 486, 682–693. 1[JMS04] Haixia Jia, Cris Moore, and Bart Selman,
From spin glasses to hard satisfiable for-mulas , International Conference on Theory and Applications of Satisfiability Testing,Springer, 2004, pp. 199–210. 9[JNS13] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi,
Low-rank matrix completionusing alternating minimization , Proceedings of the forty-fifth annual ACM symposiumon Theory of computing, ACM, 2013, pp. 665–674. 142JOH] Kishore Jaganathan, Samet Oymak, and Babak Hassibi,
Sparse phase retrieval: Con-vex algorithms and limitations , 2013 IEEE International Symposium on InformationTheory. 1[JT18] Ziwei Ji and Matus Telgarsky,
Risk and parameter convergence of logistic regression ,arXiv preprint arXiv:1803.07300 (2018). 1[Kea98] Michael Kearns,
Efficient noise-tolerant learning from statistical queries , Journal ofthe ACM (JACM) (1998), no. 6, 983–1006. 3, 9[KKMM19] Jonathan Kelner, Frederic Koehler, Raghu Meka, and Ankur Moitra, Learning somepopular gaussian graphical models without condition number bounds , arXiv preprintarXiv:1905.01282 (2019). 37[KMH +
20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess,Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei,
Scaling lawsfor neural language models , arXiv preprint arXiv:2001.08361 (2020). 1[KMOW17] Pravesh K Kothari, Ryuhei Mori, Ryan O’Donnell, and David Witmer,
Sum of squareslower bounds for refuting any csp , Proceedings of the 49th Annual ACM SIGACTSymposium on Theory of Computing, 2017, pp. 132–145. 9[KS07] Adam R Klivans and Alexander A Sherstov,
Unconditional lower bounds for learningintersections of halfspaces , Machine Learning (2007), no. 2-3, 97–114. 9[KWB19] Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira, Notes on computationalhardness of hypothesis testing: Predictions using the low-degree likelihood ratio , arXivpreprint arXiv:1907.11636 (2019). 2, 3, 8, 9, 20, 26, 53[LDP07] Michael Lustig, David Donoho, and John M Pauly,
Sparse MRI: The applicationof compressed sensing for rapid MR imaging , Magnetic Resonance in Medicine: AnOfficial Journal of the International Society for Magnetic Resonance in Medicine (2007), no. 6, 1182–1195. 1[LML +
17] Thibault Lesieur, L´eo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborov´a,
Statistical and computational phase transitions in spiked tensor estimation , 2017 IEEEInternational Symposium on Information Theory (ISIT), IEEE, 2017, pp. 511–515. 8[LZ20] Yuetian Luo and Anru R Zhang,
Tensor clustering with planted structures: Statisticaloptimality and computational limits , arXiv preprint arXiv:2005.10743 (2020). 1[MM09] Marc Mezard and Andrea Montanari,
Information, physics, and computation , OxfordUniversity Press, 2009. 9[Mon14] A. Montanari,
Computational Implications of Reducing Data to Sufficient Statistics ,ArXiv e-prints (2014). 24[Mon15] Andrea Montanari,
Finding one community in a sparse graph , Journal of StatisticalPhysics (2015), no. 2, 273–299. 1[MPW15] Raghu Meka, Aaron Potechin, and Avi Wigderson,
Sum-of-squares lower bounds forplanted clique , Proceedings of the forty-seventh annual ACM symposium on Theoryof computing, ACM, 2015, pp. 87–96. 143MV10] Ankur Moitra and Gregory Valiant,
Settling the polynomial learnability of mixtures ofgaussians , 2010 IEEE 51st Annual Symposium on Foundations of Computer Science,IEEE, 2010, pp. 93–102. 36[MW15] Zongming Ma and Yihong Wu,
Computational barriers in minimax submatrix detec-tion , The Annals of Statistics (2015), no. 3, 1089–1116. 1[NKB +
19] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and IlyaSutskever,
Deep double descent: Where bigger models and more data hurt , arXivpreprint arXiv:1912.02292 (2019). 1[PWB +
18] Amelia Perry, Alexander S Wein, Afonso S Bandeira, Ankur Moitra, et al.,
Optimalityand sub-optimality of pca i: Spiked random matrix models , The Annals of Statistics (2018), no. 5, 2416–2451. 32, 33[RBE10] Ron Rubinstein, Alfred M Bruckstein, and Michael Elad, Dictionaries for sparse rep-resentation modeling , Proceedings of the IEEE (2010), no. 6, 1045–1057. 1[RCLV13] Juri Ranieri, Amina Chebira, Yue M Lu, and Martin Vetterli, Phase retrieval forsparse signals: Uniqueness conditions , arXiv preprint arXiv:1308.3058 (2013). 1[RFP10] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo,
Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization , SIAM review (2010), no. 3, 471–501. 1[RM14] Emile Richard and Andrea Montanari, A statistical model for tensor pca , Advances inNeural Information Processing Systems, 2014, pp. 2897–2905. 8[Ros08] Benjamin Rossman,
On the constant-depth complexity of k-clique , Proceedings of thefortieth annual ACM symposium on Theory of computing, ACM, 2008, pp. 721–730.1[Ros14] ,
The monotone complexity of k-clique on random graphs , SIAM Journal onComputing (2014), no. 1, 256–279. 1[RRS17] Prasad Raghavendra, Satish Rao, and Tselil Schramm, Strongly refuting random CSPsbelow the spectral threshold , Proceedings of the 49th Annual ACM SIGACT Sympo-sium on Theory of Computing, 2017, pp. 121–131. 9, 26[RSS18] Prasad Raghavendra, Tselil Schramm, and David Steurer,
High-dimensional estima-tion via sum-of-squares proofs , arXiv preprint arXiv:1807.11419 (2018). 9[RWR +
11] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, Bin Yu, et al.,
High-dimensional covariance estimation by minimizing ℓ -penalized log-determinant diver-gence , Electronic Journal of Statistics (2011), 935–980. 37[Ser99] Rocco A Servedio, Computational sample complexity and attribute-efficient learning ,Proceedings of the thirty-first annual ACM symposium on Theory of computing, ACM,1999, pp. 701–710. 8[SHN +
18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Sre-bro,
The implicit bias of gradient descent on separable data , The Journal of MachineLearning Research (2018), no. 1, 2822–2878. 144SSS08] Shai Shalev-Shwartz and Nathan Srebro, SVM optimization: inverse dependence ontraining set size , Proceedings of the 25th international conference on Machine learning,ACM, 2008, pp. 928–935. 8[SSST12] Shai Shalev-Shwartz, Ohad Shamir, and Eran Tromer,
Using more data to speed-uptraining time , Artificial Intelligence and Statistics (AISTATS), 2012, pp. 1019–1027.8[SW20] Tselil Schramm and Alexander S Wein,
Computational barriers to estimation fromlow-degree polynomials , arXiv preprint arXiv:2008.02269 (2020). 9[SWW12] Daniel A Spielman, Huan Wang, and John Wright,
Exact recovery of sparsely-useddictionaries , Conference on Learning Theory (COLT), 2012, pp. 37–1. 1[WEAM19] Alexander S Wein, Ahmed El Alaoui, and Cristopher Moore,
The Kikuchi hierarchyand tensor PCA , 2019 IEEE 60th Annual Symposium on Foundations of ComputerScience (FOCS), IEEE, 2019, pp. 1446–1468. 8, 9, 26[WGL15] Zhaoran Wang, Quanquan Gu, and Han Liu,
Sharp computational-statistical phasetransitions via oracle computational model , arXiv preprint arXiv:1512.08861 (2015).34[WWR10] Wei Wang, Martin J Wainwright, and Kannan Ramchandran,
Information-theoreticbounds on model selection for gaussian markov random fields , 2010 IEEE InternationalSymposium on Information Theory, IEEE, 2010, pp. 1373–1377. 37[ZK16] Lenka Zdeborov´a and Florent Krzakala,
Statistical physics of inference: thresholds andalgorithms , Advances in Physics (2016), no. 5, 453–552. 9[ZX18] Anru Zhang and Dong Xia, Tensor SVD: Statistical and computational limits , IEEETransactions on Information Theory (2018). 1
A SDA, Product-SDA, and Simple-vs-Simple Hypothesis Testing
We make several remarks here on technical differences between our hypothesis testing and statis-tical dimension setup and those of [FGR + E (cid:2)(cid:12)(cid:12)(cid:10) D u , D v (cid:11) − (cid:12)(cid:12) | A (cid:3) for all events A in the joint distribution of u, v ∼ µ , while [FGR +
17] consid-ers only A of the form A = B ⊗ B for some event B in µ . Our version corresponds to a stronger computational model, in the sense that a lower bound on SDA( S , m ) implies a lower bound onthe statistical dimension of [FGR + +
17] are many vs. one (simple vs. composite)hypothesis testing problems, but in Appendix A.2 we show that statistical dimension implies lowerbounds on SQ algorithms in our simple vs. simple hypothesis testing setting as well. Notationally,we write SDA( S , m ) where [FGR +
17] writes SDA( S , D ∅ , m ). For this reason, we use Pr ( A ) > /q in our definition, rather than the more natural Pr ( A ) > /q , to maintainconsistency with [FGR + The difference between these two settings is the presence of the prior µ . .1 Counterexample to Equivalence of Two Notions of Statistical Dimension In this appendix we construct a testing problem which shows that the definition of statisticaldimension we use in this paper can differ from the statistical dimension of [FGR + D ∅ vs. S be a testing problem with prior µ . For D u , D v ∈ S , we write as usual the relativedensity D u ( x ) = D u ( x ) D ∅ ( x ) (and D v for v ), and the inner product (cid:10) D u , D v (cid:11) = E x ∼ D ∅ D u ( x ) D v ( x ). Wehave used the following notion of statistical dimension: Definition A.1 (SDA) . SDA( S , m ) = max (cid:26) q ∈ N : E u,v ∼ µ (cid:2)(cid:12)(cid:12)(cid:10) D u , D v (cid:11) − (cid:12)(cid:12) | A (cid:3) m for all events A s.t. Pr u,v ∼ µ ( A ) > q (cid:27) . The work [FGR +
17] employs the a different, weaker notion, which we term product-SDA orSDA × to distinguish it from the above: Definition A.2 (Product SDA) . SDA × ( S , m ) = max (cid:26) q ∈ N : E u,v ∼ µ (cid:2)(cid:12)(cid:12)(cid:10) D u , D v (cid:11) − (cid:12)(cid:12) | A u , A v (cid:3) m for all events A u s.t. Pr u ∼ µ ( A ) > q (cid:27) . In the definition of product-SDA, the event A u ∧ A v is a product of events occurring for a singlesamples u, v ∼ µ , rather than an event over the joint distribution of two samples u, v ∼ µ . In thedefinition of SDA, we use 1 /q so that the event A has probability equal to the probability of theevent { u ∈ A u , v ∈ A v } , where u ∈ A u has probability 1 /q according to µ .Since the value of the product-SDA is the value of an optimization problem over a larger setthan our notion of SDA, it is clear that SDA × ( m ) > SDA( m ). We will sketch a proof of thefollowing claim, which demonstrates an example for which this inequality is far from equality. Claim A.3.
For every n ∈ N there is a number t ( n ) and a family S = { D i } i ∈ [ n ] of distributionsover [ n ] such that for the hypothesis testing problem S , D ∅ for D ∅ the uniform distribution over[ n ], SDA( S , t ( n )) O (1) while SDA × ( S , t ( n )) > n Ω(1) .We turn to our construction. Regarding notation in what follows: for vectors in R n , whichwe typically denote by lower-case letters, h v, w i is the usual Euclidean inner product h v, w i = P i n v i w i . For functions F : [ n ] → R , which we denote by upper-case letters, h F, G i is given by E i ∼ [ n ] F ( i ) G ( i ) (this is merely a difference in normalization). We will use the following claim. Claim A.4.
Let v , . . . , v n ∈ R n . Let v max = max i k v i k ∞ be the largest-magnitude entry in any v i , and let α = max i | h v, i | / √ n , where denotes the all-1’s vector. Then there exists a familyof distributions D , . . . , D n on [ n ] such that, if D i is the density of D i relative to the uniformdistribution on [ n ], then (cid:10) D i , D j (cid:11) − nv ( h v i , v j i ± α ). Proof.
Let w i = v i − h v i , i · /n . By construction, h w i , i = 0. Let D i : [ n ] → R be the function D i ( k ) = v max ( w ik + 2 v max ). Then by construction E i ∼ [ n ] D i ( j ) = 1 and D i ( j ) > i, j , so D i is a density relative to the uniform distribution on [ n ]. Furthermore, E k ∼ [ n ] D i ( k ) D j ( k ) − n · v h w i , w j i = 1 n · v ( h v i , v j i−h v i , i h v j , i /n ) = 1 n · v ( h v i , v j i± α )as desired. 46ow we will construct a random testing problem and sketch its analysis. Let G be an n × n symmetric matrix with i.i.d. entries from N (0 , M = G + 3 √ nI . With probability at least0 .
99 the following all hold (by standard concentration of measure): • M (cid:23)
0, since the least eigenvalue of G is at most 2 √ n in magnitude, with high probability. • If v , . . . , v n ∈ R n are such that h v i , v j i = M ij , then | h v i , i | / √ n O ( √ log n/n / ) for all i ,by rotation-invariance of M . • max i k v i k ∞ O ( √ log n/n / ), again by rotation invariance.Let β = max i k v i k ∞ . By Claim A.4, there is a family of distributions D , . . . , D n on [ n ] suchthat for all i, j , | (cid:10) D i , D j (cid:11) − | = (cid:12)(cid:12)(cid:12)(cid:12) n · β ( h v i , v j i ± O (log n/ √ n )) (cid:12)(cid:12)(cid:12)(cid:12) . Now, for all constant q , we can find a subset of n /q entries of M ij such that M ij = h v i , v j i > Ω( √ log q ). So there is some constant C such that for all constant q ,SDA (cid:18) { D i } , Cnβ √ log q (cid:19) q . On the other hand, we consider product-SDA – we aim to show that product-SDA( { D i } , Cnβ √ log q ) ≫ q . Take any subset S ⊆ [ n ] of size s . Then1 n β E i,j ∼ S | h v i , v j i ± O (log n/ √ n ) | nβ (cid:20) (1 ± o (1)) E g ∼N (0 , | g | + 1 s · O ( √ n ) + O (log n/ √ n ) (cid:21) . We can take s a small as n − Ω(1) and still have E i,j ∼ S | (cid:10) D i , D j (cid:11) − | ≪ √ log qnβ , so SDA × ( { D i } , Cnβ √ log q ) > n Ω(1) . A.2 Statistical Dimension as a Lower Bound for Hypothesis Testing
Here, we extend the argument of [FGR +
17] which relates the product-statistical dimension to theSQ complexity of many-to-one hypothesis testing to simple hypothesis tests and our more powerfulnotion of statistical dimension.
Theorem A.5.
Let S = { D u } vs. D ∅ be a hypothesis testing problem with prior µ on S . Let q, k ∈ N with k even. If SDA( t ) > q , then no q -query VSTAT( t ) algorithm solves the hypothesistesting problem S vs. D ∅ .Proof. We prove the contrapositive. Let the distributions be supported on X . Suppose there is a q -query VSTAT(1 /t ) algorithm for the testing problem. Then there must be some h : X → [0 , D ∅ and D u ∼ S with probability at least q over the choice of D u given oracle access to VSTAT(1 /t ). Without loss of generality with E D ∅ h < , as this affects p bya factor of at most 2. Let a := E D ∅ h , and let a u = E D u h .Whenever h succeeds in distinguishing D u from D ∅ , by definition of VSTAT(1 /t ) we have thatfor every u for which h is successful,min (cid:16)p ta (1 − a ) , p ta u (1 − a u ) (cid:17) |h D u − , h i| .
47y Lemma 3.5 of [FGR +
17] (a simple calculation), using the fact that a , this further impliesthat r ta (cid:12)(cid:12) h D u − , h i (cid:12)(cid:12) . Now for any even k ∈ N we have that Pr u ∼ µ [ h succeeds on D u ] · r ta E u ∼ µ (cid:12)(cid:12) h D u − , h i (cid:12)(cid:12) · [ h succeeds on D u ]= (cid:28) E u ∼ µ ( D u − · sign( h D u − , h i ) · [ h succeeds on D u ] , h (cid:29) k h k · r E u,v ∼ µ | (cid:10) ( D u − , ( D v − (cid:11) | · [ h succeeds on D u , D v ]= √ a · r E u,v ∼ µ | (cid:10) D u , D v (cid:11) − | · [ h succeeds on D u , D v ] , where in the penultimate line we have chosen the worst-case signs, and in the final line we have usedthat k h k = √ a . Now, we square the above expression and divide by Pr u ∼ µ [ h succeeds on D u ] : t E u,v ∼ µ (cid:2) |h D u , D v i − | | h succeeds on D u , D v (cid:3) , where we have used that u, v ∼ µ independently. Furthermore, again by the independence of u, v ∼ µ , Pr u,v ∼ µ [ h succeeds on D u , D v ] > q . So by definition of SDA, if VSTAT(1 /t ) succeedsthen SDA(3 /t ) q . B VSTAT Algorithms Imply Low-Degree Distinguishers
In this section, we will give a direct argument that the existence of a VSTAT algorithm impliesthe existence of a good low-degree algorithm. We will prove the following theorem, which recoversa nearly identical parameter dependence to Theorem 3.1 and successfully transfers lower boundsagainst low-degree algorithms to statistical query algorithms. However, since SDA is not a char-acterization for VSTAT, and q -query VSTAT( m ) algorithms may fail even when SDA( m ) < q ,Theorem 3.1 is stronger. Theorem B.1 (VSTAT Algorithms to LDLR) . Let d, k, m, q ∈ N with k even, and τ, η ∈ (0 , .Let D ∅ be a null distribution over R n , and let S = { D v } v ∈ S be a collection of alternative probabilitydistributions, with D u the relative density of D u with respect to D ∅ . Suppose that the k -samplehigh-degree part of the likelihood ratio of S is bounded by k E u ∼ S ( D >du ) ⊗ k k δ .If there is a (randomized) q -query VSTAT(1 /τ ) algorithm which solves the many-vs-one hypoth-esis testing problem of D ∅ vs. S = { D u } u ∈ S with probability at least − η , then it must followthat τ q /k m (1 − η ) /k k · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) /k + δ /k m ! . The proof of this theorem will consist of two lemmas. The first uses a VSTAT algorithm toconstruct a good polynomial test of sample-wise degree ( ∞ , k ).48 emma B.2. Let m, q be non-negative integers, let k be a non-negative even integer, and let τ > and η ∈ [0 , . If there is a (randomized) q -query VSTAT(1 /τ ) algorithm which solves the many-vs-one hypothesis testing problem of D ∅ vs. S = { D u } u ∈ S with probability at least − η , then thereis a polynomial f : ( R n ) ⊗ m → R of sample-wise degree ( ∞ , k ) such that E u ∼ S E D ⊗ mu f > (1 − η ) s(cid:18) mk (cid:19) · (cid:16) τ (cid:17) k , E D ⊗ m ∅ f = 0 , and r E D ⊗ m ∅ f q . Furthermore, f = E g ∼ Ψ P i ,...,i k ∈ [ m ] i
1] be any sequence of q statistical queries, and withoutloss of generality assume that 0 < E D ∅ ψ t for all t ∈ [ q ]. Call p t = E D ∅ ψ t , and define ψ t ( x ) := √ p t ( ψ t ( x ) − p t ), the re-centered and re-normalized version of ψ t so that E D ∅ ψ t ( x ) = 0,and E D ∅ ψ t ( x )
1. Define f Ψ : ( R n ) ⊗ m → R by f Ψ ( x , . . . , x m ) = q X t =1 s (cid:0) mk (cid:1) X i ,...,i k ∈ [ m ] i
1. Therefore, for any distribution Q over Ψ, E D ⊗ m ∅ (cid:20) E Ψ ∼ Q f Ψ (cid:21) , and E D ⊗ m ∅ "(cid:18) E Ψ ∼ Q f Ψ (cid:19) q . Q is a distribution over Ψ so that with probability at least 1 − η over u ∼ S , the queries in Ψ give a VSTAT(1 /τ ) algorithm for distinguishing D u , D ∅ ; that is, withprobability at least 1 − η over u ∼ S, Ψ ∼ Q , we have the event E := (cid:26) max t ∈ [ q ] (cid:12)(cid:12)(cid:12)(cid:12) E D u ψ t − E D ∅ ψ t (cid:12)(cid:12)(cid:12)(cid:12) > max (cid:16) τ, p τ p t (1 − p t ) (cid:17)(cid:27) = ⇒ (cid:26) max t ∈ [ q ] (cid:12)(cid:12)(cid:12)(cid:12) E D u ψ t (cid:12)(cid:12)(cid:12)(cid:12) > r τ (cid:27) , where we have used the definition of ψ t and the fact that (1 − p t ) > by assumption. This implies E u E D ⊗ mu E Ψ ∼ Q f Ψ = E u E Ψ ∼ Q q X t =1 E D ⊗ mu s (cid:0) mk (cid:1) X i ,...,i k ∈ [ m ] i (1 − η ) E u E Ψ ∼ Q " q X t =1 s(cid:18) mk (cid:19) (cid:18) E D u ψ t (cid:19) k | E > (1 − η ) s(cid:18) mk (cid:19) · (cid:18)r τ (cid:19) k , where in the third line we use the law of conditional expectation and the fact that k is even to dropthe expectation in the event E , and in the final line we use the implication of E and the fact that k is even. Letting f := E Ψ ∼ Q f Ψ , our conclusion now follows by linearity of expectation.We now will show that if the k -sample high-degree part of the likelihood ratio of S is bounded,then a good polynomial test of sample-wise degree ( ∞ , k ) also implies one of samplewise degree( d, k ). We remark that the resulting test is not necessarily the degree ( d, k )-projection f d,k ofthe degree ( ∞ , k ) test f . We instead bound the distance between f and f d,k directly by ( d, k )-LDLR m . This amounts to showing that if f and f d,k are far, then there must be a different goodpolynomial test of sample-wise degree ( d, k ). This argument is carried out below. Lemma B.3.
Let D ∅ vs. S be a hypothesis testing problem over R n , and suppose that the k -samplehigh-degree part of the likelihood ratio of S is bounded, k E u ∼ S ( D >du ) ⊗ k k δ . Let Ψ be a distributionover functions from R n → R . If f : ( R n ) ⊗ m → R is a sample-wise degree- ( ∞ , k ) polynomial of theform f ( x , . . . , x m ) = E g ∼ Ψ X i ,...,i k ∈ [ m ] i · E u E D ⊗ mu f q E D ⊗ m ∅ f . Proof.
Since the samples x , . . . , x m ∼ D ⊗ mu are independent and identically distributed, the mo-ments of f under the m -sample distribution D ⊗ m are within a multiplicative factor of the moments50f one of the summands under the k -sample distribution D ⊗ k , E u E D ⊗ mu f = E g ∼ Ψ X i ,...,i k ∈ [ m ] i < ··· E u E D ⊗ ku g ⊗ k − (cid:13)(cid:13)(cid:13)(cid:13) E g ∼ Ψ g ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) E u (cid:16) D ⊗ ku − ( D du ) ⊗ k (cid:17)(cid:13)(cid:13)(cid:13) by Cauchy-Schwarz. Now note that E u ( D du ) ⊗ k is the orthogonal projection of E u D ⊗ ku onto theset of degree-( d, k ) polynomials. This set contains all constant polynomials and the projection of E u ( D du ) ⊗ k onto the set of constant polynomials is 1. Combining this with Lemmas 3.4, we have (cid:13)(cid:13)(cid:13) E u (cid:16) D ⊗ ku − ( D du ) ⊗ k (cid:17)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) E u D ⊗ ku − (cid:13)(cid:13)(cid:13) = E u,v (cid:0) h D u , D v i − (cid:1) k E u,v (cid:20)(cid:16) h D du , D dv i − (cid:17) k (cid:21) /k + δ /k ! k (cid:0) mk (cid:1) /k · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) /k + δ /k ! k , where the last line is from Lemma 3.5. Returning to (9), by linearity of projection to sample-wisedegree ( d, k ) and since f is already sample-wise degree-( ∞ , k ), we have that E u E D ⊗ mu f d,k = E u E D ⊗ mu f − (cid:18) mk (cid:19) · (cid:13)(cid:13)(cid:13)(cid:13) E g ∼ Ψ g ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) (cid:0) mk (cid:1) /k · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) /k + δ /k ! k/ , (10)where we used the independence of the samples to equate (cid:0) mk (cid:1) E g ∼ Ψ E u E D ⊗ ku g ⊗ k and E u E D ⊗ mu f .By independence of samples, the terms Q kℓ =1 g ( x i ℓ ) and Q kℓ =1 h ( x j ℓ ) are uncorrelated when x ∼ D ⊗ m ∅ , unless i , . . . , i k = j , . . . , j k . Using the fact that for every g ∼ Ψ, E D ∅ g = 0, and theindependence of the samples, this implies that E D ⊗ m ∅ f = E g,h ∼ Ψ X i ,...,i k ∈ [ m ] i < ··· E u E D ⊗ mu f d,k q E D ⊗ m ∅ ( f d,k ) > E u E D ⊗ mu f d,k q E D ⊗ m ∅ f > E u E D ⊗ mu f q E D ⊗ m ∅ f − (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) /k + δ /k · (cid:18) mk (cid:19) /k ! k/ . (12)The first inequality follows from the fact that the left-hand side gives the optimal signal to noisesratio among all sample-wise degree-( d, k ) polynomials for the distinguishing problem of D ⊗ m ∅ versus E u D ⊗ mu (see Section 2). The second inequality follows since f d,k is a projection of f onto a convexset, and the final inequality follows by combining (10) and (11). Finally, note that (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) /k + δ /k · (cid:18) mk (cid:19) /k ! k/ , Applying this after rearranging (12) now completes the proof of the lemma.Theorem B.1 now follows immediately on applying these two lemmas.
Proof of Theorem B.1.
Let f be as in Lemma B.2. Combining Lemmas B.2 and B.3 now yieldsthat q − (1 − η ) s(cid:18) mk (cid:19) · (cid:16) τ (cid:17) k E u E D ⊗ mu f q E D ⊗ m ∅ f (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) /k + δ /k · (cid:18) mk (cid:19) /k ! k/ . Rearranging and upper bounding 2 /k τ q /k (1 − η ) /k (cid:0) mk (cid:1) /k · (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) /k + δ /k ! . The fact that ( m/k ) k (cid:0) mk (cid:1) now completes the proof of the theorem. C Proofs of Cloning Facts
Lemma (Restatement of Lemma 7.2) . There is a randomized algorithm taking as input a realnumber x and outputting m independent random variables Y , . . . , Y m such that for any µ ∈ R if x ∼ N ( µ, , then Y i ∼ N ( µ/ √ m, .Proof. Let U ∈ R m × m be a matrix with all entries in the first column equal to 1 / √ m and withremaining columns chosen so that U is orthogonal, i.e., U ⊤ U = I m . Generate independent variables Z , . . . , Z m ∼ N (0 ,
1) and let Z = ( X, Z , . . . , Z m ) ⊤ . Now put Y = U Z . Note that Z d = µ · e + W ,where W ∼ N (0 , I m ) and e is the first standard basis vector, and the result follows since U W d = W . 52 emma (Restatement of Lemma 7.3) . There is an algorithm that when given m independentsamples from G ( n, U, γ ) for any U ⊆ [ n ] , efficiently produces a single instance distributed accordingto G ( n, U, γ m ) . Conversely, there is an efficient algorithm taking a graph as input and producing m random graphs, such that given an instance of planted clique G ( n, U, γ ) with unknown cliqueposition U , produces m independent samples from G ( n, U, γ /m ) .Proof. The first direction is immediate: given Y , . . . , Y m ∼ G ( n, U, γ ), form the graph X by letting X e = Q i ∈ [ m ] Y i,e . For the other direction, we will show how to produce m independent Bernoullivariables with appropriate bias from a single Bernoulli. The claim for planted clique will thenfollow immediately by applying the procedure to the edge indicators of the input graph.Suppose that p ∈ { γ, } for some γ ∈ [0 , x ∼ Bern( p ) to( y , . . . , y m ) ∼ Bern( p /m ) ⊗ m without knowing which is the true value of p . Given input x = 1,output y = · · · = y m = 1. Now suppose x = 0. Let y = v for each v ∈ { , } m \ { } withprobability ( γ | v | /m (1 − γ /m ) m −| v | ) / (1 − γ ), where | v | = P v i is the number of ones in v . Notethat this probability mass function is exchangeable and thus can be sampled in poly( m ) time asfollows. First sample the support size | y | ∈ { , , . . . , m − } , which has distribution explicitlygiven by Pr ( | y | = x ) = (cid:0) mx (cid:1) γ x/m (1 − γ /m ) m − x / (1 − γ ) since the distribution of y is exchangeable.Then produce y by sampling a random binary string in { , } m with support size exactly | y | ,uniformly at random.To check that the output distribution of ( y , . . . , y m ) is indeed Bern( p /m ) ⊗ m for p ∈ { γ, } ,first observe that if p = 1 then x = 1 deterministically and so too are y , . . . , y m . If p = γ , then Pr ( y = v ) = γ · I v = + (1 − γ ) · I v = · γ | v | /m (1 − γ /m ) m −| v | − γ = ( γ /m ) | v | (1 − γ /m ) m −| v | , which is precisely the probability mass function of Bern( γ /m ) ⊗ m . D Omitted Calculations from Applications
In this section, we include the calculations omitted from Section 8.
D.1 Tensor PCA
Claim (Restatement of Claim 8.2) . For any integers k, n , and r > kλ < n , the k -sample likelihood ratio for the n -dimensional r -tensor PCA problem with signal strength λ isbounded by (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S D ⊗ ku (cid:13)(cid:13)(cid:13)(cid:13) s π − kλ n . Proof.
To obtain the first conclusion, we expand (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S D ⊗ ku (cid:13)(cid:13)(cid:13)(cid:13) = E u,v h D u , D v i k = E u,v exp( kλ h u, v i r ) , Where for the final equality we have used a simple calculation analogous to that in the proofof Proposition 2.5 of [KWB19]. Since h u, v i for u, v sampled uniformly independently from S isdistributed as the mean of n Rademacher random variables, we have that Pr [ |h u, v i| > C √ n ] − C ), and |h u, v i|
1. So we have E u,v exp( kλ h u, v i r ) E u,v exp( kλ |h u, v i| r ) Z √ n exp (cid:18) kλ (cid:18) C √ n (cid:19) r − C (cid:19) dC Z √ n exp (cid:18) − (cid:18) − kλ n (cid:19) C (cid:19) dC s π − kλ n , where to obtain the second line we have substituted C = √ n for r − C , and to obtainthe final conclusion we have used that 2 kλ < n and the expression for the Gaussian probabilitydensity function. Claim (Restatement of Claim 8.3) . For any integers n, r, k, m and real number λ which satisfy2 emλ k ( r − / n r/ , the (1 , k )-LDLR m for the m -sample, dimension- n tensor PCA problem withsignal strength λ is bounded by (cid:13)(cid:13)(cid:13) E u ( D ⊗ mu ) ,k (cid:13)(cid:13)(cid:13) e r +1 mλ k ( r − / n r/ Proof.
For a given D u = N ( λu ⊗ r , I n r ), from D ⊗ mu we have m samples samples be { T i } mi =1 with each T i = λu ⊗ r + G i , where G i ∼ N (0 , I n r ) are independent across samples. We will use the Fourierbasis for ( D ⊗ m ∅ ) ,k − ( χ S | S ∈ k [ t =1 (cid:18) [ n ] r (cid:19) ⊗ t × (cid:18) [ m ] t (cid:19)) , that is, for each S = { ( A ℓ , j ℓ ) } tℓ =1 , which specifies a collection ( A , . . . , A t ) of t indices in ( R n ) ⊗ r and t sample indices ( j , . . . , j t ) in [ m ], we take χ S ( T , . . . , T m ) = Q tℓ =1 ( T j ℓ ) S ℓ . For any such S with | S | = t , we may compute E u E T ,...,T m ∼ D u [ χ S ( T , . . . , T m )] = E u t Y ℓ =1 (cid:16) λu A ℓ + G ( ℓ ) A ℓ (cid:17) = (cid:18) λ ( √ n ) r (cid:19) | S | · [ S is even] , where by “ S is even” we mean that the multiset ∪ tℓ =1 A ℓ contains every i ∈ [ n ] with even multiplicity.This is because the indices j , . . . , j t ∈ [ m ] are all distinct, so any term in the expansion of theproduct with nonzero degree in the G ( ℓ ) A ℓ variables has expectation 0, and for any multiset of indices B ⊂ [ n ] r , E u u B = 0 if any index appears in B with odd multiplicity, and E u u B = n − r | B | / otherwise.The even S of size t for a fixed set of samples j , . . . , j t ∈ (cid:0) [ m ] t (cid:1) are in bijection with t -edgehypergraph with hyperedges from [ n ] r in which every vertex has even degree. Since there can be atmost rt/ ( rt !)2 rt/ ( r !) t ways of choosing an even hypergraph on them according to the configuration model (assign everyvertex 2 half-edges, assign every hyperedge r half-edges, and then count the number of distinctmatchings), |{ S | | S | = t, S even }| (cid:18) mt (cid:19) · n rt/ · (cid:18) rt !2 rt/ ( r !) t (cid:19) (cid:16) emt (cid:17) t · n rt/ · ( t ) rt/ , where we have applied Stirling’s approximation and used that r >
2. Thus, we can bound theLDLR, (cid:13)(cid:13)(cid:13) E u ( D ⊗ mu ) ,k − (cid:13)(cid:13)(cid:13) = k X t =1 |{ S | | S | = t, S even }| · E u E D ⊗ mu [ χ S ] k X t =1 (cid:16) emt ( r − / n r/ (cid:17) t · (cid:18) λn r/ (cid:19) t = k X t =1 emλ t ( r − / n r/ ! t k X t =1 emλ k ( r − / n r/ ! t emλ k ( r − / n r/ , where in the final line we have used that 2 emλ k ( r − /r n r/ and the fact that the sum isgeometric. D.2 Planted Clique
Claim (Restatement of Claim 8.6) . For any
K, N, k, d, m ∈ N , define γ = ( p − q ) q (1 − q ) . Then the ( d, k )-LDLR m for bipartite PDS is bounded k E u ∼ µ ( D ⊗ mu ) d,k − k = O N (1) if K N · max n mN , (1 + γ ) k o − Ω N (1) . Proof.
We will compute the Fourier coefficients of D = E u ∼ µ D ⊗ mu as a function on { , } m × N . Foreach m -tuple of subsets α = ( α , α , . . . , α m ) where α i ⊆ [ N ], define the Fourier character χ α ( x ) = m Y i =1 Y j ∈ α i x ij − q p q (1 − q )for each x ∈ { , } m × N . Note that the χ α form an orthogonal basis with respect to D ⊗ m ∅ . For each α , let L ( α ) = α ∪ α ∪ · · · ∪ α m and R ( α ) = { i ∈ m : α i = ∅ } . A direct computation yields thatthe Fourier coefficients of D are given by b D ( α ) = E u ∼ µ E x ∼ D ⊗ mu χ α ( x ) = (cid:18) KN (cid:19) | L ( α ) | + | R ( α ) | γ P mi =1 | α i | By Parseval’s identity, we now have that (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) = k X t =1 (cid:18) mt (cid:19) X | α | ,..., | α t | d b D ( α , . . . , α t , ∅ , . . . , ∅ ) = k X t =1 (cid:18) mt (cid:19) X | α | ,..., | α t | d (cid:18) KN (cid:19) | L ( α ) | +2 t γ P ti =1 | α i | (13)Here, we have used the fact that b D ( α ) = b D ( α σ ) where α σ = ( α σ (1) , α σ (2) , . . . , α σ ( m ) ) for all σ ∈ S m ,by symmetry. Now note that for any fixed A ⊆ [ N ], we have that X | α | ,..., | α t | d : L ( α )= A (cid:18) KN (cid:19) | L ( α ) | +2 t γ P ti =1 | α i | (cid:18) KN (cid:19) | A | +2 t X | α | ,..., | α t | d : L ( α ) ⊆ A γ P ti =1 | α i | = (cid:18) KN (cid:19) | A | +2 t min( d, | A | ) X ℓ =1 (cid:18) | A | ℓ (cid:19) γ ℓ t (cid:18) KN (cid:19) | A | +2 t (1 + γ ) | A | t min( d, | A | ) X ℓ =1 (cid:18) | A | ℓ (cid:19) γ ℓ | A | X ℓ =0 (cid:18) | A | ℓ (cid:19) γ ℓ = (1 + γ ) | A | Note that | L ( α ) | can vary between 1 and kd . The fact that there are (cid:0) Ns (cid:1) N s possible A with agiven fixed size | A | = s combined with Equation (13) now yields that (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) k X t =1 kd X s =1 m t N s (cid:18) KN (cid:19) s +2 t (1 + γ ) ts k X t =1 kd X s =1 (cid:18) K mN (cid:19) t (cid:18) K (1 + γ ) k N (cid:19) s where the second inequality follows from the fact that (1 + γ ) ts (1 + γ ) ks and rearranging. Underthe given condition, this upper bound is the product of two geometric series with ratios 1 − Ω N (1),completing the proof of the claim. Claim (Restatement of Claim 8.7) . For any
K, N, k ∈ N , the k -sample LR is bounded by k E u ∼ µ D ⊗ ku k = O N (1) if K N · max (cid:26) kN , (1 + γ ) k (cid:27) − Ω N (1)where γ = ( p − q ) q (1 − q ) . Proof.
The follows from Claim 8.6 applied with d = N and m = k , and the observation (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ D ⊗ ku (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ ( D ⊗ ku ) N,k − (cid:13)(cid:13)(cid:13)(cid:13) + 1since ( D ⊗ ku ) N,k = D ⊗ ku and h E u ∼ µ D ⊗ ku , i = 1. Claim (Restatement of Claim 8.12) . For any s, K, N, k, d, m ∈ N , the ( d, k )-LDLR m for multi-sample hypergraph PC satisfies that k E u ∼ µ ( D ⊗ mu ) d,k − k = O N (1) if the following conditions aresatisfied: γ · max { m, ( ksd ) s } = O N (1) and 2 sk e K N = 1 − Ω N (1)where γ = − qq . Proof.
Similar to as in Claim 8.6, we will compute the Fourier coefficients of D = E u ∼ µ D ⊗ mu asa function on { , } m × H where H = (cid:0) [ N ] s (cid:1) . The relevant orthogonal basis of Fourier characters isindexed by m -tuples of families of subsets α = ( α , α , . . . , α m ) where α i ⊆ H and given by χ α ( x ) = m Y i =1 Y e ∈ α i x ie − q p q (1 − q )for each x ∈ { , } m × H . Given some α i ⊆ H , let V ( α i ) = S { v ,v ,...,v s }∈ α i { v , v , . . . , v s } be thevertex set of the hyperedges in α . Furthermore, let V ( α ) = V ( α ) ∪ V ( α ) ∪ · · · ∪ V ( α m ) where56 = ( α , α , . . . , α m ). Note that E x ∼ D ⊗ mu χ α ( x ) = 0 unless V ( α ) ⊆ u , which occurs with probability (cid:0) K | V ( α ) | (cid:1) / (cid:0) N | V ( α ) | (cid:1) if u ∼ µ . Therefore the Fourier coefficients of D are then given by b D ( α ) = E u ∼ µ E x ∼ D ⊗ mu χ α ( x ) = (cid:0) K | V ( α ) | (cid:1)(cid:0) N | V ( α ) | (cid:1) · γ P mi =1 | α i | (cid:18) eKN (cid:19) | V ( α ) | γ P mi =1 | α i | where the inequality follows from ( a/b ) b (cid:0) ab (cid:1) ( ea/b ) b . The same application of Parseval’s as inClaim 8.6 now yields that (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) k X t =1 (cid:18) mt (cid:19) X | α | ,..., | α t | d (cid:18) eKN (cid:19) | V ( α ) | γ P ti =1 | α i | We now have that for any A ⊆ [ N ], X | α | ,..., | α t | d : V ( α )= A (cid:18) eKN (cid:19) | V ( α ) | γ P ti =1 | α i | (cid:18) eKN (cid:19) | A | X | α | ,..., | α t | d : α i ⊆ ( As ) γ P ti =1 | α i | = (cid:18) eKN (cid:19) | A | min (cid:16) d, ( | A | s ) (cid:17) X ℓ =1 (cid:18)(cid:0) | A | s (cid:1) ℓ (cid:19) γ ℓ t (cid:18) eKN (cid:19) | A | γ t (cid:18) | A | s (cid:19) t (1 + γ )( | A | s ) t The last inequality holds because of the following observation min( d,y ) X ℓ =1 (cid:18) yℓ (cid:19) γ ℓ yγ · min( d,y ) X ℓ =1 (cid:18) y − ℓ − (cid:19) γ ℓ − yγ (1 + γ ) y for any y ∈ N . Note that if α = ( α , α , . . . , α t ) satisfies that that 1 | α i | d , then s | V ( α ) | ksd . Give that there are (cid:0) Na (cid:1) N a sets A ⊆ [ N ] of a fixed size | A | = a , we have (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ ( D ⊗ mu ) d,k − (cid:13)(cid:13)(cid:13)(cid:13) k X t =1 ksd X a = s m t t ! · N a (cid:18) eKN (cid:19) a γ t a st (1 + γ ) a s t = ksd X a = s (cid:18) e K N (cid:19) a k X t =1 (cid:0) mγa s (1 + γ ) a s (cid:1) t t ! ksd X a = s a sk (cid:18) e K N (cid:19) a k X t =1 ( mγ · exp( γk s s s d s )) t t ! ksd X a = s (cid:18) sk e K N (cid:19) a ! · exp ( mγ · exp( γk s s s d s ))The second last line follows from the inequalities a st a sk , a ksd and 1 + γ exp( γ ). The lastline follows from the fact that if x > P kt =1 x t /t ! exp( x ) and a sk ask since a >
1. The givenconditions now imply that the exponential factor is O N (1) and that the geometric series has ratio1 − Ω N (1) and thus is also O N (1), completing the proof of the claim.57 laim (Restatement of Claim 8.13) . For any
K, N, k ∈ N , the k -sample LR is bounded by k E u ∼ µ D ⊗ ku k = O N (1) if the following condition are satisfied: K N and γ k · K − s log (cid:18) NK (cid:19) where γ = − qq . Proof.
Note that D u ( x ) = Q e ∈ ( us ) q − x e for each x ∈ { , } ( [ N ] s ). Therefore we have that h D u , D v i = E x ∼ D ∅ Y e ∈ ( u ∩ vs ) q − x e Y e ∈ ( us ) ∆ ( vs ) q − x e = Y e ∈ ( u ∩ vs ) q − E x e ∼ Ber( q ) [ x e ] Y e ∈ ( us ) ∆ ( vs ) q − E x e ∼ Ber( q ) [ x e ]= q − ( | u ∩ v | s )where A ∆ B denotes the symmetric difference of the sets A and B . Now since X = | u ∩ v | isdistributed as Hypergeometric( N, K, K ), we have that (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ D ⊗ ku (cid:13)(cid:13)(cid:13)(cid:13) = E u,v ∼ µ h D u , D v i k = E q − k ( Xs ) = K X x =0 (cid:0) Kx (cid:1)(cid:0) N − KK − x (cid:1)(cid:0) NK (cid:1) · q − k ( xs )Now note that for each 0 x K , (cid:0) Kx (cid:1)(cid:0) N − KK − x (cid:1)(cid:0) NK (cid:1) = (cid:0) Kx (cid:1) K ( K − · · · ( K − x + 1) N x Q x − i =0 (cid:0) − iN (cid:1) Q K − x − i =0 (cid:16) − K − xN − k − i (cid:17) K x N x (cid:16) − P x − i =0 iN − P K − x − i =0 K − xN − k − i (cid:17) K x N x (cid:16) − K N − K +1 (cid:17) (cid:18) K N (cid:19) x where the last inequality follows from the fact that K N . Now since q − exp( γ ) and (cid:0) xs (cid:1) xK s − for all x K , we have that (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ µ D ⊗ ku (cid:13)(cid:13)(cid:13)(cid:13) K X x =0 exp (cid:18) kγxK s − − x log (cid:18) NK (cid:19)(cid:19) K X x =0 (cid:18) K N (cid:19) x/ = O N (1)by the given condition on γ . This completes the proof of the claim. D.3 Spiked Wishart PCA
Lemma (Restatement of Lemma 8.18) . Let t, d ∈ N . Suppose that nρ , and that dtλ ρn .Then, we have: (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ ( D du − ⊗ t (cid:13)(cid:13)(cid:13)(cid:13) (cid:18) d kλρn (cid:19) t . roof. Fix any multi-index α = ( α , . . . , α t ) so that | α i | is even and so that 2 | α i | d , for all i = 1 , . . . , t . Suppose moreover that |{ j : ∃ i : α ij = 0 }| = ℓ , and let s = | α | . Then the proceedinglemma implies that (cid:18) E u ∼ S ρ h D u , H α i (cid:19) (cid:18) dλρn (cid:19) s ρ ℓ . The total number of such monomials can be naively upper bounded by (cid:0) nℓ (cid:1) ℓ s . Hence the contributionto the LDLR of all such monomials, for a fixed ℓ and s , can be upper bounded by (cid:18) nℓ (cid:19) ℓ s (cid:18) dλρn (cid:19) s ρ ℓ (cid:18) dℓλρn (cid:19) s ( nρ ) ℓ (cid:18) dℓλρn (cid:19) s , by assumption. Summing over all 2 t s dt , and 1 ℓ dt , we obtain that (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ ( D du − ⊗ t (cid:13)(cid:13)(cid:13)(cid:13) X t s dt, ℓ dt (cid:18) dℓλρn (cid:19) s (cid:18) d kλρn (cid:19) t , since from our assumptions, the sum is convergent. Lemma (Restatement of Lemma 8.21) . Assume that nk ( d + 1) ρ . For λ < / and d even,we have: (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ (cid:16) D >du (cid:17) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) (cid:18) λ ρn (cid:19) k ( d +1) . Proof.
This proof closely resembles the proof of Lemma 6.2. Let Z be the random variable givenby Z = λ h u,v i when u, v ∼ S ρ . From the proceeding lemma, we have that (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ (cid:16) D >du (cid:17) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) E Z h φ >d/ ( Z ) k i By Taylor’s theorem, since the function φ ( x ) is analytic for all | x | /
4, we know that (cid:12)(cid:12)(cid:12) φ >d/ ( x ) (cid:12)(cid:12)(cid:12) (cid:18) d + 2 d/ (cid:19) x d +1 (1 − η ( x )) − ( d +3) / (cid:18) d + 2 d/ (cid:19) x d +1 φ ( x ) d +3 . where 0 η ( x ) x , and the last inequality follows since φ is monotone. Hence (cid:13)(cid:13)(cid:13)(cid:13) E u ∼ S ρ (cid:16) D >du (cid:17) ⊗ k (cid:13)(cid:13)(cid:13)(cid:13) d k ( d +2) (cid:0) − λ (cid:1) k ( d +3) E Z Z k ( d +1) . The moment can only be increased by considering the inner product between the two untruncatedvectors. Let Z ′ be distributed as the untruncated version of Z . Then Z ′ = λ ρn ( P ni =1 Y i ) whereeach Y i is independent, Y i = 0 with probability 1 − ρ / Y i = 1 with probability ρ /
4, and Y i = − ρ /
4. Hence E Z Z k ( d +1) E Z ′ ( Z ′ ) k ( d +1) = (cid:18) λ ρn (cid:19) k ( d +1) E Y ,...,Y n n X i =1 Y i ! k ( d +1) = (cid:18) λ ρn (cid:19) k ( d +1) X | α | =2 k ( d +1) E Y α (cid:18) λ ρn (cid:19) k ( d +1) k ( d +1) X ℓ =1 (cid:18) nℓ (cid:19) · (cid:18) k ( d + 1) + ℓℓ (cid:19) ρ ℓ (cid:18) λ ρn (cid:19) k ( d +1) k ( d +1) X ℓ =1 (cid:0) nk ( d + 1) ρ (cid:1) ℓ (cid:18) λ ρn (cid:19) k ( d +1) , where the final summand is convergent by assumption. D.4 Gaussian Graphical Models
Lemma (Restatement of Lemma 8.31) . For any integer d sufficiently large, any s ≫ d sufficientlylarge, any n ≫ s sufficiently large, and κ ∈ (0 , √ d ) such that the following holds: If S vs. D ∅ isan instance of the ( κ, d, s, n ) -prsGGM problem, then for any even integer k and q > , SDA S , (cid:18) nq s (cid:19) /k sdκ ) − ! > q, and further, E u,v h D u , D v i k (cid:18) s n (cid:19) /k (cid:0) exp( sdκ ) − (cid:1)! k . To prove this lemma, we will make use of the following claim:
Claim D.1.
Let
A, B be symmetric n × n real matrices, let D ∅ = N (0 , I ). Suppose I n + A + B ≻ I n + A ≻
0, and I n + B (cid:23)
0. Let D a = N (0 , ( I + A ) − ) and D b = N (0 , ( I + B ) − ), and let D a , D b be the respective relative densities. Then h D a , D b i D ∅ = 1 p det ( I − ( I + A ) − AB ( I + B ) − ) . Proof.
We have that h D a , D b i = 1 p (2 π ) n det(( I + A ) − ) det(( I + B ) − ) Z R n exp (cid:18) − x ⊤ ( I n + A + B ) x (cid:19) dx = s det(( I + A + B ) − )det(( I + A ) − ) det(( I + B ) − )= 1 p det ( I − ( I + A ) − AB ( I + B ) − ) , where the second line follows by integrating the Gaussian pdf with covariance ( I + A + B ) − , andthe third line follows by noting that det( X − ) = det( X ) − , that det( X ) det( Y ) = det( XY ), andthat I + A + B = ( I + A )( I + B ) − AB . This completes the proof. Proof of Lemma 8.31.
First, since a random signed d -regular graph on s vertices has its spectrumwithin [ − √ d − ε ) , √ d − ε )] with high probability, for sufficiently large d the conditionon the spectrum is met with very high probability, and S has size at least (cid:0) ns (cid:1) · (cid:0) sd (cid:1) s/ (a vastunderestimate of the number of d -regular random graphs on s vertices planted within n -vertexempty graphs). 60ince κ √ d < , the matrices I + κ ∆ u and I + κ ∆ u + κ ∆ v meet the conditions of Claim D.1.Using Claim D.1, it suffices to bound E u,v ∼ S (cid:0) h D u , D v − i (cid:1) k = E u,v ∼ S s I − κ ( I + κ ∆ u ) − ∆ u ∆ v ( I + κ ∆ v ) − ) − ! k , (14)since to obtain the SDA bound we may apply Equation (2), and to get the second conclusion weuse H¨older’s inequality and the triangle inequality, E u,v ∼ S h D u , D v i k k X ℓ =0 (cid:18) kℓ (cid:19) E u,v h |h D u , D v i − | ℓ i (cid:18) E u,v h ( h D u , D v i − k i /k (cid:19) k , Now, when u, v ∼ S , with probability at least 1 − s n , ∆ u and ∆ v correspond to graphs withdisjoint support, so ∆ u ∆ v = 0. For such u, v , the right-hand side of (14) is zero.Otherwise, if ∆ u , ∆ v overlap, the ( I + κ ∆ u ) − ∆ u ∆ v ( I + κ ∆ v ) − has at most s eigenvalueswhich are not 1 (since ∆ u , ∆ v are rank- s ). Further, since all eigenvalues ∆ u , ∆ v are in the inter-val [ − √ d, √ d ], and since ∆ u and ( I + κ ∆ u ) − commute, the eigenvalues of ( I + κ ∆ u ) − ∆ u , ∆ v ( I + κ ∆ v ) − are in the interval [ − √ d − κ √ d , √ d − κ √ d ]. This implies that all eigenvalues of ( I + κ ∆ u ) − ∆ u ∆ v ( I + κ ∆ v ) − are in the interval [ − d (1 − κ √ d ) , d (1 − κ √ d ) ]. Thus, for such u, v , s I − κ ( I + κ ∆ u ) − ∆ u ∆ v ( I + κ ∆ v ) − ) − κ d (1 − κ √ d ) s/ . Putting these observations together with (14), E u,v (cid:0) h D u , D v i − (cid:1) k s n − d (cid:16) κ − κ √ d (cid:17) s/ − k s n (cid:16)(cid:0) κ d (cid:1) s/ − (cid:17) k , where we have used that κ √ d < . We can further simplify the above by noting that 1+ x exp( x ).Thus, applying Equation (2), we have that for any q > S , (cid:18) nq s (cid:19) /k sdκ / − ! > q, and we obtain the bound on k E u D ⊗ ku kk