[PDF] Counterexamples to the Low-Degree Conjecture

Abstract

A conjecture of Hopkins (2018) posits that for certain high-dimensional hypothesis testing problems, no polynomial-time algorithm can outperform so-called "simple statistics", which are low-degree polynomials in the data. This conjecture formalizes the beliefs surrounding a line of recent work that seeks to understand statistical-versus-computational tradeoffs via the low-degree likelihood ratio. In this work, we refute the conjecture of Hopkins. However, our counterexample crucially exploits the specifics of the noise operator used in the conjecture, and we point out a simple way to modify the conjecture to rule out our counterexample. We also give an example illustrating that (even after the above modification), the symmetry assumption in the conjecture is necessary. These results do not undermine the low-degree framework for computational lower bounds, but rather aim to better understand what class of problems it is applicable to.

Full PDF

aa r X i v : . [ c s . CC ] A p r Counterexamples to the Low-Degree Conjecture

Justin Holmgren ∗ and Alexander S. Wein ‡ Simons Institute for the Theory of Computing Department of Mathematics, Courant Institute of Mathematical Sciences, New YorkUniversityApril 21, 2020

Abstract

A conjecture of Hopkins [Hop18] posits that for certain high-dimensional hypothesis testingproblems, no polynomial-time algorithm can outperform so-called “simple statistics”, which arelow-degree polynomials in the data. This conjecture formalizes the beliefs surrounding a lineof recent work that seeks to understand statistical-versus-computational tradeoﬀs via the low-degree likelihood ratio . In this work, we refute the conjecture of Hopkins [Hop18]. However, ourcounterexample crucially exploits the speciﬁcs of the noise operator used in the conjecture, andwe point out a simple way to modify the conjecture to rule out our counterexample. We alsogive an example illustrating that (even after the above modiﬁcation), the symmetry assumptionin the conjecture is necessary. These results do not undermine the low-degree framework forcomputational lower bounds, but rather aim to better understand what class of problems it isapplicable to.

A primary goal of computer science is to understand which problems can be solved by eﬃcientalgorithms. Given the formidable diﬃculty of proving unconditional computational hardness, state-of-the-art results typically rely on unproven conjectures. While many such results rely only upon thewidely-believed conjecture P = NP , other results have only been proven under stronger assumptionssuch as the unique games conjecture [Kho02, Kho05], the exponential time hypothesis [IP01], thelearning with errors assumption [Reg09], or the planted clique hypothesis [Jer92, BR13].It has also been fruitful to conjecture that a speciﬁc algorithm (or limited class of algorithms) isoptimal for a suitable class of problems. This viewpoint has been particularly prominent in the studyof average-case noisy statistical inference problems, where it appears that optimal performanceover a large class of problems can be achieved by methods such as the sum-of-squares hierarchy(see [RSS18]), statistical query algorithms [Kea93, BFJ + +

17, Hop18]. It is helpfulto have such a conjectured-optimal meta-algorithm because this often admits a systematic analysisof hardness. However, the exact class of problems for which we believe these methods are optimal ∗ Email: [email protected] . ‡ Email: [email protected] . Partially supported by NSF grant DMS-1712730 and by the Simons Collaborationon Algorithms and Geometry. low-degree likelihood ratio [HS17, HKP +

17, Hop18] has recently emerged as a frameworkfor studying computational hardness in high-dimensional statistical inference problems. It hasbeen shown that for many “natural statistical problems,” all known polynomial-time algorithmsonly succeed in the parameter regime where certain “simple” (low-degree) statistics succeed. Thepower of low-degree statistics can often be understood via a relatively simple explicit calculation,yielding a tractable way to precisely predict the statistical-versus-computational tradeoﬀs in a givenproblem. These “predictions” can rigorously imply lower bounds against a broad class of spectralmethods [KWB19, Theorem 4.4] and are intimately connected to the sum-of-squares hierarchy (see [HKP +

17, Hop18, RSS18]). Recent work has (either explicitly or implicitly) carried out thistype of low-degree analysis for a variety of statistical tasks [BHK +

19, HS17, HKP +

17, Hop18,BCL +

19, BKW19, KWB19, DKWB19, BB19, MRX19, CHK + O (log n )-degree polynomial of the data whose value behaves noticeably diﬀerent underthe null and planted distributions (in a precise sense). Thus, barring the discovery of a drasticallynew algorithmic approach, the low-degree conjecture seems to hold for all the above problems. Infact, a more general version of the conjecture seems to hold for runtimes that are not necessarilypolynomial: degree- D statistics are as powerful as all n ˜Θ( D ) -time algorithms, where ˜Θ hides factorsof log n [Hop18, Hypothesis 2.1.5] (see also [KWB19, DKWB19]).A precise version of the low-degree conjecture was formulated in the PhD thesis of Hop-kins [Hop18]. This includes precise conditions on the null distribution ν and planted distribution µ which capture most of the problems mentioned above. The key conditions are that there should besuﬃcient symmetry, and that µ should be injected with at least a small amount of noise. Most ofthe problems above satisfy this symmetry condition (a notable exception being the spiked Wishartmodel , which satisﬁes a mild generalization of it), but it remained unclear whether this assumptionwas needed in the conjecture. On the other hand, the noise assumption is certainly necessary, asillustrated by the example of solving a system of linear equations over a ﬁnite ﬁeld: if the equationshave an exact solution then it can be obtained via Gaussian elimination even though low-degreestatistics suggest that the problem should be hard; however, if a small amount of noise is added (sothat only a 1 − ε fraction of the equations can be satisﬁed) then Gaussian elimination is no longerhelpful, and the low-degree conjecture seems to hold. Here we mean the formulation of the spiked Wishart model used in [BKW19], where we directly observe Gaussiansamples instead of only their covariance matrix.

2n this work we investigate more precisely what kinds of noise and symmetry conditions areneeded in the conjecture of Hopkins [Hop18]. Our ﬁrst result (Theorem 3.1) actually refutes theconjecture in the case where the underlying random variables are real-valued. Our counterexampleexploits the speciﬁcs of the noise operator used in the conjecture, along with the fact that asingle real number can be used to encode a large (but polynomially bounded) amount of data.In other words, we show that a stronger noise assumption than the one in [Hop18] is needed;Remark 3.3 explains a modiﬁcation of the conjecture that we do not know how to refute. Oursecond result (Theorem 3.4) shows that the symmetry assumption in [Hop18] cannot be dropped,i.e., we give a counterexample for a weaker conjecture that does not require symmetry. Both of ourcounterexamples are based on eﬃciently decodable error-correcting codes.

Notation

Asymptotic notation such as o (1) and Ω(1) pertains to the limit n → ∞ . We say that an eventoccurs with high probability if it occurs with probability 1 − o (1), and we use the abbreviationw.h.p. (“with high probability”). We use [ n ] to denote the set { , , . . . , n } . The Hamming distance between vectors x, y ∈ F n (for some ﬁeld F ) is ∆( x, y ) = |{ i ∈ [ n ] : x i = y i }| and the Hammingweight of x is ∆( x, We now state the formal variant of the low-degree conjecture proposed in the PhD thesis of Hop-kins [Hop18, Conjecture 2.2.4]. The terminology used in the statement will be explained below.

Conjecture 2.1.

Let Ω be a ﬁnite set or R , and let k ≥ be a ﬁxed integer. Let N = (cid:0) nk (cid:1) . Let ν bea product distribution on Ω N . Let µ be another distribution on Ω N . Suppose that µ is S n -invariantand (log n ) -wise almost independent with respect to ν . Then no polynomial-time computabletest distinguishes T δ µ and ν with probability − o (1) , for any δ > . Formally, for all δ > andevery polynomial-time computable t : Ω N → { , } there exists δ ′ > such that for every largeenough n , P x ∼ ν ( t ( x ) = 0) + 12 P x ∼ T δ µ ( t ( x ) = 1) ≤ − δ ′ . (The asymptotic notation Ω(1) is not to be confused with the set Ω.) We now explain some ofthe terminology used in the conjecture, referring the reader to [Hop18] for the full details. We willbe concerned with the case k = 1, in which case S n -invariance of µ means that for any x ∈ Ω n and any π ∈ S n (the symmetric group) we have P µ ( x ) = P µ ( π · x ) where π acts by permutingcoordinates. The notion of D -wise almost independence captures how well degree- D polynomialscan distinguish µ and ν . For our purposes, we do not need the full deﬁnition of D -wise almostindependence (see [Hop18]), but only the fact that it is implied by exact D -wise independence,deﬁned as follows. Deﬁnition 2.2.

A distribution µ on Ω N is D -wise independent with respect to ν if for any S ⊆ [ N ] with | S | ≤ D we have equality of the marginal distributions µ | S = ν | S . Finally, the noise operator T δ is deﬁned as follows. Deﬁnition 2.3.

Let ν be a product distribution on Ω N and let µ be another distribution on Ω N .For δ ∈ [0 , , let T δ µ be the distribution on Ω N generated as follows. To sample z ∼ T δ µ , ﬁrst ample x ∼ µ and y ∼ ν independently, and then, independently for each i , let z i = ( x i with probability − δ,y i with probability δ. (Note that T δ depends on ν but the notation suppresses this dependence; ν will always be clear fromcontext.) We ﬁrst give a counterexample that refutes Conjecture 2.1 in the case Ω = R . Theorem 3.1.

The following holds for inﬁnitely many n . Let Ω = R , and ν = Unif([0 , n ) . Thereexists a distribution µ on Ω n such that µ is S n -invariant (with k = 1 ) and Ω( n ) -wise independentwith respect to ν , and for some constant δ > there exists a polynomial-time computable testdistinguishing T δ µ and ν with probability − o (1) . The proof is given in Section 5.1.

Remark 3.2.

We assume a standard model of ﬁnite-precision arithmetic over R , i.e., the algorithm t can access polynomially-many bits in the binary expansion of its input. Note that we refute an even weaker statement than Conjecture 2.1 because our counterexamplehas exact Ω( n )-wise independence instead of only (log n ) -wise almost independence. Remark 3.3.

Essentially, our counterexample exploits the fact that a single real number can beused to encode a large block of data, and that the noise operator T δ will leave many of these blocksuntouched (eﬀectively allowing us to use a super-constant alphabet size). We therefore proposemodifying Conjecture 2.1 in the case Ω = R by using a diﬀerent noise operator that applies a smallamount of noise to every coordinate instead of resampling a small number of coordinates. If ν isi.i.d. N (0 , then the standard Ornstein-Uhlenbeck noise operator is a natural choice (and in fact,this is mentioned by [Hop18]). Formally, this is the noise operator T δ that samples T δ µ as follows:draw x ∼ µ and y ∼ ν and output √ − δx + √ δy . Our second result illustrates that in the case where Ω is a ﬁnite set, the S n -invariance assumptioncannot be dropped from Conjecture 2.1. (In stating the original conjecture, Hopkins [Hop18]remarked that he was not aware of any counterexample when the S n -invariance assumption isdropped). Theorem 3.4.

The following holds for inﬁnitely many n . Let Ω = { , } and ν = Unif( { , } n ) .There exists a distribution µ on Ω n such that µ is Ω( n ) -wise independent with respect to ν , and forsome constant δ > there exists a polynomial-time computable test distinguishing T δ µ and ν withprobability − o (1) . The proof is given in Section 5.2.Both of our counterexamples are in fact still valid in the presence of a stronger noise operator T δ that adversarially changes any δ -fraction of the coordinates.The rest of the paper is organized as follows. Our counterexamples are based on error-correctingcodes, so in Section 4 we review the basic notions from coding theory that we will need. In Section 5we construct our counterexamples and prove our main results.4 Coding Theory Preliminaries

Let F = F q be a ﬁnite ﬁeld. A linear code C (over F ) is a linear subspace of F n . Here n is calledthe (block) length , and the elements of C are called codewords . The distance of C is the minimumHamming distance between two codewords, or equivalently, the minimum Hamming weight of anonzero codeword. Deﬁnition 4.1.

Let C be a linear code. The dual distance of C is the minimum Hamming weightof a vector in F n that is orthogonal to all codewords. Equivalently, the dual distance is the distanceof the dual code C ⊥ = { x ∈ F n : h x, c i = 0 ∀ c ∈ C } . The following standard fact will be essential to our arguments.

Proposition 4.2. If C is a linear code with dual distance d , then the uniform distribution overcodewords is ( d − -wise independent with respect to the uniform distribution on F n .Proof. This is a standard fact in coding theory, but we give a proof here for completeness. Fix S ⊆ [ n ] with | S | ≤ d −

1. For some k , we can write C = { x ⊤ G : x ∈ F k } for some k × n generatormatrix G whose rows form a basis for C . Let G S be the k × | S | matrix obtained from G by keepingonly the columns in S . It is suﬃcient to show that if x is drawn uniformly from F k then x ⊤ G S isuniform over F | S | . The columns of G S must be linearly independent, because otherwise there is avector y ∈ F n of Hamming weight ≤ d − Gy = 0, implying y ∈ C ⊥ , which contradictsthe dual distance. Thus there exists a set T ⊆ [ k ] of | S | linearly independent rows of G S (whichform a basis for F | S | ). For any ﬁxed choice of x [ k ] \ T (i.e., the coordinates of x outside T ), if thecoordinates x T are chosen uniformly at random then x ⊤ G S is uniform over F | S | . This completesthe proof. Deﬁnition 4.3.

A code C admits eﬃcient decoding from r errors and s erasures if there is adeterministic polynomial-time algorithm D : ( F ∪ {⊥} ) n → F n ∪ { fail } with the following properties. • For any codeword c ∈ C , let c ′ ∈ F n be any vector obtained from c by changing the values ofat most r coordinates (to arbitrary elements of F ), and replacing at most s coordinates withthe erasure symbol ⊥ . Then D ( c ′ ) = c . • For any arbitrary c ′ ∈ F n (not obtained from some codeword as above), D ( c ′ ) can output anycodeword or fail (but must never output a vector that is not a codeword ). Note that the decoding algorithm knows where the erasures have occurred but does not know wherethe errors have occurred.Our ﬁrst counterexample (Theorem 3.1) is based on the classical

Reed-Solomon codes RS q ( n, k ),which consist of univariate polynomials of degree at most k evaluated at n canonical elements ofthe ﬁeld F q , and are known to have the following properties. Proposition 4.4 (Reed-Solomon Codes) . For any integers ≤ k < n and for any prime power q ≥ n , there is a length- n linear code C over F q with the following properties: • the dual distance is k + 2 , • C admits eﬃcient decoding from r errors and s erasures whenever r + s < n − k . The codes we deal with in this paper can be eﬃciently constructed, and so it is easy to test whether a givenvector is a codeword. Thus, this assumption is without loss of generality. roof. See e.g., [GS98] for the construction and basic facts regarding Reed-Solomon codes RS q ( n, k ).The distance of RS q ( n, k ) is n − k . It is well known that the dual code of RS q ( n, k ) is RS q ( n, n − k − Proposition 4.5.

There exists a universal constant ζ ≥ / such that for every integer i ≥ there is a linear code C over F = { , } of block length n = 42 · i +1 , with the following properties: • the dual distance is at least ζn , and • C admits eﬃcient decoding from ζn/ errors (with no erasures). Before proving the main results, we state some prerequisite notation and lemmas.

Deﬁnition 5.1.

Let F be a ﬁnite ﬁeld. For S ⊆ [ n ] (representing erased positions), the S -restrictedHamming distance ∆ S ( x, y ) is the number of coordinates in [ n ] \ S where x and y diﬀer: ∆ S ( x, y ) = |{ i ∈ [ n ] \ S : x i = y i }| . We allow x, y to belong to either F n or F [ n ] \ S , or even to ( F ∪ {⊥} ) n so long as the “erasures” ⊥ occur only in S . The following lemma shows that a random string is suﬃciently far from any codeword.

Lemma 5.2.

Let C be a length- n linear code over a ﬁnite ﬁeld F . Suppose C admits eﬃcientdecoding from r errors and s erasures, for some r, s satisfying r ≤ ( n − s ) / (8 e ) . For any ﬁxedchoice of at most s erasure positions S ⊆ [ n ] , if x is a uniformly random element of F [ n ] \ S then P x ( ∃ c ∈ C : ∆ S ( c, x ) ≤ r ) ≤ ( r + 1)2 − r . Proof.

Let B r ( c ) = { x ∈ F [ n ] \ S : ∆ S ( c, x ) ≤ r } ⊆ F [ n ] \ S denote the Hamming ball (with erasures S ) of radius r and center c , and let | B r | denote its cardinality (which does not depend on c ). Wehave the following basic bounds on | B r | : n − | S | r ! ( | F | − r ≤ | B r | ≤ ( r + 1) n − | S | r ! ( | F | − r . Since decoding from 2 r errors and s erasures is possible, the Hamming balls { B r ( c ) } c ∈ C are disjoint,and so P x ( ∃ c ∈ C : ∆ S ( c, x ) ≤ r ) ≤ | B r | / | B r |≤ ( r + 1) n − | S | r ! n − | S | r ! − ( | F | − − r ≤ ( r + 1) n − | S | r ! n − | S | r ! − . (cid:0) nk (cid:1) k ≤ (cid:0) nk (cid:1) ≤ (cid:0) nek (cid:1) k , this becomes ≤ ( r + 1) (cid:18) ern − | S | (cid:19) r ≤ ( r + 1) (cid:18) ern − s (cid:19) r which is at most ( r + 1)2 − r provided r ≤ ( n − s ) / (8 e ). Lemma 5.3.

Let j , . . . , j n be uniformly and independently chosen from [ n ] . For any constant α < /e , the number of indices i ∈ [ n ] that occur exactly once among j , . . . , j n is at least αn withhigh probability.Proof. Let X i ∈ { , } be the indicator that i occurs exactly once among j , . . . , j n , and let X = P ni =1 X i . We have (as n → ∞ ) E [ X i ] = n (1 /n )(1 − /n ) n − → e − , E [ X i ] = E [ X i ], and for i = i ′ , E [ X i X i ′ ] = n ( n − /n ) (1 − /n ) n − → e − . This means E [ X ] = (1 + o (1)) n/e andVar( X ) = X i ( E [ X i ] − E [ X i ] ) + X i = i ′ ( E [ X i X i ′ ] − E [ X i ] E [ X i ′ ]) ≤ n (1 + o (1))( e − − e − ) + n ( n − · o (1)= o ( n ) . The result now follows by Chebyshev’s inequality.

The idea of the proof is as follows. First imagine the setting where µ is not required to be S n -invariant. By using each real number to encode an element of F = F q , we can take ν to be theuniform distribution on F n and take µ to be a random Reed-Solomon codeword in F n . Under T δ µ ,the noise operator will corrupt a few symbols (“errors”), but the decoding algorithm can correctthese and thus distinguish T δ µ from ν .In order to have S n -invariance, we need to modify the construction. Instead of observing anordered list y = ( y , . . . , y n ) of symbols, we will observe pairs of the form ( i, y i ) (with each pairencoded by a single real number) where i is a random index. If the same i value appears in twodiﬀerent pairs, this gives conﬂicting information; we deal with this by simply throwing it out andtreating y i as an “erasure” that the code needs to correct. If some i value does not appear in anypairs, we also treat this as an erasure. The full details are given below. Proof of Theorem 3.1.

Let C be the length- n code from Proposition 4.4 with k = ⌈ αn ⌉ for someconstant α ∈ (0 ,

1) to be chosen later. Fix a scheme by which a real number encodes a tuple ( j, y )with j ∈ [ n ] and y ∈ F = F q , in such a way that a uniformly random real number in [0 ,

1] encodesa uniformly random tuple ( j, y ). More concretely, we can take n = q = 2 m for some integer m ≥ j, y ) can be directly encoded using the ﬁrst 2 m bits of the binary expansion of a real7umber. Under x ∼ ν , each coordinate x i encodes an independent uniformly random tuple ( j i , y i ).Under µ , let each coordinate x i be a uniformly random encoding of ( j i , y i ), drawn as follows. Let ˜ c be a uniformly random codeword from C . Draw j , . . . , j n ∈ [ n ] independently and uniformly. Foreach i , if j i is a unique index (in the sense that j i = j i ′ for all i ′ = i ) then set y i = ˜ c j i ; otherwisechoose y i uniformly from F .Note that µ is S n -invariant. Since the dual distance of C is k + 2 = Ω( n ), it follows (usingProposition 4.2) that µ is Ω( n )-wise independent with respect to ν . By choosing δ > α > δn +(2 n/ δn ) < n − k and so C admits eﬃcient decodingfrom 8 δn errors and 2 n/ δn erasures (see Proposition 4.4). The algorithm to distinguish T δ µ and ν is as follows. Given a list of ( j i , y i ) pairs, produce c ′ ∈ F n by setting c ′ j i = y i wherever j i isa unique index (in the above sense), and setting all other positions of c ′ to ⊥ (an “erasure”). Let S ⊆ [ n ] be the indices i for which c ′ i = ⊥ . Run the decoding algorithm on c ′ ; if it succeeds andoutputs a codeword c such that ∆ S ( c, c ′ ) ≤ δn then output “ T δ µ ”, and otherwise output “ ν ”.We can prove correctness as follows. If the true distribution is ν then Lemma 5.3 guarantees | S | ≤ n/ c ′ are uniformly random, Lemma 5.2 ensuresthere is no codeword c with ∆ S ( c, c ′ ) ≤ δn (w.h.p.), provided we choose δ ≤ / (96 e ), and so ouralgorithm outputs “ ν ” (w.h.p).Now suppose the true distribution is T δ µ . In addition to the ≤ n/ j i ’s sampled from µ , each coordinate resampled by T δ can create up to 2 additional erasuresand can also create up to 2 errors (i.e., coordinates i for which c ′ i = ⊥ but c ′ i = ˜ c i ). Since at most 2 δn coordinates get resampled (w.h.p.), this means we have a total of (up to) 4 δn errors and 2 n/ δn erasures. This means decoding succeeds and outputs ˜ c (i.e., the true codeword used to sample µ ),and furthermore, ∆ S (˜ c, c ′ ) ≤ δn . The idea of the proof is similar to the previous proof, and somewhat simpler (since we do notneed S n -invariance). We take ν to be the uniform distribution on binary strings and take µ to bethe uniform distribution on codewords, using the binary code from Proposition 4.5. The decodingalgorithm is able to correct the errors caused by T δ . Proof of Theorem 3.4.

Let C be the code from Proposition 4.5. Each codeword c ∈ C is an elementof F n , which can be identiﬁed with { , } n = Ω n . Let µ be the uniform distribution over codewords.Since the dual distance of C is at least ζn , we have from Proposition 4.2 that µ is Ω( n )-wiseindependent with respect to the uniform distribution ν . We also know that C admits eﬃcientdecoding from ζn/ δ = min { / (16 e ) , ζ/ } . The algorithm to distinguish T δ µ and ν , given a sample c ′ , isto run the decoding algorithm on c ′ ; if decoding succeeds and outputs a codeword c such that∆( c, c ′ ) ≤ δn then output “ T δ µ ”, and otherwise output “ ν ”.We can prove correctness as follows. If c ′ is drawn from T δ µ then c ′ is separated from somecodeword c by at most 2 δn ≤ ζn/ c . If instead c ′ is drawnfrom ν then (since δ ≤ / (16 e )) by Lemma 5.2, there is no codeword within Hamming distance 2 δn of c ′ (w.h.p.). Acknowledgments

We thank Sam Hopkins and Tim Kunisky for comments on an earlier draft.8 eferences [BB19] Matthew Brennan and Guy Bresler. Average-case lower bounds for learning sparse mix-tures, robust estimation and semirandom adversaries. arXiv preprint arXiv:1908.06130 ,2019.[BCL +

19] Boaz Barak, Chi-Ning Chou, Zhixian Lei, Tselil Schramm, and Yueqi Sheng. (Nearly)eﬃcient algorithms for the graph matching problem on correlated random graphs. In

Advances in Neural Information Processing Systems , pages 9186–9194, 2019.[BFJ +

94] Avrim Blum, Merrick L. Furst, Jeﬀrey C. Jackson, Michael J. Kearns, Yishay Mansour,and Steven Rudich. Weakly learning DNF and characterizing statistical query learningusing fourier analysis. In Frank Thomson Leighton and Michael T. Goodrich, editors,

Proceedings of the Twenty-Sixth Annual ACM Symposium on Theory of Computing,23-25 May 1994, Montréal, Québec, Canada , pages 253–262. ACM, 1994.[BHK +

19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra,and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted cliqueproblem.

SIAM Journal on Computing , 48(2):687–735, 2019.[BKW19] Afonso S Bandeira, Dmitriy Kunisky, and Alexander S Wein. Computational hardnessof certifying bounds on constrained PCA problems. arXiv preprint arXiv:1902.07324 ,2019.[BR13] Quentin Berthet and Philippe Rigollet. Computational lower bounds for sparse PCA. arXiv preprint arXiv:1304.0828 , 2013.[CHK +

19] Yeshwanth Cherapanamjeri, Samuel B Hopkins, Tarun Kathuria, Prasad Raghavendra,and Nilesh Tripuraneni. Algorithms for heavy-tailed statistics: Regression, covarianceestimation, and beyond. arXiv preprint arXiv:1912.11071 , 2019.[DKWB19] Yunzi Ding, Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira.Subexponential-time algorithms for sparse PCA. arXiv preprint arXiv:1907.11635 ,2019.[DMM09] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algorithms forcompressed sensing.

Proceedings of the National Academy of Sciences , 106(45):18914–18919, 2009.[FR93] G-L Feng and Thammavarapu RN Rao. Decoding algebraic-geometric codes up to thedesigned minimum distance.

IEEE Transactions on Information Theory , 39(1):37–45,1993.[GS98] Venkatesan Guruswami and Madhu Sudan. Improved decoding of reed-solomon andalgebraic-geometric codes. In

Proceedings 39th Annual Symposium on Foundations ofComputer Science , pages 28–37. IEEE, 1998.[GS01] Venkatesan Guruswami and Madhu Sudan. On representations of algebraic-geometrycodes.

IEEE Transactions on Information Theory , 47(4):1610–1613, 2001.9HKP +

17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, TselilSchramm, and David Steurer. The power of sum-of-squares for detecting hidden struc-tures. In , pages 720–731. IEEE, 2017.[Hop18] Samuel Hopkins.

Statistical Inference and the Sum of Squares Method . PhD thesis,Cornell University, 2018.[HS17] Samuel B Hopkins and David Steurer. Eﬃcient bayesian estimation from few samples:community detection and related problems. In , pages 379–390. IEEE, 2017.[IP01] Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-SAT.

J. Comput.Syst. Sci. , 62(2):367–375, 2001.[Jer92] Mark Jerrum. Large cliques elude the metropolis process.

Random Structures & Algo-rithms , 3(4):347–359, 1992.[Kea93] Michael J. Kearns. Eﬃcient noise-tolerant learning from statistical queries. In S. RaoKosaraju, David S. Johnson, and Alok Aggarwal, editors,

Proceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, May 16-18, 1993, San Diego,CA, USA , pages 392–401. ACM, 1993.[Kho02] Subhash Khot. On the power of unique 2-prover 1-round games. In

Proceedings of thethiry-fourth annual ACM symposium on Theory of computing , pages 767–775, 2002.[Kho05] Subhash Khot. On the unique games conjecture. In

FOCS , volume 5, page 3, 2005.[KWB19] Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Notes on computationalhardness of hypothesis testing: Predictions using the low-degree likelihood ratio. arXivpreprint arXiv:1907.11636 , 2019.[LKZ15] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborová. MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel. In ,pages 680–687. IEEE, 2015.[MRX19] Sidhanth Mohanty, Prasad Raghavendra, and Jeﬀ Xu. Lifting sum-of-squares lowerbounds: Degree-2 to degree-4. arXiv preprint arXiv:1911.01411 , 2019.[Reg09] Oded Regev. On lattices, learning with errors, random linear codes, and cryptography.