The Kikuchi Hierarchy and Tensor PCA
aa r X i v : . [ c s . D S ] O c t The Kikuchi Hierarchy and Tensor PCA
Alexander S. Wein ∗ , Ahmed El Alaoui ‡ , and Cristopher Moore § Department of Mathematics, Courant Institute of Mathematical Sciences, NYU Departments of Electrical Engineering and Statistics, Stanford University Santa Fe InstituteOctober 2, 2019
Abstract
For the tensor PCA (principal component analysis) problem, we propose a new hierarchy ofincreasingly powerful algorithms with increasing runtime. Our hierarchy is analogous to the sum-of-squares (SOS) hierarchy but is instead inspired by statistical physics and related algorithmssuch as belief propagation and AMP (approximate message passing). Our level- ℓ algorithm canbe thought of as a linearized message-passing algorithm that keeps track of ℓ -wise dependenciesamong the hidden variables. Specifically, our algorithms are spectral methods based on the Kikuchi Hessian , which generalizes the well-studied Bethe Hessian to the higher-order Kikuchifree energies.It is known that AMP, the flagship algorithm of statistical physics, has substantially worseperformance than SOS for tensor PCA. In this work we ‘redeem’ the statistical physics approachby showing that our hierarchy gives a polynomial-time algorithm matching the performance ofSOS. Our hierarchy also yields a continuum of subexponential-time algorithms, and we provethat these achieve the same (conjecturally optimal) tradeoff between runtime and statisticalpower as SOS. Our proofs are much simpler than prior work, and also apply to the relatedproblem of refuting random k -XOR formulas. The results we present here apply to tensor PCAfor tensors of all orders, and to k -XOR when k is even.Our methods suggest a new avenue for systematically obtaining optimal algorithms forBayesian inference problems, and our results constitute a step toward unifying the statisticalphysics and sum-of-squares approaches to algorithm design. ∗ Email: [email protected] . Partially supported by NSF grant DMS-1712730 and by the Simons Collaborationon Algorithms and Geometry. ‡ Email: [email protected] . Partially supported by IIS-1741162, and ONR N00014-18-1-2729. § Email: [email protected] . Partially supported by NSF grant BIGDATA-1838251. Introduction
High-dimensional Bayesian inference problems are widely studied, including planted clique [Jer92,AKS98], sparse PCA [JL04], and community detection [DKMZ11b, DKMZ11a], just to name afew. For these types of problems, two general strategies, or meta-algorithms , have emerged. Thefirst is rooted in statistical physics and includes the belief propagation (BP) algorithm [Pea86,YFW03] along with variants such as approximate message passing (AMP) [DMM09], and relatedspectral methods such as linearized BP [KMM +
13, BLM15], and the Bethe Hessian [SKZ14]. Thesecond meta-algorithm is the sum-of-squares (SOS) hierarchy [Sho87, Par00, Las01], a hierarchy ofincreasingly powerful semidefinite programming relaxations to polynomial optimization problems,along with spectral methods inspired by it [HSS15, HSSS16]. Both of these meta-algorithms areknown to achieve statistically-optimal performance for many problems. Furthermore, when theyfail to perform a task, this is often seen as evidence that no polynomial-time algorithm can succeed.Such reasoning takes the form of free energy barriers in statistical physics [LKZ15a, LKZ15b] orSOS lower bounds (e.g., [BHK + efficient algorithms.A fundamental question is whether we can unify statistical physics and SOS, showing thatthe two approaches yield, or at least predict, the same performance on a large class of problems.However, one barrier to this comes from the tensor principal component analysis (PCA) prob-lem [RM14], on which the two meta-algorithms seem to have very different performance. For aninteger p ≥
2, in the order- p tensor PCA or spiked tensor problem we observe a p -fold n × n × · · · × n tensor Y = λx ⊗ p ∗ + G where the parameter λ ≥ x ∗ ∈ R n is a planted signal withnormalization k x ∗ k = √ n drawn from a simple prior such as the uniform distribution on {± } n ,and G is a symmetric noise tensor with N (0 ,
1) entries. Information-theoretically, it is possible torecover x ∗ given Y (in the limit n → ∞ , with p fixed) when λ ≫ n (1 − p ) / [RM14, LML + A ≫ B can be understood to mean A ≥ B polylog( n ).) However, thisinformation-theoretic threshold corresponds to exhaustive search. We would also like to understandthe computational threshold, i.e., for what values of λ there is an efficient algorithm.The sum-of-squares hierarchy gives a polynomial-time algorithm to recover x ∗ when λ ≫ n − p/ [HSS15], and SOS lower bounds suggest that no polynomial-time algorithm can do better [HSS15,HKP + p ≥ λ ≫ n − / [RM14]. Various other “local” algorithms such as the tensor powermethod, Langevin dynamics, and gradient descent also fail below this “local” threshold λ ∼ n − / [RM14, AGJ18]. This casts serious doubts on the optimality of the statistical physics approach.In this paper we resolve this discrepancy and “redeem” the statistical physics approach. The Bethe free energy associated with AMP is merely the first level of a hierarchy of
Kikuchi free ener-gies [Kik51, Kik94, YFW03]. From these Kikuchi free energies, we derive a hierarchy of increasinglypowerful algorithms for tensor PCA, similar in spirit to generalized belief propagation [YFW03].Roughly speaking, our level- ℓ algorithm can be thought of as an iterative message-passing algorithmthat reasons about ℓ -wise dependencies among the hidden variables. As a result, it has time andspace complexity n O ( ℓ ) . Specifically, the level- ℓ algorithm is a spectral method on a n O ( ℓ ) × n O ( ℓ ) submatrix of (a first-order approximation of) the Kikuchi Hessian , i.e., the matrix of second deriva-tives of the Kikuchi free energy. This generalizes the
Bethe Hessian spectral method, which hasbeen successful in the setting of community detection [SKZ14]. We note that the Ph.D. dissertationof Saade [Saa16] proposed the Kikuchi Hessian as a direction for future research.2or order- p tensor PCA with p even, we show that level ℓ = p/ λ ∼ n − p/ , closing the gap between SOS andstatistical physics. Furthermore, by taking ℓ = n δ levels for various values of δ ∈ (0 , + We obtain similar results when p is odd, by combining a matrix related to the KikuchiHessian with a construction similar to [CGL04]; see Appendix F.2.Our approach also applies to the problem of refuting random k -XOR formulas when k is even,showing that we can strongly refute random formulas with n variables and m ≫ n k/ clauses inpolynomial time, and with a continuum of subexponential-time algorithms that succeed at lowerdensities. This gives a much simpler proof of the results of [RRS17], using only the matrix Chernoffbound instead of intensive moment calculations; see Appendix F.1. We leave for future work theproblem of giving a similar simplification of [RRS17] when k is odd.Our results redeem the statistical physics approach to algorithm design and give hope thatthe Kikuchi hierarchy provides a systematic way to derive optimal algorithms for a large class ofBayesian inference problems. We see this as a step toward unifying the statistical physics and SOSapproaches. Indeed, we propose the following informal meta-conjecture: for high-dimensional in-ference problems with planted solutions (and related problems such as refuting random constraintsatisfaction problems) the SOS hierarchy and the Kikuchi hierarchy both achieve the optimal trade-off between runtime and statistical power.After the initial appearance of this paper, some related independent work has appeared. Ahierarchy of algorithms similar to ours is proposed by [Has19], but with a different motivationbased on a system of quantum particles. Also, [BCR19] gives an alternative “redemption” of localalgorithms based on replicated gradient descent . Our asymptotic notation (e.g., O ( · ) , o ( · ) , Ω( · ) , ω ( · )) pertains to the limit n → ∞ (large dimension)and may hide constants depending on p (tensor order), which we think of as fixed. We say an eventoccurs with high probability if it occurs with probability 1 − o (1).A tensor T ∈ ( R n ) ⊗ p is an n × n × · · · × n ( p times) multi-array with entries denoted by T i ,...,i p , where i k ∈ [ n ] := { , , . . . , n } . We call p the order of T and n the dimension of T .For a vector u ∈ R n , the rank-1 tensor u ⊗ p is defined by ( u ⊗ p ) i ,...,i p = Q pk =1 u i k . A tensor T is symmetric if T i ,...,i p = T i π (1) ,...,i π ( p ) for any permutation π ∈ S p . For a symmetric tensor, if E = { i , . . . , i p } ⊆ [ n ], we will often write T E := T i ,...,i p . A general formulation of the spiked tensor model is as follows. For an integer p ≥
2, let e G ∈ ( R n ) ⊗ p be an asymmetric tensor with entries i.i.d. N (0 , G , G := 1 √ p ! X π ∈ S p e G π , The strongest SOS results only apply to a variant of the spiked tensor model with Rademacher observations, butwe do not expect this difference to be important; see Section 2.6. S p is the symmetric group of permutations of [ p ], and e G πi ,...,i p := e G i π (1) ,...,i π ( p ) . Note that if i , . . . , i p are distinct then G i ,...,i p ∼ N (0 , x ∗ ∈ R n froma prior distribution P x supported on the sphere S n − = { x ∈ R n : k x k = √ n } . Then we let Y ∈ ( R n ) ⊗ p be the tensor Y = λx ⊗ p ∗ + G . (1)We will mostly focus on the Rademacher-spiked model where x ∗ is uniform in {± } n , i.e., P x =2 − n Q i (cid:0) δ ( x i −
1) + δ ( x i + 1) (cid:1) . We will sometimes state results without specifying the prior P x , inwhich case the result holds for any prior normalized so that k x ∗ k = √ n . Let P λ denote the law ofthe tensor Y . The parameter λ = λ ( n ) may depend on n . We will consider the limit n → ∞ with p held fixed.Our algorithms will depend only on the entries Y i ,...,i p where the indices i , . . . , i p are distinct:that is, on the collection (cid:8) Y E = λx E ∗ + G E : E ⊆ [ n ] , | E | = p (cid:9) , where G E ∼ N (0 ,
1) and for a vector x ∈ R n we write x E = Q i ∈ E x i .Perhaps one of the simplest statistical tasks is binary hypothesis testing. In our case thisamounts to, given a tensor Y as input with the promise that it was sampled from P λ with λ ∈ { , ¯ λ } ,determining whether λ = 0 or λ = ¯ λ . We refer to P λ for λ > planted distribution, and P as the null distribution. Definition 2.1.
We say that an algorithm (or test) f : ( R n ) ⊗ p → { , } achieves strong detection between P and P λ iflim n →∞ P λ ( f ( Y ) = 1) = 1 and lim n →∞ P ( f ( Y ) = 0) = 1 . Additionally we say that f achieves weak detection between P and P λ if the sum of Type-I andType-II errors remains strictly below 1:lim sup n →∞ (cid:8) P ( f ( Y ) = 1) + P λ ( f ( Y ) = 0) (cid:9) < . An additional goal is to recover the planted vector x ∗ . Note that when p is even, x ∗ and − x ∗ have the same posterior probability. Thus, our goal is to recover x ∗ up to a sign. Definition 2.2.
The normalized correlation between vectors ˆ x, x ∈ R n iscorr(ˆ x, x ) = |h ˆ x, x i|k ˆ x kk x k . Definition 2.3.
An estimator ˆ x = ˆ x ( Y ) achieves weak recovery if corr(ˆ x, x ∗ ) is lower-bounded bya strictly positive constant—and we write corr(ˆ x, x ∗ ) = Ω(1)—with high probability, and achieves strong recovery if corr(ˆ x, x ∗ ) = 1 − o (1) with high probability.We expect that strong detection and weak recovery are generally equally difficult, although formalimplications are not known in either direction. We will see in Section 2.5 that in some regimes,weak recovery and strong recovery are equivalent.4 he matrix case. When p = 2, the spiked tensor model reduces to the spiked Wigner model . Weknow from random matrix theory that when λ = ˆ λ/ √ n with ˆ λ >
1, strong detection is possible bythresholding the maximum eigenvalue of Y , and weak recovery is achieved by PCA, i.e., taking theleading eigenvector [FP07, BGN11]. For many spike priors including Rademacher, strong detectionand weak recovery are statistically impossible when ˆ λ < λ = 1. (Note that weak detection is still possible below ˆ λ = 1 [EKJ18].)A more sophisticated algorithm is AMP (approximate message passing) [DMM09, FR18, LKZ15a,DAM15], which can be thought of as a modification of the matrix power method which uses certainnonlinear transformations to exploit the structure of the spike prior. For many spike priors includingRademacher, AMP is known to achieve the information-theoretically optimal correlation with thetrue spike [DAM15, DMK + λ <
1. For certain spike priors (e.g., sparse priors), statistical-computational gaps can appearin which it is information-theoretically possible to succeed for some ˆ λ < + The tensor case.
The tensor case p ≥ P x including Rademacher, if λ = ˆ λn (1 − p ) / then weak recovery is possiblewhen ˆ λ > λ c and impossible when ˆ λ < λ c for a particular constant λ c = λ c ( p, P x ) depending on p and P x [LML + λ > λ c and impossible otherwise [Che17, CHL18, JLM18] (see also [PWB16]). In fact, it is shown inthese works that even weak detection is impossible below λ c , in sharp contrast with the matrixcase. There are polynomial-time algorithms (e.g., SOS) that succeed at both strong detection andstrong recovery when λ ≫ n − p/ for any spike prior [RM14, HSS15, HSSS16], which is above theinformation-theoretic threshold by a factor that diverges with n . There are also SOS lower boundssuggesting that (for many priors) no polynomial-time algorithm can succeed at strong detection orweak recovery when λ ≪ n − p/ [HSS15, HKP + n (1 − p ) / ≪ λ ≪ n − p/ . Various algorithms have been proposed and analyzed for tensor PCA [RM14, HSS15, HSSS16,ADGM16, AGJ18]. We will present two such algorithms that are simple, representative, and willbe relevant to the discussion. The first is the tensor power method [AGH +
14, RM14, AGJ17].
Algorithm 2.4. (Tensor Power Method) For a vector u ∈ R n and a tensor Y ∈ ( R n ) ⊗ p , let Y { u } ∈ R n denote the vector Y { u } i = X j ,...,j p − Y i,j ,...,j p − u j · · · u j p − . The tensor power method begins with an initial guess u ∈ R n (e.g., chosen at random) and repeat-edly iterates the update rule u ← Y { u } until u/ k u k converges.The tensor power method appears to only succeed when λ ≫ n − / [RM14], which is worse thanthe SOS threshold λ ∼ n − p/ . The AMP algorithm of [RM14] is a more sophisticated variant of thetensor power method, but AMP also fails unless λ ≫ n − / [RM14]. Two other related algorithms,gradient descent and Langevin dynamics, also fail unless λ ≫ n − / [AGJ18]. Following [AGJ18], werefer to all of these algorithms (tensor power method, AMP, gradient descent, Langevin dynamics)5s local algorithms , and we refer to the corresponding threshold λ ∼ n − / as the local threshold .Here “local” is not a precise notion, but roughly speaking, local algorithms keep track of a currentguess for x ∗ and iteratively update it to a nearby vector that is more favorable in terms of e.g. thelog-likelihood. This discrepancy between local algorithms and SOS is what motivated the currentwork. We have seen that local algorithms do not seem able to reach the SOS threshold. Let us nowdescribe one of the simplest algorithms that does reach this threshold: tensor unfolding . Tensorunfolding was first proposed by [RM14], where it was shown to succeed when λ ≫ n −⌊ p/ ⌋ / andconjectured to succeed when λ ≫ n − p/ (the SOS threshold). For the case p = 3, the samealgorithm was later reinterpreted as a spectral relaxation of SOS, and proven to succeed when λ ≫ n − / = n − p/ [HSS15], confirming the conjecture of [RM14]. We now present the tensorunfolding method, restricting to the case p = 3 for simplicity. There is a natural extension toall p [RM14], and (a close variant of) this algorithm will in fact appear as level ℓ = ⌊ p/ ⌋ in ourhierarchy of algorithms (see Section 3 and Appendix C). Algorithm 2.5. (Tensor Unfolding) Given an order-3 tensor Y ∈ ( R n ) ⊗ , flatten it to an n × n matrix M , i.e., let M i,jk = Y ijk . Compute the leading eigenvector of M M ⊤ .If we use the matrix power method to compute the leading eigenvector, we can restate thetensor unfolding method as an iterative algorithm: keep track of state vectors u ∈ R n and v ∈ R n ,initialize u randomly, and alternate between applying the update steps v ← M ⊤ u and u ← M v .We will see later (see Section 4.1) that this can be interpreted as a message-passing algorithmbetween singleton indices, represented by u , and pairs of indices, represented by v . Thus, tensorunfolding is not “local” in the sense of Section 2.3 because it keeps a state of size O ( n ) (keepingtrack of pairwise information) instead of size O ( n ). We can, however, think of it as local on a“lifted” space, and this allows it to surpass the local threshold.Other methods have also been shown to achieve the SOS threshold λ ∼ n − p/ , including SOSitself and various spectral methods inspired by it [HSS15, HSSS16]. One fundamental difference between the matrix case ( p = 2) and tensor case ( p ≥
3) is the followingboosting property. The following result, implicit in [RM14], shows that for p ≥
3, if λ is substantiallyabove the information-theoretic threshold (i.e., λ ≫ n (1 − p ) / ) then weak recovery can be boostedto strong recovery via a single power iteration. We give a proof in Appendix D. Proposition 2.6.
Let Y ∼ P λ with any spike prior P x supported on S n − . Suppose we have aninitial guess u ∈ R n satisfying corr( u, x ∗ ) ≥ τ . Obtain b x from u via a single iteration of the tensorpower method: b x = Y { u } . There exists a constant c = c ( p ) > such that with high probability, corr( b x, x ∗ ) ≥ − cλ − τ − p n (1 − p ) / . In particular, if τ > is any constant and λ = ω ( n (1 − p ) / ) then corr( b x, x ) = 1 − o (1) . The analysis of [HSS15] applies to a close variant of the spiked tensor model in which the noise tensor is asym-metric. We do not expect this difference to be important. p ≥
3, since we do not expect polynomial-time algorithms to succeed when λ = O ( n (1 − p ) / ),this implies an “all-or-nothing” phenomenon: for a given λ = λ ( n ), the optimal polynomial-timealgorithm will either achieve correlation that is asymptotically 0 or asymptotically 1. This is instark contrast to the matrix case where, for λ = ˆ λ/ √ n , the optimal correlation is a constant (in[0 , λ and the spike prior P x .This boosting result substantially simplifies things when p ≥ p = 2), one needs to use AMPin order to achieve optimal correlation, but one can achieve the optimal threshold using linearizedAMP, which boils down to computing the top eigenvector. In the related setting of communitydetection in the stochastic block model, one needs to use belief propagation to achieve optimalcorrelation [DKMZ11b, DKMZ11a, MNS14], but one can achieve the optimal threshold using alinearized version of belief propagation, which is a spectral method on the non-backtracking walkmatrix [KMM +
13, BLM15] or the related
Bethe Hessian [SKZ14]. Our spectral methods for tensorPCA are based on the
Kikuchi Hessian , which is a generalization of the Bethe Hessian.
The degree- ℓ sum-of-squares algorithm is a large semidefinite program that requires runtime n O ( ℓ ) to solve. Oftentimes the regime of interest is when ℓ is constant, so that the algorithm runs inpolynomial time. However, one can also explore the power of subexponential-time algorithms byletting ℓ = n δ for δ ∈ (0 , n δ . Results of thistype are known for tensor PCA [RRS17, BGG +
16, BGL16]. The strongest such results are for adifferent variant of tensor PCA, which we now define.
Definition 2.7.
In the order- p discrete spiked tensor model with spike prior P x (normalized sothat k x ∗ k = √ n ) and SNR parameter λ ≥
0, we draw a spike x ∗ ∼ P x and then for each 1 ≤ i ≤· · · ≤ i p ≤ n , we independently observe a {± } -valued random variable Y i ,...,i p with E [ Y i ,...,i p ] = λ ( x ⊗ p ∗ ) i ,...,i p .This model differs from our usual one in that the observations are conditionally Rademacher insteadof Gaussian, but we do not believe this makes an important difference. However, for technicalreasons, the known SOS results are strongest in this discrete setting. Theorem 2.8 ([BGL16, HSS15]) . For any ≤ ℓ ≤ n , there is an algorithm with runtime n O ( ℓ ) that achieves strong detection and strong recovery in the order- p discrete spiked tensor model (withany spike prior) whenever λ ≥ ℓ / − p/ n − p/ polylog( n ) . The work of [BGL16] shows how to certify an upper bound on the injective norm of a random {± } -valued tensor, which immediately implies the algorithm for strong detection. When combined with[HSS15], this can also be made into an algorithm for strong recovery (see Lemma 4.4 of [HSS15]).Similar (but weaker) SOS results are also known for the standard spiked tensor model (see [RRS17]and arXiv version 1 of [BGL16]), and we expect that Theorem 2.8 also holds for this case.When ℓ = n δ for δ ∈ (0 , n O ( n δ ) = 2 n δ + o (1) that succeeds when λ ≫ n δ/ − pδ/ − p/ . Note that this interpolates smoothlybetween the polynomial-time threshold ( λ ∼ n − p/ ) when δ = 0, and the information-theoretic7hreshold ( λ ∼ n (1 − p ) / ) when δ = 1. We will prove (for p even) that our algorithms achieve thissame tradeoff, and we expect this tradeoff to be optimal . In this section we present our main results about detection and recovery in the spiked tensor model.We propose a hierarchy of spectral methods, which are directly derived from the hierarchy of
Kikuchifree energies . Specifically, the symmetric difference matrix defined below appears (approximately)as a submatrix of the Hessian of the Kikuchi free energy. The full details of this derivation aregiven in Section 4 and Appendix E. For now we simply state the algorithms and results.We will restrict our attention to the Rademacher-spiked tensor model, which is the setting inwhich we derived our algorithms. However, we show in Appendix B that the same algorithm worksfor a large class of priors (at least for strong detection). Furthermore, we show in Appendix F.1that the same algorithm can also be used for refuting random k -XOR formulas (when k is even).We will also restrict to the case where the tensor order p is even. The case of odd p is discussedin Appendix C, where we give an algorithm derived from the Kikuchi Hessian and conjecture thatit achieves optimal performance. We are unable to prove this, but we are able to prove that optimalresults are attained by a related algorithm (see Appendix F.2).Our approach requires the introduction of two matrices: The symmetric difference matrix of order ℓ . Let p be even and let Y ∈ ( R n ) ⊗ p be theobserved order- p symmetric tensor. We will only use the entries Y i ,...,i p for which the indices i , . . . , i p are distinct; we denote such entries by Y E where E ⊆ [ n ] with | E | = p . Fix an integer ℓ ∈ [ p/ , n − p/ (cid:0) nℓ (cid:1) × (cid:0) nℓ (cid:1) matrix M indexed by sets S ⊆ [ n ] of size ℓ , having entries M S,T = ( Y S △ T if | S △ T | = p, . (2)Here △ denotes the symmetric difference between sets. The leading eigenvector of M is intended tobe an estimate of ( x S ∗ ) | S | = ℓ where x S := Q i ∈ S x i . The following voting matrix is a natural roundingscheme to extract an estimate of x ∗ from such a vector. The voting matrix.
To a vector v ∈ R ( nℓ ) we associate the following symmetric n × n ‘voting’matrix V ( v ) having entries V ii ( v ) = 0 ∀ i ∈ [ n ] , and V ij ( v ) = 12 X S,T ∈ ( [ n ] ℓ ) v S v T S △ T = { i,j } ∀ i = j. (3)Let us define the important quantity d ℓ := (cid:18) n − ℓp/ (cid:19)(cid:18) ℓp/ (cid:19) . (4)This is the number of sets T of size ℓ such that | S △ T | = p for a given set S of size ℓ . Now we arein position to formulate our algorithms for detection and recovery. One form of evidence suggesting that this tradeoff is optimal is based on the low-degree likelihood ratio ;see [KWB19]. lgorithm 3.1 (Detection for even p ) .
1. Compute the top eigenvalue λ max ( M ) of the symmetric difference matrix M .2. Reject the null hypothesis P (i.e., return ‘1’) if λ max ( M ) ≥ λd ℓ / Algorithm 3.2 (Recovery for even p ) .
1. Compute a (unit-norm) leading eigenvector v top ∈ R ( nℓ ) of M .2. Form the associated voting matrix V ( v top ).3. Compute a leading eigenvector b x of V ( v top ), and output b x .The next two theorems characterize the performance of Algorithms 3.1 and 3.2 for the strongdetection and recovery tasks, respectively. The proofs can be found in Appendix A. Theorem 3.3.
Consider the Rademacher-spiked tensor model with p even. For all λ ≥ and ℓ ∈ [ p/ , n − p/ , we have P (cid:16) λ max ( M ) ≥ λd ℓ (cid:17) ∨ P λ (cid:16) λ max ( M ) ≤ λd ℓ (cid:17) ≤ n ℓ e − λ d ℓ / . Therefore, Algorithm 3.1 achieves strong detection between P and P λ if λ d ℓ − ℓ log n → + ∞ as n → + ∞ . Theorem 3.4.
Consider the Rademacher-spiked tensor model with p even. Let b x ∈ R n be theoutput of Algorithm 3.2. There exists an absolute constant c > such that for all ǫ > and δ ∈ (0 , , if ℓ ≤ nǫ and λ ≥ c ǫ − q log( n ℓ /δ ) (cid:14) d ℓ , then corr( b x, x ∗ ) ≥ − c ǫ with probability atleast − δ . Remark 3.5. If ℓ = o ( n ), we have d ℓ = Θ( n p/ ℓ p/ ), and so the above theorems imply that strongdetection and strong recovery are possible as soon as λ ≫ ℓ − ( p − / n − p/ √ log n . Comparing withTheorem 2.8, this scaling coincides with the guarantees achieved by the level- ℓ SOS algorithmof [BGL16], up to a possible discrepancy in logarithmic factors.Due to the particularly simple structure of the symmetric difference matrix M (in particular,the fact that its entries are simply entries of Y ), the proof of detection (Theorem 3.3) follows froma straightforward application of the matrix Chernoff bound. In contrast, the corresponding SOSresults [RRS17, BGG +
16, BGL16] work with more complicated matrices involving high powers ofthe entries of Y , and the analysis is much more involved.Our proof of recovery is unusual in that the signal component of M , call it X , is not rank-one;it even has a vanishing spectral gap when ℓ ≫
1. Thus, the leading eigenvector of M does notcorrelate well with the leading eigenvector of X . While this may seem to render recovery hopelessat first glance, this is not the case, due to the fact that many eigenvectors (actually, eigenspaces)of X contain non-trivial information about the spike x ∗ , as opposed to only the top one. We provethis by exploiting the special structure of X through the Johnson scheme , and using tools fromFourier analysis on a slice of the hypercube, in particular a Poincar´e-type inequality by [Fil16]. We define a leading eigenvector to be an eigenvector whose eigenvalue is maximal (although our results still holdfor an eigenvector whose eigenvalue is maximal in absolute value). emoving the logarithmic factor Both Theorem 3.3 and Theorem 3.4 involve a logarithmicfactor in n in the lower bound on SNR λ . These log factors are an artifact of the matrix Chernoffbound, and we believe they can be removed. (The analysis of [HSS15] removes the log factors forthe tensor unfolding algorithm, which is essentially the case p = 3 and ℓ = 1 of our algorithm.)This suggests the following precise conjecture on the power of polynomial-time algorithms. Conjecture 3.6.
Fix p and let ℓ be constant (not depending on n ). There exists a constant c p ( ℓ ) > with c p ( ℓ ) → as ℓ → ∞ (with p fixed) such that if λ ≥ c p ( ℓ ) n − p/ then Algorithm 3.1 andAlgorithm 3.2 (which run in time n O ( ℓ ) ) achieve strong detection and strong recovery, respectively. Specifically, we expect c p ( ℓ ) ∼ ℓ − ( p − / for large ℓ . In this section we motivate the symmetric difference matrices used in our algorithms. In Section 4.1we give some high-level intuition, including an explanation of how our algorithms can be thoughtof as iterative message-passing procedures among subsets of size ℓ . In Section 4.2 we give a moreprincipled derivation based on the Kikuchi Hessian, with many of the calculations deferred toAppendix E. As stated previously, our algorithms will choose to ignore the entries Y i ,...,i p for which i , . . . , i p are not distinct; these entries turn out to be unimportant asymptotically. We restrict to theRademacher-spiked tensor model, as this yields a clean and simple derivation. The posterior dis-tribution for the spike x ∗ given the observed tensor Y is P ( x | Y ) ∝ exp n − X i < ···
S, T ⊆ [ n ] with | S | = | T | = ℓ and | S △ T | = p ,and where N = d ℓ (cid:0) nℓ (cid:1) / (cid:0) np (cid:1) is the number of terms ( S, T ) with a given symmetric difference E .A natural message-passing algorithm to maximize the log-likelihood is the following. For each S ⊆ [ n ] of size | S | = ℓ , keep track of a variable u S ∈ R , which is intended to be an estimate of x S ∗ := Q i ∈ S ( x ∗ ) i . Note that there are consistency constraints that ( x S ∗ ) | S | = ℓ must obey, such as x S ∗ x T ∗ x V ∗ = 1 when S △ T △ V = ∅ ; we will relax the problem and will not require our vector u = ( u S ) | S | = ℓ to obey such constraints. Instead we simply attempt to maximize1 k u k X | S △ T | = p Y S △ T u S u T (7)10ver all u ∈ R ( nℓ ). To do this, we iterate the update equations u S ← X T : | S △ T | = p Y S △ T u T . (8)We call S and T neighbors if | S △ T | = p . Intuitively, each neighbor T of S sends a message m T → S := Y S △ T u T to S , indicating T ’s opinion about u S . We update u S to be the sum of allincoming messages.Now note that the sum in (7) is simply k u k − u ⊤ M u where M is the symmetric differencematrix, and (8) can be written as u ← M u . Thus our natural message-passing scheme is precisely power iteration against M , and so we shouldtake the leading eigenvector v top of M as our estimate of ( x S ∗ ) | S | = ℓ (up to a scaling factor). Finally,defining our voting matrix V ( v top ) and taking its leading eigenvector is a natural method forrounding v top to a vector of the form u x where u xS = x S , thus restoring the consistency constraintswe ignored before.Indeed, if we carry out this procedure on all subsets S ⊆ [ n ] then this works as intended, andno rounding is necessary: consider the 2 n × n matrix M S,T = Y S △ T | S △ T | = p . It is easy to verifythat the eigenvectors of M are precisely the Fourier basis vectors on the hypercube, namely vectorsof the form u x where u xS = x S and x ∈ {± } n . Moreover, the eigenvalue associated to u x is12 n ( u x ) ⊤ M u x = 12 n X S,T ⊆ [ n ] : | S △ T | = p Y S △ T x S x T = X | E | = p Y E x E . This is the expression in the log-likelihood in (5). Thus the leading eigenvector of M is exactly u x where x is the maximum-likelihood estimate of x ∗ .This procedure succeeds all the way down to the information-theoretic threshold λ ∼ n (1 − p ) / ,but takes exponential time. Our contribution can be viewed as showing that even when we restrictto the submatrix M of M supported on subsets of size ℓ , the leading eigenvector still allows usto recover x ∗ whenever the SNR is sufficiently large. Proving this requires us to perform Fourieranalysis over a slice of the hypercube rather than the simpler setting of the entire hypercube, whichwe do by appealing to Johnson schemes and some results of [Fil16]. We now introduce the
Kikuchi approximations to the free energy (or simply the
Kikuchi free ener-gies ) of the above posterior (5) [Kik51, Kik94], the principle from which our algorithms are derived.For concreteness we restrict to the case of the Rademacher-spiked tensor model, but the Kikuchifree energies can be defined for general graphical models [YFW03].The posterior distribution in (5) is a Gibbs distribution P ( x | Y ) ∝ e − βH ( x ) with randomHamiltonian H ( x ) := − λ P i < ··· H : Ω → R . Considerthe optimization problem inf µ F ( µ ) , (9)11here the supremum is over probability distributions µ supported on Ω, and define the free energyfunctional F of µ by F ( µ ) := E x ∼ µ [ H ( x )] − β S ( µ ) , (10)where S ( µ ) is the Shannon entropy of µ , i.e., S ( µ ) = − P x ∈ Ω µ ( x ) log µ ( x ). Then the uniqueoptimizer of (9) is the Gibbs distribution µ ∗ ( x ) ∝ exp( − βH ( x )). If we specialize this statement toour setting, µ ∗ = P ( ·| Y ) and F n (1; Y ) = F ( µ ∗ ). We refer to [WJ08] for more background.In light of the above variational characterization, a natural algorithmic strategy to learn theposterior distribution is to minimize the free energy functional F ( µ ) over distributions µ . However,this is a priori intractable because (for a high-dimensional domain such as Ω = {± } n ) an exponen-tial number of parameters are required to represent µ . The idea underlying the belief propagation algorithm [Pea86, YFW03] is to work only with locally-consistent marginals, or beliefs , instead ofa complete distribution µ . Standard belief propagation works with beliefs on singleton variablesand on pairs of variables. The Bethe free energy is a proxy for the free energy that only dependson these beliefs, and belief propagation is a certain procedure that iteratively updates the beliefsin order to locally minimize the Bethe free energy. The level- r Kikuchi free energy is a generaliza-tion of the
Bethe free energy that depends on r -wise beliefs and gives (in principle) increasinglybetter approximations of F ( µ ∗ ) as r increases. Our algorithms are based on the principle of locallyminimizing Kikuchi free energy, which we define next.We now define the level- r Kikuchi approximation to the free energy. We require r ≥ p , i.e.,the Kikuchi level needs to be at least as large as the interactions present in the data (although the r < p case could be handled by defining a modified graphical model with auxiliary variables). TheBethe free energy is the case r = 2.For S ⊆ [ n ] with 0 < | S | ≤ r , let b S : {± } S → R denote the belief on S , which is a probabilitymass function over {± } | S | representing our belief about the joint distribution of x S := { x i } i ∈ S .Let b = { b S : S ∈ (cid:0) [ n ] ≤ r (cid:1) } denote the set of beliefs on s -wise interactions for all s ≤ r . Following[YFW03], the Kikuchi free energy is a real-valued functional K of b having the form E − β S (in ourcase, β = 1). Here the first term is the ‘energy’ term E ( b ) = − λ X | S | = p Y S X x S ∈{± } S b S ( x S ) x S . where, recall, x S := Q i ∈ S x i . (This is a proxy for the term E x ∼ µ [ H ( x )] in (10).) The second termin K is the ‘entropy’ term S ( b ) = X < | S |≤ r c S S S ( b ) , S S ( b ) = − X x S ∈{± } S b S ( x S ) log b S ( x S ) , where the overcounting numbers are c S := P T ⊇ S, | T |≤ r ( − | T \ S | . These are defined so that for any S ⊆ [ n ] with 0 < | S | ≤ r , X T ⊇ S, | T |≤ r c T = 1 , (11)which corrects for overcounting. Notice that E and S each take the form of an “expectation” withrespect to the beliefs b S ; these would be actual expectations were the beliefs the marginals of anactual probability distribution. This situation is to be contrasted with the notion of a pseudo-expectation , which plays a central role in the theory underlying the sum-of-squares algorithm.Our algorithms are based on he Kikuchi Hessian , a generalization of the Bethe Hessian matrixthat was introduced in the setting of community detection [SKZ14]. The Bethe Hessian is the12essian of the Bethe free energy with respect to the moments of the beliefs, evaluated at beliefpropagation’s so-called “uninformative fixed point.” The bottom eigenvector of the Bethe Hessianis a natural estimator for the planted signal because it represents the best direction for localimprovement of the Bethe free energy, starting from belief propagation’s uninformative startingpoint. We generalize this method and compute the analogous Kikuchi Hessian matrix. The fullderivation is given in Appendix E. The order- ℓ symmetric difference matrix (2) (approximately)appears as a submatrix of the level- r Kikuchi Hessian whenever r ≥ ℓ + p/ We have presented a hierarchy of spectral algorithms for tensor PCA, inspired by variational in-ference and statistical physics. In particular, the core idea of our approach is to locally minimizethe Kikuchi free energy. We specifically implemented this via the Kikuchi Hessian, but there maybe many other viable approaches to minimizing the Kikuchi free energy such as generalized beliefpropagation [YFW03]. Broadly speaking, we conjecture that for many average-case problems, al-gorithms based on Kikuchi free energy and algorithms based on sum-of-squares should both achievethe optimal tradeoff between runtime and statistical power. One direction for further work is toverify that this analogy holds for problems other than tensor PCA; in particular, we show here thatit also applies to refuting random k -XOR formulas when k is even.Perhaps one benefit of the Kikuchi hierarchy over the sum-of-squares hierarchy is that it hasallowed us to systematically obtain spectral methods, simply by computing a certain Hessian matrix.Furthermore, the algorithms we obtained are simpler than their SOS counterparts. We are hopefulthat the Kikuchi hierarchy will provide a roadmap for systematically deriving simple and optimalalgorithms for a large class of problems. Acknowledgments
We thank Alex Russell for suggesting the matrix Chernoff bound (Theorem A.4). For helpfuldiscussions, we thank Afonso Bandeira, Sam Hopkins, Pravesh Kothari, Florent Krzakala, TselilSchramm, Jonathan Shi, and Lenka Zdeborov´a. This project started during the workshop
SpinGlasses and Related Topics held at the Banff International Research Station (BIRS) in the Fall of2018. We thank our hosts at BIRS as well as the workshop organizers: Antonio Auffinger, Wei-KuoChen, Dmitry Panchenko, and Lenka Zdeborov´a.
References [AB ˇC13] Antonio Auffinger, G´erard Ben Arous, and Jiˇr´ı ˇCern`y. Random matrices and complex-ity of spin glasses.
Communications on Pure and Applied Mathematics , 66(2):165–201,2013.[ADGM16] Anima Anandkumar, Yuan Deng, Rong Ge, and Hossein Mobahi. Homotopy analysisfor tensor PCA. arXiv preprint arXiv:1610.09322 , 2016.[AG89] Richard Arratia and Louis Gordon. Tutorial on large deviations for the binomialdistribution.
Bulletin of mathematical biology , 51(1):125–131, 1989.13AGH +
14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Tel-garsky. Tensor decompositions for learning latent variable models.
The Journal ofMachine Learning Research , 15(1):2773–2832, 2014.[AGJ17] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor powermethod dynamics in overcomplete regime.
The Journal of Machine Learning Research ,18(1):752–791, 2017.[AGJ18] Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Algorithmic thresholdsfor tensor PCA. arXiv preprint arXiv:1808.00921 , 2018.[AKS98] Noga Alon, Michael Krivelevich, and Benny Sudakov. Finding a large hidden cliquein a random graph.
Random Structures & Algorithms , 13(3-4):457–466, 1998.[AOW15] Sarah R Allen, Ryan O’Donnell, and David Witmer. How to refute a random CSP.In , pages689–708. IEEE, 2015.[BCR19] Giulio Biroli, Chiara Cammarota, and Federico Ricci-Tersenghi. How to iron outrough landscapes and get optimal performances: Replicated gradient descent and itsapplication to tensor PCA. arXiv preprint arXiv:1905.12294 , 2019.[BGG +
16] Vijay VSP Bhattiprolu, Mrinalkanti Ghosh, Venkatesan Guruswami, Euiwoong Lee,and Madhur Tulsiani. Multiplicative approximations for polynomial optimization overthe unit sphere. In
Electronic Colloquium on Computational Complexity (ECCC) ,volume 23, page 1, 2016.[BGL16] Vijay Bhattiprolu, Venkatesan Guruswami, and Euiwoong Lee. Sum-of-squares certifi-cates for maxima of random tensors on the sphere. arXiv preprint arXiv:1605.00903 ,2016.[BGN11] Florent Benaych-Georges and Raj Rao Nadakuditi. The eigenvalues and eigenvectorsof finite, low rank perturbations of large random matrices.
Advances in Mathematics ,227(1):494–521, 2011.[BHK +
16] Boaz Barak, Samuel B Hopkins, Jonathan Kelner, Pravesh Kothari, Ankur Moitra,and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted cliqueproblem. In , pages 428–437. IEEE, 2016.[BLM15] Charles Bordenave, Marc Lelarge, and Laurent Massouli´e. Non-backtracking spectrumof random graphs: community detection and non-regular ramanujan graphs. In , pages 1347–1357.IEEE, 2015.[BMV +
18] Jess Banks, Cristopher Moore, Roman Vershynin, Nicolas Verzelen, and Jiaming Xu.Information-theoretic bounds and phase transitions in clustering, sparse PCA, andsubmatrix localization.
IEEE Transactions on Information Theory , 64(7):4872–4894,2018.[Bur17] Amanda Burcroff. Johnson schemes and certain matrices with integral eigenvalues.Technical report, University of Michigan, 2017.14CGL04] Amin Coja-Oghlan, Andreas Goerdt, and Andr´e Lanka. Strong refutation heuristicsfor random k-sat. In
Approximation, Randomization, and Combinatorial Optimiza-tion. Algorithms and Techniques , pages 310–321. Springer, 2004.[Che17] Wei-Kuo Chen. Phase transition in the spiked random tensor with rademacher prior. arXiv preprint arXiv:1712.01777 , 2017.[CHL18] Wei-Kuo Chen, Madeline Handschy, and Gilad Lerman. Phase transition in randomtensors with multiple spikes. arXiv preprint arXiv:1809.06790 , 2018.[DAM15] Yash Deshpande, Emmanuel Abbe, and Andrea Montanari. Asymptotic mutual infor-mation for the two-groups stochastic block model. arXiv preprint arXiv:1507.08685 ,2015.[DKMZ11a] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov´a. Asymp-totic analysis of the stochastic block model for modular networks and its algorithmicapplications.
Physical Review E , 84(6):066106, 2011.[DKMZ11b] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov´a. Inferenceand phase transitions in the detection of modules in sparse networks.
Physical ReviewLetters , 107(6):065701, 2011.[DMK +
16] Mohamad Dia, Nicolas Macris, Florent Krzakala, Thibault Lesieur, and Lenka Zde-borov´a. Mutual information for symmetric rank-one matrix estimation: A proof ofthe replica formula. In
Advances in Neural Information Processing Systems , pages424–432, 2016.[DMM09] David L Donoho, Arian Maleki, and Andrea Montanari. Message-passing algo-rithms for compressed sensing.
Proceedings of the National Academy of Sciences ,106(45):18914–18919, 2009.[EKJ18] Ahmed El Alaoui, Florent Krzakala, and Michael I Jordan. Fundamental limits ofdetection in the spiked wigner model. arXiv preprint arXiv:1806.09588 , 2018.[Fil16] Yuval Filmus. An orthogonal basis for functions over a slice of the Boolean hypercube.
The Electronic Journal of Combinatorics , 23(1):P1–23, 2016.[FP07] Delphine F´eral and Sandrine P´ech´e. The largest eigenvalue of rank one deformationof large wigner matrices.
Communications in mathematical physics , 272(1):185–228,2007.[FR18] Alyson K Fletcher and Sundeep Rangan. Iterative reconstruction of rank-one matricesin noise.
Information and Inference: A Journal of the IMA , 7(3):531–562, 2018.[Gri01] Dima Grigoriev. Linear lower bound on degrees of Positivstellensatz calculus proofsfor the parity.
Theoretical Computer Science , 259(1-2):613–622, 2001.[GS10] Chris Godsil and Sung Y Song. Association schemes.
Handbook of CombinatorialDesigns, , pages 325–330, 2010.[Has19] Matthew B Hastings. Classical and quantum algorithms for tensor principal compo-nent analysis. arXiv preprint arXiv:1907.12724 , 2019.15HKP +
17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, TselilSchramm, and David Steurer. The power of sum-of-squares for detecting hidden struc-tures. In , pages 720–731. IEEE, 2017.[HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal componentanalysis via sum-of-square proofs. In
Conference on Learning Theory , pages 956–1006,2015.[HSSS16] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spec-tral algorithms from sum-of-squares proofs: tensor decomposition and planted sparsevectors. In
Proceedings of the forty-eighth annual ACM symposium on Theory ofComputing , pages 178–191. ACM, 2016.[Jer92] Mark Jerrum. Large cliques elude the metropolis process.
Random Structures &Algorithms , 3(4):347–359, 1992.[JL04] Iain M Johnstone and Arthur Yu Lu. Sparse principal components analysis.
Unpub-lished manuscript , 7, 2004.[JLM18] Aukosh Jagannath, Patrick Lopatto, and Leo Miolane. Statistical thresholds for tensorPCA. arXiv preprint arXiv:1812.03403 , 2018.[Kik51] Ryoichi Kikuchi. A theory of cooperative phenomena.
Phys. Rev. , 81:988, 1951.[Kik94] Ryoichi Kikuchi. Special issue in honor of R. Kikuchi.
Progr. Theor. Phys. Suppl , 115,1994.[KMM +
13] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, LenkaZdeborov´a, and Pan Zhang. Spectral redemption in clustering sparse networks.
Pro-ceedings of the National Academy of Sciences , 110(52):20935–20940, 2013.[KWB19] Dmitriy Kunisky, Alexander S Wein, and Afonso S Bandeira. Notes on computationalhardness of hypothesis testing: Predictions using the low-degree likelihood ratio. arXivpreprint arXiv:1907.11636 , 2019.[Las01] Jean B Lasserre. An explicit exact SDP relaxation for nonlinear 0-1 programs. In
International Conference on Integer Programming and Combinatorial Optimization ,pages 293–303. Springer, 2001.[LKZ15a] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborov´a. MMSE of probabilistic low-rank matrix estimation: Universality with respect to the output channel. In ,pages 680–687. IEEE, 2015.[LKZ15b] Thibault Lesieur, Florent Krzakala, and Lenka Zdeborov´a. Phase transitions in sparsePCA. In , pages1635–1639. IEEE, 2015.[LML +
17] Thibault Lesieur, L´eo Miolane, Marc Lelarge, Florent Krzakala, and Lenka Zdeborov´a.Statistical and computational phase transitions in spiked tensor estimation. In , pages 511–515. IEEE,2017. 16MNS14] Elchanan Mossel, Joe Neeman, and Allan Sly. Belief propagation, robust reconstruc-tion and optimal recovery of block models. In
Conference on Learning Theory , pages356–370, 2014.[MRZ15] Andrea Montanari, Daniel Reichman, and Ofer Zeitouni. On the limitation of spectralmethods: From the gaussian hidden clique problem to rank-one perturbations of gaus-sian tensors. In
Advances in Neural Information Processing Systems , pages 217–225,2015.[Oli10] Roberto Oliveira. Sums of random Hermitian matrices and an inequality by Rudelson.
Electronic Communications in Probability , 15:203–212, 2010.[Par00] Pablo A Parrilo.
Structured semidefinite programs and semialgebraic geometry methodsin robustness and optimization . PhD thesis, California Institute of Technology, 2000.[Pea86] Judea Pearl. Fusion, propagation, and structuring in belief networks.
Artificial intel-ligence , 29(3):241–288, 1986.[PWB16] Amelia Perry, Alexander S Wein, and Afonso S Bandeira. Statistical limits of spikedtensor models. arXiv preprint arXiv:1612.07728 , 2016.[PWBM18] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimalityand sub-optimality of PCA I: Spiked random matrix models.
The Annals of Statistics ,46(5):2416–2451, 2018.[RM14] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. In
Advancesin Neural Information Processing Systems , pages 2897–2905, 2014.[RRS17] Prasad Raghavendra, Satish Rao, and Tselil Schramm. Strongly refuting randomCSPs below the spectral threshold. In
Proceedings of the 49th Annual ACM SIGACTSymposium on Theory of Computing , pages 121–131. ACM, 2017.[Saa16] Alaa Saade. Spectral inference methods on sparse graphs: theory and applications. arXiv preprint arXiv:1610.04337 , 2016.[Sch79] Alexander Schrijver. A comparison of the Delsarte and Lov´asz bounds.
IEEE Trans.Information Theory , 25(4):425–429, 1979.[Sch08] Grant Schoenebeck. Linear level Lasserre lower bounds for certain k-CSPs. In , pages 593–602.IEEE, 2008.[Sho87] Naum Z Shor. Class of global minimum bounds of polynomial functions.
Cybernetics ,23(6):731–734, 1987.[SKZ14] Alaa Saade, Florent Krzakala, and Lenka Zdeborov´a. Spectral clustering of graphswith the Bethe Hessian. In
Advances in Neural Information Processing Systems , pages406–414, 2014.[Tro12] Joel A Tropp. User-friendly tail bounds for sums of random matrices.
Foundations ofcomputational mathematics , 12(4):389–434, 2012.[Wat90] William C Waterhouse. The absolute-value estimate for symmetric multilinear forms.
Linear Algebra and its Applications , 128:97–105, 1990.17WJ08] Martin J Wainwright and Michael I Jordan. Graphical models, exponential families,and variational inference.
Foundations and Trends R (cid:13) in Machine Learning , 1(1–2):1–305, 2008.[YFW03] Jonathan S Yedidia, William T Freeman, and Yair Weiss. Understanding belief propa-gation and its generalizations. Exploring artificial intelligence in the new millennium ,8:236–239, 2003.
A Analysis of Symmetric Difference and Voting Matrices
We adopt the notation x S := Q i ∈ S x i for x ∈ {± } n and S ⊆ [ n ]. Recall the matrix M indexedby sets S ⊆ [ n ] of size ℓ , having entries M S,T = Y S △ T | S △ T | = p where Y S △ T = λx S △ T ∗ + G S △ T . (12)First, observe that we can restrict our attention to the case where the spike is the all-onesvector x ∗ = without loss of generality. To see this, conjugate M by a diagonal matrix D with diagonal entries D S,S = x S ∗ and obtain ( M ′ ) S,T = ( D − M D ) S,T = Y ′ S △ T | S △ T | = p where Y ′ S △ T = x S ∗ x T ∗ Y S △ T = x S △ T ∗ Y S △ T = λ + g ′ S △ T where g ′ S △ T = x S ∗ x T ∗ g S △ T . By symmetry of theGaussian distribution, ( g ′ E ) | E | = p are i.i.d. N (0 ,
1) random variables. Therefore, the two matri-ces have the same spectrum and the eigenvectors of M can be obtained from those of M ′ bypre-multiplying by D . So from now on we write M = λ X + Z , (13)where X S,T = | S △ T | = p and Z S,T = g S △ T | S △ T | = p , where ( g E ) | E | = p is a collection of i.i.d. N (0 , A.1 Structure of X The matrix X is the adjacency matrix of a regular graph J n,ℓ,p on (cid:0) nℓ (cid:1) vertices, where vertices arerepresented by sets, and two sets S and T are connected by an edge if | S △ T | = p , or equivalently | S ∩ T | = ℓ − p/
2. This matrix belongs to the Bose-Mesner algebra of the ( n, ℓ )-Johnson associationscheme (see for instance [Sch79, GS10]). This is the algebra of (cid:0) nℓ (cid:1) × (cid:0) nℓ (cid:1) real- or complex-valuedsymmetric matrices where the entry X S,T depends only on the size of the intersection | S ∩ T | . Inaddition to this set of matrices being an algebra, it is a commutative algebra, which means that allsuch matrices are simultaneously diagonalizable and share the same eigenvectors.Filmus [Fil16] provides a common eigenbasis for this algebra: for 0 ≤ m ≤ ℓ , let ϕ =( a , b , . . . , a m , b m ) be a sequence of 2 m distinct elements of [ n ]. Let | ϕ | = 2 m denote its totallength. Now define a vector u ϕ ∈ R ( nℓ ) having coordinates u ϕS = m Y i =1 ( a i ∈ S − b i ∈ S ) , | S | = ℓ . In the case m = 0, ϕ is the empty sequence ∅ and we have u ∅ = (the all-ones vector). Proposition A.1.
Each u ϕ is an eigenvector of X . Furthermore, the linear space Y m := span { u ϕ : | ϕ | = 2 m } for ≤ m ≤ ℓ is an eigenspace of X (i.e., all vectors u ϕ with sequences ϕ of lengthof m have the same eigenvalue µ m ). Lastly R ( nℓ ) = L ℓm =0 Y m , and dim Y m = (cid:0) nm (cid:1) − (cid:0) nm − (cid:1) . (Byconvention, (cid:0) n − (cid:1) = 0 .) roof. The first two statements are the content of Lemma 4.3 in [Fil16]. The dimension of Y m isgiven in Lemma 2.1 in [Fil16].We note that ( u ϕ ) | ϕ | =2 m are not linearly independent; an orthogonal basis, called the Youngbasis, consisting of linear combinations of the u ϕ ’s is given explicitly in [Fil16].We see from the above proposition that X has ℓ + 1 distinct eigenvalues µ ≥ µ ≥ · · · ≥ µ ℓ ,each one corresponding to the eigenspace Y m . The first eigenvalue is the degree of the graph J n,ℓ,p : µ = d ℓ = (cid:18) ℓp/ (cid:19)(cid:18) n − ℓp/ (cid:19) . (14)We provide an explicit formula for all the remaining eigenvalues: Lemma A.2.
The eigenvalues of X are as follows: µ m = min( m,p/ X s =0 ( − s (cid:18) ms (cid:19)(cid:18) ℓ − mp/ − s (cid:19)(cid:18) n − ℓ − mp/ − s (cid:19) , ≤ m ≤ ℓ . (15) Proof.
These are the so-called Eberlein polynomials, which are polynomials in m of degree p (see,e.g., [Sch79]). We refer to [Bur17] for formulae in more general contexts, but we give a proof herefor completeness. Let A = { a , . . . , a m } and B = { b , . . . , b m } . Note that u ϕS is nonzero if and onlyif | S ∩ { a i , b i }| = 1 for each 1 ≤ i ≤ m . By symmetry, we can assume that A ⊆ S and S ∩ B = ∅ .Then µ m is the sum over all sets T , such that | S △ T | = p and | T ∩ { a i , b i }| = 1 for each i , of( − s where s = | T ∩ B | . For each s there are (cid:0) ms (cid:1) choices of this set of indices, giving the firstbinomial. Adding these b i to T and removing these a i from S contributes 2 s to | S △ T | . To achieve | S △ T | = p , we also need to remove p/ − s elements of S \ A from S , giving the second binomial.We also need to add p/ − s elements of S ∪ B to T , giving the third binomial. Finally, we have s ≤ m and s ≤ p/ m . Lemma A.3.
Let ≤ p ≤ √ n and let ℓ < n/p . For all ≤ m ≤ ℓ it holds that | µ m | µ ≤ max (cid:26)(cid:16) − mℓ (cid:17) p/ , pn (cid:27) . Proof.
The terms in (15) have alternating signs. We will show that they decrease in absolute valuebeyond the first nonzero term, so that it gives a bound on µ m . We consider two cases. First,suppose m ≤ ℓ − p/ s = 0 term is positive. Then the ( s + 1)st term divided by the s thterm is, in absolute value, (cid:0) ms +1 (cid:1)(cid:0) ℓ − mp/ − s − (cid:1)(cid:0) n − ℓ − mp/ − s − (cid:1)(cid:0) ms (cid:1)(cid:0) ℓ − mp/ − s (cid:1)(cid:0) n − ℓ − mp/ − s (cid:1) = (cid:18) m − ss + 1 (cid:19) (cid:18) p/ − sℓ − m − p/ s + 1 (cid:19) (cid:18) p/ − sn − ℓ − m − p/ s + 1 (cid:19) ≤ m ( p/ p/ ℓ − m − p/ n − ℓ − m − p/ ≤ ( ℓ − p/ p/ p/ n − ℓ + 1 since m ≤ ℓ − p/ ≤ ℓp / n − ℓ < .
19t follows that the s = 0 term is an upper bound, µ m ≤ (cid:18) ℓ − mp/ (cid:19)(cid:18) n − ℓ − mp/ (cid:19) , and so µ m µ ≤ (cid:18) ℓ − mp/ (cid:19)(cid:30)(cid:18) ℓp/ (cid:19) ≤ (cid:18) ℓ − mℓ (cid:19) p/ . Next we consider the case m > ℓ − p/
2, so that the first nonzero term has s = p/ − ℓ + m ≥ µ m by at least a factor of n , and we will show this is the case. For theterms with s ≥ p/ − ℓ + m , the ratio of absolute values is again bounded by (cid:0) ms +1 (cid:1)(cid:0) ℓ − mp/ − s − (cid:1)(cid:0) n − ℓ − mp/ − s − (cid:1)(cid:0) ms (cid:1)(cid:0) ℓ − mp/ − s (cid:1)(cid:0) n − ℓ − mp/ − s (cid:1) = (cid:18) m − ss + 1 (cid:19) (cid:18) p/ − sℓ − m − p/ s + 1 (cid:19) (cid:18) p/ − sn − ℓ − m − p/ s + 1 (cid:19) < ( ℓ − p/ p/ p/ p/ − ℓ + m + 1)( n − ℓ + 1) ≤ ℓp / n − ℓ < . It follows that the s = p/ − ℓ + m term gives a bound on the absolute value, | µ m | ≤ (cid:18) mℓ − p/ (cid:19)(cid:18) n − ℓ − mℓ − m (cid:19) ≤ (cid:18) ℓp/ (cid:19)(cid:18) n − ℓℓ − m (cid:19) , in which case | µ m | µ ≤ (cid:0) n − ℓℓ − m (cid:1)(cid:0) n − ℓp/ (cid:1) = ( p/ n − ℓ − p/ ℓ − m )!( n − ℓ + m )! ≤ (cid:18) p/ n − ℓ − p/ (cid:19) p/ − ℓ + m ≤ (cid:16) pn (cid:17) p/ − ℓ + m ≤ pn . Combining these two cases gives the stated result.
A.2 Proof of Strong Detection
Here we prove our strong detection result, Theorem 3.3. The proof doesn’t exploit the full detailsof the structure exhibited above. Instead, the proof is a straightforward application of the matrixChernoff bound for Gaussian series [Oli10] (see also Theorem 4.1.1 of [Tro12]):
Theorem A.4.
Let { A i } be a finite sequence of fixed symmetric d × d matrices, and let { ξ i } beindependent N (0 , random variables. Let Σ = P i ξ i A i . Then, for all t ≥ , P ( k Σ k op ≥ t ) ≤ de − t / σ where σ = (cid:13)(cid:13) E [ Σ ] (cid:13)(cid:13) op . Let us first write M in the form of a Gaussian series. For a set E ∈ (cid:0) [ n ] p (cid:1) , define the (cid:0) nℓ (cid:1) × (cid:0) nℓ (cid:1) matrix A E as ( A E ) S,T = S △ T = E . It is immediate that for λ = 0, M = Z = P | E | = p g E A E where ( g E ) | E | = p is a collection of i.i.d. N (0 ,
1) random variables. The second moment of this random matrix is E [ M ] = X E : | E | = p A E = d ℓ I , A E is the diagonal matrix with ( A E ) S,S = | S △ E | = p , and summing over all E gives d ℓ on thediagonal. The operator norm of the second moment is then d ℓ . It follows that for all t ≥ P (cid:0) λ max ( M ) ≥ t (cid:1) ≤ (cid:18) nℓ (cid:19) e − t / d ℓ ≤ n ℓ e − t / d ℓ . (16)Now letting t = λd ℓ yields the first statement of the theorem.As for the second statement, we have λ max ( M ) ≥ k X k op − k Z k op = λd ℓ − k Z k op where Z isdefined in (13). Applying the same bound we have P λ (cid:16) λ max ( M ) ≤ λd ℓ (cid:17) ≤ P (cid:16) k Z k op ≤ λd ℓ (cid:17) ≤ n ℓ e − λ d ℓ / . A.3 Proof of Strong Recovery
Here we prove our strong recovery result, Theorem 3.4. Let v = v top ( M ) be a unit-norm leadingeigenvector of M . For a fixed m ∈ [ ℓ ] (to be determined later on), we write the orthogonaldecomposition v = v ( m ) + v ⊥ , where v ( m ) ∈ L s ≤ m Y s , and v ⊥ in the orthogonal complement. Thegoal is to first show that if m is proportional to ℓ then v ⊥ has small Euclidean norm, so that v and v ( m ) have high inner product. The second step of the argument is to approximate the votingmatrix V ( v ) by V ( v ( m ) ), and then use Fourier-analytic tools to reason about the latter.Let us start with the first step. Lemma A.5.
With Z defined as in (13) , we have (cid:13)(cid:13) v ⊥ (cid:13)(cid:13) ≤ k Z k op λ ( µ − µ m +1 ) . Proof.
Let us absorb the factor λ in the definition of the matrix X . Let { u , · · · , u d } be a setof eigenvectors of X which also form an orthogonal basis for L s ≤ m Y s , with u being the topeigenvector of X ( u is the normalized all-ones vector). We start with the inequality u ⊤ M u ≤ v ⊤ M v . The left-hand side of the inequality is µ + u ⊤ Z u . The right-hand side is v ⊤ X v + v ⊤ Z v .Moreover v ⊤ X v = v ⊤ X v ( m ) + v ⊤ X v ⊥ . Since v ( m ) ∈ L s ≤ m Y s , by Proposition A.1, X v ( m ) belongsto the space as well, and therefore v ⊤ X v ( m ) = v ( m ) ⊤ X v ( m ) . Similarly v ⊤ X v ⊥ = ( v ⊥ ) ⊤ X v ⊥ , so v ⊤ X v = v ( m ) ⊤ X v ( m ) + ( v ⊥ ) ⊤ X v ⊥ . Therefore the inequality becomes µ + u ⊤ Z u ≤ v ( m ) ⊤ X v ( m ) + ( v ⊥ ) ⊤ X v ⊥ + v ⊤ Z v . Since v ⊥ is orthogonal to the top m eigenspaces of X we have ( v ⊥ ) ⊤ X v ⊥ ≤ µ m +1 (cid:13)(cid:13) v ⊥ (cid:13)(cid:13) . Moreover, v ( m ) ⊤ X v ( m ) ≤ µ (cid:13)(cid:13) v ( m ) (cid:13)(cid:13) , hence µ + u ⊤ Z u ≤ µ (cid:13)(cid:13) v ( m ) (cid:13)(cid:13) + µ m +1 (cid:13)(cid:13) v ⊥ (cid:13)(cid:13) + v ⊤ Z v . By rearranging and applying the triangle inequality we get,( µ − µ m +1 ) (cid:13)(cid:13) v ⊥ (cid:13)(cid:13) ≤ | v ⊤ Z v | + | u ⊤ Z u | ≤ k Z k op . Combining this fact with Lemma A.3, recalling that µ = d ℓ we obtain21 emma A.6. For any ǫ > and δ ∈ (0 , , if λ ≥ ǫ − p n ℓ /δ ) /d ℓ , then (cid:13)(cid:13) v ⊥ (cid:13)(cid:13) ≤ ǫ ℓm , with probability at least − δ .Proof. Lemma A.3 implies µ − µ m +1 ≤ µ − (1 − m +1 ℓ ) p/ ≤ µ · ℓm . Therefore, Lemma A.5 implies (cid:13)(cid:13) v ⊥ (cid:13)(cid:13) ≤ k Z k op λd ℓ ℓm . The operator norm of the noise can be bounded by a matrix Chernoff bound [Oli10, Tro12], similarlyto our argument in the proof of detection: for all t ≥ P ( k Z k op ≥ t ) ≤ n ℓ e − t / d ℓ . Therefore, letting λ ≥ ǫ − p n ℓ /δ ) /d ℓ we obtain the desired result. A.3.1 Analysis of the Voting Matrix
Recall that the voting matrix V ( v ) of a vector v ∈ R ( nℓ ) has zeros on the diagonal, and off-diagonalentries V ij ( v ) = 12 X S,T ∈ ( [ n ] ℓ ) v S v T S △ T = { i,j } = X S ∈ ( [ n ] ℓ ) v S v S △{ i,j } i ∈ S,j / ∈ S , ≤ i = j ≤ n . It will be more convenient in our analysis to work with V ( v ( m ) ) instead of V ( v ). To this end weproduce the following approximation result: Lemma A.7.
Let u, e ∈ R ( nℓ ) and v = u + e . Then k V ( v ) − V ( u ) k F ≤ ℓ k e k (2 k u k + k e k ) . In particular, (cid:13)(cid:13) V ( v ) − V ( v ( m ) ) (cid:13)(cid:13) F ≤ ℓ (cid:13)(cid:13) v ⊥ (cid:13)(cid:13) . Proof.
Let us introduce the shorthand notation h u, v i ij := P S ∈ ( [ n ] ℓ ) v S v S △{ i,j } i ∈ S,j / ∈ S . We have (cid:13)(cid:13) V ( v ) − V ( u ) (cid:13)(cid:13) F = X i,j ( V ij ( u + e ) − V ij ( u )) = X i,j ( h u + e, u + e i ij − h u, u i ij ) = X i,j ( h u, e i ij + h e, u i ij + h e, e i ij ) ≤ X i,j (cid:0) h u, e i ij + h e, u i ij + h e, e i ij (cid:1) , (17)22here the last step uses the bound ( a + b + c ) ≤ a + b + c ). Now we expand X i,j h u, e i ij = X i,j X S : | S | = ℓ, i ∈ S, j / ∈ S u S e S △{ i,j } ≤ X i,j X S : | S | = ℓ, i ∈ S, j / ∈ S u S X S : | S | = ℓ, i ∈ S, j / ∈ S e S △{ i,j } (Cauchy–Schwarz) ≤ X i,j X S : | S | = ℓ, i ∈ S u S X T : | T | = ℓ, j ∈ T e T = X i X S : | S | = ℓ, i ∈ S u S X j X T : | T | = ℓ, j ∈ T e T = ℓ X | S | = ℓ u S ℓ X | T | = ℓ e T = ℓ k u k k e k . (18)Plugging this back into (17) yields the desired result. To obtain k V ( v ) − V ( v ( m ) ) k F ≤ ℓ k v ⊥ k we just bound 2 k v ( m ) k + k v ⊥ k by 3.Let us also state the following lemma, which will be need later on: Lemma A.8.
For u ∈ R ( nℓ ) , k V ( u ) k F ≤ ℓ k u k .Proof. Note that k V ( u ) k F = P i,j V ij ( u ) = P i,j h u, u i ij and so the desired result follows immedi-ately from (18).Next, in the main technical part of the proof, we show that V ( v ( m ) ) is close to a multiple ofthe all-ones matrix in Frobenius norm: Proposition A.9.
Let ˆ = / √ n , α = ℓ k v ( m ) k and η = mℓ + ℓn . Then (cid:13)(cid:13) V ( v ( m ) ) − α ˆ ˆ ⊤ (cid:13)(cid:13) F ≤ ηα . Before proving the above proposition, let us put the results together and prove our recoveryresult.
A.3.2 Proof of Theorem 3.4
For ǫ, δ >
0, assume λ ≥ ǫ − p n ℓ /δ ) /d ℓ . By Lemma A.6 and Lemma A.7 we have (cid:13)(cid:13) V ( v ) − V ( v ( m ) ) (cid:13)(cid:13) F ≤ ℓ ℓm ǫ, with probability at least 1 − δ . Moreover, by Lemma A.9, we have (cid:13)(cid:13) V ( v ( m ) ) − α ˆ ˆ ⊤ (cid:13)(cid:13) F ≤ ηα , α = ℓ k v ( m ) k ≤ ℓ and η = mℓ + ℓn . Therefore, by a triangle inequality we have (cid:13)(cid:13) V ( v ) − α ˆ ˆ ⊤ (cid:13)(cid:13) F ≤ ℓ r ℓǫm + p ηα ≤ ℓ (cid:16)r ℓǫm + r mℓ + ℓn (cid:17) , with probability at least 1 − δ . Now let us choose a value of m that achieves a good tradeoff of theabove two error terms: m = ℓ √ ǫ . Let us also use the inequality √ a + b ≤ √ a + √ b for positive a, b ,to obtain (cid:13)(cid:13) V ( v ) − α ˆ ˆ ⊤ (cid:13)(cid:13) F ≤ ℓ (cid:16) ǫ / + r ℓn (cid:17) , (19)under the same event.Now let b x be a leading eigenvector of V ( v ), and let R = V ( v ) − α ˆ ˆ ⊤ . Since b x ⊤ V ( v ) b x ≥ ˆ ⊤ V ( v )ˆ , we have α h b x, ˆ i + b x ⊤ R b x ≥ α + ˆ ⊤ R ˆ . Therefore h b x, ˆ i ≥ − k R k op α . Since α = ℓ (1 − k v ⊥ k ), and k v ⊥ k ≤ ǫ ℓm ≤ √ ǫ , the bound (19) (together with k R k op ≤ k R k F )implies h b x, ˆ i ≥ − ǫ / + p ℓ/n − √ ǫ . (20)To conclude the proof of our theorem, we let ℓ ≤ n √ ǫ , ǫ < /
16 and then replace ǫ by ǫ : we obtain h b x, ˆ i ≥ − ǫ with probability at least 1 − δ if λ ≥ ǫ − p n ℓ /δ ) /d ℓ . A.3.3 Proof of Proposition A.9
Let α = ℓ k v ( m ) k . By Lemma A.8 we have (cid:13)(cid:13) V ( v ( m ) ) (cid:13)(cid:13) F ≤ α . Therefore (cid:13)(cid:13) V ( v ( m ) ) − α ˆ ˆ ⊤ (cid:13)(cid:13) F = (cid:13)(cid:13) V ( v ( m ) ) (cid:13)(cid:13) F − n α n X i,j =1 V ij ( v ( m ) ) + α ≤ α − n α n X i,j =1 V ij ( v ( m ) ) . (21)Now we need a lower bound on P ni,j =1 V ij ( v ( m ) ). This will crucially rely on the fact that v ( m ) liesin the span of the top m eigenspaces of X : Lemma A.10.
For a fixed m ≤ ℓ , let v ∈ L ms =0 Y s . Then n X i,j =1 V ij ( v ) ≥ (( ℓ − m ) n − ℓ ) k v k . We plug the result of the above lemma in (21) to obtain (cid:13)(cid:13) V ( v ( m ) ) − α ˆ ˆ ⊤ (cid:13)(cid:13) F ≤ α (cid:16) − ℓ − mℓ + ℓn (cid:17) = 2 α (cid:16) mℓ + ℓn (cid:17) , as desired. 24 .3.4 A Poincar´e Inequality on a Slice of the Hypercube To prove Lemma A.10, we need some results on Fourier analysis on the slice of the hypercube (cid:0) [ n ] ℓ (cid:1) . Following [Fil16], we define the following. First, given a function f : (cid:0) [ n ] ℓ (cid:1) R , we define itsexpectation as its average value over all sets of size ℓ , and write E | S | = ℓ [ f ( S )] := 1 (cid:0) nℓ (cid:1) X | S | = ℓ f ( S ) . We also define its variance as V [ f ] := E | S | = ℓ [ f ( S ) ] − E | S | = ℓ [ f ( S )] . Moreover, we identify a vector u ∈ R ( nℓ ) with a function on sets of size ℓ in the obvious way: f ( S ) = u S . Definition A.11.
For u ∈ R ( nℓ ) and i, j ∈ [ n ], let u ( ij ) denote the vector having coordinates u ( ij ) S = (cid:26) u S △{ i,j } if | S ∩ { i, j }| = 1 u S otherwise.(The operation u u ( ij ) exchanges the roles of i and j whenever possible.) Definition A.12.
For u ∈ R ( nℓ ) and i, j ∈ [ n ], define the influence of the pair ( i, j ) asInf ij [ u ] := 12 E | S | = ℓ h(cid:0) u ( ij ) S − u S (cid:1) i , and the total influence as Inf[ u ] := 1 n X i We have m E | S | = ℓ [ v S ] ≥ m V [ v ] ≥ Inf[ v ] = n P i While we have mainly focused on the Rademacher-spiked tensor model, we now show that ouralgorithm works just as well (at least for detection) for a much larger class of spike priors. Theorem B.1. Let p ≥ be even. Consider the spiked tensor model with a spike prior P x that drawsthe entries of x ∗ i.i.d. from some distribution π on R (which does not depend on n ), normalized sothat E [ π ] = 1 . There is a constant C (depending on p and π ) such that if λ ≥ Cℓ / d − / ℓ √ log n then Algorithm 3.1 achieves strong detection.Proof. From (16), we have k Z k op = O ( √ ℓd ℓ log n ) with high probability, and so it remains to givea lower bound on k X k op . Letting u S = Q i ∈ S sgn(( x ∗ ) i ) for | S | = ℓ , k X k op ≥ u ⊤ X u k u k = (cid:18) nℓ (cid:19) − u ⊤ X u where u ⊤ X u = X | S △ T | = p u S X S,T u T = X | S △ T | = p x S △ T ∗ Y i ∈ S sgn(( x ∗ ) i ) Y i ∈ T sgn(( x ∗ ) i )= X | S △ T | = p (cid:12)(cid:12)(cid:12) x S △ T ∗ (cid:12)(cid:12)(cid:12) . 26e have E [ u ⊤ X u ] = (cid:18) nℓ (cid:19) d ℓ ( E | π | ) p = C ( π, p ) (cid:18) nℓ (cid:19) d ℓ , (22)and Var h u ⊤ X u i = Var X | S △ T | = p (cid:12)(cid:12)(cid:12) x S △ T ∗ (cid:12)(cid:12)(cid:12) = X | S △ T | = p X | S ′ △ T ′ | = p Cov( | x S △ T ∗ | , | x S ′ △ T ′ ∗ | ) . (23)We have Cov( | x S △ T ∗ | , | x S ′ △ T ′ ∗ | ) ≤ r Var (cid:16) | x S △ T ∗ | (cid:17) Var (cid:16) | x S ′ △ T ′ ∗ | (cid:17) = Var (cid:16) | x S △ T ∗ | (cid:17) ≤ E h | x S △ T ∗ | i = ( E [ π ]) p = 1 . Also, Cov( | x S △ T ∗ | , | x S ′ △ T ′ ∗ | ) = 0 unless S △ T and S ′ △ T ′ have nonempty intersection. UsingLemma B.2 (below), the fraction of terms in (23) that are nonzero is at most p /n and soVar h u ⊤ X u i ≤ (cid:20)(cid:18) nℓ (cid:19) d ℓ (cid:21) p n . (24)By Chebyshev’s inequality, it follows from (22) and (24) that u ⊤ X u ≥ C ( π, p ) (cid:0) nℓ (cid:1) d ℓ with proba-bility at least 1 − p C ( π,p ) n . This implies k X k op ≥ C ( π, p ) d ℓ with the same probability, and so wehave strong detection provided λ ≥ c ℓ / d − / ℓ √ log n for a particular constant c = c ( π, p ).Above, we made use of the following lemma. Lemma B.2. Fix A ⊆ [ n ] with | A | = a . Let B be chosen uniformly at random from all subsets of [ n ] of size b . Then P ( A ∩ B = ∅ ) ≤ abn .Proof. Each element of A will lie in B with probability b/n , so the result follows by a union boundover the elements of A . C The Odd- p Case When p is odd, the Kikuchi Hessian still gives rise to a spectral algorithm. While we conjecturethat this algorithm is optimal, we unfortunately only know how to prove sub-optimal results for it.(However, we can prove optimal results for a related algorithm; see Appendix F.2.) We now statethe algorithm and its conjectured performance.Let p be odd and fix an integer ℓ ∈ [ ⌊ p/ ⌋ , n − ⌈ p/ ⌉ ]. Consider the symmetric difference matrix M ∈ R ( nℓ ) × ( nℓ +1 ) with entries M S,T = (cid:26) Y S △ T if | S △ T | = p, S, T ⊆ [ n ] with | S | = ℓ and | T | = ℓ + 1. 27 lgorithm C.1 (Recovery for odd p ) . Let u be a (unit-norm) top left-singular vector of M andlet v = M ⊤ u be the corresponding top right-singular vector. Output b x = b x ( Y ) ∈ R n , defined by b x i = X S ∈ ( [ n ] ℓ ) , T ∈ ( [ n ] ℓ +1 ) u S v T S △ T = { i } , i ∈ [ n ] . Notice that the rounding step consisting in extracting an n -dimensional vector b x from thesingular vectors of M is slightly simpler that the even- p case, in that it does not require forming avoting matrix. We conjecture that, like the even case, this algorithm matches the performance ofSOS. Conjecture C.2. Consider the Rademacher-spiked tensor model with p ≥ odd. If λ ≫ ℓ − ( p − / n − p/ then (i) there is a threshold τ = τ ( n, p, ℓ, λ ) such that strong detection can be achieved by thresh-olding the top singular value of M at τ , and (ii) Algorithm C.1 achieves strong recovery. Similarly to the proof of Theorem 3.4, the matrix Chernoff bound (Theorem A.4) can be usedto show that strong recovery is achievable when λ ≫ ℓ − ( p − / n − ( p − / , which is weaker than SOSwhen ℓ ≪ n . We now explain the difficulties involved in improving this. We can decompose M into a signal part and a noise part: M = λ X + Z . In the regime of interest, ℓ − ( p − / n − p/ ≪ λ ≪ ℓ − ( p − / n − ( p − / , the signal term is smaller in operator norm than the noise term, i.e., λ k X k op ≪ k Z k op . While at first sight this would seem to suggest that detection and recovery arehopeless, we actually expect that λ X still affects the top singular value and singular vectors of M . This phenomenon is already present in the analysis of tensor unfolding (the case p = 3, ℓ = 1)[HSS15], but it seems that new ideas are required to extend the analysis beyond this case. D Proof of Boosting Definition D.1. For a tensor G ∈ ( R n ) ⊗ p , the injective tensor norm is k G k inj := max k u (1) k = ··· = k u ( p ) k =1 X i ,...,i p G i ,...,i p u (1) i · · · u ( p ) i p , where u ( j ) ∈ R n . For a symmetric tensor G , it is known [Wat90] that equivalently, k G k inj = max k u k =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ,...,i p G i ,...,i p u i · · · u i p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Proof of Proposition 2.6. Write ˆ x = λ h u, x ∗ i p − x ∗ + ∆ where k ∆ k ≤ k G k inj k u k p − . We have |h ˆ x, x ∗ i| ≥ λ |h u, x i| p − k x ∗ k − k ∆ kk x ∗ k , and k ˆ x k ≤ λ |h u, x ∗ i| p − k x ∗ k + k ∆ k , x, x ∗ ) = |h ˆ x, x ∗ i|k ˆ x kk x ∗ k≥ λ |h u, x ∗ i| p − k x ∗ k − k ∆ k λ |h u, x ∗ i| p − k x ∗ k + k ∆ k = 1 − k ∆ k λ |h u, x ∗ i| p − k x ∗ k + k ∆ k≥ − k ∆ k λ |h u, x ∗ i| p − k x ∗ k≥ − k G k inj k u k p − λ |h u, x ∗ i| p − k x ∗ k≥ − k G k inj λτ p − k x ∗ k p . Our prior P x is supported on the sphere of radius √ n , so k x ∗ k = √ n . We need to control theinjective norm of the tensor G . To this end we use Theorem 2.12 in [AB ˇC13] (see also Lemma 2.1of [RM14]): there exists a constant c ( p ) > E ( p ) in [AB ˇC13]) such that for all ǫ > P (cid:16)r pn k G k inj ≥ c ( p ) + ǫ (cid:17) −→ n →∞ . Letting ǫ = c ( p ) we obtain corr(ˆ x, x ∗ ) ≥ − c ( p ) √ p n (1 − p ) / λτ p − , with probability tending to 1 as n → ∞ . E Computing the Kikuchi Hessian In Section 4 we defined the Kikuchi free energy and explained the high level idea of how thesymmetric difference matrices are derived from the Kikuchi Hessian. We now carry out the KikuchiHessian computation in full detail. This is a heuristic (non-rigorous) computation, but we believethese methods are important as we hope they will be useful for systematically obtaining optimalspectral methods for a wide variety of problems. E.1 Derivatives of Kikuchi Free Energy Following [SKZ14], we parametrize the beliefs in terms of the moments m S = E [ x S ]. Specifically, b S ( x S ) = 12 | S | X ∅ ⊂ T ⊆ S m T x T . (25)We imagine m T are close enough to zero so that b S is a positive measure. One can check that thesebeliefs indeed have the prescribed moments: for T ⊆ S , X x S b S ( x S ) x T = m T . K as a function of the moments { m S } < | S |≤ r . Thisparametrization forces the beliefs to be consistent, i.e., if T ⊆ S then the marginal distribution b S | T is equal to b T .We now compute first and second derivatives of K = E − S with respect to the moments m S .First, the energy term: E = − λ X | S | = p Y S m S ∂ E ∂m S = (cid:26) − λ Y S if | S | = p ∂ E ∂m S ∂m S ′ = 0 . Now the entropy term: ∂ S S ∂b S ( x S ) = − log b S ( x S ) − . From (25), for ∅ ⊂ T ⊆ S , ∂b S ( x S ) ∂m T = x T | S | and so ∂ S S ∂m T = X x S ∂ S S ∂b S ( x S ) · ∂b S ( x S ) ∂m T = − −| S | X x S x T [log b S ( x S ) + 1]= − −| S | X x S x T log b S ( x S ) . (26)For ∅ ⊂ T ⊆ S , ∅ ⊂ T ′ ⊆ S , ∂ S S ∂m T ∂m T ′ = − −| S | X x S x T b S ( x S ) − · ∂b S ( x S ) ∂m T ′ = − − | S | X x S x T x T ′ b S ( x S ) − = − − | S | X x S x T △ T ′ b S ( x S ) − . Finally, if T S then ∂ S S ∂m T = 0. E.2 The Case r = p We first consider the simplest case, where r is as small as possible: r = p . (We need to require r ≥ p in order to express the energy term in terms of the beliefs.)30 .2.1 Trivial Stationary Point There is a “trivial stationary point” of the Kikuchi free energy where the beliefs only depend onlocal information. Specifically, if | S | < p then b S is the uniform distribution over {± } | S | , and if | S | = p then b S ( x S ) ∝ exp (cid:0) λ Y S x S (cid:1) i.e., b S ( x S ) = 1 Z S exp (cid:0) λ Y S x S (cid:1) (27)where Z S = X x S exp (cid:0) λ Y S x S (cid:1) . Note that these beliefs are consistent (if T ⊆ S with | S | ≤ p then b S | T = b T ) and so there is acorresponding set of moments { m S } | S |≤ p .We now check that this is indeed a stationary point of the Kikuchi free energy. Using (26) and(27) we have for ∅ ⊂ T ⊆ S and | S | ≤ p , ∂ S S ∂m T = − −| S | X x S x T log b S ( x S )= − −| S | X x S x T (cid:2) − log Z S + λ | S | = p Y S x S (cid:3) = (cid:26) − λ Y T if | T | = p | T | < p we have ∂ K ∂m T = 0, and if | T | = p we have ∂ K ∂m T = ∂∂m T E − X < | S |≤ p c S S S = − λ Y T + c T λ Y T = 0This confirms that we indeed have a stationary point. E.2.2 Hessian We now compute the Kikuchi Hessian, the matrix indexed by subsets ∅ < | T | ≤ p with entries H T,T ′ = ∂ K ∂m T ∂m T ′ , evaluated at the trivial stationary point. Similarly to the Bethe Hessian [SKZ14],we expect the bottom eigenvector of the Kikuchi Hessian to be a good estimate for the (momentsof) the true signal. This is because this bottom eigenvector indicates the best local direction forimproving the Kikuchi free energy, starting from the trivial stationary point. If all eigenvalues of H are positive then the trivial stationary point is a local minimum and so an algorithm actinglocally of the beliefs should not be able to escape from it, and should not learn anything about thesignal. On the other hand, a negative eigenvalue (or even an eigenvalue close to zero) indicates a(potential) direction for improvement. 31 emark E.1. When p is odd, we cannot hope for a substantially negative eigenvalue because x ∗ and − x ∗ are not equally-good solutions and so the Kikuchi free energy should be locally cubicinstead of quadratic. Still, we believe that the bottom eigenvector of the Kikuchi Hessian (whichwill have eigenvalue close to zero) yields a good algorithm. For instance, we will see in the nextsection that this method yields a close variant of tensor unfolding when r = p = 3.Recall that for ∅ ⊂ T ⊆ S , and ∅ ⊂ T ′ ⊆ S , ∂ S S ∂m T ∂m T ′ = − − | S | X x S x T △ T ′ b S ( x S ) − . If | S | < p then b S is uniform on {± } | S | (at the trivial stationary point) and so ∂ S S ∂m T ∂m T ′ = − T = T ′ .If | S | = p then b S ( x S ) = Z S exp( λ Y S x S ) where Z S = P x S exp( λ Y S x S ) = 2 | S | cosh( λ Y S ), and so ∂ S S ∂m T ∂m T ′ = − − | S | X x S x T △ T ′ b S ( x S ) − = − − | S | X x S x T △ T ′ Z S exp( − λ Y S x S )= − −| S | X x S x T △ T ′ cosh( λ Y S ) (cid:18) − λ Y S x S + 12! λ Y S − λ Y S x S + · · · (cid:19) = − −| S | X x S x T △ T ′ cosh( λ Y S ) (cid:0) cosh( λ Y S ) − sinh( λ Y S ) x S (cid:1) since cosh( x ) = 1 + x + x + · · · and sinh( x ) = x + x + x + · · · = − cosh ( λ Y S ) if T = T ′ , T ⊆ S cosh( λ Y S ) sinh( λ Y S ) if T △ T ′ = S, T ∪ T ′ ⊆ S − cosh ( λ Y S ) if T = T ′ , T ⊆ S cosh( λ Y S ) sinh( λ Y S ) if T ⊔ T ′ = S ⊔ denotes disjoint union. (Note that we have replaced △ with ⊔ due to the restriction T, T ′ ⊆ S .) We can now compute the Hessian: H T,T ′ = ∂ K ∂m T ∂m T ′ = − X S ⊇ T ∪ T ′ | S |≤ p c S ∂ S S ∂m T ∂m T ′ (28)= P S ⊇ T | S |
1] if T = T ′ − cosh( λ Y T ⊔ T ′ ) sinh( λ Y T ⊔ T ′ ) if | T ⊔ T ′ | = p λ ≪ T = T ′ then, using the cosh Taylor series, we have the leading-order approximation H T,T ≈ X S ⊇ T | S | = p λ Y S ≈ (cid:18) n − | T | p − | T | (cid:19) λ E [ Y S ] ≈ n p −| T | ( p − | T | )! λ . This means H ≈ ˜ H where˜ H T,T ′ = ∨ n p −| T | ( p −| T | )! λ if T = T ′ − λ Y T ⊔ T ′ if | T ⊔ T ′ | = p E.2.3 The Case r = p = 3We now restrict to the case r = p = 3 and show that the Kikuchi Hessian recovers (a close variantof) the tensor unfolding method. Recall that in this case the computational threshold is λ ∼ n − / and so we can assume λ ≪ n − / (or else the problem is easy). We have˜ H T,T ′ = n λ if T = T ′ with | T | = 11 if T = T ′ with | T | ∈ { , }− λ Y T ⊔ T ′ if | T ⊔ T ′ | = 30 otherwise.This means we can write ˜ H = αI − λ M − λ M ⊤ I 00 0 I where α = n λ and M is the n × (cid:0) n (cid:1) flattening of Y , i.e., M i, { j,k } = Y ijk { i,j,k distinct } .Since we are looking for the minimum eigenvalue of ˜ H , we can restrict ourselves to the submatrix˜ H ≤ indexed by sets of size 1 and 2. We have˜ H ≤ = (cid:18) αI − λ M − λ M ⊤ I (cid:19) . An eigenvector [ u v ] ⊤ of ˜ H ≤ with eigenvalue β satisfies αu − λ M v = βu and − λ M ⊤ u + v = βv which implies (1 − β ) v = λ M ⊤ u and so λ M M ⊤ u = ( α − β )(1 − β ) u . This means either u is aneigenvector of λ M M ⊤ with eigenvalue ( α − β )(1 − β ), or u = 0 and β ∈ { , α } . Conversely, if u is aneigenvector of λ M M ⊤ with eigenvalue ( α − β )(1 − β ) = 0, then [ u v ] ⊤ with v = (1 − β ) − λ M ⊤ u is an eigenvector of ˜ H ≤ with eigenvalue β . Letting µ > · · · > µ n > λ M M ⊤ , ˜ H ≤ has 2 n eigenvalues of the form α + 1 ± p ( α − + 4 µ i α or 1. Thus, the u -part of the bottom eigenvector of ˜ H ≤ isprecisely the leading eigenvector of M M ⊤ . This is a close variant of the tensor unfolding spectralmethod (see Section 2.4), and we expect that its performance is essentially identical. E.2.4 The General Case: r ≥ p One difficulty when r > p is that there is no longer a trivial stationary point that we can write downin closed form. There is, however, a natural guess for “uninformative” beliefs that only depend onthe local information: for 0 < | S | ≤ r , b S ( x S ) = 1 Z S exp λ X U ⊆ S | U | = p Y U x U for the appropriate normalizing factor Z S . Unfortunately, these beliefs are not quite consistent, sowe need separate moments for each set S : m ( S ) T = E x S ∼ b S [ x T ] . Provided λ ≪ 1, we can check that m ( S ) T ≈ m ( S ′ ) T to first order, and so the above beliefs are at leastapproximately consistent: Z S = X x S exp λ X U ⊆ S | U | = p Y U x U ≈ X x S λ X U ⊆ S | U | = p Y U x U = 2 | S | and so m ( S ) T = X x S b S ( x S ) x T = 1 Z S X x S x T exp λ X U ⊆ S | U | = p Y U x U ≈ Z S X x S x T λ X U ⊆ S | U | = p Y U x U = (cid:26) λ Y T if | T | = p S . Thus we will ignore the slight inconsistencies and carry on with the34erivation. As above, the important calculation is, for T, T ′ ⊆ S , ∂ S S ∂m ( S ) T ∂m ( S ) T ′ = − − | S | X x S x T △ T ′ b S ( x S ) − = − − | S | X x S x T △ T ′ Z S exp − λ X U ⊆ S | U | = p Y U x U ≈ − −| S | X x S x T △ T ′ exp − λ X U ⊆ S | U | = p Y U x U ≈ − −| S | X x S x T △ T ′ − λ X U ⊆ S | U | = p Y U x U = − T = T ′ λ Y T △ T ′ if | T △ T ′ | = p H T,T ′ := − X S ⊇ T ∪ T ′ | S |≤ r c S ∂ S S ∂m ( S ) T ∂m ( S ) T ′ . If we fix ℓ , ℓ , the submatrix H ( ℓ ,ℓ ) = ( H T,T ′ ) | T | = ℓ , | T ′ | = ℓ takes the form H ( ℓ ,ℓ ) ≈ a ( ℓ , ℓ ) ℓ = ℓ I − b ( ℓ , ℓ ) M ( ℓ ,ℓ ) for certain scalars a ( ℓ , ℓ ) and b ( ℓ , ℓ ), where I is the identity matrix and M ( ℓ ,ℓ ) ∈ R ( [ n ] ℓ ) × ( [ n ] ℓ )is the symmetric difference matrix M ( ℓ ,ℓ ) T,T ′ = (cid:26) Y T △ T ′ if | T △ T ′ | = p M ( ℓ,ℓ ) ,which (when p is even) appears as a diagonal block of the Kikuchi Hessian whenever r ≥ ℓ + p/ | T | = | T ′ | = ℓ with | T ∪ T ′ | ≤ r and | T △ T ′ | = p ). Our theoretical results(see Section 3) show that indeed M ( ℓ,ℓ ) yields algorithms matching the (conjectured optimal)performance of sum-of-squares. When p is odd, M ( ℓ,ℓ ) = 0 and so we propose to instead focus on M ( ℓ,ℓ +1) ; see Appendix C. F Extensions F.1 Refuting Random k -XOR Formulas for k Even Our symmetric difference matrices can be used to give a simple algorithm and proof for a relatedproblem: strongly refuting random k -XOR formulas (see [AOW15, RRS17] and references therein).35his is essentially a variant of the spiked tensor problem with sparse Rademacher observationsinstead of Gaussian ones. It is known [RRS17] that this problem exhibits a smooth tradeoff betweensubexponential runtime and the number of constraints required, but the proof of [RRS17] involvesintensive moment calculations. When k is even, we will give a simple algorithm and a simple proofusing the matrix Chernoff bound that achieves the same tradeoff. SOS lower bounds suggest thatthis tradeoff is optimal [Gri01, Sch08].When k is odd, we expect that the construction in Section F.2 should achieve the optimaltradeoff, but we do not have a proof for this case. F.1.1 Setup Let x , . . . , x n be {± } -valued variables. A k -XOR formula Φ with m constraints is specified bya sequence of subsets U , . . . , U m with U i ⊆ [ n ] and | U i | = k , along with values b , . . . , b m with b i ∈ {± } . For 1 ≤ i ≤ m , constraint i is satisfied if x U i = b i , where x U i := Q j ∈ U i x j . We write P Φ ( x ) for the number of constraints satisfied by x . We will consider a uniformly random k -XORformula in which each U i is chosen uniformly and independently from the (cid:0) nk (cid:1) possible k -subsets,and each b i is chosen uniformly and independently from {± } .Given a formula Φ, the goal of strong refutation is to certify an upper bound on the number ofconstraints that can be satisfied. In other words, our algorithm should output a bound B = B (Φ)such that for every formula Φ, max x ∈{± } n P Φ ( x ) ≤ B (Φ). (Note that this must be satisfied always ,not merely with high probability.) At the same time, we want the bound B to be small with highprobability over a random Φ. Since a random assignment x will satisfy roughly half the constraints,the best bound we can hope for is B = m (1 + ε ) with ε > F.1.2 Algorithm Let k ≥ ℓ ≥ k/ 2. Given a k -XOR formula Φ, construct the order- ℓ symmetricdifference matrix M ∈ R ( [ n ] ℓ ) × ( [ n ] ℓ ) as follows. For S, T ⊆ [ n ] with | S | = | T | = ℓ , let M ( i ) S,T = (cid:26) b i if S △ T = U i M = m X i =1 M ( i ) . Define the parameter d ℓ := (cid:18) n − ℓk/ (cid:19)(cid:18) ℓk/ (cid:19) , which, for any fixed | S | = ℓ , is the number of sets | T | = ℓ such that | S △ T | = k . For an assignment x ∈ {± } n , let u x ∈ R ( [ n ] ℓ ) be defined by u xS = x S for all | S | = ℓ . We have k M k ≥ ( u x ) ⊤ M u x k u x k = (cid:18) nℓ (cid:19) − m X i =1 X S △ T = U i x U i b i = d ℓ (cid:18) nk (cid:19) − (2 P Φ ( x ) − m )since for any fixed U i (of size k ), the number of ( S, T ) pairs such that S △ T = U i is (cid:0) nℓ (cid:1) d ℓ (cid:0) nk (cid:1) − .Thus we can perform strong refutation by computing k M k : P Φ ( x ) ≤ m d ℓ (cid:18) nk (cid:19) k M k . (29)36 heorem F.1. Let k ≥ be even and let k/ ≤ ℓ ≤ n − k/ . Let β ∈ (0 , . If m ≥ e (cid:0) nk (cid:1) log (cid:0) nℓ (cid:1) β d ℓ (30) then k M k certifies P Φ ( x ) ≤ m β ) with probability at least − (cid:0) nℓ (cid:1) − over a uniformly random k -XOR formula Φ with m constraints. If k is constant and ℓ = n δ with δ ∈ (0 , m ≥ O ( β − n k/ ℓ − k/ log n ) = O ( β − n k/ δ (1 − k/ log n ) , matching the result of [RRS17]. In fact, our result is tighter by polylog factors. F.1.3 Binomial Tail Bound The main ingredients in the proof of Theorem F.1 will be the matrix Chernoff bound (Theorem A.4)and the following standard binomial tail bound. Proposition F.2. Let X ∼ Binomial( n, p ) . For p < un < , P ( X ≥ u ) ≤ exp (cid:20) − u (cid:18) log (cid:18) upn (cid:19) − (cid:19)(cid:21) . Proof. We begin with the standard Binomial tail bound [AG89] P ( X ≥ u ) ≤ exp (cid:16) − n D (cid:16) un (cid:13)(cid:13)(cid:13) p (cid:17)(cid:17) for p < un < 1, where D ( a k p ) := a log (cid:18) ap (cid:19) + (1 − a ) log (cid:18) − a − p (cid:19) . Since log( x ) ≥ − /x ,(1 − a ) log (cid:18) − a − p (cid:19) ≥ (1 − a ) (cid:18) − − p − a (cid:19) = p − a ≥ − a, and the desired result follows. F.1.4 Proof Proof of Theorem F.1. We need to bound k M k with high probability over a uniformly random k -XOR formula Φ. First, fix the subsets U , . . . , U m and consider the randomness of the signs b i .We can write M as a Rademacher series M = m X i =1 b i A ( i ) where A ( i ) S,T = S △ T = U i . 37y the matrix Chernoff bound (Theorem A.4), P ( k M k ≥ t ) ≤ (cid:18) nℓ (cid:19) e − t / σ = 2 exp (cid:18) log (cid:18) nℓ (cid:19) − t σ (cid:19) where σ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m X i =1 ( A ( i ) ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . In particular, P k M k ≥ s σ log (cid:18) nℓ (cid:19)! ≤ (cid:18) nℓ (cid:19) − . (31)Now we will bound σ with high probability over the random choice of U , . . . , U m . We have m X i =1 ( A ( i ) ) = diag( D )where D S is the number of i for which | S △ U i | = ℓ . This means σ = max | S | = ℓ D S . For fixed S ⊆ [ n ] with | S | = ℓ , the number of sets U ⊆ [ n ] with | U | = k such that | S △ U | = ℓ is d ℓ and so D S ∼ Binomial ( m, p ) with p := d ℓ (cid:0) nk (cid:1) − . Using the Binomial tail bound (Proposition F.2) and aunion bound over S , P (cid:0) σ ≥ u (cid:1) ≤ (cid:18) nℓ (cid:19) exp (cid:20) − u (cid:18) log (cid:18) upm (cid:19) − (cid:19)(cid:21) = exp (cid:20) log (cid:18) nℓ (cid:19) − u (cid:18) log (cid:18) upm (cid:19) − (cid:19)(cid:21) . Provided upm ≥ e and u ≥ (cid:18) nℓ (cid:19) , (32)we have P (cid:0) σ ≥ u (cid:1) ≤ (cid:18) nℓ (cid:19) − . Let β ∈ (0 , P Φ ( x ) ≤ m (1+ β ) it suffices to have k M k ≤ βmd ℓ (cid:0) nk (cid:1) − = βpm .Therefore, from (31), it suffices to have σ ≤ β p m (cid:0) nℓ (cid:1) . From (32), this will occur provided β p m (cid:0) nℓ (cid:1) ≥ e pm ⇔ m ≥ e log (cid:0) nℓ (cid:1) β p (33)and β p m (cid:0) nℓ (cid:1) ≥ (cid:18) nℓ (cid:19) ⇔ m ≥ √ (cid:0) nℓ (cid:1) βp . (34)Note that (34) is subsumed by (33). This completes the proof.38 .2 Odd-Order Tensors When the tensor order p is odd, we have given an algorithm for tensor PCA based on the KikuchiHessian (see Appendix C) but are unfortunately unable to give a tight analysis of it. Here wepresent a related algorithm for which we are able to give a better analysis, matching SOS. Theidea of the algorithm is to use a construction from the SOS literature that transforms an order- p tensor (with p odd) into an order-2( p − 1) tensor via the Cauchy–Schwarz inequality [CGL04].We then apply a variant of our symmetric difference matrix to the resulting even-order tensor. Asimilar construction was given independently in the recent work [Has19] and shown to give optimalperformance for all ℓ ≤ n δ for a certain constant δ > 0. The proof we give here applies to thefull range of ℓ values: ℓ ≪ n . Our proof uses a certain variant of the matrix Bernstein inequalitycombined with some fairly simple moment calculations. F.2.1 Setup For simplicity, we consider the following version of the problem. Let p ≥ Y ∈ ( R n ) ⊗ p be an asymmetric tensor with i.i.d. Rademacher (uniform ± 1) entries. Our goal is to certifyan upper bound on the Rademacher injective norm , defined as k Y k ± := max x ∈{± } n / √ n |h Y , x ⊗ p i| . The true value is O ( √ n ) with high probability. In time n ℓ (where ℓ = n δ with δ ∈ (0 , k Y k ± ≤ n p/ ℓ / − p/ polylog( n ), matching the results of [BGG + 16, BGL16].Such certification results can be turned into recovery results using sum-of-squares; see Lemma 4.4of [HSS15]. To certify a bound on the injective norm instead of the Rademacher injective norm(where x is constrained to the sphere instead of the hypercube), one should use the basis-invariantversion of the symmetric difference matrices given by [Has19] (but we do not do this here). F.2.2 Algorithm We will use a trick from [CGL04] which is often used in the sum-of-squares literature. For any k x k = 1, we have by the Cauchy–Schwarz inequality, h Y , x ⊗ p i ≤ k x k h T , x ⊗ q i = h T , x ⊗ q i where p = 2 q + 1 and T abcd := P e ∈ [ n ] Y ace Y bde where a, b, c, d ∈ [ n ] q . We have E [ T ] abcd = n · { ac = bd } and so h E [ T ] , x ⊗ q i = n P ac ( x a x c ) = n . Let ˜ T = T − E [ T ], i.e., ˜ T abcd = T abcd · { ac = bd } .Define the n ℓ × n ℓ matrix M as follows. For S, T ∈ [ n ] ℓ , M S,T := X abcd ˜ T abcd N − ab,cd · { S ab,cd ←→ T } where S ab,cd ←→ T roughly means that S is obtained from T by replacing ab by cd , or cd by ab ; theformal definition is given below. Also, N ab,cd denotes the number of ( S, T ) pairs for which S ab,cd ←→ T . Definition F.3. For S, T ∈ [ n ] ℓ and a, b, c, d ∈ [ n ] q , we write S ab,cd ←→ T if there are distinct indices i , . . . , i q ∈ [ ℓ ] such that either: (i) S i j = ( ab ) i j and T i j = ( cd ) i j for all j ∈ [2 q ], the values in a, b, c, d do not appear anywhere else in S or T , and S, T are identical otherwise: S i = T i forall i / ∈ { i , . . . , i q } ; or (ii) the same holds but with ab and cd interchanged. (Here ab denotesconcatenation.) 39ote that N ab,cd ≥ ¯ N := (cid:18) ℓ q (cid:19) ( n − q ) ℓ − q . (35)The above construction ensures that n ℓ ( x ⊗ ℓ ) ⊤ M ( x ⊗ ℓ ) = n q h ˜ T , x ⊗ q i for all x ∈ {± } n / √ n. This means we can certify an upper bound on k Y k ± by computing k M k : k Y k ± ≤ p h T , x ⊗ q i ≤ p h E [ T ] , x ⊗ q i + q h ˜ T , x ⊗ q i ≤ n / + n ℓ/ − q k M k / . Theorem F.4. Let p ≥ be odd and let p − ≤ ℓ ≤ min { n − ( p − , n p − , n n } . Then k M k certifies k Y k ± ≤ n / + 8 p p ℓ / − p/ n p/ (log n ) / with probability at least − n − ℓ over an i.i.d. Rademacher Y . F.2.3 Proof We will use the following variant of the matrix Bernstein inequality; this is a special case ( A k = R · I )of [Tro12], Theorem 6.2. Theorem F.5 (Matrix Bernstein) . Consider a finite sequence { X i } of independent random sym-metric d × d matrices. Suppose E [ X i ] = 0 and k E [ X ri ] k ≤ r !2 R r for r = 2 , , , . . . . Then Pr ((cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n X i =1 X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ t ) ≤ d · exp (cid:18) − t / nR + Rt (cid:19) . For e ∈ [ n ], let M ( e ) S,T := X abcd Y ace Y bde N − ab,cd · { S ab,cd ←→ T } · { ac = bd } . We will apply Theorem F.5 to the sum M = P e M ( e ) . Note that E [ M ( e ) ] = 0. To bound themoments k E [( M ( e ) ) r ] k , we will use the following basic fact. Lemma F.6. If A is a symmetric matrix, k A k ≤ max j X i | A ij | . Proof. Let v be the leading eigenvector of A so that A v = λv where k A k = | λ | . Normalize v sothat k v k = 1. Then k A v k = | λ | · k v k and so k A k = | λ | · k v k = X i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X j A ij v j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ X ij | A ij | · | v j | ≤ X j | v j | · X i | A ij | ≤ k v k · max j X i | A ij | . roof of Theorem F.4. For any fixed e , we have by Lemma F.6, k E [( M ( e ) ) r ] k ≤ max S X T | E [( M ( e ) ) r ] S,T | =: max S h ( r, e, S ) . Let π denote a “path” of the form π = ( S , a , b , c , d , S , a , b , c , d , S , . . . , a r , b r , c r , d r , S r )such that S = S , ( a i , c i ) = ( b i , d i ), and S i − a i b i ,c i d i ←→ S i . Then we have h ( r, e, S ) = X π E r Y i =1 Y a i c i e Y b i d i e N − a i b i ,c i d i . Among tuples of the form ( a i , c i ) and ( b i , d i ), each must occur an even number of times (or elsethe term associated with π is 0). There are 2 r such tuples, so there are (cid:0) rr (cid:1) r ! 2 − r ways to pairthem up. Once S i − is chosen, there are at most 2( ℓn ) q choices for ( a i , c i ), and the same is truefor ( b i , d i ). Once S i − , a i , b i , c i , d i are chosen, there are at most (2 q )! possible choices for S i . Thismeans h ( r, e, S ) ≤ (cid:18) rr (cid:19) r ! 2 − r [2( ℓn ) q · (2 q )!] r ¯ N − r . where ¯ N is defined in (35). Since (cid:0) rr (cid:1) ≤ r , we can apply Theorem F.5 with R = 8(2 q )!( ℓn ) q ¯ N − .This yields Pr {k M k ≥ t } ≤ n ℓ · exp (cid:18) − t / nR + Rt (cid:19) . Let t = R √ ℓn log n . Provided ℓ ≤ n/ (8 log n ) we have Rt ≤ nR and soPr n k M k ≥ R p ℓn log n o ≤ exp (cid:18) ℓ log n − t nR (cid:19) = exp( − ℓ log n ) = n − ℓ . Thus with high probability we certify k Y k ± ≤ n / + n ℓ/ − q k M k / ≤ n / + n ℓ/ − q R / (8 ℓn log n ) / = n / + n ℓ/ − q p q )!( ℓn ) q/ ¯ N − / (8 ℓn log n ) / = n / + 8 / p (2 q )! n ℓ/ − q ¯ N − / ( ℓn ) / q/ (log n ) / . 41e have the following bound on ¯ N :¯ N = (cid:18) ℓ q (cid:19) ( n − q ) ℓ − q = n ℓ · (cid:0) ℓ q (cid:1) ( n − q ) ℓ − q n ℓ ≥ n ℓ · ℓ q (2 q ) q · ( n − q ) ℓ − q n ℓ = n ℓ (2 q ) q · (cid:18) n − qn (cid:19) ℓ − q (cid:18) ℓn (cid:19) q = n ℓ (2 q ) q · (cid:18) − qn (cid:19) ℓ − q (cid:18) ℓn (cid:19) q ≥ n ℓ p p · (cid:18) − ( ℓ − q ) 4 qn (cid:19) (cid:18) ℓn (cid:19) q ≥ n ℓ p p · (cid:18) − qℓn (cid:19) (cid:18) ℓn (cid:19) q ≥ n ℓ p p (cid:18) ℓn (cid:19) q provided ℓ ≤ n/ (8 q ). Therefore we certify k Y k ± ≤ n / + 2 / · / p (2 q )! p p/ ℓ / q/ n / − q/ ( ℓ/n ) − q (log n ) / ≤ n / + 8 p p ℓ / − p/ n p/ (log n ) /4