Estimating Rank-One Spikes from Heavy-Tailed Noise via Self-Avoiding Walks
aa r X i v : . [ c s . D S ] A ug Estimating Rank-One Spikes from Heavy-TailedNoise via Self-Avoiding Walks
Jingqiu Ding ∗ ETH Zurich [email protected]
Samuel B. Hopkins ∗ UC Berkeley [email protected]
David Steurer ∗ ETH Zurich [email protected]
Abstract
We study symmetric spiked matrix models with respect to a general class of noisedistributions. Given a rank-1 deformation of a random noise matrix, whose entriesare independently distributed with zero mean and unit variance, the goal is toestimate the rank-1 part. For the case of Gaussian noise, the top eigenvector ofthe given matrix is a widely-studied estimator known to achieve optimal statisticalguarantees, e.g., in the sense of the celebrated BBP phase transition. However,this estimator can fail completely for heavy-tailed noise.In this work, we exhibit an estimator that works for heavy-tailed noise up to theBBP threshold that is optimal even for Gaussian noise. We give a non-asymptoticanalysis of our estimator which relies only on the variance of each entry remainingconstant as the size of the matrix grows: higher moments may grow arbitrarily fastor even fail to exist. Previously, it was only known how to achieve these guaranteesif higher-order moments of the noises are bounded by a constant independent ofthe size of the matrix.Our estimator can be evaluated in polynomial time by counting self-avoidingwalks via a color coding technique. Moreover, we extend our estimator to spikedtensor models and establish analogous results.
Principal component analysis (PCA) and other spectral methods are ubiquitous in machine learning.They are useful for dimensionality reduction, denoising, matrix completion, clustering, data visual-ization, and much more. However, spectral methods can break down in the face of egregiously-noisydata: a few unusually large entries of an otherwise well-behaved matrix can have an outsized effecton its eigenvectors and eigenvalues.In this paper, we revisit the single-spike recovery problem , a simple and extensively-studied statisti-cal model for the core task addressed by spectral methods, in the setting of heavy-tailed noise , wherethe above shortcomings of PCA and eigenvector-based methods are readily apparent [Joh01]. We de-velop and analyze algorithms for this problem whose provable guarantees improve over traditionaleigenvector-based methods . Our main problem is:
Problem 1.1 (Generalized spiked Wigner model, recovery) . Given a realization of a symmetricrandom matrix of the form Y = λxx ⊤ + W , where x ∈ R n is an unknown fixed vector with k x k = √ n , λ > , and the upper triangular off-diagonal entries of W ∈ R n × n are independently(but not necessarily identically) distributed with zero-mean and unit variance E W ij = 1 , estimate x . The main question about the spiked Wigner model is: how large should the signal-to-noise ratio λ > be in order to achieve constant correlation with x ? The standard algorithmic approach to ∗ equal contributionPreprint. Under review. olve the spiked Wigner recovery problem is PCA, using the top eigenvector of the matrix Y as anestimator for x . This approach has been extensively studied (e.g. in [BBAP05, PRS13]), usuallyunder stronger assumptions on the distribution of the entries of W .Assuming boundedness of moments, i.e. E | W ij | ≤ O (1) , a clear picture has emerged: theproblem is information-theoretically impossible for λ √ n < , and for λ √ n > the top eigenvectorof Y is an optimal estimator for x – this is the celebrated BBP phase transition [BBAP05, PRS13].If we weaken the assumption to E | W ij | ≤ O (1) , it is well known that E k W k ≤ √ n , so PCA willestimate x nontrivially when λ √ n > . However, many natural random matrices do not satisfy theseconditions – consider for instance random sparse matrices or matrices with heavy-tailed entries.Our setting allows for much nastier noise distributions: we assume only that the entries of W haveunit variance – E | W ij | . may grow arbitrarily fast with n , or even fail to exist. Under such weakassumptions, the top eigenvector of Y may be completely uncorrelated with the planted spike x , for λ √ n = O (1) . In this paper, we ask: Main Question: For which λ > is recovery possible in the spiked Wigner modelvia an efficient algorithm under heavy-tailed noise distributions? A natural strategy to deal with heavy-tailed noise is to truncate unusually large entries before per-forming vanilla PCA. However, truncation-based algorithms can fail dramatically if the distributionsof the noise entries are adversarially chosen, as our random matrix model allows. We provide coun-terexamples to truncation-based algorithms in Section 2.2.
In this work, we develop and analyze computationally-efficient algorithms based on self-avoidingwalks.
PCA or eigenvector methods can be thought of as computing a power Y ℓ of the input matrix,for ℓ → ∞ . The polynomial Y ℓ in the entries of Y can be expanded in terms of length- ℓ walksin the complete graph on n vertices. Our algorithms, by contrast, are based on a different degree- ℓ polynomial in the entries of Y , which can be expanded in terms of length- ℓ self-avoiding walks. Wedescribe the main ideas more thoroughly below, turning for now to our results. Spiked Matrices with Heavy-Tailed Noise:
The first result addresses the main question above,demonstrating that our self-avoiding walk algorithm addresses some of the shortcomings of PCAand eigenvector-based methods for the spiked Wigner recovery problem in the heavy-tailed setting.
Theorem 1.2.
For every δ > , there is a polynomial-time algorithm such that for every x ∈ R n with k x k = √ n and k x k ∞ ≤ n / − δ and every n / λ ≥ δ , given Y = λxx ⊤ + W distributedas in the spiked Wigner model, the algorithm returns ˆ x such that E h ˆ x, x i ≥ δ O (1) · k x k · E k ˆ x k . To interpret the result, we note that even if the entries of W are Gaussian, when λ √ n < noestimator ˆ x achieves nontrivial correlation with x [PWBM18], so the assumption λ √ n ≥ δ is the weakest one can hope for. Furthermore, under this assumption, when δ is close to , it isinformation-theoretically impossible to find ˆ x such that h x, ˆ x i / ( k x k k ˆ x k ) → . The guaranteewe achieve, that ˆ x is nontrivially correlated to x , is the best one can hope for. (For the regime λ √ n → ∞ , our algorithm does achieve correlation going to . Improving the δ O (1) term to bequantitatively optimal is an interesting open question.) Spiked Tensors with Heavy-Tailed Noise:
The self-avoiding walk approach to algorithm designis quite flexible, and in particular is not limited to spiked matrices. We also study an analogousproblem for spiked tensors . The single-spike tensor model is the analogue of the spiked Wignermodel above, but for the task of recovering information from noisy multi-modal data, which hasmany applications across machine learning [AGH +
14, RM14].
Theorem 1.3. For every c > , δ < there is a polynomial-time algorithm with the followingguarantees. Let x ∈ R n be a random vector with independent, mean-zero entries having E x i = 1 and Γ = E x i ≤ n o (1) . Let λ > . Let Y = λ · x ⊗ + W , where W ∈ R n × n × n has independent, The theorem is stated for planted vectors sampled from independent zero mean prior distribution; for fixedplanted vector, similar guarantees can be obtained using nearly the same techniques as in the spiked matrixmodel ean-zero entries with E W ijk = 1 . Then if λ ≥ cn − / , the algorithm finds ˆ x ∈ R n such that E h x, ˆ x i ≥ δ · ( E k x k ) / · ( E k ˆ x k ) / . Under the additional assumption that all entries in W have bounded -th moments, a slightly modi-fied algorithm finds ˆ x such that h x, ˆ x i ≥ (1 − o (1))( E k x k ) / · ( E k ˆ x k ) / , as shown in appendixB.4. (We have not made an effort to optimize the constant ; some improvement may be possible.)The results are stated for order- tensors for simplicity; there is no difficulty in extending them tothe higher order case. (See appendix C.)Prior work considers the spiked tensor model only in the case that W has either Gaussian or dis-crete entries [HSS15, WAM19, BCRT19, Has19, BGL + W to be heavy-tailed. Therequirement that λ ≥ Ω( n − / ) is widely believed to be necessary for polynomial-time algorithms[HKP + x successfully for λ ≤ n − / − Ω(1) in Gaussian and discrete settings [BGL16b, WAM19, Has19, RRS17] – we show that a sub-exponential time version of our algorithm achieves many of the same guarantees while still allowingfor heavy-tailed noise. Concretely, we extend Theorem 1.3 as follows:
Theorem 1.4.
In the same setting as theorem 1.3, for any c ≥ n − / Γ / polylog ( n ) and δ < ,there is an n O (1 /c ) -time algorithm such that E h x, ˆ x i ≥ δ · ( E k x k ) / · ( E k ˆ x k ) / . In particular, the tradeoff we obtain between running time and signal-to-noise ratio λ matcheslower bounds in the low-degree model for the (easier) case of Gaussian noise [KWB19], for c ≥ n − / o (1) . Numerical Experiments:
We test our algorithms on synthetic data – random matrices (and tensors)with hundreds of rows and columns – empirically demonstrating the improvement over vanilla PCA.
We now offer an overview of the self-avoiding walk technique we use to prove Theorems 1.2 and 1.3.For this exposition, we focus on the case of spiked matrices (Theorem 1.2).Our techniques are inspired by recent literature on sparse stochastic block models, in particularthe study of nonbacktracking random walks in sparse random graphs [Abb17]. We remark furtherbelow on the relationship with this literature, but note for now that a self-avoiding walk algorithmclosely related to the one we present here appeared in [HS17] in the context of the sparse stochasticblock model with overlapping communities. In the present work we give a refined analysis of thisalgorithm to obtain Theorem 1.2, and extend the algorithm to spiked tensors to obtain Theorem 1.3.Recall that given a spiked random matrix Y = λxx ⊤ + W , our goal is to estimate the vector x .For simplicity of exposition, we suppose x ∈ {± } n . To estimate x up to sign, we will in fact aimto estimate each entry of the matrix xx ⊤ . Our starting point is the observation that any sequence i , i , . . . , i ℓ ∈ [ n ] without repeated indices (i.e. a length- ℓ self-avoiding walk in the complete graphon [ n ] ) gives an estimator of x i x i ℓ as follows: E W Y j<ℓ Y i j ,i j +1 = λ ℓ x i x i . . . x i ℓ − x i ℓ = λ ℓ x i x i ℓ . (1)To aggregate these estimators into a single estimator for xx ⊤ , we relate them to self-avoidingwalks in the complete graph on [ n ] . We denote by SAW ℓ ( i, j ) the set of length- ℓ self-avoidingwalks between i, j on the vertex set [ n ] . Then we associate the polynomial Q j<ℓ Y i j ,i j +1 to α = ( i , i , . . . , i ℓ ) ∈ SAW ℓ ( i, j ) , where i = i, i ℓ = j , and we denote this polynomial as χ α ( Y ) .We define the self-avoiding walk matrix : Definition 1.5 (Self-avoiding walk matrix) . Let P ( Y ) ∈ R n × n be given by P ij ( Y ) = X α ∈ SAW ℓ ( i,j ) χ α ( Y ) Our estimator for x i x j will simply be P ij ( Y ) λ ℓ | SAW ℓ ( i,j ) | . By (1), P ij ( Y ) λ ℓ | SAW ℓ ( i,j ) | is an unbiased estimatorfor x i x j . The crucial step is to bound the variance of P ij ( Y ) . Our key insight is: because we3verage only over self-avoiding walks, P ij ( Y ) is multilinear in the entries of W , so E P ij ( Y ) can becontrolled under only the assumption of unit variance for each entry of W . Our technical analysisshows that E P ij ( Y ) is small enough to provide a nontrivial estimator of x i x j when (a) λ √ n ≥ δ and (b) ℓ ≥ O δ (log n ) , for any δ > . Rounding algorithm:
Once we have P ( Y ) achieving constant correlation with xx ⊤ , the followingtheorem, proved in [HS17], gives a polynomial time algorithm for extracting an estimator ˆ x for x . Theorem 1.6.
Let Y be a symmetric random matrix and x a vector. Suppose we have a matrix-valued function P ( Y ) such that E h P ( Y ) , xx ⊤ i ( E k P ( Y ) k F · k x k ) / = δ . then with probability δ O (1) , a random unit vector ˆ x in the span of top- δ − O (1) eigenvectors of P ( Y ) achieves h x, ˆ x i ≥ δ O (1) k x k . Prior-free estimation for general x : A significant innovation of our work over prior work such as[HS17] investigating estimators based on self-avoiding walks is that we avoid the assumption of aprior distribution on the planted vector x ; instead we assume only a mild bound on the ℓ ∞ normof x . While in the setting of Gaussian W one can always assume that x is random by applying arandom rotation to the input matrix Y (which preserves W if it is Gaussian), in our setting workingwith fixed x presents technical challenges.In the foregoing discussion we assumed x to be ± -valued – to drop this assumption, we must forego(1) and give up on the hope that each self-avoiding walk from i to j is an unbiased estimator of x i x j .Instead, we are able to use the weak ℓ ∞ bound to control the bias of an average self-avoiding walkas an estimator for x i x j , and hence control the bias of the estimator P ij ( Y ) . Compared to [HS17],which studies the cases of random or ± -valued x , our calculation of the variance of P ij ( Y ) is alsosignificantly more intricate, again because we cannot rely on either randomness or ± -ness of x . Polynomial time via color coding:
The techniques described already yield an algorithm for thespiked Wigner model running in quasipolynomial time n O δ (log n ) , simply by evaluating all of theself-avoiding walk polynomials. We use the color coding technique of [AYZ95] (previously used inthe context of the stochastic block model by [HS17]) to improve the running time to n O δ (1) . Briefly,color coding speeds up the computation of the self-avoiding walk estimators P ij ( Y ) with a clevercombination of randomization and dynamic programming. Extension to spiked tensors:
The tensor analogue of the PCA algorithm for spiked matrices is the tensor unfolding method, where an n × n × n input tensor Y = λx ⊗ + W is unfolded to an n × n matrix, and then the top n -dimensional singular vector of this matrix is used to estimate x . This strat-egy is successful in the case of Gaussian noise, for λ ≫ n − / . To prove Theorem 1.3 we adapt theself-avoiding walk method above to handle this form of rectangular matrix. To prove Theorem 1.4,we combine the self-avoiding walk method with higher-order spectral methods previously used toobtain subexponential time algorithms for the spiked tensor model [RSS18, BGL16b]. Relationship to PCA and Non-Backtracking Walks
To provide some further context for our tech-niques, it is helpful to observe the following relationship to PCA. Given a symmetric matrix Y , PCAwill extract the top eigenvector of Y . Often, this is implemented via the power method – that is,PCA will (implicitly) compute the matrix Y ℓ for ℓ ≈ log n . Notice that the entries of Y ℓ can beexpanded as ( Y ℓ ) ij = X k ,...,k ℓ − ∈ [ n ] Y i,k · Y a ≤ ℓ Y k a ,k a +1 · Y k ℓ − ,j which is a sum over all length- ℓ walks from i to j in the complete graph. Our estimator P ( Y ) canbe viewed as removing some problematic (high variance) terms from this sum, leaving only theself-avoiding walks.This approach is inspired by recent developments in the study of sparse random graphs, where ver-tices of unusually high degree spoil the spectrum of the adjacency matrix (indeed, this is morallya special case the heavy-tailed noise setting we consider). In particular, inspired by statisticalphysics, nonbacktracking walks were developed as a technique to learn communities in the stochas-tic block model [MNS18, AS18, SLKZ15, DKMZ11, KMM +
13, BLM15]. A k -nonbacktracking4alk i , . . . , i ℓ does not repeat any indices i ∈ [ n ] among any consecutive k steps; as k increasesfrom to ℓ this interpolates between na¨ıve PCA and our self-avoiding walk estimator.The k -nonbacktracking algorithm for k < ℓ is also a natural approach in the setting we study.(Our approach corresponds to k = ℓ .) Indeed, there are some advantages to choosing k = O (1) :the O (1) -nonbacktracking-based estimator can be computed much more efficiently than the self-avoiding walk-based estimator. Furthermore, in numerical experiments we observe that even -stepnon-backtracking gives performance comparable with fully self-avoiding walks. However, rigor-ous analysis of the O (1) -non-backtracking walk estimator in our distribution-independent settingappears to be a major technical challenge – even establishing rigorous guarantees in the stochasticblock model was a major breakthrough [BLM15, MNS18]. An advantage of our estimator is that itcomes with a relatively simple and highly adaptable rigorous analysis. In section 2, we discuss algorithms for the spiked matrix model, proving Theorem 1.2 and providingcounterexamples to na¨ıve truncation-based algorithms. In section 2.3 we discuss results of numericalexperiments for spiked random matrices. In section 3 we describe our algorithm for the spiked tensormodel, deferring the analysis to supplementary material.
Here we prove Theorem 1.2 by analyzing the self-avoiding walk estimator. (Some details are de-ferred to supplementary material.) We focus for now on the following main lemma, putting togetherthe proof of Theorem 1.2 at the end of this section.
Lemma 2.1.
In spiked matrix model Y = λxx ⊤ + W with k x k = √ n and the upper triangularentries in symmetric matrix W independently sampled with zero mean and unit variance, we assume k x k ∞ = n − Ω(1) . Then if λn / = 1 + δ = 1 + Ω(1) , setting ℓ = O (log δ n ) , we have: E h P ( Y ) , xx ⊤ i n ( E k P ( Y ) k F ) / = δ O (1) , where P ( Y ) is the length- ℓ self-avoiding walk matrix (Definition 1.5). For Lemma 2.1, we will repeatedly need the following technical bound, which we prove in Ap-pendix A.3.
Lemma 2.2.
Let V ⊆ [ n ] , k x k = √ n and t , t ∈ N . We define the quantity S t ,t ,V as thefollowing: S t ,t ,V = E ( v ,...,v t t ) ⊆ [ n ] \ V " t Y i =1 x v i t + t Y i = t +1 x v i where ( v , v , . . . , v t + t ) is uniformly sampled from all size- ( t + t ) ordered subsets of [ n ] \ V (without repeating elements). Then assuming | V | , t , t = O (log n ) and k x k ∞ = n − Ω(1) , we have S t ,t ,V ≤ (1 + n − Ω(1) ) k x k t ∞ . Further if t = 0 , we have S t ,t ,V ≥ − n − Ω(1) . From the case t = 0 , one can easily deduce the following bound on E h P ( Y ) , xx ⊤ i . Lemma 2.3.
Under the same setting as lemma 2.1, we have E h P ( Y ) , xx ⊤ i = (1 ± o (1)) λ ℓ n ℓ +1 roof. We have E P ij ( Y ) = X α ∈ SAW ℓ ( i,j ) ℓ − Y t =1 λx α t x α t +1 = λ ℓ ( n − n − ( ℓ − x i x j E α ∈ SAW ℓ ( i,j ) " ℓ − Y t =1 x α t = (1 + n − Ω(1) ) λ ℓ x i x j n ℓ − E α ∈ SAW ℓ ( i,j ) " ℓ − Y t =1 x α t , where the expectation is taken uniformly over α ∈ SAW ℓ ( i, j ) . For simplicity of notation, we denote E α ∈ SAW ℓ ( i,j ) Q ℓ − t =1 x α t as S ij . Then according to lemma 2.2, we have S ij = 1 ± o (1) . Thereforewe have h P ( Y ) , xx ⊤ i = (1 ± o (1)) λ ℓ n ℓ +1 .To prove Lemma 2.1, the remaining task is to bound the second moment E k P ( Y ) k F . We canexpand the second moment in terms of pairs of self-avoiding walks. For α, β ∈ SAW ℓ ( i, j ) andcorresponding polynomials χ α ( Y ) , χ β ( Y ) , there is a close relationship between E [ χ α ( Y ) χ β ( Y )] and the number of shared vertices and edges of α, β . Specifically, E [ χ α ( Y ) χ β ( Y )] = E Y ( u,v ) ∈ α ∩ β Y uv Y ( u,v ) ∈ α ∆ β Y uv = Y ( u,v ) ∈ α ∆ β λx u x v Y ( u,v ) ∈ α ∩ β (cid:0) λ x u x v (cid:1) = λ ℓ − k Y u ∈ deg ( α ∆ β, x u Y u ∈ deg ( α ∆ β, x u Y ( u,v ) ∈ α ∩ β (1 + λ x u x v ) where k is number of shared edges between α, β and deg ( α ∆ β, j ) is the set of vertices with degree j in the graph α ∆ β . The size of deg ( α ∆ β, is equal to the number of shared vertices which are notincident to any shared edge. Thus for the analysis of E P ij ( Y ) = P α,β ∈ SAW ℓ ( i,j ) E [ χ α ( Y ) χ β ( Y )] ,we classify pairs α, β ∈ SAW ℓ ( i, j ) according to numbers of shared edges and vertices between α, β . The following graph-theoretic lemma is needed for bounding the number of such pairs in eachclass; we will prove it in appendix A.3. Lemma 2.4.
Let α = α , α , . . . , α ℓ and β = β , β , . . . , β ℓ be two length- ℓ self-avoiding walksin the complete graph on [ n ] , with α = β = i and α ℓ = β ℓ = j . Let k be the number of sharededges between α, β , r be the number of shared vertices between α, β excluding i, j , and s be thenumber of shared vertices which are not i, j and not incident to shared edges. Further we denotethe number of connected components in α ∩ β not containing i, j as p . Then for α = β we have therelation p ≤ r − s − k , and for α = β we have p = s = 0 and r = k − . We note that for self-avoiding walks α, β , the connected components of α ∩ β are all self-avoidingwalks. A simple corollary of lemma 2.2 turns out to be helpful, which we prove in appendix A.3 Lemma 2.5.
Suppose we have x ∈ R n with norm √ n . If V ⊆ [ n ] has | V | = O (log n ) and if weaverage over size- h directed self-avoiding walks ξ on vertices [ n ] \ V , then for h = O (log n ) wehave the bounds ξ ⊆ [ n ] \ V x ξ x ξ h Y ( u,v ) ∈ ξ (1 + λ x u x v ) ≤ (1 + n − Ω(1) ) k x k ∞ (1 + λ k x k ∞ ) h E ξ ⊆ [ n ] \ V x ξ h Y ( u,v ) ∈ ξ (1 + λ x u x v ) ≤ (1 + n − Ω(1) )(1 + λ k x k ∞ ) h E ξ ⊆ [ n ] \ V x ξ Y ( u,v ) ∈ ξ (1 + λ x u x v ) ≤ (1 + n − Ω(1) )(1 + λ k x k ∞ ) h E ξ ⊆ [ n ] \ V Y ( u,v ) ∈ ξ (1 + λ x u x v ) ≤ (1 + n − Ω(1) )(1 + λ k x k ∞ ) h where ξ is the label of starting vertex of ξ , and ξ h is the label of the end vertex of ξ . These bounds hold since we can expand the product into a sum of monomials and apply lemma 2.2for each monomial.Now for self-avoiding walk pairs ( α, β ) intersecting on a given number of edges and vertices, webound the correlation of corresponding polynomials and hence the contribution to the variance of P .For simple expressions, we take λ = O ( n − / ) . Definition 2.6.
On the complete graph K n , for pairs of self-avoiding walks ( α, β ) and ( γ, ξ ) , wesay that ( α, β ) is isomorphic to ( γ, ξ ) if there is a permutation π : [ n ] → [ n ] fixing i, j such that π ( α ) = γ and π ( β ) = ξ . We partition all pairs of length- ℓ self-avoiding walks between vertices i, j into isomorphism classes. We denote the set of all isomorphism classes containing pairs length- ℓ self-avoiding walks between vertices i, j sharing r vertices and k edges as shape ( k, r, i, j ) .We note that r < k is only possible when r + 1 = k = ℓ (that is, the two paths are identical). Lemma 2.7 (Self-avoiding walk polynomial correlation) . In the spiked Wigner model Y = λxx ⊤ + W , where x has norm √ n and W is symmetric with entries independently sampled with zero meanand unit variance, for any isomorphism class S ∈ shape ( k, r, i, j ) , we have E ( α,β ) ∼S E W [ χ α ( Y ) χ β ( Y )] ≤ ( (1 + n − Ω(1) ) λ ℓ − k k x k r − k ) ∞ (cid:0) λ k x k ∞ (cid:1) k if r ≥ k (1 + n − Ω(1) ) (cid:0) λ k x k ∞ (cid:1) k if r + 1 = k = ℓ where ( α, β ) ∼ S is taken uniformly over the isomorphism class S and χ α ( Y ) = Q ( u,v ) ∈ α Y u,v .Proof. We first consider the case r ≥ k , where α = β . For each α, β intersecting on k edges and r vertices, we have the following bound: E [ χ α ( Y ) χ β ( Y )] = λ ℓ − k Y u ∈ deg ( α ∆ β, x u Y u ∈ deg ( α ∆ β, x u Y ( u,v ) ∈ α ∩ β (1 + λ x u x v ) where deg ( α ∆ β, j ) is the set of vertices with degree j in the graph α ∆ β .For any subgraph G of K n , we denote by V ( G ) the set of vertices incident to edges in G . We denote | deg ( α ∆ β, | as s and the number of shared vertices between α, β excluding i, j as r . We denotethe number of connected components in α ∩ β not containing i, j as p . Then we have the relation p ≤ r − s − k for α = β according to lemma 2.4. We note that α ∩ β can be decomposed into a setof disjoint self-avoiding walks, which we denote as SAW ( α ∩ β ) .Now we take the expectation over ( α, β ) on isomorphism class S . This is equivalent to takinguniform expectation over the labeling of the ℓ − − r vertices α, β which are not equal to i or j .7hen we have E ( v ,v ,...v ℓ − − r ) Y u ∈ deg ( α ∆ β, x u Y u ∈ deg ( α ∆ β, x u Y ξ ∈ SAW ( α ∩ β ) Y ( u,v ) ∈ ξ (1 + λx u x v ) ≤ (1 + n − Ω(1) ) k x k s ∞ E ( v ,v ,...v ℓ − − r ) Y u ∈ V ( α ∩ β ) u ∈ V ( α ∆ β ) x u Y ξ ∈ SAW ( α ∩ β ) Y ( u,v ) ∈ ξ (1 + λx u x v ) ≤ (1 + n − Ω(1) ) k x k p +2 s ∞ (cid:0) λ k x k ∞ (cid:1) k ≤ (1 + n − Ω(1) ) k x k r − k ) ∞ (cid:0) λ k x k ∞ (cid:1) k where we use lemma 2.2 in the first inequality, lemma 2.5 in the second inequality, and lemma 2.4in the last inequality.This proves the first claim.For any isomorphism class S ∈ shape ( k, r, i, j ) with k = r + 1 = ℓ , and ( α, β ) ∈ S , we have α = β . In this case we have E W [ χ α ( Y ) χ β ( Y )] = Y ( u,v ) ∈ α (1 + λ x u x v ) = Y i ∈ [ ℓ ] (cid:16) λ x v i − x v i (cid:17) By lemma 2.5, taking expectation over the labeling of the ℓ − vertices in α which are not equal to i, j , we have E v ,v ,...,v ℓ − Y i ∈ [ ℓ ] (cid:16) λ x v i − x v i (cid:17) ≤ (cid:16) n − Ω(1) (cid:17) (cid:0) λ k x k ∞ (cid:1) ℓ This proves the second claim.Now we finish the proof of lemma 2.1.
Proof of Lemma 2.1.
We bound the variance of the estimator P ij ( Y ) . As stated above, E P ij ( Y ) = X α,β ∈ SAW ℓ ( i,j ) λ ℓ − k Y u ∈ deg ( α ∆ β, x u Y u ∈ deg ( α ∆ β, x u Y ( u,v ) ∈ α ∩ β (1 + λ x u x v ) , (2)where k is number of shared edges between α, β and deg ( α ∆ β, j ) is the set of vertices with degree j in the graph α ∆ β .We note that for fixed i, j, r, k there are at most n ℓ − − r ℓ O ( r − k ) pairs of α, β . For fixed k, r , weapply lemma 2.7. For k < r the contribution to summation 2 is bounded by n ℓ − − r ℓ O ( r − k ) E S∼ shape ( k,r,i,j ) h λ ℓ − k k x k r − k ) ∞ (cid:0) λ k x k ∞ (cid:1) k i = n − · n ℓ · λ ℓ · n − r ℓ O ( r − k ) λ − k k x k r − k ) ∞ (cid:0) λ k x k ∞ (cid:1) k where S is sampled with some distribution over all shapes in shape ( k, r, i, j ) .For k = r + 1 = ℓ , if we take ℓ = C log λ n n with constant C large enough, then the contributionto summation 2 is bounded by n ℓ − (cid:0) λ k x k ∞ (cid:1) ℓ ≤ n − Ω(1) n ℓ − λ ℓ Combining all possible k, r , we have summation 2 bounded by n ℓ − λ ℓ " n − Ω(1) + ℓ − X k =0 (cid:16) n − k λ − k (cid:0) λ k x k ∞ (cid:1) k (cid:17) ℓ − X r = k (cid:16) ℓ O ( r − k ) k x k r − k ) ∞ n k − r (cid:17) Since λ k x k ∞ = n − Ω(1) , we have n − λ − (cid:0) λ k x k ∞ (cid:1) k ≤ − δ/ . Thus P ℓ − k =0 (cid:16) n − k λ − k (cid:0) λ k x k ∞ (cid:1) k (cid:17) ≤ δ − O (1) . On the other hand, since ℓ O (1) k x k ∞ n − = n − Ω(1)
8y the assumption on k x k ∞ , we have P ℓ − r = k (cid:16) ℓ O ( r − k ) k x k r − k ) ∞ n k − r (cid:17) ≤ n − Ω(1) . Thussummation 2 is bounded by δ − O (1) λ ℓ n ℓ − .Summing over n pairs of i, j we have E k P ( Y ) k F ≤ δ − O (1) (cid:0) n ℓ λ ℓ (cid:1) Combining with lemma 2.3 we have E h P ( Y ) ,xx ⊤ i n ( E k P ( Y ) k F ) / = δ O (1) = Ω(1) and the lemma follows.Finally, using color-coding method, the degree O (log n ) polynomial P ( Y ) can be well approxi-mated in polynomial time, which we prove in appendix A.4 Lemma 2.8 (Formally stated in appendix A.4) . For δ = λn / − > and ℓ = O (log δ n ) , P ( Y ) can be accurately evaluated in n δ − O (1) time. The evaluation algorithm 1 is based on the idea of color-coding method[AYZ95]. Similar algorithmhas already appeared and analyzed in the literature [HS17].
Algorithm 1:
Algorithm for evaluating self-avoiding walk matrix
Data:
Given Y ∈ R n × n s.t Y = λxx ⊤ + W Result: P ( Y ) ∈ R n × n where P ij ( Y ) is the sum of multilinear monomials corresponding to length ℓ self-avoiding walk between i, j (up to accuracy n − Ω(1) ) C ← exp(100 ℓ ) ; for i ← to C do Sample coloring c t : [ n ] [ ℓ ] uniformly at random;Construct a R ℓ n × ℓ n matrix M , with rows and columns indexed by ( v, S ) , where v ∈ [ n ] and S is a subset of [ ℓ ] ;a matrix H ∈ R n × ℓ n with rows indexed by [ n ] and columns indexed by ( v, S ) where v ∈ [ n ] and S is a subset of [ ℓ ] ;a matrix N ∈ R ℓ n × n , with rows indexed by ( v, S ) where S is a subset of [ ℓ ] and columnsindexed by [ n ] ;Record matrix p c i = HM ℓ − N ;Return P Ci =1 p c i /C We describe how to construct matrices
H, M, N used in the algorithm 1 given coloring c : [ n ] [ ℓ ] .For matrix M , the entry M ( v ,S ) , ( v ,T ) = Y v ,v if S ∪ { c ( v ) } = T and c ( v ) S . Otherwise M ( v ,S ) , ( v ,T ) = 0 . For matrix H , the entry H v , ( v ,S ) = Y v ,v if S = { c ( v ) } . Otherwise H v , ( v ,S ) = 0 . For matrix N , the entry N ( v ,S ) ,v = Y v ,v if c ( v ) S and S ∪ { c ( v ) , c ( v ) } =[ ℓ ] . Otherwise N ( v ,S ) ,v = 0 .The critical observation is that for coloring c : [ n ] [ ℓ ] sampled uniformly at random, we have E c p c ( Y ) = P ( Y ) . By averaging over lots of such random colorings, we have an unbiased estimatorwith low variance. The proof is deferred to appendix A.4.Combining theorem 1.6, lemma 2.1,2.8, we have theorem 1.2. In this section we show that while truncating entries at threshold τ ( n ) can help on many occasions,it can fail for some noise distributions we consider.The class of truncation algorithm we consider can be described as following: Algorithm 2.9.
Given matrix Y ∈ R n × n , set truncation threshold τ = τ ( n ) . We first obtain Y ′ by truncating the entries Y ij with magnitude larger than τ to sgn ( Y ij ) τ . Then, we obtain Y ′′ bysubtracting the average value of all entries in Y ′ . Finally we extract the top eigenvector of Y ′′ .9irst we show that for many long tail distributions, PCA algorithm can be saved by such truncation.(We defer the proof to appendix A.1.) Theorem 2.10.
Consider problem 1.1 such that the signal-to-noise ratio satisfies ε = n / λ − , the upper triangular entries of W are identically distributed, the entries of X = λxx ⊤ arebounded by o (1) , and the entries of x sum to . Then, for τ = min ( ε , , the algorithm 2.9 outputsunit norm estimator ˆ x ∈ R n s.t h x, ˆ x i = Ω( n ) : However, as illustrated by the following examples, this truncation strategy can fail inherently whenthe noise entries are not identically distributed and their distributions are adversarily chosen (depend-ing on the vector x ). We show that there is no choice of truncation level τ for which Algorithm 2.9outputs a vector whose correlation with the planted vector x is nonvanishing for all choices of noisematrix W whose entries are independently sampled with zero mean and unit variance.First truncating at τ = Ω( √ n ) fails the following example Example 2.11.
For d = ω (1) , W ij equals to − q n − dd with probability d/n and q dn − d with proba-bility − dn . This is just normalized and centralized adjacency matrix of Erdos-Renyi random graph, the spectrumof which is well studied in the literature [BGBK17, MS16]. For superconstant d , the spectral normis of order ω ( √ n ) , much larger than the spectral norm of λxx ⊤ . Therefore, the leading eigenvectorwill not be correlated with hidden vector x as we desire.Then we only need to consider τ = o ( √ n ) . For simplicity, we analyze an alternative strategy whereentries Y ij > τ are truncated to . Similar results for truncation to τ sgn ( Y ij ) are in the appendixA.2. We consider the example below Example 2.12.
For i + j even, we let W ij sampled as in example 2.11. For i + j odd, we let − W ij distributed the same as above.For d = o ( n/τ ) , only entries perturbed by noise ± q dn − d are preserved. Then Y ′ ij = λx i x j + q dn − d ( − i + j with probability − dn and with probability d/n . Therefore the leading eigenvectorof Y ′ will be well correlated with h rather than x .Since Y ′ has zero mean, Y ′′ − Y ′ has smallFrobenius norm, thus the leading eigenvector of Y ′′ is close to Y ′ . For comparing the performance of algorithms proposed, we conduct experiments with several typicaldistributions of noise: (1) the noise is distributed as example 2.11. (2) the noise is distributed asexample 2.12 (3) entry W ij is distributed as N (0 , when i + j is even and as example 2.11 when i + j is odd. In each case planted vector x is randomly sampled from N (0 , Id n ) . In these examples,the smaller parameter d corresponds to the more heavy tailed noise distribution.In experiments with size n = 10 − , self-avoiding walk estimator shows better performance thannaive PCA and truncation PCA algorithm. Furthermore, the non-backtracking algorithm achievesperformance no worse than self-avoiding walk estimator under many settings. The results are shownin figure 1. For proving theorem 1.3, we use the sum of multilinear polynomials corresponding to a variant ofself-avoiding walk. Here we only describe a simple special case of the algorithm, which providesestimation guarantee when λ > n − / . Definition 3.1 (Polynomial time estimator for spiked tensor recovery) . Given tensor Y ∈ R n × n × n ,we have estimator P ( Y ) ∈ R n where each entry is degree ℓ − polynomial given by P i ( Y ) = There is trivial algorithm for this specific noise distribution, but it breaks down easily for other noisedistribution included in the class we consider. −1 parameter d Sq u a r e d c o rr e l a t i o n naivetruncateNBWSAWworst (a) n = 200 , λ ′ = 1 . , d ∈ [0 . , −1 parameter d Sq u a r e d c o rr e l a t i o n naivetruncateNBWSAWworst (b) n = 200 , λ ′ = 1 . , d ∈ [0 . , −1 parameter d0.00.20.4 Sq u a r e d c o rr e l a t i o n naivetruncate NBWworst (c) n = 400 , λ ′ = 1 . , d ∈ [0 . , Sq u a r e d c o rr e l a t i o n naive truncateworst (d) n = 2000 , λ ′ ∈ [1 . , . , d = 30 Figure 1: (a)(b) The performance of non-backtracking walk estimator with ℓ = 10 is no worse thanself-avoiding walk estimator with ℓ = 7 under distribution (2),(3). They drastically beat naive PCAalgorithm. (c) The performance of non-backtracking walk estimator with length ℓ = 17 can be muchbetter than PCA under distribution. (1) (d) Truncating at τ = 5 can fail drastically under distribution (2) .Each data point is the result of averaging trials. For notation, λ ′ = λn / , the y axis representsmean of squared correlation h ˆ x,x i k ˆ x k k x k . The line “worst” represents the optimal guarantee in caseof Gaussian noise with same λ , while the line “NBW” represents the experiment results from non-backtracking algorithm. P α ∈ S ℓ,i χ α ( Y ) , where χ α ( Y ) is multilinear polynomial basis χ α ( Y ) = Q ( i,j,k ) ∈ α Y ijk and S ℓ,i isthe set of directed hypergraph associated with vertex i generated in the following way: • we construct ℓ levels of distinct vertices. Level is vertex i . For < t < ℓ , level t contains vertex and level t − contains vertices. Level ℓ − contains vertex. • We connect a hyperedge between adjacent levels t − , t for t ∈ [2 ℓ − . Each hyperedgedirects from level t − to level t . • For vertex u which lies in level ℓ − and vertices v, v ′ which lie in level ℓ − , we addthe hyperedge ( u, v, v ′ ) .An illustration of a self-avoiding walk α ∈ S ℓ,i is given by figure 2.In appendix B.2 we show that by introducing width to levels, estimation under smaller SNR λ ispossible by exploiting more computational power. In appendix B.3 we show that P ( Y ) can beevaluated in n O ( v ) time using color coding method. These lead to the proofs of theorems 1.3, 1.4.11 v v v v v v v ℓ v − v ℓ v v ℓ v Figure 2: Illustration for a self-avoiding walk α ∈ S ℓ,i for tensor estimation. Each colored areacorresponds to a hyperedge. We provide an algorithm which nontrivially estimates rank-one spikes of Wigner matrices for signal-to-noise ratios λ approaching the sharp threshold λ √ n → , even in the setting of heavy-tailednoise (having only finite moments) with unknown, adversarially-chosen distribution. For futurework, it would be intriguing to obtain strengthened guarantees along (at least) two axes. First,[PWBM18] give an algorithm which recovers rank-one spikes for even smaller values of λ , when(a) the distribution of the entries of W is known, and (b) a large constant number of moments of theentries of W are O (1) . Relaxing either of the assumptions (a) or (b) while keeping λ √ n ≪ wouldbe very interesting.In a different direction, our experiments suggest that the non-backtracking walk estimator performsas well as the self-avoiding walk estimator which we are able to analyze rigorously. Rigorously es-tablishing similar guarantees for the non-backtracking walk estimator – or finding counterexamples– would be of great interest. S.B.H. is supported by a Miller Postdoctoral Fellowship. J.D and D.S are supported by ERC consol-idator grant.
References [Abb17] Emmanuel Abbe. Community detection and stochastic block models: recent develop-ments.
The Journal of Machine Learning Research , 18(1):6446–6531, 2017.[AGH +
14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgar-sky. Tensor decompositions for learning latent variable models.
Journal of MachineLearning Research , 15:2773–2832, 2014.[AS18] Emmanuel Abbe and Colin Sandon. Proof of the Achievability Conjectures for theGeneral Stochastic Block Model.
Communications on Pure and Applied Mathematics ,71(7):1334–1406, 2018. 12AYZ95] Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding.
Journal of the ACM (JACM) ,42(4):844–856, 1995.[BBAP05] Jinho Baik, G´erard Ben Arous, and Sandrine P´ech´e. Phase transition of thelargest eigenvalue for nonnull complex sample covariance matrices.
Ann. Probab. ,33(5):1643–1697, 09 2005.[BCRT19] Giulio Biroli, Chiara Cammarota, and Federico Ricci-Tersenghi. How to iron out roughlandscapes and get optimal performances: Replicated gradient descent and its applica-tion to tensor pca, 2019.[BGBK17] Florent Benaych-Georges, Charles Bordenave, and Antti Knowles. Spectral radii ofsparse random matrices, 2017.[BGL + , pages 1347–1357.IEEE, 2015.[DKMZ11] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and Lenka Zdeborov´a. Asymp-totic analysis of the stochastic block model for modular networks and its algorithmicapplications. Physical Review E , 84(6):066106, 2011.[Has19] M. B. Hastings. Classical and quantum algorithms for tensor principal componentanalysis, 2019.[HKP +
17] Samuel B Hopkins, Pravesh K Kothari, Aaron Potechin, Prasad Raghavendra, TselilSchramm, and David Steurer. The power of sum-of-squares for detecting hidden struc-tures. In , pages 720–731. IEEE, 2017.[HS17] S. B. Hopkins and D. Steurer. Efficient bayesian estimation from few samples: Com-munity detection and related problems. In , pages 379–390, 2017.[HSS15] Samuel B Hopkins, Jonathan Shi, and David Steurer. Tensor principal componentanalysis via sum-of-square proofs. In
Conference on Learning Theory , pages 956–1006, 2015.[Joh01] Iain M Johnstone. On the distribution of the largest eigenvalue in principal componentsanalysis.
Annals of statistics , pages 295–327, 2001.[KMM +
13] Florent Krzakala, Cristopher Moore, Elchanan Mossel, Joe Neeman, Allan Sly, LenkaZdeborov´a, and Pan Zhang. Spectral redemption in clustering sparse networks.
Pro-ceedings of the National Academy of Sciences , 110(52):20935–20940, 2013.[KWB19] Dmitriy Kunisky, Alexander S. Wein, and Afonso S. Bandeira. Notes on computationalhardness of hypothesis testing: Predictions using the low-degree likelihood ratio, 2019.[MNS18] Elchanan Mossel, Joe Neeman, and Allan Sly. A proof of the block model thresholdconjecture.
Combinatorica , 38(3):665–708, 2018.[MS16] Andrea Montanari and Subhabrata Sen. Semidefinite programs on sparse randomgraphs and their application to community detection. In
Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing , STOC ’16, pages 814–827,New York, NY, USA, 2016. ACM.[PRS13] Alessandro Pizzo, David Renfrew, and Alexander Soshnikov. On finite rank defor-mations of wigner matrices.
Ann. Inst. H. Poincar´e Probab. Statist. , 49(1):64–94, 022013.[PWBM18] Amelia Perry, Alexander S. Wein, Afonso S. Bandeira, and Ankur Moitra. Optimalityand sub-optimality of pca i: Spiked random matrix models.
The Annals of Statistics ,46(5):2416–2451, 2018. 13RM14] Emile Richard and Andrea Montanari. A statistical model for tensor pca. In
Advancesin Neural Information Processing Systems , pages 2897–2905, 2014.[RRS17] Prasad Raghavendra, Satish Rao, and Tselil Schramm. Strongly refuting random cspsbelow the spectral threshold. In
Proceedings of the 49th Annual ACM SIGACT Sympo-sium on Theory of Computing , STOC 2017, page 121–131. Association for ComputingMachinery, 2017.[RSS18] Prasad Raghavendra, Tselil Schramm, and David Steurer. High-dimensional estimationvia sum-of-squares proofs.
International Congress of Mathematicians , 2018.[SLKZ15] Alaa Saade, Marc Lelarge, Florent Krzakala, and Lenka Zdeborov´a. Spectral detectionin the censored block model. , pages 1184–1188, 2015.[WAM19] Alexander S. Wein, Ahmed El Alaoui, and Cristopher Moore. The kikuchi hierarchyand tensor PCA. In , pages 1446–1468, 2019.14
Spiked matrix model
A.1 Proof of Theorem 2.10
For proof of theorem 2.10, we need a result available in previous literature stating about the univer-sality of spiked matrix model
Theorem A.1 (Theorem 1.1 in [PRS13]) . In spiked matrix model Y = λxx T + W , x ∈ R n hasnorm √ n , W ∈ R n × n is a symmetric random matrix of i.i.d entries with zero mean and variancebounded by . If the -th moment of entries in W is bounded by O (1) , then the following guaranteewill hold w.h.p: λ max ( Y ) ≥ (1 − o (1)) (cid:18) λn + 1 λ (cid:19) We also need a simple observation about the deterministic relation between leading eigenvalue andleading eigenvector in spiked matrix model.
Lemma A.2.
For matrix M ∈ R n × n and matrix N = γxx T + M (where γ > and x ∈ R n has √ n ), if the leading eigenvalue λ max ( N ) is larger than λ max ( M ) by Ω( nγ ) , then the unit normleading eigenvector of N denoted by ξ will achieve constant correlation with x : h ξ, x i ≥ Ω( n ) Proof.
We have λ max ( N ) = ξ ⊤ ( γxx ⊤ + M ) ξ ≤ λ max ( M )+ γ h ξ, x i . Since λ max ( N ) − λ max ( M ) =Ω( nγ ) , we have h ξ, x i = Ω( n ) . Proof of Theorem 2.10.
By definition we have Y ′ ij = Y | Y ij |≤ τ + τ sgn ( Y ij ) | Y ij |≥ τ Given the assumption on the | λx i x j | = o ( τ ) , one can observe that this can be decomposed into Y ′ = λxx ⊤ + T + M + ∆ where we have T ij = W ij | W ij | + τ sgn ( W ij ) | W ij | >τ M ij = − λx i x j | W ij |≥ τ ∆ ij = ( τ sgn ( Y ij ) − Y ij )( | λx i x j + W ij |≥ τ − | W ij |≥ τ ) Further we denote Y ′ − Y ′′ as H . Then we have Y ′′ = λxx ⊤ + ( T − E T ) + M + ∆ − ( H − E T ) .Next we analyze the terms in the decomposition of Y ′′ . Specifically, we want to show that withconstant probability the largest eigenvalue of Y ′′ is larger than the one of Y ′′ − λxx T by Ω( λn ) . Ifthis is proved then for the leading unit eigenvector of Y ′′ denoted by ˆ x , we must have h ˆ x, x i = Ω( n ) with constant probability by lemma A.2.First for matrix T − E T , the variance of each entry is bounded by . Further each entry is boundedby τ . According to theorem A.1, the largest eigenvalue of matrix λxx ⊤ + T − E T is given by λn + λ and the largest eigenvalue of matrix T − E T is given by √ n with high probability.For matrix M ,we have E | W ij |≥ τ ≤ τ because the variance of entries in W is bounded by .Therefore the expectation E k M k F is bounded by λnτ . For non-zero entries ( i, j ) in matrix ∆ , wemust have | Y ij − τ sgn ( Y ij ) | ≤ | λx i x j | . Therefore these non zero entries ∆ ij are bounded by | λx i x j | . Further each entry in ∆ is non-zero with probability bounded by τ . Therefore we have E k ∆ k F bounded by λnτ .Finally we have H = h ⊤ where h = P i,j Y ′ i,j n . Since T ij are i.i.d for i ≤ j , we have E "(cid:18) P ij T ij n − E [ T ij ] (cid:19) ≤ n
15y linearity of expectation E k H − E T k ≤ O (1) ≤ λnτ .In all we have E k Y ′′ − λxx ⊤ − T + E T k ≤ λnτ . By Markov inequality with probability / ,we have k Y ′′ − λxx ⊤ − T + E T k ≤ λnτ . As stated above, the largest eigenvalue of matrix λxx ⊤ + T − E T is given by λn + λ and the largest eigenvalue of matrix T − E T is given by √ n with high probability. If we take τ large enough constant(e.g min ( ε , ), then with probability Ω(1) the spectral norm of Y ′′ is larger than λn + λ − . min ( ε , λn and spectral norm of Y ′′ − λxx ⊤ is smaller than √ n + 0 . min ( ε , λn .Therefore for λ = 1 + ε with ε = Ω(1) , the spectral norm of Y ′′ is larger than the spectral norm of Y ′′ − λxx ⊤ by Ω( λn ) with constant probability. As a result, with constant probability the leadingeigenvector of Y ′′ must achieve Ω(1) correlation with hidden vector x by lemma A.2. A.2 Example of failure for truncation algorithm
For example 2.12, we only explain why truncating to could fail in main article. Next we show thattruncating to τ sgn ( Y ij ) will fail as well for d between o ( n/τ ) and ω (1) .We still denote h = {± } n as the Rademacher vector with alternating sign and x ∈ {± } n orthog-onal to all- vector. For d = o ( n/τ ) and d = ω (1) , only entries perturbed by noise ± q dn − d arenot truncated. Then Y ′ ij = λx i x j + q dn − d ( − i + j with probability − dn and − τ with probability d/n . Therefore Y ′ can be decomposed into Y ′ = λxx ⊤ + r dn − d hh ⊤ + ∆ where E k ∆ k = o ( √ nd ) . Above computational threshold λ ≥ n − / , the spectral norm of λxx ⊤ is smaller than √ n . Further matrix H = Y ′′ − Y ′ also has spectral norm o ( √ nd ) by central limittheorem.For unit norm leading eigenvector ξ of matrix Y ′′ , we suppose that E h ξ, x i ≥ Ω( n ) and proveby contradiction. Because ξ is leading eigenvector, we have E h ξξ ⊤ , Y ′′ i ≥ E h hh ⊤ , Y ′′ i ≥ (1 − o (1)) √ nd . However, we have h ξ, h i ≤ − h ξ, x i /n . Therefore, we have h ξξ ⊤ , Y ′′ i ≤ (1 − Ω(1)) √ nd + o ( √ nd ) . This leads to contradiction. A.3 Proof of lemma 2.2, lemma 2.4 and lemma 2.5
We first prove lemma 2.2
Proof of Lemma 2.2.
First since the bound on infinity norm of x , we have S t ,t ,V ≤ k x k t ∞ E ( v ,...,v t t ) ⊆ [ n ] \ V " t + t Y i =1 x v i Denote t + t = t , now we take marginal on x v t . Then the marginal is given by E ( v ,...,v t t ) ⊆ [ n ] \ V " t + t Y i =1 x v i = E ( v ,...,v t − ) ⊆ [ n ] \ V " t − Y i =1 x v i n − P t − i =1 x v i − P i ∈ V x i n − | V | − ( t − ! = (cid:18) ± O (cid:18) ( | V | + t ) k x k ∞ n (cid:19)(cid:19) E ( v ,...,v t − ) ⊆ [ n ] \ V " t − Y i =1 x v i By induction this is (1 ± n − Ω(1) ) t . In all we have S t ,t ,V bounded by (1 + o (1)) k x k t ∞ .Next we prove lemma 2.4. 16 roof of Lemma 2.4. For self-avoiding walks α, β ∈ SAW ( i, j ) , the connected components of α ∩ β are all self-avoiding walks. We consider quantity g = r − k . Then each of the p connectedcomponents in α ∩ β not containing i or j contributes to g . Since α = β , other connectedcomponents in α ∩ β can only contain one of i, j . Such connected components contribute to g .Further each shared vertex not incident to any shared edge contribute to g and the total number ofsuch vertices is given by s . Therefore we have r − k = p + s Finally we prove lemma 2.5.
Proof of Lemma 2.5.
We prove the first bound. We represent ξ as an ordered set of vertices ( ξ , ξ , . . . , ξ h ) then we note that the product in the expectation can be expanded to the sum ofmonomials: E ξ ⊆ [ n ] \ V x ξ x ξ h Y i ∈ [ h ] (1 + λ x ξ i x ξ i − ) = E ξ ⊆ [ n ] V X S ⊆ [ h ] x ξ x ξ h Y i ∈ S λ x ξ i x ξ i − Since for fixed set S ⊆ [ h ] , the number of variables x u with degree in the monomial is boundedby | S | +1, by lemma 2.2,we have E ξ ,ξ ,...,ξ h " x ξ x ξ h Y i ∈ S ( λ x ξ i x ξ i − ) ≤ (1 + n − Ω(1) ) k x k ∞ ( λ k x k ∞ ) | S | On the other hand we have (1 + λ k x k ∞ ) h = X S ⊆ [ h ] (cid:0) λ k x k ∞ (cid:1) | S | Therefore we have E ξ ⊆ [ n ] \ V x ξ x ξ h Y i ∈ [ h ] (1 + λ x ξ i x ξ i − ) ≤ (1 + n − Ω(1) ) k x k ∞ (1 + λ k x k ∞ ) h The other three bounds can be proved in very similar ways.
A.4 Evaluation of self-avoiding walk estimator
In spiked matrix model Y = λxx ⊤ + W , we denote δ = n / λ − . For evaluation of degree ℓ = O (log δ n ) self-avoiding walk polynomial: P ij ( Y ) = X α ∈ SAW ( i,j ) Y ( u,v ) ∈ α Y uv we use color coding strategy pretty similar to the literature in [HS17]. The algorithm and construc-tion of matrices has already been described in the main body. We restate the algorithm 2 for readers’convenience.On complete graph K n , for a specific coloring c ∈ [ n ] [ ℓ ] , we say that a length- ℓ self-avoidingwalk α = ( v , v , . . . , v ℓ ) is colorful if the colors of v , v , . . . , v ℓ are different.Then a criticalobservation is that p c,i,j ( Y ) = P α ∈ SAW ℓ ( i,j ) F c,α χ α ( Y ) , where F c,α is the - indicator of randomevent that α is colorful. Taking uniform expectation on c over all random colorings, we have thefollowing relation: Lemma A.3.
In the algorithm 1, for C ≥ exp(100 ℓ ) , we have E Y,c ,...,c C C X t p c t ,i,j ( Y ) − E c p c,i,j ( Y ) ! ≤ exp( − O ( ℓ )) E Y (cid:18) E c P c,i,j ( Y ) (cid:19) where random colorings c , c , . . . , c C ∈ [ n ] [ ℓ ] are independently sampled uniformly at randomand the expectation of random coloring c ∈ [ n ] [ ℓ ] on right hand side is taken uniformly atrandom. lgorithm 2: Algorithm for evaluating self-avoiding walk matrix
Data:
Given Y ∈ R n × n s.t Y = λxx ⊤ + W Result: P ( Y ) ∈ R n × n where P ij ( Y ) is the sum of multilinear monomials corresponding to length ℓ self-avoiding walk between i, j (up to accuracy n − Ω(1) ) for i ← to C do Sample coloring c t : [ n ] [ ℓ ] uniformly at random;Construct a R ℓ n × ℓ n matrix M , where rows and columns are indexed by ( v, S ) , where v ∈ [ n ] and S is a subset of [ ℓ ] ;a matrix H ∈ R n × ℓ n where each row is indexed by [ n ] and each column is indexed by ( v, S ) where v ∈ [ n ] and S is a subset of [ ℓ ] ;a matrix N ∈ R ℓ n × n , where each row is indexed by ( v, S ) where S is a subset of [ ℓ ] and eachcolumn is indexed by [ n ] ;Record p c i = HM ℓ − N ;Return P Ci =1 p c i /C Proof.
For a fixed path α , the probability that F c,α = 1 is bounded by ℓ ! ℓ ℓ ≥ exp( − O ( ℓ )) . Thereforewe have E (cid:2) p c,i,j ( Y ) (cid:3) = E X α,β ∈ SAW ℓ ( i,j ) χ α ( Y ) χ β ( Y ) F c,α F c,β (3a) ≤ E Y X α,β ∈ SAW ℓ ( i,j ) [ χ α ( Y ) χ β ( Y )] (3b) ≤ exp( O ( ℓ )) E Y X α ∈ SAW ℓ ( i,j ) E c [ χ α ( Y ) F c,α ] (3c) = exp( O ( ℓ )) E Y "(cid:18) E c [ p c,i,j ( Y )] (cid:19) (3d)For step (3b) and (3c), we use the fact that ≤ F c,α ≤ and E [ χ α ( Y ) χ β ( Y )] ≥ for all α, β ∈ SAW ℓ ( i, j ) . For step (3c), we also use the fact that E F c,α ≥ exp( − O ( ℓ )) . Therefore E Y E c "(cid:18) p c,i,j ( Y ) − E c p c,i,j ( Y ) (cid:19) ≤ exp( O ( ℓ )) E Y "(cid:18) E c p c,i,j ( Y ) (cid:19) By averaging p c,i,j ( Y ) for C independent random colorings, the variance is reduced and we have E Y,c ,...,c C C X t,i,j p c t ,i,j ( Y ) − E c p c,i,j ( Y ) ≤ C exp( O ( ℓ )) E Y (cid:18) E c P c,i,j ( Y ) (cid:19) Therefore let C = exp(100 ℓ ) , the lemma is proved.This lemma implies that the average of p c ( Y ) for n δ − O (1) independent random colorings p ( Y ) givesaccurate approximation of P ( Y ) . The following simple corollary implies that the this matrix p ( Y ) achieves the same correlation with xx ⊤ as P ( Y ) . Lemma A.4 (Formal statement of Lemma 2.8) . The algorithm 2 runs in n δ − O (1) time when ℓ = O (log δ n ) . For matrix returned by algorithm 2, we have E c ,...,c C " C X t p c t ,i,j ( Y ) = ℓ ! ℓ ℓ P ij ( Y ) Y,c ,...,c C C X t p c t ,i,j ( Y ) ! ≤ (cid:16) n − Ω(1) (cid:17) E Y "(cid:18) ℓ ! ℓ ℓ P ij ( Y ) (cid:19) Proof of Lemma A.4.
First we note that for ℓ = O (log δ n ) , algorithm 2 runs in time n δ − O (1) .For any random coloring c and length- ℓ self-avoiding walk α , the probability that F c,α = 1 is ℓ ! /ℓ ℓ .Thus E F c,α = ℓ ! /ℓ ℓ . Since p c,i,j = P α ∈ SAW ℓ ( i,j ) F c,α χ α ( Y ) , by linearity of expectation we getthe first equality.By lemma A.3, we have E Y,c ,...,c C C X t p c t ,i,j ( Y ) − E c p c,i,j ( Y ) ! ≤ exp( − O ( ℓ )) E Y (cid:18) E c P c,i,j ( Y ) (cid:19) Therefore E Y,c ,...,c C C X t p c t ,i,j ( Y ) ! ≤ (cid:16) n − Ω(1) (cid:17) E Y "(cid:18) E c p c,i,j ( Y ) (cid:19) Further as stated above we have E c p c,i,j ( Y ) = ℓ ! ℓ ℓ P ij ( Y ) . Thus we get the inequality.Now the proof of theorem 1.2 is self-evident. Proof of Theorem 1.2.
We denote C P Ct =1 p c t ( Y ) as p ( Y ) . Then by lemma A.4 and lemma 2.1, wehave E c ,...,c C ,Y h ˆ p ( Y ) , xx ⊤ i n ( E c ,...,c C ,Y k ˆ p ( Y ) k F ) / = δ O (1) = Ω(1) By the same rounding procedure as in [HS17], we obtain theorem 1.2 by extracting a random vectorin the span of top /δ O (1) eigenvectors of Y . A.5 Algorithm for evaluating non-backtracking walk estimator
In experiments, we use estimator closely related to non-backtracking walk and color coding method.On complete graph K n , for vertice labels i, j ∈ [ n ] , we define the set of length- ℓ k -step non-backtracking walks ( i, v , v , . . . , j ) as NBW ℓ ( i, j ) . For non-backtracking walk α and randomcoloring c : [ n ] [ ℓ ] , we denote F c,α as - indicator of the random event that each length k chunk of walk α is colorful(i.e, not containing repeated colors). For a fixed path α , the probabilitythat F c,α = 1 is bounded by (cid:0) − kℓ (cid:1) ℓ ≥ exp( − O ( k )) .Then we use the following non-backtracking walk estimator P ( Y ) ∈ R n × n : P ij ( Y ) = X α ∈ NBW ℓ ( i,j ) χ α ( Y ) E c F c,α (4)where χ α ( Y ) = Q ( u,v ) ∈ α Y uv and the expectation on coloring c is taken uniformly. The algorithmfor approximating P ( Y ) is given as following: We now describe how to construct matrix H, M, N .For matrix M , corresponding to index (( v , S ) , ( v , T )) , the entry is given by Y v ,v if • the color of v is not contained in S and the color of v is not contained in T • ordered set T is the concatenation of color of v and first ℓ − elements of S .Otherwise the entry is given by .For matrix H , corresponding to entry ( v , ( v , S )) , the entry is given by Y v ,v if S contains singleelement: the color of v and the color of v is different with the color of v .For matrix N , corresponding to entry (( v , S ) , v ) , the entry is given by Y v ,v if19 lgorithm 3: Algorithm for evaluating color-coding non-backtracking walk matrix
Data:
Given Y ∈ R n × n s.t Y = λxx ⊤ + W Result:
Approximation for P ( Y ) ∈ R n × n where P ij ( Y ) = P α ∈ NBW ℓ ( i,j ) ˆ p α χ α ( Y ) whereNBW ℓ ( i, j ) is the set of length ℓ non-backtracking walk between i, j and ˆ p α = E c F c,α for t ← to C do Sample a random coloring c t : [ n ] [ ℓ ] ;Construct a R n P ks =0 ℓ s × n P ks =0 ℓ s matrix M , with rows and columns are indexed by ( v, S ) ,where v ∈ [ n ] and S is an ordered subset of [ ℓ ] with size bounded by k ;A matrix H ∈ R n × n P ks =0 ℓ s with rows indexed by [ n ] and each column indexed by ( v, S ) where v ∈ [ n ] and S is ordered subset of [ ℓ ] with size bounded by k ;A matrix N ∈ R n P ks =0 ℓ s × n , with columns indexed by [ n ] and each row indexed by ( v, S ) where v ∈ [ n ] and S is ordered subset of [ ℓ ] with size bounded by k ;Record p c t = HM ℓ − N ;Return P t p c t /C • S contains k colors and the color of v is different from the last color of S • the color of v is different from the color of v and first k − elements of S and given by otherwise.Conditioning on some assumptions, we can sample a random vector in top- δ − O (1) span of suchmatrix in quasilinear time. Theorem A.5 (Evaluation of color-coding Non-backtracking walk estimator) . In spiked matrixmodel Y = λxx T + W , vector x ∈ R n has norm √ n and entries in W ∈ R n × n are indepen-dently sampled with zero mean and unit variance. Considering ℓ = O (log δ n ) with δ = Ω(1) , weassume that the distribution satisfies the following: X α,β ∈ NBW ℓ ( i,j ) E [ χ α ( Y ) χ β ( Y ) F c,α F c,β ] = exp( O ( k )) X α,β ∈ NBW ℓ ( i,j ) E (cid:20) χ α ( Y ) χ β ( Y ) E c F c,α E c F c,β (cid:21) Then for P ( Y ) defined as equation 3, a unit norm random vector in the span of top δ − O (1) eigen-vectors of P ( Y ) can be sampled in time O ( n log k +1 ( n ) δ − O (1) exp( O ( k ))) if ℓ = O (log n ) .If we assume that k -step non-backtracking walk matrix achieves correlation δ with xx ⊤ , then thisrandom vector ξ achieves constant correlation with x : h ξ, x i = δ O (1) n. Remark : The assumption will be satisfied if for all α, β ∈ NBW ℓ ( i, j ) , E [ χ α ( Y ) χ β ( Y )] ≥ . Proof.
First given single random coloring c ( v ) : [ n ] → [ ℓ ] , the algorithm evaluates matrix polyno-mial p c ( Y ) ∈ R n × n with each entry given by p c,i,j ( Y ) = X α ∈ NBW ( i,j ) χ α ( Y ) F c,α where F c,α is the indicator of random event that each length k walk as a chunk of α is colorful. Firstwe note that p c,i,j ( Y ) is unbiased estimator for P ij ( Y ) .Therefore we have E (cid:2) p c,i,j ( Y ) (cid:3) = E X α,β ∈ NBW ( i,j ) χ α ( Y ) χ β ( Y ) F c,α F c,β ≤ exp( O ( k )) E Y X α ∈ NBW ( i,j ) E c [ χ α ( Y ) F c,α ] E Y,c "(cid:18) p c,i,j ( Y ) − E c p c,i,j ( Y ) (cid:19) ≤ exp( O ( k )) E Y "(cid:18) E c p c,i,j ( Y ) (cid:19) By averaging p c,i,j ( Y ) for C = ω (exp( O ( k ))) random colorings, we have E Y,c ,...,c C C X t,i,j p c t ,i,j ( Y ) − E c p c,i,j ( Y ) ≤ C exp( O ( k )) E Y (cid:18) E c P c,i,j ( Y ) (cid:19) Therefore let p ij ( Y ) = C P t,i,j p c t ,i,j ( Y ) , we have E p ij ( Y ) = (1 + o (1))( E Y (cid:20) ( E c p c,i,j ( Y )) (cid:21) ) = (1 + o (1)) (cid:0) ( E P ij ( Y )) (cid:1) As a result, we can use p ij ( Y ) as substitute for P ij ( Y ) .For extracting the span of top δ − O (1) eigenvectors, we apply power method. Since p ( Y ) can berepresented as a sum of chain product of matrices, we can iteratively apply matrix-vector productrather than obtaining p ( Y ) explicitly. Since for matrix H, M, N , there are at most n log k +1 ( n ) non-zero elements.The resulting complexity is thus given by O ( n log k +1 ( n ) δ − exp( O ( k ))) B Order-3 spiked tensor model
B.1 Strong detection algorithm for spiked tensor model
For spiked tensor model, we also consider strong detection problem, which is closely related to weakrecovery problem. Specifically given tensor Y sampled from general spiked tensor model, we wantto detect whether it’s sampled with λ = 0 or large λ with high probability. Definition B.1 (Strong detection) . Given tensor Y sampled from planted distribution P or nulldistribution Q with equal probability, we need to find a function of entries in Y : f ( Y ) ∈ { , } suchthat P [ f ( Y ) = 1] + 12 Q [ f ( Y ) = 0] = 1 − o (1) It’s not explicitly stated in previous literature how to obtain strong detection algorithm via low degreemethod. The following self-clear fact provides a systematic way for doing so.
Theorem B.2 (Low degree polynomial thresholding algorithm) . Given Y sampled from P and Q with equal probability, For polynomial P ( Y ) = P α ∈ S ℓ ˆ µ α χ α ( Y ) where { χ α ( Y ) : α ∈ S ℓ } isa set of polynomial basis orthonormal under measure Q and ˆ µ α is the Fourier coefficient of like-lihood ratio between planted and null distribution µ ( Y ) = P ( Y ) Q ( Y ) corresponding to basis χ α ( Y ) : ˆ µ α = E P χ α ( Y ) , if we have diverged low degree likelihood ratio P α ∈ S ℓ ˆ µ α = ω (1) and concentra-tion property E P P ( Y ) = (1 + o (1)) ( E P P ( Y )) , then this implies strong detection algorithm bythresholding polynomial P ( Y ) .Proof. Since P α ∈ S ℓ ˆ µ α = ω (1) , we have E P P ( Y ) = E Q P ( Y ) = ω (1) . By Chebyshev’s in-equality, for Y ∼ P w.h.p we have P ( Y ) = (1 ± o (1)) E P P ( Y ) = ω ( p E Q P ( Y )) while for Y ∼ Q w.h.p we have P ( Y ) = O ( p E Q P ( Y )) Our guarantee for strong detection in spiked tensor model can be stated as following:
Theorem B.3.
Let x ∈ R n be a random vector with independent, mean-zero entries having E x i = 1 and E x i ≤ n o (1) . Let λ > . Let Y = λ · x ⊗ + W , where W ∈ R n × n × n has independent, mean-zero entries with E W ijk = 1 . Then for c ≥ n − / o (1) and λ ≥ cn − / , there is n O (1 /c ) timealgorithm achieving strong detection. For strong detection algorithm, the thresholding polynomial we use is given by the following.21igure 3: An example of possible directed hyperedge connection between adjacent layers t − , t, t +1 for an hypergraph α ∈ S ℓ, . (1) When t ∈ [2 ℓ − , the direction of hyperedges are given by ( v , v , v ) , ( v , v , v ) , ( v , v , v ) , ( v , v , v ) , ( v , v , v ) , ( v , v , v ) . (Note that each hyper-edge directs from the layer t − to the layer t or from the layer t to the layer t + 1 .) (2) When t = 0 ,the layers are given by ℓ − , , by periodic indexing. The directions of hyperedges aregiven by ( v , v , v ) , ( v , v , v ) , ( v , v , v ) , ( v , v , v ) , ( v , v , v ) , ( v , v , v ) . Definition B.4 (Thresholding polynomial for strong detection) . On directed complete 3-uniformhypergraph with n vertices, we define S ℓ,v as the set of all copies of -regular hypergraphs generatedin the following way: • we construct ℓ levels of distinct vertices labeled by , , . . . , ℓ − . For levels t ∈ [0 , ℓ − , it contains v vertices if t is even and v vertices if t is odd. • Then we construct a perfect matching between levels t, t − for t ∈ [2 ℓ − and betweenlevels , ℓ − . For each hyperedge, vertex comes from even level while vertices comefrom odd level. The hyperedges are directed from level to level ℓ − and from level t to level t + 1 for t ∈ [0 , ℓ − .(An example of such construction is illustrated in figure 3).Given tensor Y ∈ R n × n × n , the degree ℓv polynomial P ( Y ) is given by P ( Y ) = P α ∈ S ℓ,v χ α ( Y ) ,where χ α ( Y ) is the corresponding multilinear polynomial basis χ α ( Y ) = Q ( i,j,k ) ∈ α Y ijk .For simplicity of formulation, we use periodic index below(i.e for level t = − we mean level ℓ − ).For proving strong detection guarantee, we first need two hypergraph properties. Lemma B.5.
On directed complete hypergraph with n vertices, the number of hypergraphs con-tained in the set S ℓ,v defined in B.4 is given by | S ℓ,v | = (1 − o (1)) (cid:18)(cid:18) nv (cid:19)(cid:18) n v (cid:19) ((2 v )!) (cid:19) ℓ Proof.
For fixed vertices, between level t and level t + 1 there are (2 v )! v ! v ! ways of connecting hyper-edges, therefore we have | S ℓ,v | = (1 − o (1)) (cid:18)(cid:0) nv (cid:1)(cid:0) n v (cid:1) (cid:16) v !(2 v )! v ! (cid:17) (cid:19) ℓ . Lemma B.6.
For a pair of hypergraphs α, β ∈ S ℓ,v , we denote the number of shared vertices as r , the number of shared hyperedges as k , and the number of shared vertices with degree or in α ∩ β as s . Then we have relation r − s ≥ k .Further if r = 3 k , then • either α, β are disjoint k = r = 0 • or for all levels t ∈ [0 , ℓ − , k t ≥ and are equal to the same value.Proof. In -uniform sub-hypergraph α ∩ β , there are k hyperedges, r − s vertices with degree andat most s vertices with degree . By degree constraint, we have k ≤ r − s ) + s = 2 r − s .22hen r = 3 k , each shared vertice between α, β has degree in α ∩ β . However if there exists t ∈ [0 , ℓ − such that k t − = k t , then there are shared vertices at level t − with degree or insubgraph α ∩ β . Thus either α, β are disjoint( k t = 0 ), or for all levels t , k t ≥ and are all equal.We consider the set of hypergraph pairs α, β ∈ S ℓ,v such that in α • at level t there are r t vertices shared with β • between level t + 1 and level t , there are k t hyperedges shared with β .We denote such set of hypergraph pairs as S ℓ,v,k,r , where k = P k t , r = P r t . Then r is just thenumber of shared vertices between α, β as r and k is just the number of shared hyperedges between α, β . Although we abuse the notations(since the set S ℓ,v,k,r is related to k t , r t ), by the followinglemma we can bound the size of such set only using k, r, ℓ, v . Lemma B.7.
On directed complete hypergraph with n vertices, for any set S ℓ,v,k,r with ℓ = O (log n ) , v = o ( n ) , k, r ≤ ℓv , the number of hypergraph pairs contained in the set S ℓ,v,k,r isbounded by | S ℓ,v,k,r | ≤ | S ℓ,v | n − r v r − k ℓ r − k v k/ exp( O ( r )) Proof.
We generate pairs of hypergraph α, β ∈ S ℓ,v in the following way: we first choose α ∩ β andshared vertices as a subgraph of hypergraph in S ℓ,v , and then choose remaining graph respectivelyfor α, β . In hypergraph α , suppose there are r t shared vertices in level t and k t shared hyperedgesbetween level t + 1 and level t .We define parity function δ ( t ) = 2 if t is odd and if t is even. Then we have relation r t ≥ δ ( t ) max ( k t − , k t ) . For choice of α ∩ β ,there are N α ∩ β = ℓ Y t =0 (cid:18) nr t (cid:19)(cid:18) nr t +1 (cid:19)(cid:18) r t k t (cid:19)(cid:18) r t k t − (cid:19)(cid:18) r t +1 k t (cid:19)(cid:18) r t +1 k t +1 (cid:19) (2 k t )!(2 k t +1 )! ≤ n r v k/ exp( O ( r )) such subgraphs. On the other hand, the number of choices for the remaining hypergraph of α isbounded by N α − β = ℓ − Y t =0 (cid:18) nv − r t (cid:19)(cid:18) n v − r t +1 (cid:19) (2( v − k t ))!(2( v − k t +1 ))!= | S ℓ,v | n − r v r − k exp( O ( r )) Then we consider the number of β . Denote the number of degree- vertices in α ∩ β as s and thenumber of shared vertices not contained in α ∩ β as s . Let s = s + s , then there are at most ℓ s ways of embedding α ∩ β in β . Choosing the remaining hypergraph of β is also bounded by | S ℓ,v | n − r v r − k exp( − O ( k )) . Therefore, with respect to fixed number of vertices and hyperedges r t , k t in each level of α ∩ β , the total number of such hypergraph pairs S r,k,ℓ,v is bounded by r − k X s =0 " | S ℓ,v | n − r v r − k ℓ s ℓ − Y t =0 r t ! k t ! exp( O ( r )) ≤ | S ℓ,v | n − r v r − k ℓ r − k v k/ exp( O ( r )) Next we prove the strong detection through lemma B.8 and B.11.
Lemma B.8.
When γ = 0 . n / v / λ = 1 + Ω(1) , ℓ = Ω(log γ n ) and n = ω ( v poly ( ℓ Γ)) , theprojection of likelihood ratio µ ( Y ) with respect to S ℓ,v diverges, i.e P α ∈ S ℓ,v ˆ µ α = ω (1) .Proof. Because for any hypergraph α ∈ S ℓ,v , we have | α | = 2 ℓv and ˆ µ α = E χ α ( Y ) = λ ℓv . Bylemma B.5, we have X α ∈ S ℓ,v ˆ µ α ≥ (1 − o (1)) (cid:18)(cid:18) nv (cid:19)(cid:18) n v (cid:19) ((2 v )!) (cid:19) ℓ λ ℓv ≥ n ℓv v ℓv λ ℓv γ = 1 + Ω(1) and ℓ = Ω(log γ n ) , we have P α ∈ S ℓ,v ˆ µ α = n Ω(1) .For proving ( E P ( Y )) = (1 − o (1)) E P ( Y ) , we first prove several preliminary lemmas. First webound the expectation of P ( Y ) Lemma B.9.
In spiked tensor model Y = λx ⊗ + W where entries in x ∈ R n are sampledindependently with zero mean and unit variance, entries in W ∈ R n × n × n are sampled indepen-dently with zero mean and unit variance. Then for P ( Y ) defined in B.4, we have E P ( Y ) =(1 − o (1)) λ ℓv ( (cid:0) nv (cid:1)(cid:0) n v (cid:1) ((2 v )!) ) ℓ .Proof. First for each α ∈ S ℓ,v , we have E [ χ α ( Y )] = λ ℓv . By lemma B.5, we have | S ℓ,v | =(1 − o (1)) (cid:0)(cid:0) nv (cid:1)(cid:0) n v (cid:1) ((2 v )!) (cid:1) ℓ . Since P α ∈ S ℓ,v E [ χ α ( Y )] = λ ℓv | S ℓ,v | , we get the lemma.Finally we need a lemma for bounding the summation over all possible k t , r t Lemma B.10.
For t ∈ { , , . . . , ℓ − } , we define r t , k t ∈ { , , . . . , v } satisfying that r t ≥ δ ( t ) max( k t − , k t ) , where parity function δ ( t ) = 2 if t is odd and if t is even. We denote k = P ℓ − t =0 k t and r = P ℓ − t =0 r t . We take scalars η = ω ( ℓ ) and constant ψ > . Then for r ≥ k + 1 ,we have: X k t ≥ X r t ≥ δ ( t ) max( k t − ,k t )2 r ≥ k +1 η − r +3 k/ ψ k ≤ o (1) Proof.
We note that given k t for t ∈ [ ℓ ] , we have ℓ r − k/ choices for r t . Further we denote k ∆ = P t | k t +1 − k t | . Then given k ∆ , all k t can take at most k ∆ different values. As a result, fixing these k ∆ different values, there are ℓ k ∆ choices for k t for t ∈ [ ℓ ] . Further we have r − k/ ≥ k ∆ / .Therefore the summation is bounded by X k ∆ ≥ (cid:16) ηℓ (cid:17) − k ∆ / k ∆ Y t =1 X k t ≥ ψ k t = o (1) Lemma B.11.
Denote γ = 0 . n / v / λ and take ℓ = Ω(log γ n ) in the above estimator P ( Y ) ,if we have γ = 1 + Ω(1) and n = ω ( v poly ( ℓ Γ)) , then E P ( Y ) = (1 + o (1))( E P ( Y )) .Proof. We need to show that X α ∈ S ℓ,v E [ χ α ( Y )] = (1 − o (1)) X α,β ∈ S ℓ,v E [ χ α ( Y ) χ β ( Y )] For left hand side, we already have lemma B.9. Thus we only need to bound (cid:16)P α,β ∈ S ℓ,v E [ χ α ( Y ) χ β ( Y )] (cid:17) . First by direct computation we have E [ χ α ( Y ) χ β ( Y )] = (1 + n − Ω(1) ) k λ ℓv − k E Y i ∈ α ∆ β x deg ( i,α ∆ β ) i ≤ λ ℓv − k Γ r − k where deg ( i, α ) is the degree of vertex i in hypergraph α , r is the number of shared vertices and k isthe number of shared hyperedges, Γ = E [ x i ] = n o (1) according to assumptions. Using lemma B.7,lemma B.9, for any set S r,k,ℓ,v , we have P α,β ∈ S r,k,ℓ,v E χ α ( Y ) χ β ( Y )( E P ( Y )) ≤ n − r v r − k ℓ r − k λ − k Γ r − k v k exp( cr ) c is large enough constant. Summing up for different r t , k t and combining the fact that if r = k then k ≥ ℓ, k t ≥ , then we have P α,β ∈ S ℓ,v E χ α ( Y ) χ β ( Y )( E P ( Y )) = X r t ,k t c k/ n − k/ v − k/ λ − k (cid:16) ncv Γ (cid:17) − r +3 k/ ℓ r − k = X k t ≥ X r t ≥ δ ( t ) max( k t − ,k t )2 r ≥ k +1 (cid:16) ncv Γ (cid:17) − r +3 k/ (cid:16) c / n − / λ − v − / (cid:17) k + X k t ≥ (cid:16) c / n − / λ − v − / (cid:17) P t k t + 1 where parity function δ ( t ) = 2 if t is odd and if t is even. The term comes from the case r = k = 0 . When γ = 0 . n / λ v / > and ℓ = C log γ n with constant C large enough, thesecond term is bounded by n − Ω(1) . Since n = ω ( cv Γ poly ( ℓ )) , by lemma B.10 the first term isbounded by o (1) . Thus we get the theorem. Proof of theorem B.3.
Combining lemma B.1, B.11, and theorem B.2, thresholding P ( Y ) will leadto strong detection algorithm. B.2 Proof of weak recovery in spiked tensor model
We define the notion of weak recovery and strong recovery in spiked tensor model.
Definition B.12.
In spiked tensor model Y = λx ⊗ p + W for random vector x ∈ R n and randomtensor W ∈ ( R n ) ⊗ p . We define that estimator ˆ x ( Y ) ∈ R n achieves weak recovery if E h ˆ x ( Y ) , x i ≥ Ω (cid:16)(cid:0) E k ˆ x ( Y ) k (cid:1) / (cid:0) E k x k (cid:1) / (cid:17) . Further we define that ˆ x ( Y ) ∈ R n achieves strong recovery if E h ˆ x ( Y ) , x i ≥ (1 − o (1)) (cid:0) E k ˆ x ( Y ) k (cid:1) / (cid:0) E k x k (cid:1) / The estimator P ( Y ) ∈ R n we take is defined as following. Definition B.13 (Polynomial estimator for weak recovery) . On directed complete -uniform hy-pergraph with n vertices and for i ∈ [ n ] , we define S ℓ,v,i as the set of all copies of hypergraphsgenerated in the following way: • we construct ℓ levels of distinct vertex. Level contains vertice i and ( v − / vertexin addition. For < t < ℓ , level t contains v vertex and level t − contains v vertex.Level ℓ − contains v vertex. • We construct a perfect matching between level t, t + 1 for t ∈ [2 ℓ − . For each hyperedge, vertice comes from even level while vertex come from odd level. Each hyperedgedirects from level t to level t + 1 .(They are connected in the same way as S ℓ,v of the strongdetection case, which is demonstrated in figure 3.) • Level and are bipartitely connected s.t each vertice in level excluding i has degree while vertice i and vertex in level has degree . Level ℓ − and level ℓ − arebipartitely connected s.t vertex in level ℓ − have degree while vertex in level ℓ − have degree .Then given tensor Y ∈ R n × n × n , we have estimator P ( Y ) ∈ R n where each entry is degree (2 ℓ − v polynomial of entries in Y . For i ∈ [ n ] , the i -th entry is given by P i ( Y ) = P α ∈ S ℓ,v,i χ α ( Y ) , where χ α ( Y ) is multilinear polynomial basis χ α ( Y ) = Q ( i,j,k ) ∈ α Y ijk and S ℓ,v,i is the set of hypergraphsdefined above.We prove that estimator P ( Y ) defined in B.13 achieves constant correlation with the second moment xx ⊤ . The proof is very similar to the proof of strong detection algorithm.25 emma B.14. In spiked tensor model, Y = λx ⊗ + W , where the entries in x ∈ R n , W ∈ R n × n are independently sampled with zero mean and unit variance, we consider estimator P ( Y ) ∈ R n defined above with γ = 0 . n / v / λ = 1 + Ω(1) and ℓ = O (log γ n ) . Then we have E [ P i ( Y ) x i ] = (1 − o (1)) λ ℓ − v (cid:18)(cid:18) nv (cid:19)(cid:18) n v (cid:19)(cid:19) ℓ − (cid:18) n ( v − / (cid:19)(cid:18) nv (cid:19) ((2 v )!) ℓ − (3 v − / Proof.
Since P α ∈ S ℓ,v,i E [ χ α ( Y ) x i ] = λ (2 ℓ − v | S ℓ,v,i | , we only need to bound the size of S ℓ,v,i .Applying combinatorial arguments to the generating process of S ℓ,v,i , we have | S ℓ,v,i | = (1 − o (1)) (cid:18)(cid:18) nv (cid:19)(cid:18) n v (cid:19)(cid:19) ℓ − (cid:18) n ( v − / (cid:19)(cid:18) nv (cid:19) ((2 v )!) ℓ − (3 v − / . Lemma B.15.
On the directed complete -uniform hypergraph, for i ∈ [ n ] and a set of simplehypergraph S ℓ,v,i , we consider any hypergraph α, β ∈ S ℓ,v,i . Between α, β , we denote the numberof shared vertices(excluding vertice i ) in level t of α as r t , the number of shared hyperedges betweenlevel t and level t + 1 of α as k t . Further we denote r = P t r t as the total number of shared verticesexcluding i and k = P k t as the total number of shared hyperedges.Then one of the following relations must hold: • r > k • r = 3 k and α ∩ β is a hyper-path starting from vertice i or empty. • k ≥ r ≥ k − and k t ≥ for all t Proof.
Suppose we have r ≤ k − . Then by degree constraint, in hypergraph α ∩ β , excludingvertice i , all other vertices have degree . This is only possible if for all levels t , vertices containedin α ∩ β are connected to two hyperedges, implying that k t , k t − ≥ . Further we have r = 3 k − in the case.Suppose we have r = 3 k = 0 and there is t ∈ [2 ℓ ] such that k t = 0 . Then in the α ∩ β , exactly onevertice(excluding vertice i ) has degree and all the other vertices have degree . Thus there is onlyone level t ′ such that k t = 0 . Thus all shared vertices in level t have degree at most . This impliesthat there is exactly one shared vertice in level t . This is only possible if α ∩ β is a hyperpath startingfrom vertice i . Lemma B.16.
For any α, β ∈ S ℓ,v,i sharing k hyperedges, we have E [ χ α ( Y ) χ β ( Y )] = (1 + n − Ω(1) ) λ ℓ − v − k E Y j ∈ α ∆ β x deg ( j,α ∆ β ) j ≤ λ − k Γ O (2 r − k ) E [ χ α ( Y ) x i ] E [ χ β ( Y ) x i ] where deg ( j, α ∆ β ) represents the degree of vertex j in hypergraph α ∆ β .Proof. This follows from the same computation as B.7.We consider the set of hypergraph pairs α, β ∈ S ℓ,v,i such that in α • at level t there are r t vertices(excluding vertice i ) shared with β • between level t + 1 and level t , there are k t hyperedges shared with β .We denote such set of hypergraph pairs as S ℓ,v,i,k,r , where k = P k t , r = P r t . Then r is just thenumber of shared vertices between α, β as r and k is just the number of shared hyperedges between α, β . Although we abuse the notations(since the set S ℓ,v,k,r is related to k t , r t ), by the followinglemma we can bound the size of such set only using k, r, ℓ, v .26 emma B.17. On directed complete hypergraph with n vertices, for any set S ℓ,v,k,r with ℓ = O (log n ) , v = o ( n ) , k, r ≤ ℓv , the number of hypergraph pairs contained in the set S ℓ,v,i,k,r isbounded by | S ℓ,v,i,k,r | ≤ | S ℓ,v,i | n − r v r − k/ ℓ r − k exp( O ( r )) Proof.
We first choose α ∩ β and shared vertices as subgraph of hypergraph α ∈ S ℓ,v,i , and then com-pleting the remaining hypergraphs α \ β and β \ α . If in the α there are r t shared vertices(excluding i ) in level t , and k t shared hyperedges between level t and level t + 1 for t = 0 , , . . . , ℓ − , thenthe number of choices for shared vertices and α ∩ β is bounded by N α ∩ β ≤ (cid:18) nr (cid:19)(cid:18) nr ℓ − (cid:19)(cid:18) r + 1 k (cid:19)(cid:18) r ℓ − k ℓ − (cid:19) ℓ − Y t =1 (cid:20)(cid:18) nr t − (cid:19)(cid:18) nr t (cid:19)(cid:18) r t − k t − (cid:19)(cid:18) r t − k t − (cid:19)(cid:18) r t k t (cid:19) (cid:18) r t k t − (cid:19)(cid:21) ℓ − Y t =0 (2 k t )! This is upper bounded by Q ℓ − t =0 (2 k t )! Q ℓ − t =0 (cid:0) nr t (cid:1) exp( O ( r )) ≤ n r v k/ exp( O ( r )) . Next wechoose the remaining hypergraph α \ β and β \ α respectively. For α \ β , we have N α \ β = (cid:18) n v − − r (cid:19)(cid:18) nv − r ℓ − (cid:19) ℓ − Y t =1 (cid:18) n v − r t − (cid:19)(cid:18) nv − r t (cid:19) ℓ − Y t =0 (2( v − k t ))! ≤ | S ℓ,v,i | n − r v r v − k exp( O ( r )) Suppose there are s degree vertices in α ∩ β and s vertices shared between α, β but not containedin α ∩ β , denoting s = s + s , then there are ℓ s ways of placing α ∩ β and shared vertices in hyper-graph β and the count of remaining hypergraph is also bounded by | S ℓ,v,i | n − r v r v − k exp( O ( r )) .Multiplying together we will get the claim. Lemma B.18 (Recovery for general spiked model) . In spiked tensor model Y = λx ⊗ + W withthe same setting as theorem 1.3, taking γ = 0 . n / v / λ = 1 + Ω(1) and ℓ = O (log γ n ) inthe estimator above, then if n = ω ( v Γ polylog ( n )) , we have E h P ( Y ) ,x i ( E k P ( Y ) k E k x k ) / = Ω(1) .Proof. We need to show the estimator P ( Y ) ∈ R n above achieves constant correlation with thehidden vector x . Equivalently, we want to show that for each i ∈ [ n ] X α ∈ S ℓ,v,i E [ χ α ( Y ) x i ] = Ω X α,β ∈ S ℓ,v,i E [ χ α ( Y ) χ β ( Y )] For left hand side, we can simply apply B.14. For the right hand side, we have E [ χ α ( Y ) χ β ( Y )] ≤ λ − k Γ O (2 r − k ) E [ χ α ( Y ) x i ] E [ χ β ( Y ) x i ] By lemma, B.17 the contribution to E [ P i ( Y )]( E [ P i ( Y ) x i ]) with respect to specific r t , k t is bounded by ( n − r v r v − k ) n r v k/ ℓ s Γ O (2 r − k ) λ − k exp( O ( r )) ≤ (cid:16) ncv ℓ Γ O (1) (cid:17) − r +3 k/ (cid:16) cn − / v − / λ − (cid:17) k where c is constant. For nv − = ω ( poly (Γ ℓ )) , using argument very similar to lemma B.10, thedominating term is given by r ≤ k . For r = 3 k + 1 , by lemma B.15 we must have k ≥ ℓ ,therefore for cn − / v − / λ − < and ℓ = C log n with constant C large enough, the contributionis n − Ω(1) . For r = 3 k , by lemma B.15 either k ≥ ℓ or α ∩ β exists as a hyperpath starting fromvertex i . The first case can be treated in the same way as r = 3 k − . For the second case, thecontribution is bounded by P ℓ − k =0 (cid:0) cn − / v − / λ − (cid:1) k ≤ − cn − / v − / λ − Therefore in all, we have ( E P i ( Y ) x i ) E P i ( Y ) ≥ − cn − / v − / λ − . This is Ω(1) when we have relation cn − / v − / λ − = 1 − Ω(1) and n = ω ( v poly (Γ ℓ )) . Therefore, such polynomial estimator P ( Y ) ∈ R n achieving weak recovery under the given condition.27 .3 Color coding method for polynomial evaluation in order-3 spiked tensor modelB.3.1 Strong detection polynomial For constant v , although the thresholding polynomial and polynomial estimator has degree O (log n ) ,these polynomials can actually be evaluated in polynomial time via color coding method as a gener-alization of result in [HS17]. In the same way color coding method also improves the running timeof sub-exponential time algorithms.We first describe the evaluation algorithm for scalar polynomial, as shown in algorithm 4. Algorithm 4:
Algorithm for evaluating the thresholding polynomial
Data:
Given Y ∈ R n × n × n s.t Y = λx ⊗ + W Result: P ( Y ) ∈ R which is the sum of multilinear monomials corresponding to hypergraphs in S ℓ,v (up to accuracy n − Ω(1) ) for i ← to C do Sample coloring c i : [ n ] [ ℓ ] uniformly at random;Construct a matrix M ∈ R (2 ℓv − n v × (2 ℓv − n v , rows and columns of M are indexed by ( V , S ) and ( V , T ) where V ∈ [ n ] v and V ∈ [ n ] v are set of vertices while S, T ( [3 ℓv ] aresubset of colors.;Matrices Q, N ∈ R R (23 ℓv − n v × (23 ℓv − nv , the rows and columns of which are indexed by ( V , S ) and ( V , T ) where V ∈ [ n ] v and V ∈ [ n ] v are set of vertices while S, T ⊆ [3 ℓv ] arenon-empty subset of colors.Record p c i = (3 ℓv ) ℓv (3 ℓv )! trace (( M N ) ℓ − M Q ) ;Return C P Ci =1 p c i Next we describe the construction of matrices
M, N, Q .We have M ( V ,S ) , ( V ,T ) = 0 if S ∪ { c ( v ) : v ∈ V } 6 = T or { c ( v ) : v ∈ V } and S are not disjoint.Otherwise M ( V ,S ) , ( V ,T ) is given by P γ ∈ S V ,V χ γ ( Y ) where S V ,V is the set of perfect matchinginduced by V and V (each hyperedge in S V ,V direct from vertice from V to vertice from V ).In the same way, We have N ( V ,S ) , ( V ,T ) = 0 if S ∪ { c ( v ) : v ∈ V } 6 = T or { c ( v ) : v ∈ V } and S are not disjoint. Otherwise N ( V ,S ) , ( V ,T ) is given by P γ ∈ S V ,V χ γ ( Y ) where S V ,V is the setof perfect matching induced by V and V (each hyperedge in S V ,V direct from vertice in V to vertice in V ).For matrix Q , the indexing and non-zero entry locations are the same as N . However the non-zeroelements are given by P γ ∈ S V ,V χ γ ( Y ) where S V ,V is the set of perfect matching induced by V and V (each hyperedge in S V ,V direct from vertice in V to vertice in V ). Lemma B.19 (Evaluation of thresholding polynomial) . There exists a n v exp( O ( ℓv )) -time algo-rithm that given a coloring c : [ n ] → [3 ℓv ] (where ℓv is the number of vertices in hypergraph α ∈ S ℓ,v ) and a tensor Y ∈ R n × n × n evaluates degree ℓv polynomial in polynomial time p c ( Y ) = X α ∈ S ℓ,v χ α ( Y ) F c,α (5) F c,α = (3 ℓv ) ℓv (3 ℓv )! · c ( α )=[3 ℓv ] (6) when thresholding polynomial P ( Y ) defined in B.4 satisfies ( E P ( Y )) = (1 − o (1)) E P ( Y ) ,we can take exp( O ( ℓv )) random colorings and give an accurate estimation of the thresholdingpolynomial by averaging p c ( Y ) Proof.
A critical observation is that for each given random coloring c , the algorithm above evaluates p c ( Y ) . we prove that averaging random coloring for p c ( Y ) will give accurate estimate for P ( Y ) .28his follows from the same reasoning as in matrix case. First we note that E c p c ( Y ) = P ( Y ) . Next,for single coloring we have E p c ( Y ) = X α,β ∈ S ℓ,v E [ F c,α F c,β χ α ( Y ) χ β ( Y )] ≤ exp( O ( ℓv )) E P ( Y ) ≤ exp( O ( ℓv ))( E p c ( Y )) where we use the result that E P ( Y ) = (1 + o (1))( E P ( Y )) . Therefore, by averaging L =exp( O ( ℓv )) random colorings, the variance can be reduced such that P Lt =1 p ct ( Y ) L = (1 ± o (1)) P ( Y ) w.h.p. B.3.2 Evaluation of estimator for weak recovery
Next we discuss about the evaluation of polynomial estimator for weak recovery. Except for matrices
M, N as defined above, we need to construct two additional matrices A ∈ R n ( v +1) / × (2 ℓv − n v and B ∈ R (2 ℓv − n v × n v , where ℓ v = ℓ − v +12 is the number of vertices in each hypergraphcontained in the set S ℓ,v,i . Then we describe how to construct matrices A, B . For i ∈ [ n ] ,set of Algorithm 5:
Algorithm for evaluating estimation polynomial vector
Data:
Given Y ∈ R n × n × n s.t Y = λx ⊗ + W Result: P ( Y ) ∈ R n , the i -th entry of which is the sum of multilinear monomials corresponding tohypergraphs in S ℓ,v,i (up to accuracy n − Ω(1) ) for i ← to C do Sample coloring c i : [ n ] [ ℓ ] uniformly at random;Construct matrices M, N as in the algorithm of strong detection;Construct matrix A ∈ R n ( v +1) / × (2 ℓv − n v , where the rows are indexed by ( i, V ) ( i ∈ [ n ] and V ∈ [ n ] ( v − / ), and columns are indexed by ( V , S ) ( V ∈ [ n ] v and S ⊆ [ ℓ v ] );Construct matrix B ∈ R (2 ℓv − n v × n v , where the rows are indexed by ( V , S ) , and columns areindexed by V ;Record p c i = A ( N M ) ℓ − N B ;Return C P Ci =1 p c i vertices V in level ,set of vertices V in level and set of colors T , denoting S i,V ,V as the set ofall possible connections between level and , entry A ( i,V ) , ( V ,T ) is given by P α ∈ S i,V ,V χ α ( Y ) if T = { c ( v ) : v ∈ V ∪ v } , v V and otherwise. In the same way, denoting L V ,V as all possibleconnections between level ℓ − and ℓ − , we have entry B ( V ,S ) ,V given by P α ∈L V ,V χ α ( Y ) if S ∪ { c ( v ) : v ∈ V ∪ V } = [ ℓ v ] , S ∩ { c ( v ) : v ∈ V ∪ V } = ∅ and zero otherwise. Lemma B.20 (Evaluation of polynomial estimator) . Denote the number of vertices in any hyper-graph contained in S ℓ,v,i as ℓ v , then ℓ v = ℓ − v +12 . Then there exists a n v exp( O ( ℓ v )) -time al-gorithm that given a coloring c : [ n ] → [ ℓ v ] and a tensor Y ∈ R n × n × n evaluates vector p c ( Y ) ∈ R n with each entry p c,i ( Y ) a polynomial of entries in Yp c,i ( Y ) = X α ∈ S ℓ,v,i χ α ( Y ) F c,α F c,α = ℓ ℓ v v ℓ v ! · c ( α )=[ ℓ v ] For polynomial estimator defined in 3.1, if . λ n / v / > , we can take exp( O ( ℓv )) randomcolorings and give an accurate estimation of the estimation polynomial P i ( Y ) by averaging p c,i ( Y ) .When λ n / = ω (1) , we have n o (1) time algorithm algorithm for evaluation.Proof. The critical observation is that p c,i ( Y ) can be obtained from vector ξ = A ( N M ) ℓ − N B ( ∈ R n v is all- vector), by summing up all rows in ξ indexed by ( i, · ) .29y the same argument as in the strong detection algorithm, we can obtain accurate estimate of P ( Y ) by averaging exp( O ( ℓ )) random colorings when weak recovery is achieved. Therefore, the estimatorcan be evaluated in time n O ( v ) when . λ n / v / > .Moreover, when λ n / = ω (1) , it’s enough to take v = 1 and ℓ = o (log n ) . Thus ξ = A ( N M ) ℓ − N B can be evaluated in n exp( O ( ℓ )) = n o (1) time by recursively executingmatrix-vector multiplication. Therefore the polynomial estimator achieving strong recovery canbe evaluated in time n o (1) . Proof of Theorem 1.3.
When cn − / v − / λ − = 1 − Ω(1) and n = ω ( v poly (Γ ℓ )) , by lemmaB.18 P ( Y ) k P ( Y ) k achieves Ω(1) correlation with x k x k . Further by the color-coding method, according tolemma P ( Y ) can be evaluated in time n O ( v ) . These proves the claim in theorem 1.3. B.4 Equivalence between strong and weak recovery
Under some mild conditions, combining concentration argument and ’all or nothing’ amplification,we can actually obtain strong recovery algorithm when n − / v − / λ − = Θ(1) .For this we need an assumption on the tensor injective norm. The injective norm of an order- p tensor W ∈ ( R n ) ⊗ p is defined as k W k inj = max k u (1) k = ··· = k u ( p ) k =1 X i ,...,i p W i ,...,i p u (1) i · · · u ( p ) i p Theorem B.21 (Strong recovery) . In general spiked tensor model Y = λx ⊗ + W , we take estima-tion vector y ∈ R n by setting γ = 0 . n / v / λ = 1 + Ω(1) and ℓ = O (log γ n ) in estimator3.1. If injective norm of W is o ( n / ) w.h.p, then for constant v if n = ω ( v poly ( ℓ Γ)) , we have ˆ x ∈ R n s.t ˆ x i = P j ,j Y i,j ,j y j y j achieves strong recovery, i.e we have h ˆ x k ˆ x k , x k x k i = 1 − o (1) with high probability We use assumption that the injective norm of W is o ( n / ) w.h.p. Now we interpret this assumption.For Gaussian tensor the injective norm of W is o ( n / ) with high probability. For general tensor,this assumption is weaker than finite bounded moments. Lemma B.22.
For a tensor W ∈ ( R n ) ⊗ , the injective norm is O ( √ n ) with high probability if theabsolute value of entries in W are all bounded by B = o ( n / ) with high probability. Remark : If entries W ijk have bounded -th moment, then by Markov inequality and union bound,the entries in W are all bounded by B = o ( n / ) with high probability. Proof.
For fixed unit vectors x, y, z and T = P ijk W ijk x i y j z k , by Hoeffding bound we havePr [ T ≥ tB ′ ] ≤ exp( − ct ) where c is constant and B ′ is the maximum absolute value of entries in W . We denote the event B ′ ≤ B as A . By assumption A happens with high probability. The ε -net of unit sphere S ε,n has size at most ( O (1) /ε ) n (see e.g Tao’s random matrix book lemma 2.3.4). Thus the size of set B = { ( x, y, z ) | x, y, z ∈ S ε,n } is bounded by ( O (1) /ε ) n . Taking union bound on this set we havePr [ max ( x,y,z ) ∈B h W, x ⊗ y ⊗ z i ≥ Bt ] ≤ exp( c n ) exp( − ct ) + o (1) where c is constant. Taking t = C √ n with constant C large enough, it follows that max ( x,y,z ) ∈B h W, x ⊗ y ⊗ z i = o ( n / ) with high probability.Finally we have k W k inj = max k x k =1 , k y k =1 , k z k =1 h W, x ⊗ y ⊗ z i = h W, x ∗ ⊗ y ∗ ⊗ z ∗ i , where x ∗ , y ∗ , z ∗ is the maximizer. By definition of ε -net, we can find (˜ x, ˜ y, ˜ z ) ∈ B such that k ˜ x − x ∗ k , k ˜ y − y ∗ k , k ˜ z − z ∗ k ≤ ε . Thus h W, x ∗ ⊗ y ∗ ⊗ z ∗ − ˜ x ⊗ ˜ y ⊗ ˜ z i ≤ ε h W, x ∗ ⊗ y ∗ ⊗ z ∗ i . Forsmall ε , we have k W k inj ≤ ( x,y,z ) ∈B h W, x ⊗ y ⊗ z i = o ( n / ) with high probability. Thisproves the claim. 30his proof of theorem naturally follows from the weak recovery result above and the following twolemmas. Lemma B.23 (All or nothing phenomenon) . In general spiked tensor model Y = λx ⊗ + W , ifthe injective norm of tensor W is o ( n / ) , then if we have unit norm estimator y ∈ R n satisfy-ing h y, x/ k x ki = Ω(1) w.h.p, then let ˆ x ∈ R n with ˆ x i = P j ,j Y i,j ,j y j y j , we have w.h.p D ˆ x k ˆ x k , x k x k E = 1 − o (1) This lemma follows the same proof as appendix D in [WAM19]
Lemma B.24 (Concentration property) . For the above estimator P ( Y ) ∈ R n in definition 3.1,whenwe take ℓ, v as described in theorem B.18,we have E h P ( Y ) , x i = (1 + o (1)) ( E h P ( Y ) , x i ) We now prove lemma B.24. The proof is very similar to the proof of strong detection in appendixB.1.
Proof of Lemma B.24.
We denote S ℓ,v = ∪ i ∈ [ n ] S ℓ,v,i . For α ∈ S ℓ,v , we denote x α = x i if α ∈ S ℓ,v,i . Then equivalently we want to show. X α,β ∈ S ℓ,v E [ χ α ( Y ) χ β ( Y ) x α x β ] ≤ (1 − o (1)) X α,β ∈ S ℓ,v E [ χ α ( Y ) x α ][ χ β ( Y ) x β ] Since (cid:16)P α ∈ S ℓ,v E [ χ α ( Y ) x α ] (cid:17) = λ (2 ℓ − v | S ℓ,v | , we only need to bound the size of S ℓ,v . Applyingcombinatorial arguments to the generating process of S ℓ,v , we have | S ℓ,v | = (1 − o (1)) (cid:18)(cid:18) nv (cid:19)(cid:18) n v (cid:19)(cid:19) ℓ − (cid:18) n ( v + 1) / (cid:19)(cid:18) nv (cid:19) ((2 v )!) ℓ − (3 v − / .Now we bound the right hand side. First for the case that α, β are disjoint( r = 0 ), we have E [ χ α ( Y ) χ β ( Y )] = E [ χ α ( Y )] E [ χ β ( Y )] . Thus we have X α,β ∈ S ℓ,v E [ χ α ( Y ) χ β ( Y ) x α x β ] ≤ (1 − o (1)) X α,β ∈ S ℓ,v E [ χ α ( Y ) x α ][ χ β ( Y ) x β ] For each pair of α, β ∈ S ℓ,v sharing k hyperedges and r vertices, we have E [ χ α ( Y ) χ β ( Y )] = (1 + n − Ω(1) ) λ ℓ − v − k E Y j ∈ α ∆ β x deg ( j,α ∆ β ) j ≤ λ − k Γ O (2 r − k ) E [ χ α ( Y )] E [ χ β ( Y )] where deg ( j, α ∆ β ) represents the degree of vertex j in hypergraph α ∆ β .Next we bound the number of hypergraph pairs ( α, β ) sharing specified vertices and hyperedges.For this, we first choose α ∩ β and shared vertices as a subgraph of a hypergraph α contained in S ℓ,v .We consider the following case: • in level t ∈ [2 ℓ − of α there are r t shared vertices • between the levels t and t + 1 of α there are k t shared hyperedges.Then the number of choices for these shared hyperedges and vertices is bounded by N α ∩ β (cid:18) nr (cid:19)(cid:18) nr ℓ − (cid:19)(cid:18) r k (cid:19)(cid:18) r ℓ − k ℓ − (cid:19) ℓ − Y t =1 (cid:18) nr t − (cid:19)(cid:18) nr t (cid:19)(cid:18) r t − k t − (cid:19)(cid:18) r t − k t − (cid:19)(cid:18) r t k t (cid:19)(cid:18) r t k t − (cid:19) ℓ − Y t =0 (2 k t )!
31y Strling’s approximation, this is upper bounded by Q ℓ − t =0 (2 k t )! Q ℓ − t =0 (cid:0) nr t (cid:1) exp( O ( r )) . Next wechoose the remaining hypergraph α \ β and β \ α respectively. For α \ β , we have N α \ β = (cid:18) n v +12 − r (cid:19)(cid:18) nv − r ℓ − (cid:19) ℓ − Y t =1 (cid:18) n v − r t − (cid:19)(cid:18) nv − r t (cid:19) ℓ − Y t =0 (2( v − k t ))! ≤ | S ℓ,v,i | n − r v r v − k exp( O ( r )) For bounding the number of choices for β , suppose α, β share s vertices with degree or in the sub-graph α ∩ β . Then there are ℓ s ways of embedding α ∩ β and shared vertices in hypergraph β . Furtherthe number of choices for the remaining hypergraph is also bounded by | S ℓ,v | n − r v r v − k exp( O ( r )) .Therefore the contribution to E [ h P ( Y ) ,x i ]( E [ h P ( Y ) ,x i ]) with respect to specific r t , k t is bounded by ( n − r v r v − k ) n r v k/ ℓ s Γ O (2 r − k ) λ − k exp( O ( r )) ≤ (cid:16) ncv ℓ Γ O (1) (cid:17) − r +3 k/ (cid:16) cn − / v − / λ − (cid:17) k where c is constant. As in the proof of B.3, when nv − = ω ( poly (Γ ℓ )) the dominating term is givenby r = k . In this case, we must have k ≥ ℓ . For cn − / v − / λ − < and ℓ = C log n withconstant C large enough, the contribution is n − Ω(1) .Therefore in all, we have ( E h P ( Y ) ,x i ) E h P ( Y ) ,x i = 1 − o (1) when we have relation cn − / v − / λ − =1 − Ω(1) and n = ω ( v poly (Γ ℓ )) . Proof of Theorem B.21.
As a result of lemma B.24, by Chebyshev’s inequality we have h P ( Y ) , x i = (1 ± o (1)) E h P ( Y ) , x i w.h.p. Combined with correlation in expectation and Markovinequality, we have h P ( Y ) ,x ik x kk P ( Y ) k ≥ ε · Ω(1) with probability − ε . We set ε = o (1) and take estimator z ∈ R n with z k = P ijk Y ijk P i ( Y ) P j ( Y ) . Then according to the theorem B.23, we get the strongrecovery guarantee. C Higher order general spiked tensor model
For clarity, we discuss the algorithms for order-3 spiked tensor model and spiked Wigner modelabove. Such claims can be generalized to higher order tensor without difficulties.
Theorem C.1.
Let x ∈ R n be a random vector with independent, mean-zero entries having E x i = 1 and Γ = E x i ≤ n o (1) . Let λ > and Y = λ · x ⊗ p + W , where W ∈ ( R n ) ⊗ p has independent,mean-zero entries with E W i ,i ,...,i p = 1 . Then for v = o ( n / poly (Γ log n ) ) and λ ≥ c p n − p/ v − ( p − / ,there is n O ( pv ) time algorithm achieving strong detection. Theorem C.2.
Let x ∈ R n be a random vector with independent, mean-zero entries having E x i = 1 and Γ = E x i ≤ n o (1) . Let λ > and Y = λ · x ⊗ p + W , where W ∈ ( R n ) ⊗ p has independent,mean-zero entries with E W i ,i ,...,i p = 1 . Then for v = o ( n / poly (Γ log n ) ) and λ ≥ c p n − p/ v − ( p − / ,there is n O ( pv ) time algorithm giving unit norm estimator ˆ x s.t h ˆ x, x k x k i ≥ Ω(1) . Specifically this leads to polynomial time algorithm when λ = Ω( n − p/ ) .When the order- p is odd, the analysis is very similar to the one for the case p = 3 . Therefore wemainly talk about case where order p is even and prove the guarantee of the theorem. C.1 Strong detection algorithm for even p
For strong detection we propose the following thresholding polynomial
Definition C.3 (thresholding polynomial for even p) . On directed complete p -uniform hypergraphwith n vertices, we define S ℓ,v as the set of all copies of -regular hypergraphs generated in thefollowing way • We construct ℓ levels of vertices, with each level containing pv/ vertices.32 For t ∈ [ ℓ ] , we connect a perfect matching with v hyperedges between level t and level t + 1 . Each hyperedge directs from p/ vertices in level t to p/ vertices in level t + 1 . • Finally we similarly connect a perfect matching with v hyperedges between level ℓ − andlevel . Each hyperedge directs from p/ vertices in level to p/ vertices in level ℓ − The thresholding polynomial is given by P ( Y ) = P α ∈ S ℓ,v χ α ( Y ) where χ α ( Y ) is the Fourier basisassociated with hypergraph α : χ α ( Y ) = P ( i ,i ,...,i p ) ∈ α Y i ,i ,...,i p . Lemma C.4.
Suppose we have γ = c p v ( p − / λ n p/ = 1 + Ω(1) with c p small enough constantrelated to p , then taking ℓ = O (log γ n ) , the projection of likelihood ratio ˆ µ ( Y ) with respect to S ℓ,v is ω (1) when n/v = ω ( poly ( ℓ )) .Proof. Given fixed vertices in level t and level t + 1 , we have (( pv/ /v ! choices for the hyper-edges between level t and level t + 1 . Therefore we have S ℓ,v = (1 − o (1)) (cid:18) ( pv/ v ! (cid:19) ℓ n ℓpv/ . Therefore by Strling’s approximation this implies S ℓ,v ≥ v ( p − v/ λ v n ℓpv/ exp( − O ( v )) . There-fore we have X α ∈ S ℓ µ α ≥ λ ℓv n ℓpv/ v ( p − v/ exp( − O ( v )) = ω (1) when we have ℓ, v as described. Lemma C.5.
Suppose we have γ = c p v ( p − / λ n p/ = 1 + Ω(1) , and ℓ = O (log γ n ) with c p small enough constant related to p . If n/v = ω ( poly (Γ ℓ )) then we have the following concentrationproperty: E P ( Y ) = (1 + o (1))( E P ( Y )) Proof.
We consider α, β ∈ S ℓ,v . We first choose α ∩ β and shared vertices as subgraph of α and thenselect the remaining hypergraph of α, β . As before, we have E [ χ α ( Y ) χ β ( Y )] ≤ λ ℓv − k Γ r − pk ,where r is the number of shared vertices, k is the number of shared hyperedges, and Γ = E [ x i ] = n o (1) . Considering shared vertices and hyperedges in α , if there are r t shared vertices in level t and k t shared hyperedges between level t and level t + 1 , then there are N α ∩ β ≤ ℓ Y t =0 (cid:18) nr t (cid:19)(cid:18) r tpk t (cid:19)(cid:18) r tpk t − (cid:19) ((( pk t / k t ! ≤ n r v ( p − k/ exp( O ( r )) such subgraphs. On the other hand, the number of choices for the remaining hypergraph of α isbounded by N α − β = ℓ − Y t =0 (cid:18) n pv − r t (cid:19) (( p ( v − k t ) / ( v − k t )!= | S ℓ,v | n − r v r − pk + k exp( O ( r )) Then we consider choices for β . Denote the number of degree- vertices in α ∩ β as s and thenumber of shared vertices not contained in α ∩ β as s . Let s = s + s , then there are at most ℓ s ways of putting α ∩ β in β . For the same reasoning, number of ways for choosing the remaininghypergraph of β is also bounded by | S ℓ,v | n − r v r − pk + k exp( O ( r )) . Therefore, with respect to fixednumber of vertices and hyperedges r t , k t in each level of α , the total number of such hypergraphpairs S r,k,ℓ,v is bounded by | S ℓ,v | n − r v r − pk +2 k ℓ s v ( p − k/ exp( O ( r )) Therefore the corresponding contribution is given by X α,β ∈ S r,k,ℓ,v E χ α ( Y ) χ β ( Y )( E P ( Y )) ≤ n − r v r − pk +2 k ℓ s v ( p − k/ λ − k Γ r − pk exp( O ( r )) r − s ≥ pk by degree constraints, this is bounded by c pk/ n − pk/ v − ( p − k/ λ − k ℓ (cid:16) ncv Γ (cid:17) − r + pk/ where c is constant. Summing up for α, β with respect to different r t , k t and combining the fact thatif r = pk/ then k ≥ ℓ , we have X r t ,k t c pk/ n − pk/ v − ( p − k/ λ − k ℓ (cid:16) ncv Γ (cid:17) − r + pk/ = X k t ≥ X r t ≥ δ ( t ) max ( k t ,k t − )2 r ≥ pk +1 c pk/ n − pk/ v − ( p − k/ λ − k ℓ (cid:16) ncv Γ (cid:17) − r + pk/ + X k t ≥ (cid:16) c p/ n − p/ λ − v − ( p − / (cid:17) P kt When γ = c p n − p/ λ − v − / > and ℓ = C log γ n with constant C large enough, the second termis bounded by n − Ω(1) . For the first term we note that given k t for t ∈ [ ℓ ] , we have ℓ r − pk/ choicesfor r t . We denote k ∆ = | k t +1 − k t | . Then we have k ∆ = O ( r − k/ . Then given k ∆ we haveat most k ∆ different values for k t . As a result fixing these k ∆ different values, there are ℓ k ∆ choicesfor k t for t ∈ [ ℓ ] . Therefore the first term is bounded by X k ∆ ≥ (cid:18) ncv poly (Γ ℓ ) (cid:19) − max (1 / ,k ∆ ) k ∆ Y t =1 X k t ≥ ( c p/ n − p/ λ − v − ( p − / ) k t = o (1) In all, we have P α,β ∈ S ℓ,v E [ χ α ( Y ) χ β ( Y )] ≤ (1 + o (1))( E P ( Y )) Next we show that the running time can be improved using color-coding method. We describe theevaluation algorithm 6. We describe the matrices
M, N used in the algorithm. The rows and columns
Algorithm 6:
Algorithm for evaluating the thresholding polynomial
Data:
Given Y ∈ ( R n ) ⊗ p s.t Y = λx ⊗ p + W Result: P ( Y ) ∈ R which is the sum of multilinear monomials corresponding to hypergraphs in S ℓ,v (up to accuracy n − Ω(1) ) for i ← to C do Sample coloring c i : [ n ] [ ℓ ] uniformly at random;Construct a matrix M, N ∈ R (2 pℓv − n pv/ × (2 pℓv − n pv/ ;Record p c i = ( pℓv ) pℓv ( pℓv )! Tr ( M ℓ − N ) ;Return C P Ci =1 p c i of M are indexed by ( V , S ) and ( V , T ) where V ∈ [ n ] pv/ and V ∈ [ n ] pv/ correspond to setsof labels of vertices while S, T ( [ ℓv ] correspond to subsets of colors. We have M ( V ,S ) , ( V ,T ) = 0 if S ∪ { c ( v ) : v ∈ V } 6 = T or { c ( v ) : v ∈ V } and S are not disjoint. Otherwise M ( V ,S ) , ( V ,T ) isgiven by P γ ∈ S V ,V χ γ ( Y ) where S V ,V is the set of perfect matching induced by V and V (eachhyperedge in S V ,V direct from p/ vertices from V to p/ vertices from V ).For matrix N , the indexing is the same as M . The entry N ( V ,S ) , ( V ,T ) = 0 if S ∪ { c ( v ) : v ∈ V } 6 = T or { c ( v ) : v ∈ V } and S are not disjoint. Otherwise N ( V ,S ) , ( V ,T ) is given by P γ ∈ S V ,V χ γ ( Y ) where S V ,V is the set of perfect matching induced by V and V (each hyperedge in hypergraph α ∈ S V ,V directs from p/ vertices from V to p/ vertices from V ). Lemma C.6 (Evaluation of thresholding polynomial) . There exists a n O ( v ) -time algorithm thatgiven a coloring c : [ n ] → [ pℓv ] (where pℓv is the number of vertices in hypergraph α ∈ S ℓ,v ) and atensor Y ∈ ( R n ) ⊗ p evaluates degree ℓv polynomial below in polynomial time p c ( Y ) = X α ∈ S ℓ,v χ α ( Y ) F c,α (7)34 c,α = ( pℓv ) pℓv ( pℓv )! · c ( α )=[ pℓv ] (8) when thresholding polynomial P ( Y ) defined in C.3 satisfies ( E P ( Y )) = (1 − o (1)) E P ( Y ) ,we can take exp( O ( ℓv )) random colorings and give an accurate estimation of the thresholdingpolynomial by averaging p c ( Y ) .Proof. The observation is that p c ( Y ) is just given by ( pℓv ) pℓ v / ( pℓv )! times the trace of ( M ) ℓ − N .This can be done in time n pv/ exp( O ( ℓv )) .Next we prove that averaging random coloring for p c ( Y ) will give accurate estimation for P ( Y ) inthe detection algorithm. First we note that E c p c ( Y ) = P ( Y ) . Next, for single coloring we have E p c ( Y ) = X α,β ∈ S ℓ,v E [ F c,α F c,β χ α ( Y ) χ β ( Y )] ≤ exp( O ( ℓv )) E P ( Y ) ≤ exp( O ( ℓv ))( E p c ( Y )) where we use the result that E P ( Y ) = (1 + o (1))( E P ( Y )) . Therefore, by averaging L =exp( O ( ℓv )) random colorings, the variance can be reduced such that P Lt =1 p ct ( Y ) L = (1 ± o (1)) P ( Y ) w.h.p. Remark : When λ = Ω( n − p/ ) , we can take c p λn p/ v > with c p being small enough constantdependent on order p . This leads to O ( n pv ) time algorithm with constant v . When λ = ω ( n − p/ ) ,we can simply take P ( Y ) = P i
Combining the concentration property proved in lemma C.5 and the runningtime proved in lemma C.6, we get the claim.
C.2 Weak recovery algorithm for even p
For weak recovery we want to propose estimator P ( Y ) ∈ R n × n such that (cid:0) E (cid:10) P ( Y ) , xx ⊤ (cid:11)(cid:1) ( E P ( Y ) k F k E k xx ⊤ k F ) For even p , it can always be decomposed into two odd numbers p and p s.t p = p + p . Let p , p be such a pair of odd numbers that minimizes | p − p | . Then we define the estimation vector forweak recovery as following: Definition C.7 (Estimator for even- p weak recovery) . On p -uniform directed complete hypergraphon n vertices, we define the following set of subgraphsGiven tensor Y ∈ ( R n ) ⊗ p , we have estimator P ( Y ) ∈ R n × n where each entry P ij ( Y ) is a degree (2 ℓ − v polynomial given by P α ∈ S ℓ,v,i,j χ α ( Y ) where S ℓ,v,i,j is the set of hypergraph generatedin the following way: • we construct ℓ levels of vertices. Level contains vertex i and ( p v − / vertices inaddition. For < t < ℓ , level t contains p v vertices while for < t < ℓ − level t + 1 contains p v vertices. Level ℓ − contains vertex j and ( p v − / vertices in addition.All vertices are distinct. • We construct a perfect matching between level t, t + 1 for t ∈ [1 , ℓ − ,each hyperedgedirects from p vertices in even level to p vertices in odd level. • Level and are connected as bipartite hypergraph s.t each vertex in level excluding i has degree while vertex i and vertices in level has degree . Level ℓ − and level ℓ − are connected as bipartite hypergraph s.t vertices in level ℓ − excluding j has degree while vertices in level ℓ − and vertex j has degree emma C.8. Taking γ = c p n p/ v ( p − / λ = 1 + Ω(1) (where c p is small enough constant relatedto p ) and ℓ = O (log γ n ) in the estimator above, then if n = ω ( v poly ( ℓ Γ)) , we have E h P ( Y ) , xx ⊤ i ( E k P ( Y ) k F E k x k ) / = Ω(1) Proof.
We need to show the estimator P ( Y ) ∈ R n above achieves constant correlation with thehidden vector x . Equivalently we want to show that for each i, j ∈ [ n ] , we have X α ∈ S ℓ,v,i,j E [ χ α ( Y ) x i x j ] = Ω X α,β ∈ S ℓ,v,i,j E [ χ α ( Y ) χ β ( Y )] Since we have (cid:16)P α,β ∈ S ℓ,v,i,j E [ χ α ( Y ) x i x j ] (cid:17) = λ (2 ℓ − v | S ℓ,v,i,j | , we only need to bound the sizeof S ℓ,v,i,j . Applying combinatorial arguments to the generating process of S ℓ,v,i,j , we have | S ℓ,v,i,j | = (1 − o (1)) (cid:18)(cid:18) np v (cid:19)(cid:18) np v (cid:19)(cid:19) ℓ − (cid:18) n ( p v − / (cid:19)(cid:18) n ( p v − / (cid:19) (( p v )!( p v )!) ℓ − v ℓ − ( pv − / On the other hand, first choose α ∩ β and shared vertices(excluding i and j ) as subgraph of hyper-graph α ∈ S ℓ,v,i,j . For the shared vertices and hyperedges consisting in α , if there are r t verticesin level t ∈ [2 ℓ − and k t hyperedges between level r t and level r t +1 , then the number of suchintersection is bounded by N α ∩ β : (cid:18) nr (cid:19)(cid:18) nr ℓ − (cid:19)(cid:18) r + 1 p k (cid:19)(cid:18) r ℓ − + 1 p k ℓ − (cid:19) ℓ − Y t =1 (cid:18) nr t − (cid:19)(cid:18) nr t (cid:19)(cid:18) r t − p k t − (cid:19)(cid:18) r t − p k t − (cid:19)(cid:18) r t p k t (cid:19)(cid:18) r t p k t − (cid:19) ℓ − Y t =0 ( p k t )!( p k t )! k t ! This is upper bounded by ℓ − Y t =0 ( p k t )!( p k t )! k t ! ℓ − Y t =0 (cid:18) nr t (cid:19) exp( O ( r )) . Next we choose the remaining hypergraph α \ β and β \ α respectively. For α \ β , we have N α \ β = (cid:18) n p v − − r (cid:19)(cid:18) n p v − − r ℓ − (cid:19) ℓ − Y t =1 (cid:18) np v − r t − (cid:19)(cid:18) np v − r t (cid:19) ℓ − Y t =0 ( p ( v − k t ))!( p ( v − k t ))! k t ! ≤ | S ℓ,v,i,j | n − r v r v − ( p − k exp( O ( r )) Suppose there are s degree vertices in α ∩ β and s vertices shared between α, β but notcontained in α ∩ β , denoting s = s + s , then there are ℓ s ways of placing α ∩ β andshared vertices in hypergraph β and the count of remaining hypergraph is also bounded by | S ℓ,v,i,j | n − r v r v − ( p − k exp( O ( r )) . Moreover we have E [ χ α ( Y ) χ β ( Y )] = (1 + n − Ω(1) ) λ ℓ − v − k E Y j ∈ α ∆ β x deg ( j,α ∆ β ) j ≤ λ − k Γ O (2 r − pk ) E [ χ α ( Y )] E [ χ β ( Y )] where deg ( j, α, ∆ β ) represents the degree of vertex j in hypergraph α ∆ β . Therefore the contribu-tion to E [ P ij ( Y )]( E [ P ij ( Y ) x i x j ]) with respect to specific r t , k t is bounded by ( n − r v r v (1 − p ) k ) n r v ( p − k/ ℓ s Γ O (2 r − pk ) λ − k exp( O ( r )) r − s − ≥ pk by degree constraints, we study the terms in cases of r − s > pk and r − s < = pk . For the case r − s > pk , the sum of contribution is o (1) by the same reasoningas in the proof of detection. For the case r − s = pk , α ∩ β consists of hyperpaths respectivelystarting from i and j . Further each shared vertex is contained in α ∩ β . For such case the contributionis bounded by X ℓ ,ℓ c p ( ℓ + ℓ ) n − p ( ℓ + ℓ ) / v − ( p − ℓ + ℓ ) / λ − ℓ + ℓ ) ≤ X ℓ c ℓ n − pℓ / v − ( p − ℓ / λ − ℓ ! ≤ (cid:18) − v − ( p − / n − p/ λ − (cid:19) For the case r − s < pk , α ∩ β contains more than ℓ − hyperedges. For ℓ = O (log γ n ) withhidden constant large enough, the contribution is also o (1) . Therefore in all we have (cid:0) E h P ( Y ) , xx ⊤ i (cid:1) E k P ( Y ) k F E k xx ⊤ k F = 11 − v − ( p − / n − p/ λ − + o (1) Therefore when n = ω ( cv Γ polylog ( n )) and γ = 1 + Ω(1) , we have weak recovery algorithm bytaking random eigenvector in the top (1 − γ − ) − O (1) span of matrix P ( Y ) (as shown in [HS17]).When γ = ω (1) , taking leading eigenvector of P ( Y ) gives strong recovery guarantee.Next we evaluate the polynomial estimator for weak recovery using color-coding method, as shownin algorithm 7. We denote ℓ v = p (2 ℓ − v +12 Algorithm 7:
Algorithm for evaluating estimation matrix
Data:
Given Y ∈ ( R n ) ⊗ p s.t Y = λx ⊗ p + W Result: P ( Y ) ∈ R n × n , with P ij ( Y ) = P α ∈ S ℓ,v,i,j χ α ( Y ) (up to accuracy n − Ω(1) ) C ← exp(100 ℓv ) ; for i ← to C do Sample coloring c i : [ n ] [ ℓ v ] uniformly at random;Construct matrices M, N ∈ R (2 ℓv − n p v × (2 ℓv − n p v ;Construct matrices: A ∈ R n ( p v +1) / × (2 ℓv − n p v , B ∈ R (2 ℓv − n p v × n ( p v +1) / ;Construct matrix L (1) ∈ R n × n ( p v +1) / , L (2) ∈ R n ( p v +1) / × n ;Record matrix p c i = L (1) A ( N M ) ℓ − L (2) ;Return C P Ci =1 p c i Next we describe how to construct matrices
M, N, A, B . The rows and columns of M are indexedby ( V , S ) and ( V , T ) where V ∈ [ n ] p v and V ∈ [ n ] p v are set of vertices while S, T ( [ ℓv ] aresubset of colors. We have M ( V ,S ) , ( V ,T ) = 0 if S ∪ { c ( v ) : v ∈ V } 6 = T or { c ( v ) : v ∈ V } and S are not disjoint. Otherwise M ( V ,S ) , ( V ,T ) is given by P γ ∈ S V ,V χ γ ( Y ) where S V ,V is the set ofperfect matching induced by V and V (each hyperedge in S V ,V direct from p vertices from V to p vertices from V ).For matrix N , the indexing are the same as M . We have N ( V ,S ) , ( V ,T ) = 0 if S ∪{ c ( v ) : v ∈ V } 6 = T or { c ( v ) : v ∈ V } and S are not disjoint. Otherwise N ( V ,S ) , ( V ,T ) is given by P γ ∈ S V ,V χ γ ( Y ) where S V ,V is the set of perfect matching induced by V and V (each hyperedge in hypergraph α ∈ S V ,V directs from p vertices from V to p vertices from V ).We consider a subset of hypergraphs contained in S ℓ,v,i,j with i, j ∈ [ n ] , denoted by H i,j,V ,V .The set of vertices in level of these hypergraphs is fixed to be { i } ∪ V , where V ⊆ [ n ] . Theset of vertices in the level of these hypergraphs is fixed to be V ⊆ [ n ] . We denote S i,V ,V asthe following set of spanning subgraphs: a hypergraph α ∈ S i,V ,V if and only if there exists ahypergraph β ∈ H i,j,V ,V such that the hyperedge set of α is the same as the set of hyperedgesbetween level and of β . 37n the same way, we consider a subset of hypergraphs contained in S ℓ,v,i,j with i, j ∈ [ n ] , denotedby L i,j,V ,V , with vertices in levels ℓ − and ℓ − fixed. We denote S V ,V ,j as the followingset of spanning subgraphs: a hypergraph α ∈ S V ,V ,j if and only if there exists a hypergraph β ∈ L i,j,V ,V such that the hyperedge set of α is the same as the set of hyperedges between level ℓ − and ℓ − of β .By these definitions, the entry A ( i,V ) , ( V ,T ) is given by P α ∈ S i,V ,V χ α ( Y ) if T = { c ( v ) : v ∈ V ∪ v } , v V and otherwise. The entry B ( V ,S ) ,V is given by P α ∈ S V ,V χ α ( Y ) if S ∪ { c ( v ) : v ∈ V ∪ V } = [ ℓ v ] , S ∩ { c ( v ) : v ∈ V ∪ V } = ∅ and zero otherwise.Finally, we construct the deterministic matrices L (1) , L (2) . The columns of matrix L (1) are indexedby ( j, V ) , where j ∈ [ n ] and V is a size- n ( p v +1) / subset of [ n ] . The rows are indexed by i ∈ [ n ] .The entry L (1) i, ( j,V ) = 1 if j = i and otherwise. The transpose of L (2) is indexed in the same way.The entry L (2)( j,V ) ,i = 1 if i = j and otherwise. Lemma C.9 (Evaluation of polynomial estimator) . Denote the number of vertices in any hypergraphcontained in S ℓ,v,i as ℓ v , then ℓ v = p (2 ℓ − v +12 . Given sampled colorings c i : [ n ] → [ ℓ v ] and a tensor Y ∈ ( R n ) ⊗ p , the algorithm 7 return a matrix p ( Y, c , . . . , c C ) ∈ R n × n in time n O ( pv ) exp( O ( ℓ v )) .When E h P ( Y ) ,xx ⊤ i n ( E k P ( Y ) k F ) / = δ = Ω(1) , we have E h p ( Y, c , . . . , c C ) , xx ⊤ i n ( E k p ( Y, c , . . . , c C ) k F ) / ≥ (1 − o (1)) δ ≥ Ω(1) . Remark : When λ = ω ( n − p/ ) , using power method for extracting leading eigenvector, we have n p + o (1) time algorithm for evaluating the leading eigenvector of the matrix returned by the algo-rithm. Proof.
The critical observation is that P α ∈ S ℓ,v,i,j χ α ( Y ) F c,α can be obtained from the matrix A ( N M ) ℓ − N B in the algorithm by summing up all entries in H indexed by row ( i, · ) and column ( j, · ) . Thus given random coloring c the algorithm evaluates matrix p c ( Y ) satisfying the following: p c,i,j ( Y ) = X α ∈ S ℓ,v,i,j χ α ( Y ) F c,α F c,α = ℓ ℓ v v ℓ v ! · c ( α )=[ ℓ v ] Thus p c ( Y ) = X α ∈ S ℓ,v,i,j χ α ( Y ) F c,α = L (1) A ( N M ) ℓ − N BL (2)
By the same argument in the strong detection algorithm, we can obtain accurate estimation of P ( Y ) by averaging exp( O ( ℓv )) random colorings when γ = c p λ n p/ v ( p − / > where c p is small enough constant related to p . Therefore taking a random vector in the span of leading δ − O (1) eigenvectors of D ( Y ) = L P Lt =1 p c t ( Y ) generates an estimator achieving weak recov-ery. Since ℓ = O (log γ n ) , the polynomial can be evaluated in time O ( n pv ) when we have c p λ n p/ v ( p − / > and n/v = ω (Γ ℓ ) . This leads to polynomial time algorithm when vv