Efficient Tensor Decomposition
aa r X i v : . [ c s . D S ] J u l Efficient Tensor Decomposition ∗ Aravindan Vijayaraghavan † Abstract
This chapter studies the problem of decomposing a tensor into a sum of constituent rankone tensors. While tensor decompositions are very useful in designing learning algorithms anddata analysis, they are NP-hard in the worst-case. We will see how to design efficient algorithmswith provable guarantees under mild assumptions, and using beyond worst-case frameworks likesmoothed analysis.
Contents ∗ Chapter 19 of the book
Beyond the Worst-Case Analysis of Algorithms [Rou20]. † Department of Computer Science, Northwestern University. Supported by the National Science Foundation (NSF)under Grant No. CCF-1652491, CCF-1637585 and CCF-1934931. [email protected] . Introduction To Tensors
Tensors are multi-dimensional arrays, and constitute natural generalizations of matrices. Tensorsare fundamental linear algebraic entities, and widely used in physics, scientific computing and signalprocessing to represent multi-dimensional data or capture multi-wise correlations. The differentdimensions of the array are called the modes and the order of a tensor is the number of dimensionsor modes of the array, as shown in Figure 1. The order of a tensor also corresponds to the numberof indices needed to specify an entry of a tensor. Hence every ( i , i , i ) ∈ [ n ] × [ n ] × [ n ] specifiesan entry of the tensor T that is denoted by T ( i , i , i ). Figure 1: shows a matrix M which is a tensor of order 2,and a tensor T of order 3 with n = 7 , n = 6 , n = 5. Theposition of the entry T (7 , , We start with the definition of a rank one tensor. An order ℓ tensor T ∈ R n ×···× n ℓ is rank one if andonly if it can be written as an outer product v ⊗ v ⊗· · ·⊗ v ℓ for some vectors v ∈ R n , . . . , v ℓ ∈ R n ℓ i.e., T ( i , i , . . . , i ℓ ) = v ( i ) v ( i ) . . . v ℓ ( i ℓ ) ∀ ( i , . . . , i ℓ ) ∈ [ n ] × · · · × [ n ℓ ] . Note than when ℓ = 2, this corresponds to being expressible as v v T . Definition 1.1. [Rank k decomposition] A tensor T is said to have a decomposition of rank k iffit is expressible as the sum of k rank one tensors i.e., ∃ { u ( j ) i | i ∈ [ k ] , j ∈ [ ℓ ] } , s.t. T = k X i =1 u (1) i ⊗ u (2) i ⊗ · · · ⊗ u ( ℓ ) i . Moreover T has rank k if and only if k is the smallest natural number for which T has a rank k decompostion. 2he vectors { u ( j ) i : i ∈ [ k ] , j ∈ [ ℓ ] } are called the factors of the decomposition. To keep trackof how the factors across different modes are grouped, we will use U ( j ) = ( u ( j ) i : i ∈ [ k ]) for j ∈ [ ℓ ] to represent the factors. These “factor matrices” all have k columns, one per term of thedecomposition. Finally, we will also consider symmetric tensors – a tensor T of order ℓ is symmetriciff T ( i , i , . . . , i r ) = T ( i σ (1) , i σ (2) , . . . , i σ ( r ) ) for every permutation σ over { , , . . . , r } (see Exercise1 for an exercise about decompositions of symmetric tensors). Differences from matrix algebra and pitfalls.
Observe that these definition of rank, low-rankdecompositions specialize to the standard notions for matrices ( ℓ = 2). However it is dangerousto use intuition we have developed from matrix algebra to reason about tensors because of severalfundamental differences. Firstly, an equivalent definition for rank of a matrix is the dimension ofthe row space, or column space. This is not true for tensors of order 3 and above. In fact for atensor of order ℓ in R n × ℓ , the rank as we defined it could be as large as n ℓ − , while the dimensionof the span of n dimensional vectors along any of the modes can be at most n . The definition thatwe study in Definition 1.1 (as opposed to other notions like Tucker decompositions) is motivatedby its applications to statistics and machine learning.Secondly, much of the spectral theory for matrices involving eigenvectors and eigenvalues doesnot extend to tensors of higher order. For matrices, we know that the best rank- k approximationconsists of the leading k terms of the SVD. However this is not the case for tensor decompositions.The best rank-1 approximation may not be a factor in the best rank-2 approximation. Finally,and most importantly, the algorithmic problem of finding the best rank- k approximation of atensor is NP-hard in the worst-case, particularly for large k ; for matrices, this is of course solvedusing singular value decompositions (SVD). In fact, this worst-case NP-hardness for higher ordertensors is true for most tensor problems including computing the rank, computing the spectralnorm etc. [Hås90, HL13].For all of the reasons listed above, and more, it is natural to ask, why bother with tensordecompositions at all? We will now see a striking property (uniqueness) satisfied by low-rankdecompositions of most higher order tensors (but not satisfied by matrices), that also motivatesmany interesting uses of tensor decompositions. Uniqueness of low-rank decompositions.
A remarkable property of higher order tensors isthat (under certain conditions that hold typically) their minimum rank decompositions are uniqueupto trivial scaling and permutation. This is in sharp contrast to matrix decompositions. For anymatrix M with a rank k ≥ M = U V T = P ki =1 u i v Ti , there exists several otherrank k decompositions M = U ′ ( V ′ ) T , where U ′ = U O and V ′ = V O for any rotation matrix O i.e., OO T = I k ; in particular, the SVD is one of them. This rotation problem , is a common issue whenusing matrix decompositions in factor analysis (since we can only find the factors up to a rotation).The first uniqueness result for tensor decompositions was due to Harshman [Har70](who inturn credits it to Jennrich), assuming what is known as the “full rank condition”. In particular, if For small k , there are algorithms that find approximately optimal rank- k approximations in time exponential in k (see e.g., [BCV14, SWZ19]). There are other definitional issues with the rank – there are tensors of a certain rank, that can be arbitrarilywell-approximated by tensors of much smaller rank i.e., the “limit rank” (or formally, the border rank) may not beequal to the rank of the tensor. See Exercise 2 for an example. ∈ R n × n × n has a decomposition T = k X i =1 u i ⊗ u i ⊗ u i , s.t. { u i : i ∈ [ k ] } ⊂ R n are linearly independent , (or the factor matrix U is full rank), then this is the unique decomposition of rank k up to permutingthe terms. (The statement is actually a little more general and also handles non-symmetric tensors;see Theorem 3.1). Note that the full rank condition requires k ≤ n (moreover it holds whenthe vectors are in general position in n ≥ k dimensions). What makes the above result even moresurprising is that, the proof is algorithmic! We will in fact see the algorithm and proof in Section 3.1.This will serve as the workhorse for most of the algorithmic results in this chapter. Kruskal [Kru77]gave a more general condition that guarantees uniqueness up to rank 3 n/ −
1, using a beautifulnon-algorithmic proof. Uniqueness is also known to hold for generic tensors of rank k = Ω( n )(here “generic” means all except a measure zero set of rank k tensors). We will now see how thisremarkable property of uniqueness will be very useful for applications like learning latent variablemodels. A common approach in unsupervised learning is to assume that the data (input) that is given tous is drawn from a probabilistic model with some latent variables and/or unknown parameters θ ,that is appropriate for the task at hand i.e., the structure we want to find. This includes mixturemodels like mixtures of Gaussians, topic models for document classification etc. A central learningproblem is the efficient estimation of such latent model parameters from observed data.A necessary step towards efficient learning is to show that the parameters are indeed identifiableafter observing polynomially many samples. The method of moments approach, pioneered byPearson, infers model parameters from empirical moments such as means, pairwise correlationsand other higher order correlations. In general, very high order moments may be needed for thisapproach to succeed and the unreliability of empirical estimates of these moments leads to largesample complexity (see e.g., [MV10, BS10]). In fact, for latent variable models like mixtures of k Gaussians, an exponential sample complexity of exp(Ω( k )) is necessary, if we make no additionalassumptions.On the computational side, maximum likelihood estimation i.e., argmax θ P θ [ data ] is NP-hardfor many latent variable models (see e.g., [TD18]). Moreover iterative heuristics like expectationmaximization (EM) tend to get stuck in local optima. Efficient tensor decompositions when possi-ble, present an algorithmic framework that is both statistically and computationally efficient, forrecovering the parameters. The method-of-moments is the general approach of inferring parameters of a distribution, by com-puting empirical moments of the distribution and solving for the unknown parameters. The mo-ments of a distribution over R n are naturally represented by tensors. The covariance or the secondmoment is an n × n matrix, the third moment is represented by a tensor of order 3 in R n × n × n (the ( i , i , i ) th entry is E [ x i x i x i ]), and in general the ℓ th moment is a tensor of order ℓ . Morecrucially for many latent variable models D (¯ θ ) with parameters ¯ θ , the moment tensor or a suitable4odification of it, has a low-rank decomposition (perhaps up to some small error) in terms of theunknown parameters ¯ θ of the model. Low rank decompositions of the tensor can then be usedto implement the general method-of-moments approach, with both statistical and computationalimplications. The uniqueness of the tensor decomposition then immediately implies identifiability of the model parameters (in particular, it implies a unique solution for the parameters)! More-over, a computationally efficient algorithm for recovering the factors of the tensor gives an efficientalgorithm for recovering the parameters ¯ θ . General Recipe.
This suggests the following algorithmic framework for parameter estimation.Consider a latent variable model with model parameters ¯ θ = ( θ , θ , . . . , θ k ). These could be oneparameter each for the k possible values of the latent variable (for example, in a mixture of k Gaussians, the θ i could represent the mean of the i th Gaussian component of unit variance).1. Define an appropriate statistic T of the distribution (typically based on moments) such thatthe expected value of T has a low-rank decomposition T = E D ( θ ) [ T ] = k X i =1 λ i θ ⊗ ℓi , for some ℓ ∈ N , and (known) scalars { λ i : i ∈ [ k ] } .
2. Obtain an estimate e T of the tensor T = E [ T ] from the data (e.g., from empirical moments)up to small error (denoted by the error tensor E ).3. Use tensor decompositions to solve for the parameters ¯ θ = ( θ , . . . , θ k ) in the system P ki =1 λ i θ ⊗ ℓi ≈ e T , to obtain estimates b θ , . . . , b θ k of the parameters.The last step involving tensor decompositions is the technical workhorse of the above approach,both for showing identifiability, and getting efficient algorithms. Many of the existing algorithmicguarantees for tensor decompositions (that hold under certain natural conditions about the decom-position e.g., Theorems 3.1 and 4.1) provably recover the rank- k decomposition, thereby givingalgorithmic proofs of uniqueness as well. However, the first step of designing the right statistic T with a low-rank decomposition requires a lot of ingenuity and creativity. In Section 2.2 we willsee two important latent variable models that will serve as our case studies. You will see anotherapplication in the next chapter on topic modeling. Need for robustness to errors.
So far, we have completely ignored sample complexity consid-erations by assuming access to the exact expectation T = E [ T ], so the error E = 0 (this requiresinfinite samples). In polynomial time, the algorithm can only access a polynomial number of sam-ples. Estimating a simple 1D statistic up to ε = 1 / poly( n ) accuracy typically requires Ω(1 /ε )samples, the ℓ th moment of a distribution requires n O ( ℓ ) samples to estimate up to inverse polyno-mial error (in Frobenius norm, say). Hence, to obtain polynomial time guarantees for parameterestimation, it is vital for the tensor decomposition guarantees to be noise tolerant i.e., robust upto inverse polynomial error (this is even assuming no model mis-specification). Fortunately, suchrobust guarantees do exist – in Section 3.1, we will show robust analogue of Harshman’s uniquenesstheorem and related algorithms (see also [BCV14] for a robust version of Kruskal’s uniqueness the-orem). Obtaining robust analogues of known uniqueness and algorithmic results is quite non-trivialand open in many cases (see Section 6). 5 .2 Case studies Case Study 1: Mixtures of Spherical Gaussians.
Our first case study is mixtures of Gaus-sians. They are perhaps the most widely studied latent variable model in machine learning forclustering and modeling heterogenous populations. We are given random samples, where each sam-ple point x ∈ R n is drawn independently from one of k Gaussian components according to mixingweights w , w , . . . , w k , where each Gaussian component j ∈ [ k ] has a mean µ j ∈ R n and a covari-ance σ j I ∈ R n × n . The goal is to estimate the parameters { ( w j , µ j , σ j ) : j ∈ [ k ] } up to requiredaccuracy ε > k, n, /ε . Existing algorithmsbased on method of moments have sample complexity and running time that is exponential in k ingeneral [MV10, BS10]. However, we will see that as long as certain non-degeneracy conditions aresatisfied, tensor decompositions can be used to get tractable algorithms that only have a polynomialdependence on k (in Theorem 3.3 and Corollary 4.9).For the sake of exposition, we will restrict our attention to the uniform case when the mixingweights are all equal and variances σ i = 1 , ∀ i ∈ [ k ]. Most of these ideas also apply in the moregeneral setting [HK13].For the first step of the recipe, we will design a statistic that has a low-rank decomposition interms of the means { µ i : i ∈ [ k ] } . Proposition 2.1.
For any integer ℓ ≥ , one can compute efficiently a statistic T ℓ from the first ℓ moments such that E [ T ] = T ℓ := P ki =1 µ ⊗ ℓi . Let η ∼ N (0 , I ) denote a Gaussian r.v. The expected value of the statistic x ⊗ ℓ Mom ℓ := E [ x ⊗ ℓ ] = X i w i E η [( µ i + η ) ⊗ ℓ ] = 1 k k X i =1 X x j ∈{ µ i ,η }∀ j ∈ [ ℓ ] E η (cid:2) ℓ O j =1 x j (cid:3) . (1)Now the first term in the inner expansion (where every x j = µ i ) is the one we are interested in,so we will try to “subtract” out the other terms using the first ( ℓ −
1) moments of the distribution.Let us consider the case when ℓ = 3 to gain more intuition. As odd moments of η are zero, we haveMom := E [ x ⊗ ] = 1 k k X i =1 (cid:16) µ ⊗ i + E η [ µ i ⊗ η ⊗ η ] + E η [ η ⊗ η ⊗ µ i ] + E η [ η ⊗ µ i ⊗ η ] (cid:17) = T + (cid:16) Mom ⊗ I + two other known terms (cid:17) . Hence, we can obtain the required tensor T using a combination of Mom and Mom ; the cor-responding statistic is x ⊗ − ( x ⊗ I + two other known terms). We can use a similar inductiveapproach for obtaining T ℓ (or use Iserlis identity that expresses higher moments of a Gaussian interms of the mean and covariance) . Case study 2: Learning Hidden Markov Models (HMMs).
Our next example is HMMswhich are extensively used for data with a sequential structure. In an HMM, there is a hiddenstate sequence Z , Z , . . . , Z m taking values in [ k ], that forms a stationary Markov chain Z → An alternate trick to obtain a statistic T ℓ that only loses constant factors in the dimension involves looking atan off-diagonal block of the tensor Mom ℓ after partitioning the n co-ordinates into ℓ equal sized blocks. → · · · → Z m with transition matrix P and initial distribution w = { w j } j ∈ [ k ] (assumed to be thestationary distribution). The observation X t is represented by a vector in x ( t ) ∈ R n . Given the state Z t at time t , X t is conditionally independent of all other observations and states. The observationmatrix is denoted by O ∈ R n × k ; the columns of O represent the means of the observation X t ∈ R n conditioned on the hidden state Z t i.e., E [ X t | Z t = i ] = O i , where O i represents the i th column of O .We also assume that X t satisfies strong enough concentration bounds to use empirical estimates.The parameters are P, O , w .We now define appropriate statistics following [AMR09]. Let m = 2 ℓ + 1 for some ℓ to bechosen later. The statistic T is X ℓ +1 ⊗ X ℓ ⊗ · · · ⊗ X . We can also view this (2 ℓ + 1) momenttensor as a 3-tensor of shape n ℓ × n × n ℓ . The first mode corresponds to X ℓ ⊗ X ℓ − ⊗ . . . ⊗ X , thesecond mode is X ℓ +1 and the third mode is X ℓ +2 ⊗ X ℓ +3 ⊗ . . . X ℓ +1 . Why does it have a low-rankdecomposition? We can think of the hidden state Z ℓ +1 as the latent variable which takes k possiblevalues. Proposition 2.2.
The above statistic T has a low-rank decomposition P ki =1 A i ⊗ B i ⊗ C i withfactor matrices A ∈ R n ℓ × k , B ∈ R n × k , and C ∈ R n ℓ × k s.t. ∀ i ∈ [ k ] , A i = E [ ⊗ j = ℓ X j | Z ℓ +1 = i ] , B i = E [ X ℓ +1 | Z ℓ +1 = i ] , and C i = E [ ⊗ ℓ +1 j = ℓ +2 X j | Z ℓ +1 = i ] . Moreover,
O, P and w can be recovered from A, B, C . For ℓ = 1, C = OP, B = O, A = OP ′ where P ′ = diag( w ) P T diag( w ) − is the reverse transitionmatrix. Tensor decompositions will allow for efficient recovery of O, P, w in Theorem 3.4 andSection 4.4. We leave the proof of Proposition 2.2 as Exercise 4. See [AMR09] for more details.
We now study Jennrich’s algorithm (first described in [Har70]), that gives theoretical guarantees forfinding decompositions of third-order tensors under a natural non-degeneracy condition called thefull-rank setting. Moreover this algorithm also has reasonable robustness properties, and can beused as a building block to handle more general settings and for many machine learning applications.Consider a third-order tensor T ∈ R n × m × p that has a decomposition of rank k : T = k X i =1 u i ⊗ v i ⊗ w i . Our algorithmic goal is to recover the unknown factors
U, V, W . Of course, we only hope to recoverthe factors up to some trivial scaling of vectors (within a rank-one term) and permuting terms. Notethat our algorithmic goal here is much stronger than usual. This is possible because of uniquenessof tensor decompositions – in fact, the proof of correctness of the algorithm also proves uniqueness!The algorithm considers two matrices M a , M b that are formed by taking random linear combina-tions of the slices of the tensor as shown in Figure 2. We will show later in (2) that M a , M b both havelow-rank decompositions in terms of the unknown factors { u i , v i } . Hence, the algorithm reducesthe problem of decomposing one third-order tensor into the problem of obtaining a “simultaneous”decomposition of the two matrices M a , M b (this is also called simultaneous diagonalization).7igure 2: shows a tensor T , and a particular matrix slice high-lighted in orange (corresponding to i = 2). The linear combi-nation of the slices T ( · , · , a ) takes a linear combination of thesematrix slices weighted according to a ∈ R p . The algorithm con-siders two matrices M a = T ( · , · , a ) , M b = T ( · , · , b ) for two ran-domly chosen vectors a, b ∈ R p .In the following algorithm, M † refers to the pseudoinverse or the Moore-Penrose inverse of M (if a rank- k matrix M has a singular value decomposition M = U Σ V T where Σ is a k × k diagonalmatrix, then M † = V Σ − U T ). Algorithm 1
Jennrich’s Algorithm
Input : Tensor T ∈ R n × m × p .1. Draw a, b ∼ N (0 , p ) p ∈ R p independently. Set M a = T ( · , · , a ) , M b = T ( · , · , b ).2. Set { u i : i ∈ [ k ] } to be the eigenvectors corresponding to the k largest (in magnitude) eigen-values of M a ( M b ) † . Similarly let { v i : i ∈ [ k ] } be the eigenvectors corresponding to the k largest (in magnitude) eigenvalues of (cid:16) ( M b ) † M a (cid:17) T .3. Pair up u i , v i if their corresponding eigenvalues are reciprocals (approximately).4. Solve the linear system T = P ki =1 u i ⊗ v i ⊗ w i for the vectors w i .5. Return factor matrices U ∈ R n × k , V ∈ R m × k , W ∈ R p × k .In what follows, k T k F denotes the Frobenius norm of the tensor ( k T k F is the sum of the squaresof all the entries), and the condition number κ of matrix U ∈ R n × k is given by κ ( U ) = σ ( U ) /σ k ( U ),where σ ≥ σ ≥ · · · ≥ σ k ≥ κ , which is finite only if the matrixhas rank k (full rank). Theorem 3.1.
Suppose we are given tensor e T = T + E ∈ R m × n × p , where T has a decomposition T = P ki =1 u i ⊗ v i ⊗ w i satisfying the following conditions:1. Matrices U = ( u i : i ∈ [ k ]) , V = ( v i : i ∈ [ k ]) have condition number at most κ ,2. For all i = j , k w i k w i k − w j k w j k k ≥ δ .3. Each entry of E is bounded by k T k F · ε/ poly ( κ, max { n, m, p } , δ ) .Then the Algorithm 1 on input e T runs in polynomial time and returns a decomposition { ( e u i , e v i , e w i ) : i ∈ [ k ] } s.t. there is a permutation π : [ k ] → [ k ] with ∀ i ∈ [ k ] , k e u i ⊗ e v i ⊗ e w i − u π ( i ) ⊗ v π ( i ) ⊗ w π ( i ) k F ≤ ε k T k F .
8e start with a simple claim that leverages the randomness in the Gaussian linear combinations a, b (in fact, this is the only step of the argument that uses the randomization). Let D a :=diag( a T w , a T w , . . . , a T w k ) and D b := diag( b T w , b T w , . . . , b T w k ). Lemma 3.2.
With high probability over the randomness in a, b , the diagonal entries of D a D − b areseparated from each other, and from , i.e., ∀ i ∈ [ k ] (cid:12)(cid:12)(cid:12)(cid:12) h w i , a ih w i , b i (cid:12)(cid:12)(cid:12)(cid:12) > poly ( p ) , and ∀ i = j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h w i , a ih w i , b i − h w j , a ih w j , b i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > poly ( p ) . The proof just uses simple anti-concentration of Gaussians and a union bound. We now proceedto the proof of Theorem 3.1.
Proof of Theorem 3.1.
We first prove that when E = 0, the above algorithm recovers the decom-position exactly. The robust guarantees when E = 0 uses perturbation bounds for eigenvalues andeigenvectors. No noise setting ( E = 0 ). Recall that T has a rank k decomposition in terms of the factors U, V, W .Hence M a = k X i =1 h a, w i i u i v Ti = U D a V T , and similarly M b = U D b V T . (2)Moreover U, V are full rank by assumption, and diagonal matrices D a , D b have full column rank of k with high probability (Lemma 3.2). Hence M a ( M b ) † = U D a V T ( V T ) † D † b U † = U D a D † b U † and M Ta ( M Tb ) † = V D a D † b V † . Moreover from Lemma 3.2, the entries of D a D † b are distinct and non-zero with high probability.Hence the column vectors of U are eigenvectors of M a ( M b ) † with eigenvalues ( h w i , a i / h w i , b i : i ∈ [ k ]). Similarly, the columns of V are eigenvectors of ( M † b M a ) T with eigenvalues ( h w i , b i / h w i , a i : i ∈ [ k ]). Hence, the eigendecompositions of M a M † b and ( M † b M a ) T are unique (up to scaling of theeigenvectors) with the corresponding eigenvalues being reciprocals of each other.Finally, once we know { u i , v i : i ∈ [ k ] } (up to scaling), step 4 solves a linear system in theunknowns { w i : i ∈ [ k ] } . A simple claim shows that the corresponding co-efficient matrix given by { u i v Ti : i ∈ [ k ] } has “full” rank i.e., rank of k . Hence the linear system has a unique solution W and algorithm recovers the decomposition. Robust guarantees ( E is non-zero). When E = 0, we will need to analyze how much the eigenvectorsof M := M a M † b can change, under the (worst-case) perturbation E . The proof uses perturbationbounds for eigenvectors of matrices (which are much more brittle than eigenvalues) to carry outthis analysis. We now give a a high-level description of the approach, while pointing out a couple ofsubtle issues and difficulties. The primary issue comes from the fact that the matrix M = M a M † b is not a symmetric matrix (for which one can use the Davis-Kahan theorem for singular vectors).In our case, while we know that M is diagonalizable, there is no such guarantee about M ′ = M a M † b + E ′ , where E ′ is the error matrix that arises at this step due to E . The key property thathelps us here is Lemma 3.2, which ensures that all of the non-zero eigenvalues of M are separated.9n this case, we know the matrix M ′ is also diagonalizable using a standard strengthening of theGershgorin disc theorem. One can then use the separation in the eigenvalues of M to argue that theeigenvectors of M , M ′ are close, using ideas from the perturbation theory of invariant subspaces(see Chapter 5 of[SS90]). See also [GVX14, BCMV14] for a self-contained proof of Theorem 3.1. These efficient algorithms that (uniquely) recover the factors of a low-rank tensor decompositiongive polynomial time guarantees for learning non-degenerate instances of several latent variablemodels using the general recipe given in Section 2.1. This approach has been used for severalproblems including but not limited to, parameter estimation of hidden markov models, phylogenymodels, mixtures of Gaussians, independent component analysis, topic models, mixed communitymembership models, ranking models, crowdsourcing models, and even certain neural networks (see[AGH +
14, Moi18] for excellent expositions on this topic) .For illustration, we give the implications for our two case studies. For Gaussian mixtures, the k means are assumed to be linear independent (hence n ≥ k ). We apply Theorem 3.1 to the ℓ = 3order tensor obtained from Proposition 2.1. Theorem 3.3. [HK13] Given samples from a mixture of k spherical Gaussians, there is an algo-rithm that learns the parameters up to ε error in poly( n, /ε, /σ k ( M )) time (and samples), where M is the n × k matrix of means. For hidden markov models, we assume that the columns of the observation matrix O , and thetransition matrix P are linear independent (hence n ≥ k ). We apply Theorem 3.1 to the ℓ = 3order tensor obtained from Proposition 2.2. Theorem 3.4. [MR06, HKZ12] Given samples with m = 3 consecutive observations (correspondingto any fixed window of length ) from an HMM model as in Section 2.2, with σ k ( O ) ≥ / poly( n ) and σ k ( P ) ≥ / poly( n ) , we can recover P, O up to ε error in poly( n, /ε ) time (and samples). The tensor decomposition algorithm we have seen in the previous section requires that the fac-tor matrices have full column rank. As we have seen in Section 3.2, this gives polynomial timealgorithms for learning a broad variety of latent variable models under the full-rank assumption.However, there are many applications in unsupervised learning where it is crucial that the hiddenrepresentation has much higher dimension (or number of factors k ) than the dimension of the fea-ture space n . Obtaining polynomial time guarantees for these problems using tensor decompositionsrequires polynomial time algorithmic guarantees when the rank is much larger than the dimension(in the full-rank setting k ≤ n , even when the k factors are random or in general position in R n ).Can we hope to obtain provable guarantees when the rank k ≫ n ?This challenging setting when the rank is larger than the dimension is often referred to as the overcomplete setting . Tensor decompositions in the overcomplete setting is NP-hard in general.However for tensors of higher order, we will see in the rest of this section how Jennrich’s algorithmcan be adapted to get polynomial time guarantees even in very overcomplete settings for non-degenerate instances – this will be formalized using smoothed analysis.10 .1 Smoothed analysis model. The smoothed analysis model for tensor decompositions models the situation when the factors inthe decomposition are not worst-case. • An adversary chooses a tensor T = P ki =1 u (1) i ⊗ u (2) i ⊗ · · · ⊗ u ( ℓ ) i . • Each vector u ( j ) i is randomly “ ρ -perturbed” using an independent Gaussian N (0 , ρ /n ) n withmean 0 and variance ρ /n in each direction . • Let e T = P ki =1 ˜ u (1) i ⊗ ˜ u (2) i ⊗ · · · ⊗ ˜ u ( ℓ ) i . • The input instance is ˆ T = e T + E , where E is some small potentially adversarial noise.Our goal is to recover (approximately when E = 0) the ℓ sets of factors U (1) , . . . , U ( ℓ ) (up torescaling and relabeling), where U ( j ) = ( e u ( j ) i : i ∈ [ k ]). The parameter setting of interest is ρ being at least some inverse polynomial in n , and the maximum entry of E being smaller thansome sufficiently small inverse polynomial 1 / poly( n, /ρ ). We will also assume that the Euclideanlengths of the factors { u ( j ) i } is polynomially upper bounded. We remark that when k ≤ n (asin the full-rank setting), Theorem 3.1 already gives smoothed polynomial time guarantees when ε < ρ/ poly( n ), since the condition number κ ≤ poly( n ) /ρ with high probability. Remarks.
There is an alternate smoothed analysis model where the random perturbation is toeach entry of the tensor itself, as opposed to randomly perturbing the factors of a decomposition.The two random perturbations are very different in flavor. When the whole tensor is randomlyperturbed, we have n ℓ “bits” of randomness, whereas when only the factors are perturbed wehave ℓn “bits” of randomness. On the other hand, the model where the whole tensor is randomlyperturbed is unlikely to be easy from a computational standpoint, since this would likely implyrandomized algorithms with good worst-case approximation guarantees.Why do we study perturbations to the factors? In most applications each factor representsa parameter e.g., a component mean in Gaussian mixture models. The intuition is that if theseparameters of the model are not chosen in a worst-case configuration, we can potentially obtainvastly improved learning algorithms with such smoothed analysis guarantees.The smoothed analysis model can also be seen as the quantitative analog of “genericity” resultsthat are inspired by results from algebraic geometry, particularly when we need robustness to noise.Results of this generic flavor give guarantees for all except a set of instances of zero measure. How-ever, such results are far from being quantitative; as we will see later we typically need robustnessto inverse polynomial error with high probability for polynomial time guarantees. We will give an algorithm in the smoothed analysis setting for overcomplete tensor decompositionswith polynomial time guarantees. In the following theorem, we consider the model in Section 4.1where the low-rank tensor ˜ T = P ki =1 ˜ u (1) i ⊗ ˜ u (2) i ⊗· · ·⊗ ˜ u ( ℓ ) i , and the factors { ˜ u ( j ) i } are ρ -perturbationsof the vectors { u ( j ) i } , which we will assume are bounded by some polynomial of n . The input tensoris ˜ T + E where E represents the adversarial noise. Many of the results in the section also hold for other forms of random perturbations, as long as the distributionsatisfies a weak anti-concentration property, similar to the setting in Chapters 13-15; see [ADM +
18] for details. heorem 4.1. Let k ≤ n ⌊ ℓ − ⌋ / for some constant ℓ ∈ N , and ε ∈ [0 , . There is an algorithm thattakes as input a tensor ˆ T = ˜ T + E as described above, with every entry of E being at most ε/ ( n/ρ ) O ( ℓ ) in magnitude, and runs in time ( n/ρ ) O ( ℓ ) to recover all the rank one terms { ⊗ ℓi =1 ˜ u ( j ) i : i ∈ [ k ] } upto an additive ε error measured in Frobenius norm, with probability at least − exp( − Ω( n )) . To describe the main algorithmic idea, let us consider an order-5 tensor T ∈ R n × n × n × n × n . Wecan “flatten" T to get an order three tensor T = k X i =1 u (1) i ⊗ u (2) i | {z } factor ⊗ u (3) i ⊗ u (4) i | {z } factor ⊗ u (5) i |{z} factor . This gives us an order-3 tensor T ′ of size n × n × n . The effect of the “flattening" operation onthe factors can be described succinctly using the following operation. Definition 4.2 (Khatri-Rao product) . The Khatri-Rao product of A ∈ R m × k and B ∈ R n × k is an mn × k matrix U ⊙ V whose i th column is u i ⊗ v i .Our new order three tensor T ′ also has a rank k decomposition with factor matrices U ′ = U (1) ⊙ U (2) , V ′ = U (3) ⊙ U (4) and W ′ = U (5) respectively. Note that the columns of U ′ and V ′ are in n dimensions (in general they will be n ⌊ ( ℓ − / ⌋ dimensional). We could now hope that theassumptions on the condition number U ′ , V ′ in Theorem 3.1 are satisfied for k = ω ( n ). This is nottrue in the worst-case (see Exercise 3 for the counterexample). However, we will prove this is truew.h.p. in the smoothed analysis model!As the factors in U (1) , . . . , U ( ℓ ) are all polynomially upper bounded, the maximum singularvalue is also at most a polynomial in n . The following proposition shows high confidence lowerbounds on the minimum singular value after taking the Khatri-Rao product of a subset of thefactor matrices; this of course implies that the condition number has a polynomial upper boundwith high probability. Proposition 4.3.
Let δ ∈ (0 , be constants such that k ≤ (1 − δ ) n ℓ . Given any U (1) , U (2) , . . . , U ( ℓ ) ∈ R n × k , then for their random ρ -perturbations, we have P h σ k ( e U (1) ⊙ e U (2) ⊙ · · · ⊙ e U ( ℓ ) ) < c ( ℓ ) ρ ℓ n ℓ i ≤ k exp (cid:16) − c ( ℓ ) δn (cid:17) . where c ( ℓ ) , c ( ℓ ) are constants that depend only on ℓ . The proposition implies that the conditions of Theorem 3.1 hold for the flattened order-3 tensor T ′ ; in particular, the condition number of the factor matrices is now polynomially upper boundedwith high probability. Hence by running Jennrich’s algorithm to the order-3 tensor T ′ recovers therank-one factors w.h.p. as required in Theorem 4.1. The rest of the section outlines the proof ofProposition 4.3. Failure probability.
We remark on a technical requirement about the failure probability (that issatisfied by the above proposition) for smoothed analysis guarantees. We need our bounds onthe condition number or σ min to hold with a sufficiently small failure probability, say n − ω (1) oreven exponentially small (over the randomness in the perturbations). This is important because insmoothed analysis applications, the failure probability essentially describes the fraction of points12round any given point that are bad for the algorithm. In many applications, the time/samplecomplexity has an inverse polynomial dependence on the least singular value. For example, if wehave a guarantee that σ min ≥ γ with probability at least 1 − γ / , then the probability of the runningtime exceeding T (upon perturbation) is at most 1 / √ T . Such a guarantee does not suffice to showthat the expected running time is polynomial (also called polynomial smoothed complexity).Note that our matrix e U (1) ⊙· · ·⊙ e U ( ℓ ) is a random matrix with highly dependent entries e.g., thereare only knℓ independent variables but kn ℓ matrix entries. This presents very different challengescompared to well-studied settings in random matrix theory, where every entry is independent.While the least singular value can be hard to handle directly, it is closely related to the leave-one-out distance , which is often much easier to deal with. Definition 4.4.
Given a matrix M ∈ R n × k with columns M , . . . , M k , the leave-one-out distanceof M is ℓ ( M ) = min i ∈ [ k ] k Π ⊥− i M i k , where Π ⊥− i is the projection matrix orthogonal to span( { M j : j = i } ) . The leave-one-out distance is closely related to the least singular value, up to a factor polynomialin the number of columns of M , by the following simple lemma. Lemma 4.5.
For any matrix M ∈ R n × k , we have ℓ ( M ) √ k ≤ σ min ( M ) ≤ ℓ ( M ) . (3)The following (more general) core lemma that lower bounds the projection onto any given subspaceof a randomly perturbed rank-one tensor implies Proposition 4.3. Lemma 4.6.
Let ℓ ∈ N and δ ∈ (0 , ℓ ) be constants, and let W ⊆ R n × ℓ be an arbitrary subspace ofdimension at least δn ℓ . Given any x , · · · , x ℓ ∈ R n , then their random ρ -perturbations ˜ x , · · · , ˜ x ℓ satisfy P h k Π W (˜ x ⊗ ˜ x ⊗ · · · ⊗ ˜ x ℓ ) k < c ( ℓ ) ρ ℓ n ℓ i ≤ exp (cid:16) − c ( ℓ ) δn (cid:17) , where c ( ℓ ) , c ( ℓ ) are constants that depend only on ℓ . The polynomial of n in the exponent of the failure probability is tight; however it is unclear whatthe right polynomial dependence of n in the least singular value bound, and the right dependenceon ℓ should be. The above lemma can be used to lower bound the least singular value of the matrix e U (1) ⊙ · · · ⊙ e U ( ℓ ) in Proposition 4.3 as follows: we can lower bound the leave-one-out distance ofLemma 4.5 by applying Lemma 4.6 for each column i ∈ [ k ] with W being the subspace given byΠ ⊥− i and x , . . . , x ℓ being u (1) i , . . . , u ( ℓ ) i ; a union bound over the k columns gives Proposition 4.3.The first version of this lemma was proven in Bhaskara et al. [BCMV14] with worse polynomialdependencies both in lower bound on the condition number, and in the exponent of the failureprobability. The improved statement presented here and proof sketched in Section 4.3 are basedon Anari et al. [ADM + elation to anti-concentration of polynomials. We now briefly describe a connection toanti-concentration bounds for low-degree polynomials, and describe a proof strategy that yields aweaker version of Lemma 4.6. Anti-concentration inequalities (e.g., the Carbery-Wright inequality)for a degree- ℓ polynomial g : R n → R with k g k ≥ η , and x ∼ N (0 , n are of the form P x ∼ N (0 , n h | g ( x ) − t | < εη i ≤ O ( ℓ ) · ε /ℓ . (4)This can be used to derive a weaker version of Lemma 4.6 with an inverse polynomial fail-ure probability, by considering a polynomial whose co-efficients “lie” in the subspace W . As wediscussed in the previous section, this failure probability does not suffice for expected polynomialrunning time (or polynomial smoothed complexity). On the other hand, Lemma 4.6 manages to getan inverse polynomial lower bound with exponentially small failure probability, by considering n Ω(1) different polynomials. In fact one can flip this around and use Lemma 4.6 to show a vector-valuedvariant of the Carbery-Wright anti-concentration bounds, where if we have m ≥ δn ℓ “sufficientlydifferent” polynomials g , g , . . . , g m : R n → R each of degree ℓ , then we can get ε c ( ℓ ) δn where c ( ℓ ) > ℓ , we gain an δn factor in the exponent on account of having a vectorvalued function with m co-ordinates. See [BCPV19] for a statement and proof. The proof of Lemma 4.6 is a careful inductive proof. We sketch the proof for ℓ ≤ +
18] for the complete proof. For convenience, let ˜ x := ˜ x (1) and ˜ y := ˜ x (2) . The high level outline is the following. We will show that there exist n × n matrices M , M , . . . , M r ∈ W of bounded length measured in Frobenius norm (for general ℓ these would beorder ℓ tensors of length at most n ℓ/ ) which additionally satisfy certain “orthogonality” properties;here r = Ω ℓ ( δn ℓ ). We will use the orthogonality properties and the random perturbations to extractenough “independence” across {h M i , (˜ x ⊗ ˜ y ) i : i ∈ [ r ] } ; this will allow us to conclude that at leastone of these r inner products is at least ρ/ √ n in magnitude with probability ≥ − exp( − Ω( δn )). What orthogonality property do we want?
Case ℓ = 1. Let us start with ℓ = 1. In this case we have a subspace W ⊂ R n of dimension atleast δn . Here we could just choose the r vectors v , . . . , v r ∈ R n to be an orthonormal basis for W ,to conclude the lemma, since h v i , g i are independent. However, let’s consider a slightly differentconstruction where v , . . . v r are not orthonormal which will allow us to generalize to higher ℓ > Claim 4.7 (for ℓ = 1) . There exists a set of v , . . . , v r ∈ W , and a set of distinct indices i , i , . . . , i r ∈ [ n ] , for r = dim( W ) such that for all j ∈ { , , . . . , r } :(a) k v j k ∞ ≤ , (b) | v j ( i j ) | = 1 , (c) v j ( i j ′ ) = 0 for all j ′ < j . Hence, each of the vectors v j has a non-negligible component orthogonal to the span of v j +1 , . . . , v r .This will give us sufficient independence across the random variables h v , ˜ x i , . . . , h v r , ˜ x i . Con-sider the r inner products in reverse order i.e., h v r , ˜ x i , h v r − , ˜ x i , . . . , h v , ˜ x i . Let ˜ x = x + z where z ∼ N (0 , ρ /n ) n . First h v r , ˜ x i = h v r , x i + h v r , z i , where h v r , z i is an independent Gaussian N (0 , ρ /n )due to the rotational invariance of Gaussians. Hence for some absolute constant c >
0, from sim-ple Gaussian anti-concentration |h v r , x i| < cρ/ √ n with probability 1 /
2. Now, let us analyze the14vent h v j , x i is small, after conditioning on the values of h v j +1 , ˜ x i , . . . , h v r , ˜ x i . By construction, | v j ( i j ) | = 1, whereas v j +1 ( i j ) = · · · = v r ( i j ) = 0. Hence P h |h v j , ˜ x i| < cρ √ n (cid:12)(cid:12)(cid:12) h v j +1 , ˜ x i , . . . , h v r , ˜ x i i ≤ sup t ∈ R P h | z ( i j ) − t | < cρ √ n i ≤ . Hence P h ∀ j ∈ [ r ] , |h v j , ˜ x i| < cρ √ n i ≤ exp( − r ) , as required . Proof of Claim 4.7.
We will construct the vectors iteratively. For the first vector, pick any vector v in W , and rescale it so that k v k ∞ = 1; let i ∈ [ n ] be an index where | v ( i ) | = 1. For the secondvector, consider the restricted subspace { x ∈ W : x ( i ) = 0 } . This has dimension dim( W ) −
1; sowe can again pick an arbitrary vector in it and rescale it to get the necessary v . We can repeatthis until we get r = dim( W ) vectors (when the restricted subspace becomes empty). Proof sketch for ℓ = 2 . We can use a similar argument to come up with an analogous set ofmatrices M , . . . , M r inductively. It will be convenient to identify each of these matrices M j withan (row,column) index pair I j = ( i j , i ′ j ) ∈ [ n ] × [ n ]. We will also have a total order among all of theindex pairs as follows. We first have a ordering among all the valid row indices R = { i j : j ∈ [ r ] } (say i ≺ i ≺ · · · ≺ i r ). Moreover, among all index pairs R i ∗ in the same row i ∗ (i.e., R i ∗ := { I j = ( i ∗ , i ′ j ) } ), we have a total ordering (note that it could be the case that (2 , ≺ (2 ,
7) and(3 , ≺ (3 , i ∗ = 2 and i ∗ = 3 could be different). Claim 4.8 (for ℓ = 2) . Given any subspace W ⊂ R n × n of dimension dim( W ) ≥ δn , there exists r many (row,column) index pairs I ≺ I ≺ · · · ≺ I r as outlined above, and a set of associated matrices M , M . . . , M r such that for all j ∈ { , , . . . , r } : (a) k M j k ∞ ≤ , (b) | M j ( I j ) | = 1 ,(c) M j ( I j ′ ) = 0 for all j ′ < j and M j ( i , i ) = 0 for any i ≺ i j , i ∈ [ n ] where I j = ( i j , i ′ j ) .Further there are at least | R | = Ω( δn ) valid row indices, and each of these indices has Ω( δn ) indexpairs associated with it. The approach to proving the above claim is broadly similar to that of Claim 4.7. The proofrepeatedly treats the vectors in W as vectors in R n and applies Claim 4.7 to extract a valid rowwith Ω( δn ) valid column indices, and iterates. We leave the formal proof as Exercise 5.Once we have Claim 4.8, the argument for Lemma 4.6 is as follows. Firstly, k M j k ≤ n since k M j k ∞ ≤
1. Hence, we just need to show that there exists j ∈ [ r ] s.t. |h M j , ˜ x ⊗ ˜ y i| ≥ cρ/n inmagnitude with probability ≥ − exp( − Ω( δn )). Consider the vectors { M ˜ y, M ˜ y, . . . , M r ˜ y } ⊂ R n obtained by applying just ˜ y . For each valid row i ∗ ∈ R , consider only the corresponding vectors withrow index i ∗ from { M j ˜ y : j ∈ [ r ] } and set v i ∗ to be the vector with the largest magnitude entry incoordinate i ∗ . By our argument for ℓ = 1 we can see that with probability at least 1 − exp( − Ω( δn )), | v i ∗ ( i ∗ ) | > τ := cρ/ √ n , for some constant c >
0. Now by scaling these vectors { v i : i ∈ [ R ] } by atmost 1 /τ each, we see that they satisfy Claim 4.7. Hence, using the argument for ℓ = 1 again, weget Lemma 4.6. Extending this argument to higher ℓ > The smoothed polynomial time guarantees for overcomplete tensor decompositions in turn implypolynomial time smoothed analysis guarantees for several learning problems. In the smoothedanalysis model for these parameter estimation problems, the unknown parameters θ of the modelare randomly perturbed to give ˜ θ , and samples are drawn from the model with parameters ˜ θ .15owever, as we alluded to earlier, the corresponding tensor decomposition problems that arise,e.g., from Proposition 2.1 and 2.2 do not always fit squarely in the smoothed analysis model inSection 4.1. For example, the random perturbations to the factors { u ( j ) i : i ∈ [ k ] , j ∈ [ ℓ ] } may notall be independent. In learning mixtures of spherical Gaussians, the factors of the decompositionare ˜ µ ⊗ ℓi for some appropriate ℓ >
1, where ˜ µ i is the mean of the i th component. In learning hiddenMarkov models (HMMs), each factor is a sum of appropriate monomials of the form ˜ a i ⊗ ˜ a i ⊗· · · ⊗ ˜ a i ℓ , where i i . . . i ℓ correspond to length- ℓ paths in a graph.Fortunately the bounds in Proposition 4.3 can be used to derive similar high confidence lowerbounds on the least singular value for random matrices that arise from such applications using decoupling inequalities. For example, one can prove such bounds (as in Proposition 4.3) for the k × n ℓ matrix where the i th column is ˜ µ ⊗ ℓi (as required for mixtures of spherical Gaussians). Suchbounds also hold for other broad classes of random matrices that are useful for other applicationslike hidden markov models; see [BCPV19] for details.In the smoothed analysis model for mixtures of spherical Gaussians, the means { µ i : i ∈ [ k ] } arerandomly perturbed. The following corollary gives polynomial time smoothed analysis guaranteesfor estimating the means of a mixture of k spherical Gaussians. See [BCMV14, ABG +
14] for details.
Corollary 4.9 (Mixture of k spherical Gaussians in n ≥ k ε dimensions) . For any ε > , η > ,there is an algorithm that in the smoothed analysis setting learns the means of a mixture of k spherical Gaussians in n ≥ k ε dimensions up to accuracy η > with running time and samplecomplexity poly ( n, /η, /ρ ) O (1 /ε ) and succeeds with probability at least − exp( − Ω( n )) . In the smoothed analysis setting for hidden markov models (HMM), the model is generatedusing a randomly perturbed observation matrix e O , obtained by adding independent Gaussian ran-dom vectors drawn from N (0 , ρ /n ) n to each column of O . These techniques also give similarsmoothed analysis guarantees for learning HMMs in the overcomplete setting when n ≥ k ε di-mensions (using O (1 /ε ) consecutive observations), and under sufficient sparsity of the transitionmatrix. See [BCPV19] for details. Smoothed analysis results have also been obtained for otherproblems like overcomplete ICA [GVX14], learning mixtures of general Gaussians [GHK15], otheralgorithms for higher-order tensor decompositions [MSS16, BCPV19], and recovering assemblies ofneurons [ADM + The algorithm we have seen (based on simultaneous diagonalization) has provable guarantees in thequite general smoothed analysis setting. However, there are other considerations like running timeand noise tolerance, for which the algorithm is sub-optimal – for example, iterative heuristics likealternating least-squares or alternating minimization are more popular in practice because of fasterrunning times [KB09]. There are several other algorithmic approaches for tensor decompositionsthat work under different conditions on the input. The natural considerations are the generality ofthe assumptions, and the running time of the algorithm. The other important consideration is therobustness of the algorithm to noise or errors. I will briefly describe a selection of these algorithms,and comment along these axes. As we will discuss in the next section, the different algorithms areincomparable because of different strengths and weaknesses along these three axes.
Tensor Power Method.
The tensor power method gives an alternate algorithm for symmetric tensorsin the full-rank setting that is inspired by the matrix power method. The algorithm is designed16or symmetric tensors T ∈ R n × n × n with an orthogonal decomposition of rank k ≤ n of the form P ki =1 λ i v ⊗ i where the vectors v , . . . , v k are orthonormal. Note that not all matrices need to havesuch an orthogonal decomposition. However in many learning applications (where we have accessto the second moment matrix), one can use a trick called whitening to reduce to the orthogonaldecomposition case by a simple basis transformation.The main component of the tensor power method is an iterative algorithm to find one term inthe decomposition that repeats the following power iteration update (after initializing randomly)until convergence z ← T ( · ,z,z ) k T ( · ,z,z ) k . Here the vector T ( · , z, z ) = u where u ( i ) = P i ,i T ( i, i , i ) z i z i .The algorithm then removes this component and recurses on the remaining tensor. This methodis also known to be robust to inverse polynomial noise, and is known to converge quickly after thewhitening. See [AGH +
14] for such guarantees.
FOOBI algorithm and variants.
In a series of works, Cardoso and others [Car91, DLCC07] devisedan algorithm, popularly called the
FOOBI algorithm for symmetric decompositions of overcompletetensors of order 4 and above. At a technical level, the FOOBI algorithm finds rank-one tensors ina linear subspace, by designing a “rank-1 detecting gadget”. Recently, the FOOBI algorithm andgeneralizations have been shown to be robust to inverse polynomial error in the smoothed analysissetting for order 2 ℓ tensors up to rank k ≤ n ℓ (see [MSS16, HSS19] for order 4 and [BCPV19] forhigher even orders). Alternating Minimization and Iterative Algorithms.
Recently, Anandkumar et al. [AGJ17] analyzedpopular iterative heuristics like alternating minimization for overcomplete tensors of order 3 andgave some sufficient conditions for both local convergence and global convergence. Finally, a closelyrelated non-convex problem is that of computing the “spectral norm” i.e., maximizing h T, x ⊗ ℓ i subject to k x k = 1; under certain conditions one can show that the global maximizers are exactlythe underlying factors. The optimization landscape of this problem for tensors has also beenstudied recently(see [GM17]). But these results all mainly apply to the case when the factors ofthe decomposition are randomly chosen, which is much less general than the smoothed analysissetting. Sum-of-squares algorithms.
The sum-of-squares hierarchy (SoS) or the Lasserre hierarchy isa powerful family of algorithms based on semidefinite-programming. Algorithms based on SoStypically consider a related polynomial optimization problem with polynomial inequalities. A keystep in these arguments is to give a low-degree sum-of-squares proof of uniqueness; this is then“algorithmicized” using the SoS hierarchy. SoS-based algorithms are known to give guarantees thatcan go to overcomplete settings even for order 3 tensors (when the factors are random), and areknown to have higher noise tolerance. In particular, they can handle order-3 symmetric tensorsof rank k = ˜ O ( n . ), when the factors are drawn randomly from the unit sphere [MSS16]. TheSoS hierarchy also gives robust variants of the FOOBI algorithm, and get quasi-polynomial timeguarantees under other incomparable conditions [MSS16]. SoS based algorithms are too slow inpractice because of large polynomial running times. Some recent works explore an interestingmiddle-ground; they design spectral algorithms that are inspired by these SoS hierarchies, but havefaster running times (see e.g., [HSSS16]). 17 Discussion and Open questions
The different algorithms for tensor decompositions are incomparable because of different strengthsand weaknesses. A major advantage of SoS-based algorithms is their significantly better noise toler-ance; in some settings it can go up to constant error measured in spectral norm (of an appropriatematrix flattening), while other algorithms can get inverse polynomial error tolerance at best. This isimportant particularly in learning applications, since there is significant modeling errors in practice.However, many of these results mainly work in restrictive settings where the factors are random(or incoherent). On the other hand, the algorithms based on simultaneous decompositions andvariants of the FOOBI algorithm work in the significantly more general smoothed analysis setting,but their error tolerance is much poorer. Finally, iterative heuristics like alternating minimizationare the most popular in practice because of their significantly faster running times; however knowntheoretical guarantees are significantly worse than the other methods.Another direction where there is a large gap in our understanding is about conditions and limitsfor efficient recovery. This is particularly interesting under conditions that guarantee that the low-rank decomposition is (robustly) unique, as they imply learning guarantees. We list a few openquestions in this space.For the special case of 3-tensors in R n × n × n , recall that Jennrich’s algorithm needs the factorsto be linearly independent, hence k ≤ n . On the other hand, Kruskal’s uniqueness theorem (andits robust analogue) guarantees uniqueness even up to rank 3 n/ −
1. Kruskal in fact gave a moregeneral sufficient condition for uniqueness in terms of what is known as the
Kruskal rank of a setof vectors [Kru77]. But there is no known algorithmic proof!
Open Problem.
Is there a (robust) algorithm for decomposing a -tensor T under the conditionsof Kruskal’s uniqueness theorem? We also do not know if there is any smoothed polynomial time algorithm that works for rank(1 + ε ) n for any constant ε >
0. Moreover, we know powerful statements using ideas from algebraicgeometry that generic tensors of order 3 have unique decompositions up to rank n / ℓ tensors. Most known algorithmic results for tensor decompositions also end up recoveringthe decomposition (thereby proving uniqueness). However, even for order-3 tensors with randomfactors, there is a large gap between conditions that guarantee uniqueness vs conditions that ensuretractability. Open Problem.
Is there a (robust) algorithm for decomposing a -tensor T with random factorsfor rank k = ω ( n / ) ? Acknowledgments
I thank Aditya Bhaskara for many discussions related to the chapter, and Tim Roughgarden, AidaoChen, Rong Ge and Paul Valiant for their comments on a preliminary draft of this chapter.
References [ABG +
14] Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, and James R. Voss.The more, the merrier: the blessing of dimensionality for learning large Gaussian mix-18ures.
Conference on Learning Theory , 2014.[ADM +
18] Nima Anari, Constantinos Daskalakis, Wolfgang Maass, Christos Papadimitriou, AminSaberi, and Santosh Vempala. Smoothed analysis of discrete tensor decomposition andassemblies of neurons. In
Advances in Neural Information Processing Systems , 2018.[AGH +
14] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgar-sky. Tensor decompositions for learning latent variable models.
Journal of MachineLearning Research , 15, 2014.[AGJ17] Animashree Anandkumar, Rong Ge, and Majid Janzamin. Analyzing tensor powermethod dynamics in overcomplete regime.
Journal of Machine Learning Research , 18,2017.[AMR09] Elizabeth S Allman, Catherine Matias, and John A Rhodes. Identifiability of parametersin latent structure models with many observed variables.
The Annals of Statistics , 37,2009.[BCMV14] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan.Smoothed analysis of tensor decompositions. In
Symposium on the Theory of Com-puting (STOC) , 2014.[BCPV19] Aditya Bhaskara, Aidao Chen, Aidan Perreault, and Aravindan Vijayaraghavan.Smoothed analysis in unsupervised learning via decoupling. In
Foundations of Com-puter Science (FOCS) , 2019.[BCV14] Aditya Bhaskara, Moses Charikar, and Aravindan Vijayaraghavan. Uniqueness of tensordecompositions with applications to polynomial identifiability.
Conference on LearningTheory , 2014.[BS10] Mikhail Belkin and Kaushik Sinha. Polynomial learning of distribution families. In
Foundations of Computer Science (FOCS) . IEEE, 2010.[Car91] J. . Cardoso. Super-symmetric decomposition of the fourth-order cumulant tensor. blindidentification of more sources than sensors. In
Proceedings of ICASSP’91 , 1991.[CO12] L. Chiantini and G. Ottaviani. On generic identifiability of 3-tensors of small rank.
SIAM Journal on Matrix Analysis and Applications , 33, 2012.[DLCC07] L. De Lathauwer, J. Castaing, and J. Cardoso. Fourth-order cumulant-based blindidentification of underdetermined mixtures.
IEEE Trans. on Signal Processing , 55,2007.[GHK15] Rong Ge, Qingqing Huang, and Sham M. Kakade. Learning mixtures of Gaussians inhigh dimensions. In
Symposium on Theory of Computing , 2015.[GM17] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions. In
Advances in Neural Information Processing Systems 30 . 2017.[GVX14] Navin Goyal, Santosh Vempala, and Ying Xiao. Fourier PCA and robust tensor decom-position. In
Symposium on Theory of Computing , 2014.19Har70] Richard A Harshman. Foundations of the parafac procedure: models and conditionsfor an explanatory multimodal factor analysis. 1970.[Hås90] Johan Håstad. Tensor rank is np-complete.
Journal of Algorithms , 11(4), 1990.[HK13] Daniel Hsu and Sham M Kakade. Learning mixtures of spherical Gaussians: momentmethods and spectral decompositions. In
Proceedings of the 4th conference on Innova-tions in Theoretical Computer Science , 2013.[HKZ12] Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learninghidden markov models.
Journal of Computer and System Sciences , 78(5), 2012.[HL13] Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard.
J. ACM ,60, 2013.[HSS19] Samuel B. Hopkins, Tselil Schramm, and Jonathan Shi. A robust spectral algorithmfor overcomplete tensor decomposition. In
Proceedings of the Thirty-Second Conferenceon Learning Theory , 2019.[HSSS16] Samuel B. Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral al-gorithms from sum-of-squares proofs: Tensor decomposition and planted sparse vectors.In
STOC , 2016.[KB09] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications.
SIAMreview , 51(3), 2009.[Kru77] Joseph B Kruskal. Three-way arrays: rank and uniqueness of trilinear decompositions,with application to arithmetic complexity and statistics.
Linear algebra and its appli-cations , 18(2), 1977.[Moi18] Ankur Moitra.
Algorithmic Aspects of Machine Learning . Cambridge University Press,2018.[MR06] Elchanan Mossel and Sébastien Roch. Learning nonsingular phylogenies and hiddenmarkov models.
The Annals of Applied Probability , 2006.[MSS16] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositionswith sum-of-squares. In
IEEE Symposium on the Foundations of Computer Science ,2016.[MV10] Ankur Moitra and Gregory Valiant. Settling the polynomial learnability of mixtures ofGaussians. In
Foundations of Computer Science (FOCS) , 2010.[Rou20] Tim Roughgarden.
Beyond the Worst-Case Analysis of Algorithms . Cambridge Univer-sity Press, 2020.[SS90] G. W. Stewart and Ji-guang Sun.
Matrix Perturbation Theory . Academic Press, 1990.[SWZ19] Zhao Song, David P. Woodruff, and Peilin Zhong. Relative error tensor low rankapproximation. In
Symposium on Discrete Algorithms (SODA) , 2019.[TD18] Christopher Tosh and Sanjoy Dasgupta. Maximum likelihood estimation for mixturesof spherical gaussians is np-hard.
Journal of Machine Learning Research , 18, 2018.20
Exercises
1. The symmetric rank of a symmetric tensor T is the smallest integer r > T can beexpressed as T = P ri =1 u ⊗ ℓi for some { u i } ki =1 . Prove that for any symmetric tensor of order ℓ ,the symmetric rank is at most 2 ℓ ℓ ! times the rank of the tensor . Hint: For ℓ = 2 , if u i ⊗ v i was a term in the decomposition, we can express u i ⊗ v i + v i ⊗ u i = ( u i + v i ) ⊗ − ( u i − v i ) ⊗ .
2. Let u, v ∈ R n be two orthonormal vectors, and consider the tensor A = u ⊗ u ⊗ v + v ⊗ u ⊗ u + u ⊗ v ⊗ u . Prove that it has rank 3. Also prove that it can be arbitrarily well approximatedby a rank 2 tensor. Hint: Try to express A as a difference of two symmetric rank one tensors with large entries(Frobenius norm of Θ( m ) ), so that the error term is O (1 /m ) .
3. Construct an example of a matrix U such that the Kruskal-rank of U ⊙ U is at most twice theKruskal-rank of U . Hint: Express the identity matrix as P i u i u Ti for two different orthonormalbasis.
4. Prove Proposition 2.2.5. Complete the proof of Claim 4.8 and hence Lemma 4.6 for ℓ = 2. Comon’s conjecture asks if for every symmetric tensor, the symmetric rank is equal to the rank. A counterexamplewas shown recently by Shitov. It is open what the best gap between these two ranks can be as a function of ℓ ..