[PDF] Randomized Alternating Least Squares for Canonical Tensor Decompositions: Application to a PDE with Random Data

Abstract

This paper introduces a randomized variation of the alternating least squares (ALS) algorithm for rank reduction of canonical tensor formats. The aim is to address the potential numerical ill-conditioning of least squares matrices at each ALS iteration. The proposed algorithm, dubbed randomized ALS, mitigates large condition numbers via projections onto random tensors, a technique inspired by well-established randomized projection methods for solving overdetermined least squares problems in a matrix setting. A probabilistic bound on the condition numbers of the randomized ALS matrices is provided, demonstrating reductions relative to their standard counterparts. Additionally, results are provided that guarantee comparable accuracy of the randomized ALS solution at each iteration. The performance of the randomized algorithm is studied with three examples, including manufactured tensors and an elliptic PDE with random inputs. In particular, for the latter, tests illustrate not only improvements in condition numbers, but also improved accuracy of the iterative solver for the PDE solution represented in a canonical tensor format.

Full PDF

RRANDOMIZED ALTERNATING LEAST SQUARES FORCANONICAL TENSOR DECOMPOSITIONS: APPLICATION TOA PDE WITH RANDOM DATA

MATTHEW J REYNOLDS ∗ , ALIREZA DOOSTAN ∗ AND GREGORY BEYLKIN † Abstract.

This paper introduces a randomized variation of the alternatingleast squares (ALS) algorithm for rank reduction of canonical tensor formats.The aim is to address the potential numerical ill-conditioning of least squaresmatrices at each ALS iteration. The proposed algorithm, dubbed randomizedALS , mitigates large condition numbers via projections onto random tensors,a technique inspired by well-established randomized projection methods forsolving overdetermined least squares problems in a matrix setting. A proba-bilistic bound on the condition numbers of the randomized ALS matrices isprovided, demonstrating reductions relative to their standard counterparts.Additionally, results are provided that guarantee comparable accuracy of therandomized ALS solution at each iteration. The performance of the random-ized algorithm is studied with three examples, including manufactured tensorsand an elliptic PDE with random inputs. In particular, for the latter, testsillustrate not only improvements in condition numbers, but also improved ac-curacy of the iterative solver for the PDE solution represented in a canonicaltensor format. Introduction

The approximation of multivariate functions is an essential tool for numerousapplications including computational chemistry [1, 24], data mining [2, 9, 24], andrecently uncertainty quantiﬁcation [11, 23]. For seemingly reasonable numbers ofvariables d , e.g. O (10) , reconstructing a function, for instance, using its sampledvalues, requires computational costs that may be prohibitive. This is related tothe so-called “curse of dimensionality.” To mitigate this phenomenon, we requiresuch functions to have special structures that can be exploited by carefully craftedalgorithms. One such structure is that the function of interest u ( z , z , . . . , z d ) ,depending on variables z , z , . . . , z d , admits a separated representation, [3, 4, 24],of the form(1.1) u ( z , z , . . . z d ) = r (cid:88) l =1 σ l u l ( z ) u l ( z ) · · · u ld ( z d ) . The number of terms, r , is called the separation rank of u and is assumed to be small . Any discretization of the univariate functions u lj ( z j ) in (1.1) with u li j = Key words and phrases.

Separated representations; Tensor Decomposition; Randomized Pro-jection; Alternating Least Squares; Canonical Tensors; Stochastic PDE.This material is based upon work supported by the U.S. Department of Energy Oﬃce ofScience, Oﬃce of Advanced Scientiﬁc Computing Research, under Award Number de-sc0006402,and NSF grants DMS-1228359, DMS-1320919, and CMMI-1454601. a r X i v : . [ m a t h . NA ] O c t ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 2 u lj (cid:0) z i j (cid:1) , i j = 1 , . . . , M j and j = 1 , . . . , d , leads to a Canonical Tensor Decomposi-tion, or CTD,(1.2) U = U ( i . . . i d ) = r (cid:88) l =1 σ l u li u li · · · u li d . The functions u lj ( z j ) in (1.1) and the corresponding vectors u li j in (1.2) are nor-malized to unit norm so that the magnitude of the terms is carried by their positive s -values, σ l . It is well understood that when the separation rank r is independentof d , the computation costs and storage requirements of standard algebraic oper-ations in separated representations scale linearly in d , [4]. For this reason, suchrepresentations are widely used for approximating high-dimensional functions. Tokeep the computation of CTDs manageable, it is crucial to maintain as small aspossible separation rank. Common operations involving CTDs, e.g. summations,lead to CTDs with separation ranks that may be larger than necessary. Therefore, astandard practice is to reduce the separation rank of a given CTD without sacriﬁc-ing much accuracy, for which the workhorse algorithm is Alternating Least Squares(ALS) (see e.g., [3, 4, 6, 7, 21, 24, 31]). This algorithm optimizes the separatedrepresentation (in Frobenius norm) one direction at a time by solving least squaresproblems for each direction. The linear systems for each direction are obtained asnormal equations by contracting over all tensor indices, i = 1 , . . . , d , except thosein the direction of optimization k .It is well known that forming normal equations increases the condition numberof the least squares problem, see e.g. [16]. In this paper we investigate the behaviorof the condition numbers of linear systems that arise in the ALS algorithm, andpropose an alternative formulation in order to avoid potential ill-conditioning. Aswe shall see later, the normal equations in the ALS algorithm are formed via theHadamard (entry-wise) product of matrices for individual directions. We show thatin order for the resulting matrix to be ill-conditioned, the matrices for all directionshave to be ill-conditioned and obtain estimates of these condition numbers. Toimprove the conditioning of the linear systems, we propose a randomized version ofALS, called randomized ALS , where instead of contracting a tensor with itself (in alldirections but one), we contract it with a tensor composed of random entries. Weshow that this random projection improves the conditioning of the linear systems.However, its straightforward use does not insure monotonicity in error reduction,unlike in standard ALS. In order to restore monotonicity, we simply accept onlyrandom projections that do not increase the error.Our interest here in using CTDs stems from the eﬃciency of such representationsin tackling the issue of the curse of dimensionality arising from the solution ofPDEs with random data, as studied in the context of Uncertainty Quantiﬁcation(UQ). In the probabilistic framework, uncertainties are represented via a ﬁnitenumber of random variables z j speciﬁed using, for example, available experimentaldata or expert opinion. An important task is to then quantify the dependenceof quantities of interest u ( z , . . . , z d ) on these random inputs. For this purpose,approximation techniques based on separated representations have been recentlystudied in [12, 25, 26, 11, 27, 13, 18, 23, 10, 17, 19] and proven eﬀective in reducingthe issue of curse of dimensionality. ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 3

The paper is organized as follows. In Section 2, we introduce our notation andprovide background information on tensors, the standard ALS algorithm, and therandom matrix theory used in this paper. In Section 3, we introduce randomizedALS and provide analysis of the algorithm’s convergence and the conditioning ofmatrices used. Section 4 contains demonstrations of randomized ALS and compar-isons with standard ALS on three examples. The most important of these examplesprovides background on uncertainty quantiﬁcation and demonstrates the applica-tion of randomized ALS-based reduction as a step in ﬁnding the ﬁxed point solutionof a stochastic PDE. We conclude with a discussion on our new algorithm and futurework in Section 5. 2.

Notation and Background

Notation.

Our notation for tensors, i.e. d -directional arrays of numbers, isboldfaced uppercase letters, e.g. F ∈ R M ×···× M d . These tensors are assumed tobe in the CTD format, F = r F (cid:88) l =1 s F l F l ◦ · · · ◦ F ld , where the factors F li ∈ R M k are vectors with a subscript denoting the directionalindex and a superscript the rank index, and ◦ denotes the standard vector outerproduct. We write operators in dimension d as A = A ( j , j (cid:48) ; . . . ; j d , j (cid:48) d ) , while forstandard matrices we use uppercase letters, e.g. A ∈ R N × M . Vectors are repre-sented using boldfaced lowercase letters, e.g. c ∈ R N , while scalars are representedby lowercase letters. We perform three operations on CTDs: addition, inner prod-uct, and the application of a d -dimensional operator. • When two CTDs are added together, all terms are joined into a single listand simply re-indexed. In such a case the nominal separation rank is thesum of the ranks of the components, i.e., if CTDs are of the ranks ˜ r and ˆ r ,the output has rank ˜ r + ˆ r . • The inner product of two tensors in CTD format, ˜ F and ˆ F , is deﬁned as (cid:68) ˜ F , ˆ F (cid:69) = ˜ r (cid:88) ˜ l =1 ˆ r (cid:88) ˆ l =1 ˜ s ˜ l ˆ s ˆ l (cid:68) ˜ F ˜ l , ˆ F ˆ l (cid:69) . . . (cid:68) ˜ F ˜ ld , ˆ F ˆ ld (cid:69) , where the inner product (cid:104)· , ·(cid:105) operating on vectors is the standard vectordot product. • When applying a d -dimensional operator to a tensor in CTD format, wehave A F = r A (cid:88) ˆ l =1 r F (cid:88) ˜ l =1 s A ˆ l s F ˜ l (cid:16) A ˆ l F ˜ l (cid:17) ◦ · · · ◦ (cid:16) A ˆ ld F ˜ ld (cid:17) . We use the symbol (cid:107) · (cid:107) to denote the standard spectral norm for matrices, as wellas the Frobenius norm for tensors, (cid:107) F (cid:107) = (cid:104) F , F (cid:105) , and (cid:107) · (cid:107) and (cid:107) · (cid:107) to denote the standard Euclidean (cid:96) and (cid:96) vector norms.For analysis involving matrices we use three diﬀerent types of multiplicationin addition to the standard matrix multiplication. The Hadamard, or entry-wise, ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 4 product of two matrices A and B is denoted by A ∗ B . The Kronecker product oftwo matrices A ∈ R N A × M A and B ∈ R N B × M B , is denoted as A ⊗ B , A ⊗ B =  A (1 , B . . . A (1 , M A ) B ... . . . ... A ( N A , B . . . A ( N A , M A ) B  . The ﬁnal type of matrix product we use, the Khatri-Rao product of two matrices A ∈ R N A × M and B ∈ R N B × M , is denoted by A (cid:12) B,A (cid:12) B = (cid:2) A (: , ⊗ B (: , A (: , ⊗ B (: , . . . A (: , N ) ⊗ B (: , M ) (cid:3) . We also frequently use the maximal and minimal (non-zero) singular values of amatrix, denoted as σ max and σ min , respectively.2.2. ALS algorithm.

Operations on tensors in CTD format lead to an increase ofthe nominal separation rank. This separation rank is not necessarily the smallestpossible rank to represent the resulting tensor for a given accuracy. The ALSalgorithm attempts to ﬁnd an approximation to the tensor with minimal (or nearminimal) separation rank. Speciﬁcally, given a tensor G in CTD format withseparation rank r G , G = r G (cid:88) l =1 s G l G l ◦ · · · ◦ G ld , and an acceptable error (cid:15) , we attempt to ﬁnd a representation F = r F (cid:88) ˜ l =1 s F ˜ l F ˜ l ◦ · · · ◦ F ˜ ld with lower separation rank, r F < r G , such that (cid:107) F − G (cid:107) / (cid:107) G (cid:107) < (cid:15) .The standard ALS algorithm starts from an initial guess, F , with a small sep-aration rank, e.g., r F = 1 . A sequence of least squares problems in each directionis then constructed and solved to update the representation. Given a direction k ,we freeze the factors in all other directions to produce a least squares problem forthe factors in direction k . This process is then repeated for all directions k . Onecycle through all k is called an ALS sweep. These ALS sweeps continue until theimprovement in the residual (cid:107) F − G (cid:107) / (cid:107) G (cid:107) either drops below a certain thresholdor reaches the desired accuracy, i.e. (cid:107) F − G (cid:107) / (cid:107) G (cid:107) < (cid:15) . If the residual is stillabove the target accuracy (cid:15) , the separation rank r F is increased and we repeat theprevious steps for constructing the representation with the new separation rank.Speciﬁcally, as discussed in [4], the construction of the normal equations fordirection k can be thought of as taking the derivatives of the Frobenius norm of (cid:107) F − G (cid:107) with respect to the factors F ˜ lk , ˜ l = 1 , . . . , r F , and setting these derivativesto zero. This yields the normal equations(2.1) B k c j k = b j k , where j k corresponds to the j -th entry of F ˜ lk and c j k = c j k (˜ l ) is a vector indexedby ˜ l . Alternatively, the normal system (2.1) can be obtained by contracting alldirections except the optimization direction k , so that the entries of the matrix B k are the Hadamard product of Gram matrices,(2.2) B k (ˆ l, ˜ l ) = (cid:89) i (cid:54) = k (cid:68) F ˜ li , F ˆ li (cid:69) , ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 5 and, accordingly, the right-hand side is b j k (ˆ l ) = r G (cid:88) l =1 s G l G lk ( j k ) (cid:89) i (cid:54) = k (cid:68) G li , F ˆ li (cid:69) . We solve (2.1) for c j k and use the solution to update F ˜ lk . Pseudocode for the ALSalgorithm is provided in Algorithm 1, where max _ rank and max _ iter denote themaximum separation rank and the limit on the number of iterations. The threshold δ is used to decide if the separation rank needs to be increased. Algorithm 1:

Alternating least squares algorithm for rank reduction input : (cid:15) > , δ > , G with rank r G , max _ rank, max _ iter initialize r F = 1 tensor F = F ◦ · · · ◦ F d with randomly generated F k while r F ≤ max _ rank do iter = 1 if r F > then add a random rank contribution to F : F = F + F r F ◦ · · · ◦ F r F d end res = (cid:107) F − G (cid:107) / (cid:107) G (cid:107) while iter ≤ max _ iter do res _ old = res for k = 1 , . . . , d do solve B k c j k = b j k for every j k in direction kdeﬁne v ˜ l = (cid:16) c (˜ l ) , . . . , c M k (˜ l ) (cid:17) for ˜ l = 1 , . . . , r F s F ˜ l = (cid:13)(cid:13) v ˜ l (cid:13)(cid:13) for ˜ l = 1 , . . . , r F F ˜ lk ( j k ) = c j k (˜ l ) /s F ˜ l for ˜ l = 1 , . . . , r F end res = (cid:107) F − G (cid:107) / (cid:107) G (cid:107) if res < (cid:15) thenreturn F else if | res − res _ old | < δ thenbreakelse iter = iter + 1 endend r F = r F + 1 endreturn F A potential pitfall of the ALS algorithm is poor-conditioning of the matrix B k since the construction of normal equations squares the condition number as is wellknown in matrix problems. An alternative that avoids the normal equations ismentioned in the review paper [24], but it is not feasible for problems with evenmoderately large dimension (e.g. d = 5 ). ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 6

Estimate of condition numbers of least squares matrices.

It is an em-pirical observation that the condition number of the matrices B k is sometimessigniﬁcantly better than the condition numbers of some of the Gram matrices com-prising the Hadamard product in (2.2). In fact we have Lemma 1.

Let A and B be Gram matrices with all diagonal entries equal to .Then we have σ min ( B ) ≤ σ min ( A ∗ B ) ≤ σ max ( A ∗ B ) ≤ σ max ( B ) . If the matrix B is positive deﬁnite, then κ ( A ∗ B ) ≤ κ ( B ) . Since Gram matrices are symmetric non-negative deﬁnite, the proof of Lemma 1follows directly from [22, Theorem 5.3.4]. This estimate implies that it is suﬃcientfor only one of the matrices to be well conditioned to assure that the Hadamardproduct is also well conditioned. In other words, it is necessary for all directionalGram matrices to be ill-conditioned to cause the ill-conditioning of the Hadamardproduct. Clearly, this situation can occur and we address it in the paper.2.4.

Modiﬁcation of normal equations: motivation for randomized meth-ods.

We motivate our approach by ﬁrst considering an alternative to forming nor-mal equations for ordinary matrices (excluding the QR factorization that can beeasily used for matrices). Given a matrix A ∈ R N × n , N ≥ n , we can multiply A x = b by a matrix R ∈ R n (cid:48) × N with independent random entries and then solve(2.3) RA x = R b , instead (see, e.g. [20, 29, 30, 33] ). The solution of this system, given that R is ofappropriate size (i.e., n (cid:48) is large enough), will be close to the least squares solution[29, Lemma 2]. In [29], (2.3) is used to form a preconditioner and an initial guess forsolving min (cid:107) A x − b (cid:107) via a preconditioned conjugate gradient method. However,for our application we are interested in using equations of the form (2.3) in theHadamard product in (2.2). We observe that RA typically has a smaller conditionnumber than A T A . To see why, recall that for full-rank, square matrices A and B ,a bound on the condition number is κ ( AB ) ≤ κ ( A ) κ ( B ) . However, for rectangular full-rank matrices A ∈ R r (cid:48) × N and B ∈ R N × r , r ≤ r (cid:48) ≤ N ,this inequality does not necessarily hold. Instead, we have the inequality(2.4) κ ( AB ) ≤ κ ( A ) σ ( B ) σ min ( P A T ( B )) , where P A T ( B ) is the projection of B onto the row space of A (for a proof of this in-equality, see Appendix A). If A has a small condition number (for example, when A is a Gaussian random matrix, see [8, 14, 15]) and we were to assume σ min ( P A T ( B )) is close to σ min ( B ) , we obtain condition numbers smaller than κ ( B ) . The assump-tion that σ min ( P A T ( B )) is close to σ min ( B ) is the same as assuming the columnsof B lie within the subspace spanned by the row of A . This is achieved by choosing r (cid:48) to be larger than r when A is a randomized matrix. ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 7

Deﬁnitions and random matrix theory.

The main advantage of our ap-proach is an improved condition number for the linear system solved at every step ofthe ALS algorithm. We use a particular type of random matrices to derive boundson the condition number: the rows are independently distributed random vectors,but the columns are not (instead of the standard case where all entries are i.i.d).Such matrices were studied extensively by Vershynin [32] and we rely heavily onthis work for our estimates. To proceed, we need the following deﬁnitions from [32].

Remark . Deﬁnitions involving random variables, and vectors composed of randomvariables, are not consistent with the notation of the rest of the paper, outlined inSection 2.1.

Deﬁnition 3. [32, Deﬁnition 5.7] Let P {·} denote the probability of a set and E themathematical expectation operator. Also, let X be a random variable that satisﬁesone of the three following equivalent properties, . P {| X | > t } ≤ exp (cid:0) − t /K (cid:1) for all t ≥ . ( E | X | p ) /p ≤ K √ p for all p ≥ . E exp (cid:0) X /K (cid:1) ≤ e, where the constants K i , i = 1 , , , diﬀer from each other by at most an absoluteconstant factor (see [32, Lemma 5.5] for a proof of the equivalence of these proper-ties). Then X is called a sub-Gaussian random variable. The sub-Gaussian normof X is deﬁned as the smallest K in property 2, i.e., (cid:107) X (cid:107) ψ = sup p ≥ ( E | X | p ) /p √ p . Examples of sub-Gaussian random variables include Gaussian and Bernoulli randomvariables. We also present deﬁnitions for sub-Gaussian random vectors and theirnorm.

Deﬁnition 4. [32, Deﬁnition 5.7] A random vector X ∈ R n is called a sub-Gaussianrandom vector if (cid:104) X, x (cid:105) is a sub-Gaussian random variable for all x ∈ R n . The sub-Gaussian norm of X is subsequently deﬁned as (cid:107) X (cid:107) ψ = sup x ∈S n − (cid:107)(cid:104) X, x (cid:105)(cid:107) ψ , where S n − is the unit Euclidean sphere. Deﬁnition 5. [32, Deﬁnition 5.19] A random vector X ∈ R n is called isotropic if itssecond moment matrix, Σ = Σ ( X ) = E (cid:2) XX T (cid:3) , is equal to identity, i.e. Σ ( X ) = I .This deﬁnition is equivalent to E (cid:104) X, x (cid:105) = (cid:107) x (cid:107) for all x ∈ R n . The following theorem from [32] provides bounds on the condition numbers ofmatrices whose rows are independent sub-Gaussian isotropic random variables.

Theorem 6. [32, Theorem 5.38]

Let A be an N × n matrix whose rows A ( i, :) areindependent, sub-Gaussian isotropic random vectors in R n . Then for every t ≥ ,with probability at least − (cid:0) − ct (cid:1) , one has (2.5) √ N − C √ n − t ≤ σ min ( A ) ≤ σ max ( A ) ≤ √ N + C √ n + t. Here C = C K , c = c K > , depend only on the sub-Gaussian norm K = max i (cid:107) A ( i, :) (cid:107) ψ . ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 8

An outline of the proof of Theorem 6 will be useful for deriving our own results,so we provide a sketch in Appendix A. The following lemma is used to prove The-orem 6, and will also be useful later on in the paper. We later modify it to prove aversion of Theorem 6 that works for sub-Gaussian, non-isotropic random vectors.

Lemma 7. [32, Lemma 5.36]

Consider a matrix B that satisﬁes (cid:13)(cid:13) B T B − I (cid:13)(cid:13) < max (cid:0) δ, δ (cid:1) for some δ > . Then (2.6) − δ ≤ σ min ( B ) ≤ σ max ( B ) ≤ δ. Conversely, if B satisﬁes (2.6) for some δ > , then (cid:13)(cid:13) B T B − I (cid:13)(cid:13) < (cid:0) δ, δ (cid:1) . Randomized ALS algorithm

Alternating least squares algorithm using random matrices.

We pro-pose the following alternative to using the normal equations in ALS algorithms:instead of (2.2), deﬁne the entries of B k via randomized projections,(3.1) B k (ˆ l, ˜ l ) = (cid:89) i (cid:54) = k (cid:68) F ˜ li , R ˆ li (cid:69) , where R ˆ li is the ˆ l -th column of a matrix R i ∈ R M i × r (cid:48) , r (cid:48) > r , with random entriescorresponding to direction i . The choice of r (cid:48) > r is made to reduce the conditionnumber of B k . As shown in Section 3.3, as r/r (cid:48) → the bound on κ ( B k ) goesto κ ( B k ) ≤ κ (cid:16)(cid:0) B ALSk (cid:1) (cid:17) , where B ALSk is the B k matrix for standard ALS, i.e.(2.2). In this paper we consider independent signed Bernoulli random variables,i.e., R i ( j k , ˆ l ) is either − or each with probability / . We have had some successusing standard Gaussian random variables in our experiments as well. The proposedchange also alters the right-hand side of the normal equations (2.1),(3.2) b j k (ˆ l ) = r G (cid:88) l =1 s G l G lk ( j k ) (cid:89) i (cid:54) = k (cid:68) G li , R ˆ li (cid:69) . Equivalently, B k may be written as B k = (cid:89) i (cid:54) = k R Ti F i . Looking ahead, we choose random matrices R i such that B k is a tall, rectangu-lar matrix. Solving the linear system (2.1) with rectangular B k will require apseudo-inverse, computed via either the singular value decomposition (SVD) or aQR algorithm.To further contrast the randomized ALS algorithm with the standard ALS algo-rithm, we highlight two diﬀerences: ﬁrstly, the randomized ALS trades the mono-tonic reduction of approximation error (a property of the standard ALS algorithm)for better conditioning. To adjust we use a simple tactic: if a randomized ALSsweep (over all directions) decreases the error, we keep the resulting approxima-tion. Otherwise, we discard the sweep, generate independent random matrices R i ,and rerun the sweep. Secondly, the randomized ALS algorithm can be more com-putationally expensive than the standard one. This is due to the rejection scheme ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 9 outlined above and the fact that B k in the randomized algorithm has a larger num-ber of rows than its standard counterpart, i.e., r (cid:48) > r . Pseudocode of our newalgorithm is presented in Algorithm 2. Algorithm 2:

Randomized alternating least squares algorithm for rank reduc-tion input : (cid:15) > , G with rank r G , max _ tries, max _ rank, max _ iter initialize r F = 1 tensor F = F ◦ · · · ◦ F d with randomly generated F k while r F ≤ max _ rank do tries = 1 iter = 1 construct randomized tensor R if r F > then add a random rank contribution to F : F = F + F r F ◦ · · · ◦ F r F d endwhile iter ≤ max _ iter and tries ≤ max _ tries do F old = F for k = 1 , . . . , d do construct B k , using (3.1)solve B k c j k = b j k for every j k in direction kdeﬁne v ˜ l = (cid:16) c (˜ l ) , . . . , c M k (˜ l ) (cid:17) for ˜ l = 1 , . . . , r F s F ˜ l = (cid:13)(cid:13) v ˜ l (cid:13)(cid:13) for ˜ l = 1 , . . . , r F F ˜ lk ( j k ) = c j k (˜ l ) /s F ˜ l for ˜ l = 1 , . . . , r F endif (cid:107) F − G (cid:107) / (cid:107) G (cid:107) < (cid:15) thenreturn F else if (cid:107) F old − G (cid:107) / (cid:107) G (cid:107) < (cid:107) F − G (cid:107) / (cid:107) G (cid:107) then F = F old tries = tries + 1 iter = iter + 1 else tries = 1 iter = iter + 1 endend r F = r F + 1 endreturn F Remark . We have explored an alternative approach using projections onto ran-dom tensors, diﬀerent from Algorithm 2. Instead of using B k in (3.1) to solve for c j k , we use the QR factorization of B k to form a preconditioner matrix, similar tothe approach of [29] for solving overdetermined least squares problems in a matrixsetting. This preconditioner is used to improve the condition number of B k in (2.2).The approach is diﬀerent from Algorithm 2: we solve the same equations as thestandard ALS algorithm, but in a better conditioned manner. Solving the sameequations preserves the monotone error reduction property of standard ALS. With ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 10

Algorithm 2 the equations we solve are diﬀerent, but, as shown in Section (3.2),the solutions of each least squares problem are close to the those obtained by thestandard ALS algorithm.We provide convergence results and theoretical bounds on the condition numberof B k with entries (3.1) in Sections 3.2 and 3.3, respectively. Additionally, inSection 4, we empirically demonstrate the superior conditioning properties of B k deﬁned in (3.1) relative to those given by the standard ALS in (2.2).3.2. Convergence of the randomized ALS algorithm.

Before deriving boundson the condition number of (3.1), it is important to discuss the convergence prop-erties of our algorithm. To do so for our tensor algorithm, we derive a convergenceresult similar to [33, Lemma 4.8]. In this analysis, we ﬂatten our tensors into largematrices and use results from random matrix theory to show convergence. First,we construct the large matrices used in this section from (3.1). Writing the innerproduct as a sum allows us to group all the summations together,We have B k (cid:16) ˆ l, ˜ l (cid:17) = (cid:89) i (cid:54) = k M i (cid:88) j i =1 F ˜ li ( j i ) R ˆ li ( j i )= M (cid:88) j =1 · · · M k − (cid:88) j k − =1 M k +1 (cid:88) j k +1 =1 · · · M d (cid:88) j d =1 (cid:16) F ˜ l ( j ) . . . (cid:17) (cid:16) R ˆ l ( j ) . . . (cid:17) , where we have expanded the product to get the sum of the products of individualentries. Introducing a multi-index j = ( j , . . . , j k − , j k +1 , . . . , j d ) , we deﬁne twomatrices, A k ∈ R M × r and R k ∈ R M × r (cid:48) , where M = (cid:81) i (cid:54) = k M i is large, i.e., we write A k (cid:16) j, ˜ l (cid:17) = F ˜ l ( j ) . . . F ˜ lk − ( j k − ) F ˜ lk +1 ( j k +1 ) . . . F ˜ ld ( j d ) R k (cid:16) j, ˆ l (cid:17) = R ˆ l ( j ) . . . R ˆ lk − ( j k − ) R ˆ lk +1 ( j k +1 ) . . . R ˆ ld ( j d ) . We note that these matrices can also be written as Khatri-Rao products, A k = F (cid:12) · · · (cid:12) F k − (cid:12) F k +1 (cid:12) · · · (cid:12) F d R k = R (cid:12) · · · (cid:12) R k − (cid:12) R k +1 (cid:12) · · · (cid:12) R d . (3.3)Since M is large, M (cid:29) r (cid:48) > r , both A and R are rectangular matrices. Similarly,we rewrite a vector b in (3.2), b j k (ˆ l ) = r G (cid:88) l =1 s G l G lk ( j k ) M (cid:88) j =1 · · · M k − (cid:88) j k − =1 M k +1 (cid:88) j k +1 =1 · · · M d (cid:88) j d =1 (cid:0) G l ( j ) . . . (cid:1) (cid:16) R ˆ l ( j ) . . . (cid:17) , using the multi-index j as b k ( j ) = r G (cid:88) l =1 s G l G lk ( j k ) (cid:0) G l ( j ) . . . G lk − ( j k − ) G lk +1 ( j k +1 ) . . . G ld ( j d ) (cid:1) . Using the introduced notation, A k , R k , and b k , we rewrite the normal equations(2.1) for direction k and coordinate j k as(3.4) A Tk A k c k = A Tk b , ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 11 and the randomized version of those equations as(3.5) R Tk A k c k = R Tk b k . We highlight the notable diﬀerence between the random matrix R k above andthose found in the usual matrix settings, for instance, in randomized least squaresregression [20, 29, 30, 33]. Speciﬁcally, in the former, the entries of R k are notstatistically independent and are products of random variables, whereas in thelatter the entries are often i.i.d realizations of single random variables. In thepresent work, we utilize signed Bernoulli random variables, where the entries of R k are dependent but also follow a signed Bernoulli distribution. Whether thereexists an optimal choice of distribution for setting the entries R k in the tensor caserequires a careful examination which is beyond the scope of this paper.Next, we present a convergence result showing that the solution to the leastsquares problem at each iteration of randomized ALS is close to the solution wewould get using standard ALS. For ease of notation, we drop the subscript k from A and R . Lemma 9.

Since multiplication by A maps a vector to the column space of A , there exists T y ∈ R r Q such that A y = QT y . Hence, we obtain (cid:107) A x − b (cid:107) ≤ (cid:107) Θ (cid:107) (cid:13)(cid:13) R T QT y − R T QT b (cid:13)(cid:13) ≤ (cid:107) Θ (cid:107) (cid:13)(cid:13) R T Q (cid:13)(cid:13) (cid:107) T y − T b (cid:107) ≤ (cid:107) Θ (cid:107) (cid:13)(cid:13) R T Q (cid:13)(cid:13) (cid:107) A y − b (cid:107) , where in the last step we used the orthonormality of the columns of Q .Next we estimate norms, (cid:107) Θ (cid:107) and (cid:13)(cid:13) R T Q (cid:13)(cid:13) . First, we decompose R T Q using thesingular value decomposition, R T Q = U Σ V T . From the deﬁnition of the spectralnorm we know (cid:13)(cid:13) R T Q (cid:13)(cid:13) = σ max (cid:0) R T Q (cid:1) . Using the SVD of R T Q and the deﬁnitionof Θ along with our assumption of invertibility of (cid:0) R T Q (cid:1) T (cid:0) R T Q (cid:1) , we write (cid:107) Θ (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) Q (cid:16)(cid:0) R T Q (cid:1) T (cid:0) R T Q (cid:1)(cid:17) − (cid:0) R T Q (cid:1) T (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:0) V Σ V T (cid:1) − V Σ U T (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) V Σ − V T V Σ U T (cid:13)(cid:13) = (cid:13)(cid:13) V Σ − U T (cid:13)(cid:13) . Hence (cid:107) Θ (cid:107) = 1 /σ min ( R T Q ) , and the bound is (cid:107) A x − b (cid:107) ≤ σ max (cid:0) R T Q (cid:1) /σ min ( R T Q ) (cid:107) A y − b (cid:107) ≤ κ (cid:0) R T Q (cid:1) (cid:107) A y − b (cid:107) . (cid:3) We later use results from [32] to bound κ (cid:0) R T Q (cid:1) since R T Q is a random matrixwhose rows are independent from one another, but whose columns are not. To usethis machinery, speciﬁcally Theorem 6, we require the following lemma. Lemma 10. R T Q is a random matrix with isotropic rows.Proof. Using the second moment matrix, we show that the rows of R T Q areisotropic. Given a row of R T Q written in column form, (cid:20) R (cid:16) : , ˆ l (cid:17) T Q (cid:21) T = Q T R (cid:16) : , ˆ l (cid:17) ,we form the second moment matrix, E (cid:20) Q T R (cid:16) : , ˆ l (cid:17) R (cid:16) ˆ l, : (cid:17) T Q (cid:21) = Q T E (cid:20) R (cid:16) : , ˆ l (cid:17) R (cid:16) ˆ l, : (cid:17) T (cid:21) Q. and show E (cid:20) R (cid:16) : , ˆ l (cid:17) R (cid:16) : , ˆ l (cid:17) T (cid:21) = I M × M . Hence E (cid:20) Q T R (cid:16) : , ˆ l (cid:17) R (cid:16) ˆ l, : (cid:17) T Q (cid:21) = Q T Q = I r Q × r Q and R T Q is isotropic.From the Khatri-Rao product deﬁnition of the matrix R (3.3), we write a columnof R as R (cid:16) : , ˆ l (cid:17) = (cid:79) i = 1 : di (cid:54) = k R i (cid:16) : , ˆ l (cid:17) . ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 13

Therefore, using properties of the Kronecker product (see, e.g. [24, equation (2.2)])we can switch the order of the regular matrix product and the Kronecker products, R (cid:16) : , ˆ l (cid:17) R (cid:16) : , ˆ l (cid:17) T = (cid:79) i = 1 : di (cid:54) = k R i (cid:16) : , ˆ l (cid:17) R i (cid:16) : , ˆ l (cid:17) T . Taking the expectation and moving it inside the Kronecker product gives us E (cid:20) R (cid:16) : , ˆ l (cid:17) R (cid:16) : , ˆ l (cid:17) T (cid:21) = (cid:79) i = 1 : di (cid:54) = k E (cid:20) R i (cid:16) : , ˆ l (cid:17) R i (cid:16) : , ˆ l (cid:17) T (cid:21) = (cid:79) i = 1 : di (cid:54) = k I M i × M i = I M × M . (cid:3) Since R T Q is a tall rectangular matrix ( R T Q ∈ R r (cid:48) × r Q ) with independent sub-Gaussian isotropic rows, we may use Theorem 5.39 from Vershynin to bound theextreme singular values. Lemma 11.

For every t ≥ , with probability at least − (cid:0) − ct (cid:1) we have (3.7) κ (cid:0) R T Q (cid:1) ≤ C (cid:112) ( r + 1) /r (cid:48) + t/ √ r (cid:48) − C (cid:112) ( r + 1) /r (cid:48) − t/ √ r (cid:48) , where C = C K and c = c K > depend only on the sub-Gaussian norm K =max i (cid:13)(cid:13)(cid:13) R (: , i ) T Q (cid:13)(cid:13)(cid:13) ψ of the rows of R T Q .Proof. Using Lemma 10 and Theorem 6, we have the following bound on the ex-treme condition numbers of R T Q ∈ R r (cid:48) × r Q for every t ≥ , with probability atleast − (cid:0) − ct (cid:1) , √ r (cid:48) − C √ r Q − t ≤ σ min (cid:0) R T Q (cid:1) ≤ σ max (cid:0) R T Q (cid:1) ≤ √ r (cid:48) + C √ r Q + t, where C = C K and c = c K > depend only on the sub-Gaussian norm K =max i (cid:13)(cid:13)(cid:0) R T Q (cid:1) i (cid:13)(cid:13) ψ of the rows of R T Q . Since r Q ≤ r + 1 , we have √ r (cid:48) − C √ r + 1 − t ≤ σ min (cid:0) R T Q (cid:1) ≤ σ max (cid:0) R T Q (cid:1) ≤ √ r (cid:48) + C √ r + 1 + t, with the same probability.We now state the convergence result. (cid:3) Theorem 12.

Given A ∈ R M × r and R ∈ R M × r (cid:48) where r ≤ r (cid:48) ≤ M , and assumingthat x ∈ R r is the solution that minimizes (cid:13)(cid:13) R T A ˆ x − R T b (cid:13)(cid:13) and y ∈ R r is thesolution that minimizes (cid:107) A ˆ y − b (cid:107) , then for every t ≥ , with probability at least − (cid:0) − ct (cid:1) we have (cid:107) A x − b (cid:107) ≤ C (cid:112) ( r + 1) /r (cid:48) + t/ √ r (cid:48) − C (cid:112) ( r + 1) /r (cid:48) − t/ √ r (cid:48) (cid:107) A y − b (cid:107) , ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 14 where C = C K and c = c K > depend only on the sub-Gaussian norm K =max i (cid:13)(cid:13)(cid:13) R (: , i ) T Q (cid:13)(cid:13)(cid:13) ψ . The matrix Q ∈ R M × r Q , r Q ≤ r + 1 , is composed of or-thonormal columns from the QR factorization of the augmented matrix [ A (cid:124) b ] and R T Q is assumed to have full rank.Proof. The proof results from bounding κ (cid:0) R T Q (cid:1) in Lemma 9 with Lemma 11 viaLemma 10. (cid:3) Bounding the condition number of B k . To bound the condition numberof B k , we use a modiﬁed version of Theorem 6. If the rows of B k were isotropicthen we could use Theorem 6 directly. However, unlike R T Q in Lemma 10, thisis not the case for B k . While the second moment matrix Σ is not the identity, itdoes play a special role in the bound of the condition number since it is the matrix B k from the standard ALS algorithm. To see this, we take the ˆ l − th row of B k , B k (ˆ l, :) = (cid:110)(cid:81) i (cid:54) = k (cid:68) F ˜ li , R ˆ li (cid:69)(cid:111) ˜ l =1 ,...,n , and form the second moment matrix of X , Σ ( l, l (cid:48) ) = E (cid:89) i (cid:54) = k (cid:68) F li , R ˆ li (cid:69) (cid:68) F l (cid:48) i , R ˆ li (cid:69) = (cid:89) i (cid:54) = k E (cid:20)(cid:0) F li (cid:1) T R ˆ li (cid:16) R ˆ li (cid:17) T F l (cid:48) i (cid:21) = (cid:89) i (cid:54) = k (cid:0) F li (cid:1) T E (cid:20) R ˆ li (cid:16) R ˆ li (cid:17) T (cid:21) F l (cid:48) i . Since R ˆ li is a vector composed of either Bernoulli or standard Gaussian randomvariables, E (cid:20) R ˆ li (cid:16) R ˆ li (cid:17) T (cid:21) = I . Therefore, we are left with Σ ( l, l (cid:48) ) = (cid:81) i (cid:54) = k (cid:0) F li (cid:1) T F l (cid:48) i . We need to modify Theorem 6 for matrices that have independent, non-isotropicrows. In [32, Remark 5.40] it is noted in the case of a random matrix A ∈ R N × n with non-isotropic rows that we can apply Theorem 6 to A Σ − instead of A . Thematrix A Σ − has isotropic rows and, thus, we obtain the following inequality thatholds with probability at least − (cid:0) − ct (cid:1) ,(3.8) (cid:13)(cid:13)(cid:13)(cid:13) N A T A − Σ (cid:13)(cid:13)(cid:13)(cid:13) ≤ max (cid:0) δ, δ (cid:1) (cid:107) Σ (cid:107) , where δ = C (cid:114) nN + t √ N , and C = C K , c = c K > .To clarify how (3.8) changes the bounds on the singular values σ min ( A ) and σ max ( A ) of a matrix A with non-isotropic rows, we modify Lemma 7. Lemma 13.

Consider matrices B ∈ R N × n and Σ − ∈ R n × n (non-singular) thatsatisfy (3.9) − δ ≤ σ min (cid:16) B Σ − (cid:17) ≤ σ max (cid:16) B Σ − (cid:17) ≤ δ, for δ > . Then we have the following bounds on the extreme singular values of B : σ min (cid:16) Σ (cid:17) · (1 − δ ) ≤ σ min ( B ) ≤ σ max ( B ) ≤ σ max (cid:16) Σ (cid:17) · (1 + δ ) . ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 15

The proof of Lemma 13 is included in Appendix A. Using Lemma 13, we observethat the bound on the condition number of a matrix B satisfying (3.9) has thefollowing form: κ ( B ) ≤ (1 + δ )(1 − δ ) κ (cid:16) Σ (cid:17) . Using Lemma 13, we prove an extension of Theorem 6 for matrices with non-isotropic rows.

Theorem 14.

Let A be an N × n matrix whose rows, A ( i, :) , are independent,sub-Gaussian random vectors in R n . Then for every t ≥ , with probability at least − (cid:0) − ct (cid:1) one has σ min (cid:16) Σ (cid:17) · (cid:16) √ N − C √ n − t (cid:17) ≤ σ min ( A ) ≤ σ max ( A ) ≤ σ max (cid:16) Σ (cid:17) · (cid:16) √ N + C √ n + t (cid:17) . Here C = C K , c = c K > , depend only on the sub-Gaussian norm K = max i (cid:107) A ( i, :) (cid:107) ψ and the norm of Σ − .Proof. We form the second moment matrix Σ using rows A ( i, :) and apply Theo-rem (6) to the matrix A Σ − , which has isotropic rows. Therefore, for every t ≥ ,with probability at least − (cid:0) − ct (cid:1) , we have(3.10) √ N − C √ n − t ≤ σ min (cid:16) A Σ − (cid:17) ≤ σ max (cid:16) A Σ − (cid:17) ≤ √ N + C √ n + t, where C = C ˜ K , c = c ˜ K > , depend only on the sub-Gaussian norm ˜ K =max i (cid:13)(cid:13)(cid:13) Σ − A ( i, :) T (cid:13)(cid:13)(cid:13) ψ . Applying Lemma 13 to (3.10) with B = A/ √ N and δ = C (cid:112) n/N + t/ √ N , results in the bound σ min (cid:16) Σ (cid:17) · (cid:16) √ N − C √ n − t (cid:17) ≤ σ min ( A ) ≤ σ max ( A ) ≤ σ max (cid:16) Σ (cid:17) · (cid:16) √ N + C √ n + t (cid:17) , with the same probability as (3.10).To move Σ − outside the sub-Gaussian norm, we bound ˜ K from above usingthe sub-Gaussian norm of A , K = max i (cid:107) A ( i, :) (cid:107) ψ , (cid:13)(cid:13)(cid:13) Σ − A ( i, :) T (cid:13)(cid:13)(cid:13) ψ = sup x ∈S n − (cid:13)(cid:13)(cid:13)(cid:68) Σ − A ( i, :) T , x (cid:69)(cid:13)(cid:13)(cid:13) ψ = sup x ∈S n − (cid:13)(cid:13)(cid:13)(cid:68) A ( i, :) T , Σ − x (cid:69)(cid:13)(cid:13)(cid:13) ψ (cid:13)(cid:13)(cid:13) Σ − x (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) Σ − x (cid:13)(cid:13)(cid:13) ≤ sup y ∈S n − (cid:13)(cid:13)(cid:13)(cid:68) A ( i, :) T , y (cid:69)(cid:13)(cid:13)(cid:13) ψ sup x ∈S n − (cid:13)(cid:13)(cid:13) Σ − x (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) A ( i, :) T (cid:13)(cid:13)(cid:13) ψ (cid:13)(cid:13)(cid:13) Σ − (cid:13)(cid:13)(cid:13) , ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 16 hence ˜ K ≤ K (cid:13)(cid:13)(cid:13) Σ − (cid:13)(cid:13)(cid:13) . Using this inequality, we bound the probability in (6.4) forthe case of Theorem 6 applied to A Σ − . P (cid:26) max x ∈N (cid:12)(cid:12)(cid:12)(cid:12) N (cid:13)(cid:13)(cid:13) A Σ − x (cid:13)(cid:13)(cid:13) − (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:27) ≤ n · (cid:20) − c ˜ K (cid:0) C n + t (cid:1)(cid:21) ≤ n ·  − c K (cid:13)(cid:13)(cid:13) Σ − (cid:13)(cid:13)(cid:13) (cid:0) C n + t (cid:1) ≤  − c t K (cid:13)(cid:13)(cid:13) Σ − (cid:13)(cid:13)(cid:13)  . The last step, similar to the proof of Theorem 6, comes from choosing C largeenough, for example C = K (cid:13)(cid:13)(cid:13) Σ − (cid:13)(cid:13)(cid:13) (cid:112) ln (9) /c ). (cid:3) The combination of Lemma 13, the fact that Σ for (3.1) is the same matrixas (2.2) (denoted below as B ALSk ), and Theorem 14, leads to our bound on thecondition number of B k in (3.1) is, for every t ≥ and with probability at least − (cid:0) − ct (cid:1) ,(3.11) κ ( B k ) ≤ C (cid:112) r/r (cid:48) + t/ √ r (cid:48) − C (cid:112) r/r (cid:48) − t/ √ r (cid:48) κ (cid:16)(cid:0) B ALSk (cid:1) (cid:17) , where the deﬁnitions of C and c are the same as in Theorem 14. Remark . In both (3.7) and (3.11) the ratios r/r (cid:48) and t/ √ r (cid:48) are present. Asboth ratios go to zero, our bound on the condition number of B k goes to κ ( B k ) ≤ κ (cid:16)(cid:0) B ALSk (cid:1) (cid:17) , and the bound on the condition number of R T Q goes to κ (cid:0) R T Q (cid:1) ≤ . These properties explain our choice to set r (cid:48) as a constant multiple of r inthe randomized ALS algorithm. As with similar bounds for randomized matrixalgorithms, these bounds are pessimistic. Hence r (cid:48) does not have to be very largewith respect to r in order to get acceptable results.Another reason to choose r (cid:48) as a constant multiple of r is the construction of R k ∈ R M × r (cid:48) in (3.3) and our choice to use Bernoulli random numbers. If r (cid:48) is toosmall, there is the danger of the matrix R k becoming singular due to the size of M and repeated rows. Choosing the constant multiple that deﬁnes r (cid:48) large enoughhelps mitigate this problem. 4. Examples

Sine function.

Our ﬁrst test of the randomized ALS algorithm is to reduce aCTD generated from samples of the multivariate function sin ( z + · · · + z d ) . Thisreduction problem was studied in [4], where the output of the standard ALS algo-rithm suggested a new trigonometric identity yielding a rank d separated represen-tation of sin ( z + · · · + z d ) . As input, we use standard trigonometric identities toproduce a rank d − initial CTD.We ran tests using both standard ALS and the new randomized algorithmto reduce the separation rank of a CTD of samples of sin ( z + · · · + z d ) . The tests ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 17 diﬀered in that each one had a diﬀerent random initial guess with separation rank r F = 1 . In this example we chose d = 5 and sampled each variable z i , i = 1 , . . . , d ,with M = 64 equispaced samples in the interval [0 , π ] . Our input CTD for bothalgorithms was rank and was generated via a standard trigonometric identity.The reduction tolerance for both algorithms was set to (cid:15) = 10 − , and the maximumnumber of iterations per rank, i.e. max _ iter in Algorithm 1 and Algorithm 2,was set to . For tests involving the standard ALS algorithm we used a stucktolerance of δ = 10 − . To test the randomized ALS algorithm we used B k matricesof size (25 r F ) × r F and set max _ tries in Algorithm 2 to .According to Lemma 2.4 in [4], there exists exact rank separated representa-tions of sin ( z + · · · + z d ) . Using (cid:15) = 10 − for our reduction tolerance, we were ableto ﬁnd rank approximations with both standard ALS and our randomized ALSwhose relative errors were less than the requested (cid:15) (for a histogram of residuals ofthe tests, see Figure 4.1).Due to the random initial guess F and our choices of parameters (in particular thestuck tolerance and max _ tries ) both algorithms had a small number of runs thatdid not ﬁnd rank approximations with the requested tolerance (cid:15) . The randomizedALS algorithm produced fewer of these outcomes than standard ALS.Large diﬀerences in maximum condition number (of ALS solves) are illustrated inFigure 4.2, where we compare tests of the standard and randomized ALS algorithms.We observe that the maximum condition numbers produced by the randomizedALS algorithm are much smaller those from the standard ALS algorithm. This isconsistent with our theory.Furthermore, as shown in Figure 4.3, the number of iterations required for ran-domized ALS to converge was smaller than the number required by standard ALS.It is important to remember that the number of iterations required by the standardALS algorithm to reduce a CTD can be optimized by adjusting the tolerance, stucktolerance, and maximum number of iterations per rank. In these experiments wechose the stuck tolerance and maximum number of iterations to reduce the numberof tests of the standard ALS algorithm that did not meet the requested tolerance (cid:15) .4.2. A manufactured tensor.

Our next test is to compare the performance ofthe standard and randomized ALS algorithms on a manufactured random tensorexample. To construct this example we generate factors by drawing M = 128 random samples from the standard Gaussian distribution. We chose d = 10 and setthe separation rank of the input tensor to r = 50 . Then we normalized the factorsand set the s-values of the tensor equal to s l = e − l , l = 0 , . . . , r − , where r waspredetermined such that s end is small.Similar to the sine example, we ran experiments and requested an accuracyof (cid:15) = 10 − from both algorithms. The maximum number of iterations for bothalgorithms was set to , while the stuck tolerance for the standard ALS algo-rithm was set to − . We used the following parameters for the randomized ALSalgorithm: the B k matrices were of size (25 r F ) × r F , and the repetition parameter, max _ tries in Algorithm 2, was set to . We started all tests from randomizedguesses with rank r F = 9 . This value was chosen because in all previous runs thereduced separation rank never fell below r F = 10 . Such an experiment allows usto compare how the algorithms perform when the initial approximation has rankgreater than one. ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 18

Figure 4.1.

Histograms displaying ALS reduction residuals, in log scale, for reducing the length of a CTD of samples of sin ( z + . . . z ) . The experiments shown in (a) used randomizedALS, whereas the experiments shown in (b) used standard ALS. Wenote that both algorithms produced a small number of results withapproximation errors worse than the requested tolerance. However,the randomized ALS method produced fewer results that did notmeet our requested tolerance.We show in Figure 4.4 the output separation ranks from tests of both therandomized and standard ALS algorithms. The CTD outputs from randomizedALS had, on average, lower separation ranks than those from standard ALS. Fur-thermore, as seen in Figure 4.4, some of the output CTDs from the standard ALSalgorithm had separation rank above . In these instances, standard ALS failed toreduce the separation rank of the input CTD because simple truncation to r F = 35 would have given double precision. These failures did not occur with the random-ized ALS algorithm. We can also see the contrast in performance in Figure 4.5: all ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 19

Figure 4.2.

Histogram showing the maximum condition numbersfrom our experiments reducing the length of a CTD of samples of sin ( z + . . . z ) . The condition numbers are shown in log scale;the solid gray pattern represents condition numbers from standardALS, while the hatch pattern represents condition numbers fromthe randomized ALS algorithm. Figure 4.3.

Histogram showing the number of iterations requiredby randomized ALS (hatch pattern) and the standard ALS algo-rithm (gray pattern) to reduce the length of a CTD of samples of sin ( z + . . . z ) .tests of the randomized ALS algorithm produced CTDs with reduced separationrank whose relative reduction errors were less than the accuracy (cid:15) . Also, in Fig-ure 4.5, we observe instances where the standard ALS algorithm failed to output areduced separation rank CTD with relative error less than (cid:15) . ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 20

There was a signiﬁcant diﬀerence in the maximum condition numbers of matri-ces used in the two algorithms. In Figure 4.6, we see that matrices produced bystandard ALS had much larger condition numbers (by a factor of roughly )than their counterparts in the randomized ALS algorithm. Such large conditionnumbers may explain the failures of the standard ALS algorithm to output reducedseparation rank CTDs with relative errors less than (cid:15) .From Figure 4.7, we see that in many cases standard ALS required fewer it-erations than the randomized ALS algorithm to converge. However, relative torandomized ALS, standard ALS required a large number of iterations for a consid-erable fraction of the experiments.4.3. Elliptic PDE with random coeﬃcient.

As the key application of the ran-domized ALS algorithm, we consider the separated representation of the solution u ( x , z ) to the linear elliptic PDE − ∇ · ( a ( x , z ) ∇ u ( x , z )) = 1 , x ∈ D , (4.1) u ( x , z ) = 0 , x ∈ ∂ D , deﬁned on the unit square D = (0 , × (0 , with boundary ∂ D . The diﬀusioncoeﬃcient a ( x , z ) is considered random and is modeled by a ( x , z ) = a + σ a d (cid:88) k =1 (cid:112) ζ k ϕ k ( x ) z k , (4.2)where z = ( z , . . . , z d ) and the random variables z k are independent and uniformlydistributed over the interval [ − , , and we choose a = 0 . , σ a = 0 . , and d = 5 .In (4.2), { ζ k } dk =1 are the d largest eigenvalues associated with { ϕ k } dk =1 , the L ( D ) -orthonormalized eigenfunctions of the exponential covariance function(4.3) C aa ( x , x ) = exp (cid:18) − (cid:107) x − x (cid:107) l c (cid:19) , where l c denotes the correlation length, here set to l c = 2 / . Given the choices ofparameters a , σ a , d , and l c , the model in (4.2) leads to strictly positive realizationsof a ( x , z ) .We discretize (4.1) in the spatial domain D via triangular ﬁnite elements of size h = 1 / . This discretization, along with the aﬃne representation of a ( x , z ) in z k ,yields the random linear system of equations(4.4) (cid:32) K + d (cid:88) k =1 K k z k (cid:33) u ( z ) = f , for the approximate vector of nodal solutions u ( z ) ∈ R N . The sparse matrices K and K k are obtained from the ﬁnite element discretization of the diﬀerentialoperator in (4.1) assuming a ( x , z ) is replaced by ¯ a and σ a √ ζ k φ k ( x ) , respectively.To fully discretize (4.4), we consider the representation of u ( z ) at a tensor-product grid { ( z ( j ) , . . . , z d ( j d )) : j k = 1 , . . . , M k } where, for each k , the grid points z k ( j k ) are selected to be the Gauss-Legendre abscissas. In our numerical experi-ments, we used the same number of abscissas M k = M = 8 for all k = 1 , . . . , d .The discrete representation of (4.4) is then given by the tensor system of equations(4.5) K U = F , ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 21

Figure 4.4.

Histograms showing output ranks from experimentsin reducing the length of CTDs. (a) shows ranks of CTDs out-put by the randomized ALS algorithm. (b) shows ranks of CTDsoutput by the standard ALS algorithm. The CTDs output by therandomized ALS method typically have a smaller separation rank.In many examples the standard ALS algorithm required 40 terms,i.e. it failed since truncation of the input tensor to r F = 35 shouldgive double precision.where the linear operation K U is deﬁned as K U = d (cid:88) ˆ l =0 r U (cid:88) ˜ l =1 s U ˜ l (cid:16) K ˆ l U ˜ l (cid:17) ◦ (cid:16) K ˆ l U ˜ l (cid:17) ◦ · · · ◦ (cid:16) K ˆ ld U ˜ ld (cid:17) , K ˆ l = K ˆ l , ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 22

Figure 4.5.

Histograms displaying ALS reduction errors, in log scale, for reduced-rank CTDs of the random tensor example. (a)shows that in our 500 tests, the randomized ALS method alwaysproduced a result that met the required tolerance. (b) shows howthe standard ALS algorithm fared with the same problem. Notethat the standard ALS algorithm failed to reach the requestedtolerance in a signiﬁcant number of tests.and for k = 1 , . . . , d , K ˆ lk = (cid:40) D ˆ l = kI M ˆ l (cid:54) = k, where D =  z k (1) 0 . . . z k ( M )  , ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 23

Figure 4.6.

Histogram showing the maximum condition numbersfrom experiments in reducing the length of CTDs of the randomtensor example. The condition numbers of B k are shown in log scale; solid gray represents condition numbers from standard ALSwhile the hatch pattern represents condition numbers from the ran-domized ALS algorithm. Similar to the sine example, the conditionnumbers from randomized ALS are much smaller than those fromthe standard ALS algorithmand I M is the M × M identity matrix. The tensor F in (4.5) is deﬁned as F = f ◦ M ◦ · · · ◦ M , where M is an M -vector of ones. We seek to approximate U in (4.5) with a CTD,(4.6) U = r U (cid:88) ˜ l =1 s U ˜ l U ˜ l ◦ U ˜ l ◦ · · · ◦ U ˜ ld , where the separation rank r U will be determined by a target accuracy. In (4.6) U ˜ l ∈ R N and U ˜ lk ∈ R M , k = 1 , . . . , d . To solve (4.5), we use a ﬁxed point iterationsimilar to those used for solving matrix equations and recently employed to solvetensor equations in [23]. In detail, the iteration starts with an initial tensor U ofthe form in (4.6). At each iteration i , U is updated according to U i +1 = ( I − K ) U i + F , while requiring (cid:107) I − K (cid:107) < . To assure this requirement is satisﬁed we solve(4.7) U i +1 = c ( F − K U i ) + U i , where c is chosen such that (cid:107) I − c K (cid:107) < . We compute the operator norm (cid:107) I − K (cid:107) via power method; see, e.g., [4, 5].One aspect of applying such an iteration to a CTD is an increase in the outputseparation rank. For example, if we take a tensor U of separation rank r U anduse it as input for (4.7), one iteration would increase the rank to r F + ( d + 2) r U .Therefore we require a reduction algorithm to decrease the separation rank as we ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 24

Figure 4.7.

Histograms showing iterations required to producereduced-length CTDs for the random tensor example. (a) showsiterations required by randomized ALS, while (b) shows the iter-ations required by the standard ALS algorithm. As seen in (b),many examples using the standard ALS algorithm required largenumbers of iterations to output CTDs.iterate. This is where either the standard or randomized ALS algorithm is required:to truncate the separated representation after we have run an iteration. BothALS methods work with a user-supplied truncation accuracy (cid:15) , so we denote thereduction operator as τ (cid:15) . Including this operator into our iteration, we have(4.8) U i +1 = τ (cid:15) ( c ( F − K U i ) + U i ) . Pseudocode for our ﬁxed point algorithm is shown in Algorithm 3

Remark . In this example, the separation rank of K is directly related to theproblem dimension d , i.e. r K = d + 1 , which is a consequence of using a Karhunen-Loeve-type expansion for ﬁnite-dimensional noise representation of a ( x , z ) . This ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 25

Algorithm 3:

Fixed point iteration algorithm for solving (4.8) input : (cid:15) > , µ > , operator K , F , c, max _ iter, max _ rank, δ > (forstandard ALS), max _ tries (for randomized ALS) initialize r U = 1 tensor U = U ◦ · · · ◦ U d with either randomly generated U k or U k generated from the solution of the nominal equations. D = F − K U res = (cid:107) D (cid:107) / (cid:107) F (cid:107) iter = 0 while res > µ do iter = iter + 1 U iter = c D iter − + U iter − U iter = τ (cid:15) ( U iter ) D iter = F − K U iter res = (cid:107) D iter (cid:107) / (cid:107) F (cid:107) endreturn U iter will increase the computational cost of the algorithm to more than linear withrespect to d , e.g. quadratic in d when an iterative solver is used and N (cid:29) M .Alternatively, one can obtain the ﬁnite-dimensional noise representation of a ( x , z ) by applying the separated rank reduction technique of this study on the stochasticdiﬀerential operator itself to possibly achieve r K < d . The interested reader isreferred to [3, 4] for more details.First, we examine the convergence of the iterative algorithm given a ﬁxed ALSreduction tolerance in Figure 4.8. The randomized ALS method converges to moreaccurate solutions in all of these tests (see Table 1). However, the ranks of therandomized ALS solutions are larger than the ranks required for solutions producedby the standard ALS algorithm.In Figure 4.9, we observe diﬀerent behavior in the relative residuals using ﬁxedranks instead of ﬁxed accuracies. For these experiments the ALS-based linear solveusing the standard algorithm out-performs the randomized version, except in therank case (see Table 2). In this case, the standard ALS algorithm has issuesreaching the requested ALS reduction tolerance, thus leading to convergence prob-lems in the iterative linear solve. The randomized ALS algorithm does not have thesame diﬃculty with the rank r = 30 example. The diﬀerence in performance be-tween standard and randomized ALS for this example corresponds to a signiﬁcantdiﬀerence between the maximum condition numbers of B k . For the r = 30 case,the maximum condition number of B k matrices generated by randomized ALS was . × , whereas the maximum condition number of B k matrices generated bystandard ALS was . × .5. Discussion and conclusions

We have proposed a new ALS algorithm for reducing the rank of tensors in canon-ical format that relies on projections onto random tensors. Tensor rank reductionis one of the primary operations for approximations with tensors. Additionally, wehave presented a general framework for the analysis of this new algorithm. The

ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 26

Figure 4.8.

Residual error vs. iteration of results from linearsolvers. The black lines represent linear solve residuals where stan-dard ALS was used for reduction, while the gray lines representlinear solve residuals where randomized ALS was used for reduc-tion. In the three examples shown above the ALS tolerances, forboth standard and randomized ALS, were set to × − for curveslabeled (a), × − for curves labeled (b), and × − for curveslabeled (c).beneﬁt of using such random projections is the improved conditioning of matricesassociated with the least squares problem at each ALS iteration. While signiﬁcantreductions of condition numbers may be achieved, unlike in the standard ALS, theapplication of random projections results in a loss of monotonic error reduction.In order to restore monotonicity, we have employed a simple rejection approach,wherein several random tensors are applied and only those that do not increase theerror are accepted. This, however, comes at the expense of additional computa-tional cost as compared to the standard ALS algorithm. Finally, a set of numerical ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 27

Figure 4.9.

Plots showing relative residuals of linear solves vs.ﬁxed point iteration number. (a) two linear solves are shown here:solves for ﬁxed ranks r = 10 , and r = 20 . Gray lines are residualscorresponding to reductions with randomized ALS and black linescorrespond to reductions with the standard ALS algorithm. (b)One experiment with r = 30 and the same color scheme as in (a).ALS type ALS tol max κ ( B k ) max rank rank residualstandard × − . × . × − × − . ×

13 11 5 . × − × − . ×

37 34 4 . × − randomized × − . × . × − × − . ×

22 19 2 . × − × − . ×

57 54 3 . × − Table 1.

Table containing ranks, maximum condition numbers,and ﬁnal relative residual errors of experiments with ﬁxed ALStolerance.experiments has been studied to illustrate the eﬃciency of the randomized ALS inimproving numerical properties of its standard counterpart.The optimal choice of random variables to use in the context of projecting ontorandom tensors is a question to be addressed in future work. In our examples wehave used signed Bernoulli random variables, a choice that worked well with bothour numerical experiments and analysis. On the other hand, the limitations ofsuch a construction of random tensors have been discussed, which motivate furtherinvestigations. Another topic of interest for future work is the extension of the

ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 28

ALS type ALS tol max κ ( B k ) rank residualstandard × − . ×

10 7 . × − × − . ×

20 1 . × − × − . ×

30 4 . × − randomized × − . ×

10 1 . × − × − . ×

20 2 . × − × − . ×

30 1 . × − Table 2.

Table containing maximum condition numbers and ﬁnalrelative residual errors of experiments with ﬁxed separation ranks.proposed randomized framework to other tensor formats including the Tucker, [24],and tensor-train, [28].Finally we have suggested an alternative approach to using projections ontorandom tensors that merits further examination. This approach uses the QR fac-torization to construct a preconditioner for the least squares problem at each ALSiteration. Hence it solves the same equations as the standard ALS, but the matriceshave better conditioning. Also, because it solves the same equations, the mono-tonic error reduction property is preserved. This is an important distinction fromrandomized ALS, which solves diﬀerent linear systems, but the solutions to whichare close to the solutions from standard ALS.6.

Appendix A

First, we prove (2.4).

Proof.

To bound the condition number of AB we bound σ max ( AB ) from aboveand σ min ( AB ) from below. The bound we use of σ max ( AB ) is straightforward; itcomes from the properties of the two norm, σ max ( AB ) ≤ σ max ( A ) σ max ( B ) . To bound σ min ( AB ) we ﬁrst note that AA T is nonsingular, and write σ min ( AB ) as follows, σ min ( AB ) = (cid:13)(cid:13)(cid:13) AA T (cid:0) AA T (cid:1) − AB x ∗ (cid:13)(cid:13)(cid:13) , where (cid:107) x ∗ (cid:107) = 1 is the value of x such that the minimum of the norm is obtained(see the minimax deﬁnition of singular values, e.g. [22, Theorem 3.1.2]). If wedeﬁne y = A T (cid:0) AA T (cid:1) − AB x ∗ , then σ min ( AB ) = (cid:107) A y (cid:107) · (cid:107) y (cid:107)(cid:107) y (cid:107) ≥ σ min ( A ) · (cid:107) y (cid:107) , from the minimax deﬁnition of singular values. To bound (cid:107) y (cid:107) , we observe that A T (cid:0) AA T (cid:1) − AB is the projection of B onto the row space of A . Denoting thisprojection as P A T ( B ) we have, (cid:107) y (cid:107) = (cid:107) P A T ( B ) x ∗ (cid:107) ≥ σ min ( P A T ( B )) , ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 29 since (cid:107) x ∗ (cid:107) = 1 . Combining our bounds on the ﬁrst and last singular values givesus the bound on the condition number, κ ( AB ) = σ max ( A ) σ max ( B ) σ min ( A ) σ min ( P A T ( B )) = κ ( A ) · σ max ( B ) σ min ( P A T ( B )) . (cid:3) The proof of Theorem 6 is broken down into three steps in order to control (cid:107) A x (cid:107) for all x on the unit sphere: an approximation step, where the unit sphere iscovered using a ﬁnite epsilon-net N (see [32, Section 5.2.2] for background on nets);a concentration step, where tight bounds are applied to (cid:107) A x (cid:107) for every x ∈ N ; andthe ﬁnal step where a union bound is taken over all the vectors x ∈ N . Proof. (of Theorem 6)Vershynin observes that if we set B in Lemma 7 to A/ √ N , the bounds on theextreme singular values σ min ( A ) and σ max ( A ) in (2.5) are equivalent to(6.1) (cid:13)(cid:13)(cid:13)(cid:13) N A T A − I (cid:13)(cid:13)(cid:13)(cid:13) < max (cid:0) δ, δ (cid:1) =: (cid:15), where δ = C (cid:112) nN + t √ N . In the approximation step of the proof, he chooses a -net N to cover the unit sphere S n − . Evaluating the operator norm (6.1) on N , it issuﬃcient to show max x ∈N (cid:12)(cid:12)(cid:12)(cid:12) N (cid:107) A x (cid:107) − (cid:12)(cid:12)(cid:12)(cid:12) < (cid:15) , with the required probability to prove the theorem.Starting the concentration step, [32] deﬁnes Z i = (cid:104) A i , x (cid:105) , where A i is the i -throw of A and (cid:107) x (cid:107) = 1 . Hence, the vector norm may be written as(6.2) (cid:107) A x (cid:107) = N (cid:88) i =1 Z i . Using an exponential deviation inequality to control (6.2), and that K ≥ √ , thefollowing probabilistic bound for a ﬁxed x ∈ S n − is, P (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:107) A x (cid:107) − (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:27) = P (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 Z i − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:41) ≤ (cid:104) − c K min (cid:0) (cid:15) , (cid:15) (cid:1) N (cid:105) = 2 exp (cid:104) − c K δ N (cid:105) ≤ (cid:104) − c K (cid:0) C n + t (cid:1)(cid:105) , (6.3)where c is an absolute constant.Finally, (6.3) is applied to every vector x ∈ N resulting in the union bound,(6.4) P (cid:26) max x ∈N (cid:12)(cid:12)(cid:12)(cid:12) N (cid:107) A x (cid:107) − (cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:15) (cid:27) ≤ n · (cid:104) − c K (cid:0) C n + t (cid:1)(cid:105) ≤ (cid:18) − c t K (cid:19) , where we arrive at the second inequality by choosing a suﬃciently large C = C K ([32] gives the example C = K (cid:112) ln (9) /c ). (cid:3) We now prove Lemma 13.

ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 30

Proof.

To prove this lemma we use the following inequality derived from (3.9), (1 − δ ) ≤ σ min (cid:16) Σ − B T B Σ − (cid:17) ≤ σ max (cid:16) Σ − B T B Σ − (cid:17) ≤ (1 + δ ) . First we bound σ max ( B ) from above: σ max ( B ) ≤ (cid:13)(cid:13)(cid:13) Σ (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) Σ − B T B Σ − (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) Σ (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) Σ (cid:13)(cid:13)(cid:13) · σ max (cid:16) Σ − B T B Σ − (cid:17) ≤ σ max (cid:16) Σ (cid:17) · (1 + δ ) , implying σ max ( B ) ≤ σ max (cid:16) Σ (cid:17) · (1 + δ ) . Second we bound σ min ( B ) from below: σ min ( B ) = σ min (cid:16) Σ Σ − B T B Σ − Σ (cid:17) ≥ σ min (cid:16) Σ (cid:17) · σ min (cid:16) Σ − B T B Σ − (cid:17) ≥ σ min (cid:16) Σ (cid:17) · (1 − δ ) , (6.5)implying σ min ( B ) ≥ σ min (cid:16) Σ (cid:17) · (1 − δ ) . The ﬁrst inequality in (6.5) is from [22,prob. 3.3.12]. Finally, using properties of singular values we combine the inequali-ties: σ min (cid:16) Σ (cid:17) · (1 − δ ) ≤ σ min ( B ) ≤ σ max ( B ) ≤ σ max (cid:16) Σ (cid:17) · (1 + δ ) . (cid:3) References [1] C.J. Appellof and E.R. Davidson. Strategies for analyzing data from video ﬂuorometric mon-itoring of liquid chromatographic eﬄuents.

Analytical Chemistry , 53(13):2053–2056, 1982.[2] B.W. Bader, M.W. Berry, and M.Browne. Discussion tracking in enron email using parafac.In

Survey of Text Mining II , pages 147–163. Springer London, 2008.[3] G. Beylkin and M. J. Mohlenkamp. Numerical operator calculus in higher dimensions.

Proc.Natl. Acad. Sci. USA , 99(16):10246–10251, August 2002.[4] G. Beylkin and M. J. Mohlenkamp. Algorithms for numerical analysis in high dimensions.

SIAM J. Sci. Comput. , 26(6):2133–2159, July 2005.[5] D. J. Biagioni, D. Beylkin, and G. Beylkin. Randomized interpolative decomposition of sep-arated representations.

Journal of Computational Physics , 281:116–134, 2015.[6] R. Bro. Parafac. Tutorial & Applications. In

Chemom. Intell. Lab. Syst., Special Is-sue 2nd Internet Conf. in Chemometrics (incinc’96) , volume 38, pages 149–171, 1997. .[7] J. D. Carroll and J. J. Chang. Analysis of individual diﬀerences in multidimensional scalingvia an N-way generalization of Eckart-Young decomposition.

Psychometrika , 35:283–320,1970.[8] Z. Chen and J. Dongarra. Condition number of gaussian random matrices.

SIAM J. MatrixAnal. Appl. , 27(3):603–620, 2005.[9] P.A. Chew, B.W. Bader, T.G. Kolda, and A. Abdelali. Cross-language information retrievalusing parafac2. In

Proceedings of the 13th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining , KDD ’07, pages 143–152, New York, NY, USA,2007. ACM.[10] F. Chinesta, P. Ladeveze, and E. Cueto. A short review on model order reduction basedon proper generalized decomposition.

Archives of Computational Methods in Engineering ,18(4):395–404, 2011.

ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 31 [11] A. Doostan and G. Iaccarino. A least-squares approximation of partial diﬀerential equationswith high-dimensional random inputs.

Journal of Computational Physics , 228(12):4332–4345,2009.[12] A. Doostan, G. Iaccarino, and N. Etemadi. A least-squares approximation of high-dimensionaluncertain systems. Technical Report Annual Research Brief, Center for Turbulence Research,Stanford University, 2007.[13] A. Doostan, A. Validi, and G. Iaccarino. Non-intrusive low-rank separated approximation ofhigh-dimensional stochastic models.

Comput. Methods Appl. Mech. Engrg. , 263:42–55, 2013.[14] A. Edelman. Eigenvalues and condition numbers of random matrices.

SIAM J. Matrix Anal.Appl. , 9(4):543–560, October 1988.[15] A. Edelman.

Eigenvalues and condition numbers of random matrices . Ph.d. thesis, Mas-sachusetts Institute of Technology, 1989.[16] G. Golub and C. Van Loan.

Matrix Computations . Johns Hopkins University Press, 3rdedition, 1996.[17] L. Grasedyck, D. Kressner, and C. Tobler. A literature survey of low-rank tensor approxima-tion techniques.

CoRR , abs/1302.7121, 2013.[18] W. Hackbusch.

Tensor spaces and numerical tensor calculus , volume 42. Springer, 2012.[19] M. Hadigol, A. Doostan, H. Matthies, and R. Niekamp. Partitioned treatment of uncertaintyin coupled domain problems: A separated representation approach.

Computer Methods inApplied Mechanics and Engineering , 274:103–124, 2014.[20] N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: probabilis-tic algorithms for constructing approximate matrix decompositions.

SIAM Review , 53(2):217–288, 2011.[21] R. A. Harshman. Foundations of the Parafac procedure: model and conditions for an“explanatory” multi-mode factor analysis. Working Papers in Phonetics 16, UCLA, 1970. http://publish.uwo.ca/ ∼ harshman/wpppfac0.pdf .[22] R. A. Horn and C. R. Johnson. Topics in matrix analysis . Cambridge Univ. Press, Cambridge,1994.[23] B. Khoromskij and C. Schwab. Tensor-structured Galerkin approximation of parametric andstochastic elliptic PDEs.

SIAM Journal on Scientiﬁc Computing , 33(1):364–385, 2011.[24] T. G. Kolda and B. W. Bader. Tensor decompositions and applications.

SIAM Review ,51(3):455–500, 2009.[25] A. Nouy. A generalized spectral decomposition technique to solve a class of linear stochasticpartial diﬀerential equations.

Computer Methods in Applied Mechanics and Engineering ,196(37-40):4521–4537, 2007.[26] A. Nouy. Generalized spectral decomposition method for solving stochastic ﬁnite elementequations: Invariant subspace problem and dedicated algorithms.

Computer Methods in Ap-plied Mechanics and Engineering , 197:4718–4736, 2008.[27] A. Nouy. Proper generalized decompositions and separated representations for the numericalsolution of high dimensional stochastic problems.

Archives of Computational Methods inEngineering , 17:403–434, 2010.[28] I. V. Oseledets. Tensor-train decomposition.

SIAM Journal on Scientiﬁc Computing ,33(5):2295–2317, 2011.[29] V. Rokhlin and M. Tygert. A fast randomized algorithm for overdetermined linear least-squares regression. In

Proceedings of the National Academy of Sciences , volume 105(36),pages 13212–13217. National Academy of Sciences, September 2008.[30] T. Sarlos. Improved approximation algorithms for large matrices via random projections.In

Foundations of Computer Science, 2006. FOCS ’06. 47th Annual IEEE Symposium on ,pages 143–152, 2006.[31] G. Tomasi and R. Bro. A comparison of algorithms for ﬁtting the PARAFAC model.

Comput.Statist. Data Anal. , 50(7):1700–1734, 2006.[32] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldarand G. Kutyniok, editors,

Compressed Sensing, Theory and Applications , chapter 5, pages210–268. Cambridge University Press, 2012.[33] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert. A fast randomized algorithm for theapproximation of matrices.

Appl. Comput. Harmon. Anal. , 25(3):335–366, 2008.

ANDOMIZED ALS FOR CANONICAL TENSOR DECOMPOSITION 32 ∗ Department of Aerospace Engineering Sciences, 429 UCB, University of Col-orado at Boulder, Boulder, CO 80309, ††