[PDF] ISLET: Fast and Optimal Low-rank Tensor Regression via Importance Sketching

Abstract

In this paper, we develop a novel procedure for low-rank tensor regression, namely \emph{\underline{I}mportance \underline{S}ketching \underline{L}ow-rank \underline{E}stimation for \underline{T}ensors} (ISLET). The central idea behind ISLET is \emph{importance sketching}, i.e., carefully designed sketches based on both the responses and low-dimensional structure of the parameter of interest. We show that the proposed method is sharply minimax optimal in terms of the mean-squared error under low-rank Tucker assumptions and under randomized Gaussian ensemble design. In addition, if a tensor is low-rank with group sparsity, our procedure also achieves minimax optimality. Further, we show through numerical study that ISLET achieves comparable or better mean-squared error performance to existing state-of-the-art methods while having substantial storage and run-time advantages including capabilities for parallel and distributed computing. In particular, our procedure performs reliable estimation with tensors of dimension p=O( 10 8 ) and is 1 or 2 orders of magnitude faster than baseline methods.

Full PDF

IISLET: Fast and Optimal Low-rank Tensor Regression viaImportance Sketching

Anru Zhang , Yuetian Luo , Garvesh Raskutti , and Ming Yuan Abstract

In this paper, we develop a novel procedure for low-rank tensor regression, namely

Importance Sketching Low-rank Estimation for Tensors (ISLET). The central idea be-hind ISLET is importance sketching , i.e., carefully designed sketches based on both theresponses and low-dimensional structure of the parameter of interest. We show that theproposed method is sharply minimax optimal in terms of the mean-squared error underlow-rank Tucker assumptions and under randomized Gaussian ensemble design. In ad-dition, if a tensor is low-rank with group sparsity, our procedure also achieves minimaxoptimality. Further, we show through numerical study that ISLET achieves compara-ble or better mean-squared error performance to existing state-of-the-art methods whilehaving substantial storage and run-time advantages including capabilities for paralleland distributed computing. In particular, our procedure performs reliable estimationwith tensors of dimension p = O (10 ) and is 1 or 2 orders of magnitude faster thanbaseline methods. Key words: dimension reduction, high-order orthogonal iteration, minimax optimality,sketching, tensor regression.

The past decades have seen a large body of work on tenors or multiway arrays [65, 107, 32,71]. Tensors arise in numerous applications involving multiway data (e.g., brain imaging[143], hyperspectral imaging [76], or recommender system design [11]). In addition, tensormethods have been applied to many problems in statistics and machine learning wherethe observations are not necessarily tensors, such as topic and latent variable models [2],additive index models [5], and high-order interaction pursuit [55], among others. In many of Department of Statistics, University of Wisconsin-Madison. ( [email protected] , [email protected] , [email protected] ) Department of Statistics, Columbia University ( [email protected] ) a r X i v : . [ s t a t . M L ] M a y hese settings, the tensor of interest is high-dimensional in that the ambient dimension, i.e,the dimension of the target parameter is substantially larger than the sample size. Howeverin practice, the tensor parameter often has intrinsic dimension-reduced structure, such aslow-rankness and sparsity [65, 112, 121], which makes inference possible. How to exploitsuch structure for tensors poses new statistical and computational challenges [103].From a statistical perspective, a key question is how many samples are required to learnthe suitable dimension-reduced structure and what the optimal mean-squared error ratesare. Prior work has developed various tensor-based methods with theoretical guaranteesbased on regularization approaches [73, 91, 103, 117], the spectral method and projectedgradient descent [29], alternating gradient descent [75, 113, 143], stochastic gradient de-scent [47], and power iteration methods [2]. However, a number of these methods are notstatistically optimal. Furthermore, some of these methods rely on evaluation of a full gradi-ent, which is typically costly in the high-dimensional setting. This leads to computationalchallenges including both the storage of tensors and run time of the algorithm.From a computational perspective, one approach to addressing both the storage andrun-time challenge is randomized sketching . Sketching methods have been widely studied(see e.g. [3, 4, 8, 14, 33, 34, 35, 37, 38, 56, 82, 92, 97, 99, 100, 102, 110, 111, 114, 118, 125,126]). Many of these prior works on matrix or tensor sketching mainly focused on relativeapproximation error [14, 34, 92, 102] after randomized sketching which either may notyield optimal mean-squared error rates under statistical settings [102] or requires multiplesketching iterations [100, 101].In this article, we address both computational and statistical challenges by developinga novel sketching-based estimating procedure for tensor regression. The proposed proce-dure is provably fast and sharply minimax optimal in terms of mean-squared error underrandomized Gaussian design. The central idea lies in constructing speciﬁcally designedstructural sketches, namely importance sketching . In contrast with randomized sketchingmethods, importance sketching utilizes both the response and structure of the target tensorparameter and reduces the dimension of parameters (i.e., the number of columns) instead ofsamples (i.e., the number of rows), which leads to statistical optimality while maintainingthe computational advantages of many randomized sketching methods. See more compari-son between importance sketching in this work and sketching in prior literature in Section1.3. Speciﬁcally, we focus on the following low-rank tensor regression model, y j = (cid:104) X j , A (cid:105) + ε j , j = 1 , . . . , n, (1)2here y j and ε j are responses and observation noise, respectively; { X j } nj =1 are tensorcovariates with randomized design; and A ∈ R p ×···× p d is the order- d tensor with parametersaligned in d ways. Here (cid:104)· , ·(cid:105) stands for the usual vectorized inner product. The goal is torecover A based on observations { y j , X j } nj =1 . In particular, when d = 2, this becomes a low-rank matrix regression problem, which has been widely studied in recent years [25, 68, 104].The main focus of this paper is solving the underdetermined equation system, where thesample size n is much smaller than the number of coeﬃcients (cid:81) di =1 p i . This is because manyapplications belong to this regime. In particular, in the real data example to be discussedlater, one MRI image is 121-by-145-by-121, which includes 2,122,945 parameters. Typicallywe can collect far fewer MRI images in practice.The general regression model (1) includes speciﬁc problem instances with diﬀerentchoices of design X . Examples include matrix/tensor regression with general random or de-terministic design [29, 77, 103, 143], matrix trace regression [6, 25, 43, 45, 68, 104], and ma-trix sparse recovery [132]. Another example is matrix/tensor recovery via rank- projections [18, 30, 55], which arise by setting X j = u j ◦ v j ◦ w j , where u j , v j , w j are random vectorsand “ ◦ ” represents the outer product, which includes phase retrieval [16, 23] as a specialcase. The very popular matrix/tensor completion example [27, 78, 90, 127, 128, 134] arisesby setting X j = (cid:0) e a j ◦ e b j ◦ e c j (cid:1) , where e j is the j th canonical vector and { a j , b j , c j } nj =1 are randomly selected integers from { , . . . , p } × { , . . . , p } × { , . . . , p } . Speciﬁc applica-tions of this low-rank tensor regression model include neuroimaging analysis [52, 75, 143],longitudinal relational data analysis [58], 3D imaging processing [53], etc.For convenience of presentation, we specialize the discussions on order-3 tensors later,while the results can be extended to the general order- d tensors. In the modern high-dimensional setting, a variety of matrix/tensor data satisfy intrinsic structural assumptions,such as low-rankness [121] or sparsity [143], which makes the accurate estimation of A possible even if the sample size n is smaller than the number of coeﬃcients in the targettensor A . We thus focus on the low Tucker rank ( r , r , r ) tensor A with the followingTucker decomposition [120]: A = (cid:74) S ; U , U , U (cid:75) := S × U × U × U , (2)where S is an r -by- r -by- r core tensor and U k is a p k -by- r k matrix with orthonormalcolumns for k = 1 , ,

3. The rigorous deﬁnition of Tucker rank of a tensor and morediscussions on tensor algebra are postponed to Section 2.1. In addition, the canoni-cal polyadic (CP) low-rank tensors have also been widely considered in recent literature[55, 56, 113, 143]. Since any CP-rank- r tensor A = (cid:80) ri =1 λ i a i ◦ b i ◦ c i has the Tuckerdecomposition A = (cid:74) L ; A , B , C (cid:75) , where L is the r -by- r -by- r diagonal tensor with diagonalentries λ , . . . , λ r , A = [ a , . . . , a r ], and likewise for B , C [65], our results naturally adaptto low CP-rank tensor regression. Also, with a slight abuse of notation, we will refer to3ow-rank and low Tucker rank interchangeably throughout the paper. Moreover, we alsoconsider a sparse setting where there may exist a subset of modes, say J s ⊆ { , , } , suchthat A is sparse along these modes, i.e. A = (cid:74) S ; U , U , U (cid:75) , (cid:107) U k (cid:107) = p k (cid:88) i =1 { ( U k ) [ i, :] (cid:54) =0 } ≤ s k , k ∈ J s . (3) We make the following major contributions to low-rank tensor regression in this article.First, we introduce the main algorithm –

Importance Sketching Low-rank Estimation forTensors (ISLET). Our algorithm has three steps: (i) ﬁrst we use the tensor technique high-order orthogonal iteration (HOOI) [36] or sparse tensor alternating thresholding - singularvalue decomposition (STAT-SVD) [136] to determine the importance sketching directions.Here HOOI and STAT-SVD are regular and sparse tensor low-rank decomposition methods,respectively, whose explanations are postponed to Sections 2.2 and 2.3; (ii) using the sketch-ing directions from the ﬁrst step, we perform importance sketching, and then evaluate thedimension-reduced regression using the sketched tensors/matrices (to incorporate sparsity,we add a group-sparsity regularizer); (iii) we construct the ﬁnal tensor estimator using thesketched components. Although the focus of this work is on low-rank tensor regression,we point out that our three-step procedure applies to general high-dimensional statisticsproblems with low-dimensional structure, provided that we can ﬁnd a suitable projectionoperator in step (i) and inverse projection operator in step (iii).One of the main advantages of ISLET is the scalability of the algorithm. The pro-posed procedure is computationally eﬃcient due to the dimension reduction by importancesketchings. Most importantly, ISLET only require access to the full data twice, which sig-niﬁcantly saves run time for large-scale settings when it is not possible to store all samplesinto the core memory. We also show that our algorithm can be naturally distributed acrossmultiple machines that can signiﬁcantly reduce computation time.Second, we prove a deterministic oracle inequality for the ISLET procedure under thelow-Tucker-rank assumption and general noise and design (Theorems 2 and 3). We ad-ditionally show that ISLET achieves the optimal mean-squared error (with the optimalconstant for nonsparse ISLET) under randomized Gaussian design (Theorems 4, 5, 6, and7). The following informal statement summarizes two of the main results of the article.

Theorem 1 (ISLET for tensor regression: informal) . Consider the regular tensor regressionproblem with Gaussian ensemble design, where A is Tucker rank- ( r , r , r ) , X j has i.i.d.standard normal entries, ε j i.i.d. ∼ N (0 , σ ) , and ε j , X j are independent:(a) Under regularity conditions, ISLET achieves the following optimal rate of convergence ith the matching constant, E (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) = (1 + o (1)) mσ n , where m = r r r + r ( p − r )+ r ( p − r )+ r ( p − r ) is exactly the degree of freedomof all Tucker rank- ( r , r , r ) tensors in R p × p × p and (cid:107)·(cid:107) HS is the Hilbert-Schmidtnorm to be deﬁned in Section 2.1.(b) If, in addition, (3) holds with sparsity level s k , then under regularity conditions,ISLET achieves the following optimal rate of convergence: E (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) (cid:16) m s σ n , where m s = r r r + (cid:80) k ∈ J s s k ( r k + log( p k /s k )) + (cid:80) k / ∈ J s p k r k and “ (cid:16) ” denotes theasymptotic equivalence between two number series (see a more formal deﬁnition inSection 2.1). To the best of our knowledge, we are the ﬁrst to develop the matching-constant optimalrate results for regular tensor regression under randomized Gaussian ensemble design, evenfor the low-rank matrix recovery case since it is not clear whether prior approaches (e.g.nuclear norm minimization) achieve sharp constants. We are also the ﬁrst to develop theoptimal rate results for tensor regression with sparsity condition (3).Third, proving the optimal mean-squared error bound presents a number of technicalchallenges and we introduce novel proof ideas to overcome these diﬃculties. In particular,one major diﬃculty lies in the analysis of reduced-dimensional regressions (see (7) in Section2) since we analyze sketched regression models. To this end, we introduce partial linearmodels for these reduced-dimensional regressions from which we develop estimation errorupper bounds.The ﬁnal and most important computational contribution is to display through nu-merical studies the advantages of our ISLET algorithms. Compared to state-of-the-arttensor estimation algorithms including nonconvex projected gradient descent (PGD) [29],Tucker regression [143], and convex regularization [116], we show that our ISLET algo-rithm achieves comparable statistical performance with substantially faster computation.In particular, the run time is 1-3 orders of magnitude faster than existing methods. Inthe most prominent example, our ISLET procedure can eﬃciently solve the ultrahigh-dimensional tensor regression with covariates of 7.68 terabytes. For the order-2 case, i.e.,low-rank matrix regression, our simulation studies show that ISLET outperforms the clas-sic nuclear norm minimization estimator. We also provide a real data application wherewe study the association between the attention-deﬁcit/hyperactivity disorder disease andthe high-dimensional MRI image tensors. We show that the proposed procedure provides5igniﬁcantly better prediction performance in much less time compared to state-of-the-artmethods.

Our work is related to a broad range of literature varying from a number of communitiesincluding scientiﬁc computing, computer science, signal processing, applied mathematics,and statistics. Here we make an attempt to discuss existing results from these variouscommunities; however, we do not claim that our literature survey is exhaustive.Large-scale linear systems where the solution admits a low-rank tensor structure com-monly arise after discretizing high-dimensional partial diﬀerential equations [59, 60, 80] andvarious methods have been proposed. For example, [12] developed algebraic and Gauss-Newton methods to solve the linear system with a CP low-rank tensor solution. [7, 10]proposed iterative projection methods to solve large-scale linear systems with Kronecker-product-type design matrices. [48] introduced a greedy approach. [69, 70] consideredRiemannian optimization methods and tensor Krylov subspace methods, respectively. Thereaders are referred to [51] for a recent survey. Diﬀerent from these works, our proposedISLET is a one-step procedure that only involves solving a simple least squares regressionafter performing dimension reduction on covariates by importance sketching (see Steps 1and 2 in Section 2.2). Moreover, many prior works mainly focused on computational aspectsof their proposed methods [7, 13, 42, 48, 51], while we show that ISLET is not only com-putationally eﬃcient (see more discussion and comparison on computation complexity inthe Computation and Implementation part of Section 2.2) but also has optimal theoreticalguarantees in terms of mean square error under the statistical setting.In addition, sketching methods play an important role in computation acceleration andhave been widely considered in previous literature. For example, [34, 89, 92] provided ac-curate approximation algorithms based on sketching with novel embedding matrices, wherethe run time is proportional to the number of the nonzero entries of the input matrix.Sketching methods have also been studied in robust (cid:96) low-rank matrix approximation[85, 86, 88, 110, 141], general (cid:96) p low-rank matrix approximation [8, 31], low-rank tensor ap-proximation [111], etc. In the regression context, the sketching method has been consideredfor the least squares regression [34, 37, 92, 101, 102], (cid:96) p regression [34, 89, 92], Kroneckerproduct regression [37], ridge regression [3, 124], regularized kernel regression [22, 140],etc. Various types of random sketching matrices have been developed, including ran-dom sub-Gaussian [101], random sampling [39, 40], CountSketch [28, 33], Sparse Johnson-Lindenstrauss transformation [64], among many others. The readers are also referred tosurvey papers on sketching by Mahoney [82] and Woodruﬀ [126]. The proposed method inthis paper is diﬀerent from these previous works in various aspects. First, many randomized6ketching methods in the literature focus on relative approximation error [82, 126] and thesketching matrices are constructed only based on covariates [39, 40, 64, 101, 102]. In con-trast, we explicitly construct “supervised” sketching matrices based on both the response y j and covariates X j and obtain optimal bounds in mean square error under the statisticalsetting. Second, essentially speaking, our proposed importance sketching scheme reducesthe number of columns (parameters) instead of the number of rows (samples) in the linearequation system. Third, diﬀerent from the sketching on an overdetermined system of leastsquares [34, 37, 92, 101, 102], we mainly focus on the high-dimensional setting where thenumber of samples can be signiﬁcantly smaller than the number of coeﬃcients. In Section 2.1 we introduce important notation; then we present our ISLET procedureunder nonsparse and sparse settings in Sections 2.2 and 2.3, respectively, and illustrate theprocedure from a sketching perspective in Section 2.4. In Section 3 we provide generaltheoretical guarantees for our procedure which make no assumptions on the design or thenoise distribution; in Section 4 we specialize our bounds to tensor regression with lowTucker rank and assume the design is independent Gaussian; a simulation study showingthe substantial computational beneﬁts of our algorithm is provided in Section 5. Additionalnotation, discussion on general-order ISLET, simulation results, an application to attentiondeﬁcit hyperactivity disorder (ADHD) MRI imaging data analysis, and all technical proofsare provided in the supplementary materials [137], linked from the main article webpage.

Here we introduce the general procedure of Importance Sketching Low-Rank Estimationfor tensors (ISLET). Although for ease of presentation we will focus on order-3 tensors, theprocedure for the general order- d case can also be treated. Details of matrices and tensorsgreater than order 3 are provided in Section C of the supplementary materials [137]. The following notation will be used throughout this article. Additional deﬁnitions canbe found in Section A in the supplementary materials. Lowercase letters (e.g., a, b ), low-ercase boldface letters (e.g. u , v ), uppercase boldface letters (e.g., U , V ), and boldfacecalligraphic letters (e.g., A , X ) are used to denote scalars, vectors, matrices, and order-3-or-higher tensors respectively. For simplicity, we denote X j as the tensor indexed by j in a sequence of tensors { X j } . For any two series of numbers, say { a i } and { b i } , denote a (cid:16) b if there exist uniform constants c, C > ca i ≤ b i ≤ Ca i , ∀ i and a = Ω( b )7f there exists uniform constant c > a i ≥ cb i , ∀ i . We use bracket subscriptsto denote subvectors, submatrices, and subtensors. For example, v [2: r ] is the vector withthe 2nd to r th entries of v ; D [ i ,i ] is the entry of D on the i th row and i th column; D [( r +1): p , :] contains the ( r + 1)th to the p th rows of D ; A [1: s , s , s ] is the s -by- s -by- s subtensor of A with index set { ( i , i , i ) : 1 ≤ i ≤ s , ≤ i ≤ s , ≤ i ≤ s } . For anyvector v ∈ R p , deﬁne its (cid:96) q norm as (cid:107) v (cid:107) q = ( (cid:80) i | v i | q ) /q . For any matrix D ∈ R p × p ,let σ k ( D ) be the k th singular value of D . In particular, the least nontrivial singular valueof D , deﬁned as σ min ( D ) = σ p ∧ p ( D ), will be extensively used in later analysis. Wealso denote SVD r ( D ) = [ u · · · u r ] and QR( D ) as the subspace composed of the lead-ing r left singular vectors and the Q part of the QR orthogonalization of D , respectively.The matrix Frobenius and spectral norms are deﬁned as (cid:107) D (cid:107) F = (cid:16)(cid:80) i ,i D i ,i ] (cid:17) / =( (cid:80) p ∧ p i =1 σ i ( D )) / and (cid:107) D (cid:107) = max u ∈ R p (cid:107) Du (cid:107) / (cid:107) u (cid:107) = σ ( D ) . In addition, I r repre-sents the r -by- r identity matrix. Let O p,r = { U : U (cid:62) U = I r } be the set of all p -by- r matrices with orthonormal columns. For any U ∈ O p,r , P U = UU (cid:62) represents the pro-jection matrix onto the column space of U ; we also use U ⊥ ∈ O p,p − r to represent theorthonormal complement of U . For any event A , let P ( A ) be the probability that A occurs.For any matrix D ∈ R p × p and order- d tensor A ∈ R p ×···× p d , let vec( D ) and vec( A )be the vectorization of D and A , respectively. The matricization M ( · ) is the operation thatunfolds or ﬂattens the order- d tensor A ∈ R p ×···× p d into the matrix M k ( A ) ∈ R p k × (cid:81) j (cid:54) = k p j for k = 1 , . . . , d . Since the formal entrywise deﬁnitions of matricization and vectorizationis rather tedious, we leave them to Section A in the supplementary materials [137]. TheHilbert-Schmidt norm is deﬁned as (cid:107) A (cid:107) HS = (cid:16)(cid:80) i ,...,i d A i ,...,i d ] (cid:17) / . An order- d tensoris rank-one if it can be written as the outer product of d nonzero vectors. The CP rankof any tensor A is deﬁned as the minimal number r such that A can be decomposed as A = (cid:80) ri =1 B i for rank-1 tensors B i . The Tucker rank (or multilinear rank) of a tensor A is deﬁned as a d -tuple ( r , . . . , r d ), where r k = rank( M k ( A )). The k -mode product of A ∈ R p × ... × p d with a matrix U ∈ R p k × r k is denoted by A × k U and is of size p × · · · × p k − × r k × p k +1 × · · · × p d , such that( A × k U ) [ i ,...,i k − ,j,i k +1 ,...,i d ] = p k (cid:88) i k =1 A [ i ,i ,...,i d ] U [ i k ,j ] . For convenience of presentation, all mode indices ( · ) k of an order-3 tensor are in the senseof modulo-3, e.g., r = r , s = s , p = p , X × U = X × U .For any matrices U ∈ R p × p and V ∈ R m × m , let U ⊗ V =  U [1 , · V · · · U [1 ,p ] · V ... ... U [ p , · V · · · U [ p ,p ] · V  ∈ R ( p m ) × ( p m )

8e the Kronecker product. Some intrinsic identities among Kronecker product, vectoriza-tion, and matricization, which will be used later in this paper, are summarized in Lemma 1in the supplementary materials [137]. Readers can refer to [65] for a more comprehensiveintroduction to tensor algebra. Finally, we use

C, C , C , c and other variations to representthe large and small constants, whose actual value may vary from line to line. We ﬁrst consider the tensor regression model (1), where A is low-rank (2) without sparsityassumptions. The proposed algorithm of ISLET is divided into three steps and a pictorialillustration is provided in Figures 1 - 3 for readers’ better understanding. The pseudocodeis provided in Algorithm 1.Step 1 (Probing importance sketching directions) We ﬁrst probe the importance sketching di-rections. When the covariates satisfy E vec( X j )vec( X j ) (cid:62) = I p p p , we evaluate (cid:101) A = 1 n n (cid:88) j =1 y j X j . (4) (cid:101) A is essentially the covariance tensor between y and X . Since A = (cid:74) S ; U , U , U (cid:75) haslow Tucker rank, we perform the high-order orthogonal iterations (HOOI) on (cid:101) A to obtain (cid:101) U k ∈ O p k ,r k , k = 1 , , U k . Here HOOI is a classic method fortensor decomposition that can be traced back to De Lathauwer, Moor, and Vandewalle[36]. The central idea of HOOI is the power iterated singular value thresholding. Thenthe outcome of HOOI { (cid:101) U k } k =1 yields the following low-rank approximation for A : A ≈ (cid:74) (cid:101) S ; (cid:101) U , (cid:101) U , (cid:101) U (cid:75) , where (cid:101) S = (cid:74) (cid:101) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) ∈ R r × r × r . (5)We further evaluate (cid:101) V k := QR (cid:16) M (cid:62) k ( (cid:101) S ) (cid:17) ∈ O r k +1 r k +2 ,r k , k = 1 , , . { (cid:101) U k , (cid:101) V k } k =1 obtained here are regarded as the importance sketching directions . As wewill further illustrate in Section 3.1, the combinations of (cid:101) U k and (cid:101) V k provide approxi-mations for singular subspaces of M k ( A ).Step 2 (Linear regression on sketched covariates) Next, we perform sketching to reduce thedimension of the original regression model (1). To be speciﬁc, we project the originalhigh-dimensional covariates onto the dimension-reduced subspace “that is importantin the covariance between y and X ” and construct the following importance sketching ovariates , (cid:101) X = (cid:104) (cid:101) X B (cid:101) X D (cid:101) X D (cid:101) X D (cid:105) ∈ R n × m , (cid:101) X B ∈ R n × m B , (cid:16) (cid:101) X B (cid:17) [ i, :] = vec (cid:16) X i × (cid:101) U (cid:62) × (cid:101) U (cid:62) × (cid:101) U (cid:62) (cid:17) , (cid:101) X D k ∈ R n × m D k , (cid:16) (cid:101) X D k (cid:17) [ i, :] = vec (cid:16) (cid:101) U (cid:62) k ⊥ M k (cid:16) X i × k +1 (cid:101) U (cid:62) k +1 × k +2 (cid:101) U (cid:62) k +2 (cid:17) (cid:101) V k (cid:17) , (6)where m B = r r r , m D k = ( p k − r k ) r k , k = 1 , ,

3, and m = m B + m D + m D + m D .Then, we evaluate the least-squares estimator of the submodel with importance sketchingcovariates (cid:101) X , (cid:98) γ = arg min γ ∈ R m (cid:13)(cid:13)(cid:13) y − (cid:101) X γ (cid:13)(cid:13)(cid:13) . (7)The dimension of sketching covariate regression (7) is m , which is signiﬁcantly smallerthan the dimension of the original tensor regression model, p p p . Consequently, thecomputational cost can be signiﬁcantly reduced.Step 3 (Assembling the ﬁnal estimate) Then, (cid:98) γ is divided into four segments according to theblockwise structure of (cid:101) X = [ (cid:101) X B , (cid:101) X D , (cid:101) X D , (cid:101) X D ],vec( (cid:98) B ) = (cid:98) γ [1: m B ] , vec( (cid:98) D ) = (cid:98) γ [( m B +1):( m B + m D )] , vec( (cid:98) D ) = (cid:98) γ [( m B + m D +1):( m B + m D + m D )] , vec( (cid:98) D ) = (cid:98) γ [( m B + m D + m D +1):( m B + m D + m D + m D )] . (8)Finally, we construct the regression estimator (cid:98) A for the original problem (1) using theregression estimator (cid:98) γ for the submodel (8): let (cid:98) B k = M k ( (cid:98) B ) and calculate (cid:98) L k = (cid:16) (cid:101) U k (cid:98) B k (cid:101) V k + (cid:101) U k ⊥ (cid:98) D k (cid:17) (cid:16) (cid:98) B k (cid:101) V k (cid:17) − , k = 1 , , , (cid:98) A = (cid:114) (cid:98) B ; (cid:98) L , (cid:98) L , (cid:98) L (cid:122) . (9)More interpretation of (9) is given in Section 3.1. Remark 1 (Alternative Construction of (cid:101) A in Step 1) . When E vec( X )vec( X ) (cid:62) (cid:54) = I p p p ,we could consider the following alternative ways to construct the initial estimate (cid:101) A . First,in some cases we could do construction depending on the covariance structure of X . Forexample, in the framework of tensor recovery via rank-one sketching (discussed in the intro-duction), we have X j = u j ◦ u j ◦ u j and u j ∈ R p has i.i.d entry N (0 , . By the high-orderStein identity [63], one can show that (cid:101) A = 16  n n (cid:88) j =1 y j u j ◦ u j ◦ u j − p (cid:88) j =1 ( w ◦ e j ◦ e j + e j ◦ w ◦ e j + e j ◦ e j ◦ w )  , s a proper initial unbiased estimator for A [55, Lemma 4]. Here, w = n (cid:80) ni =1 y j u j , e j is the j th canonical basis in R p . Another commonly used setting in data analysis is thehigh-order Kronecker covariance structure: E (vec( X j )vec( X j ) (cid:62) ) = Σ ⊗ Σ ⊗ Σ , where Σ k ∈ R p k × p k , k = 1 , , are covariance matrices along three modes, respectively [57, 81, 84,98, 144]. Under this assumption, we can ﬁrst apply existing approaches to obtain estimators (cid:98) Σ k for Σ k , then whiten the covariates by replacing X j by (cid:74) X j ; (cid:98) Σ − / , (cid:98) Σ − / , (cid:98) Σ − / (cid:75) . Afterthis preprocessing step, the other steps of ISLET still follow. Moreover, it still remains anopen question how to perform initialization if X has the more general, unstructured, andunknown design. Remark 2 (Alternative Methods to HOOI) . In addition to high-order orthogonal iteration(HOOI), there are a variety of methods proposed in the literature to compute the low-ranktensor approximation, such as Newton-type optimization methods on manifolds [41, 61, 62,106], black box approximation [9, 21, 83, 94, 95, 135], generalizations of Krylov subspacemethod [49, 105], greedy approximation method [48], among many others. Further, blackbox approximation methods [9, 21, 94, 95, 135] can be applied even if the initial estimator (cid:101) A does not ﬁt into the core memory. When the tensor is further approximately CP low-rank, we can also apply the randomized compressing method [108, 109] or randomized blocksampling [123] to obtain the CP low-rank tensor approximation. Although the rest of ourdiscussion will focus on the HOOI procedure for initialization, these alternative methodscan also be applied to obtain an initialization for the ISLET algorithm. Computation and implementation.

We brieﬂy discuss computational complexityand implementation aspects for the ISLET procedure here. It is noteworthy that ISLETaccesses the sample only twice for constructing the covariance tensor (Step 1) and impor-tance sketching covariates (Step 2), respectively. In large scale cases where it is diﬃcultto store the whole dataset into random-access memory (RAM), this advantage can highlysave the computational costs.In addition, in the order-3 tensor case, when each mode shares the same dimension p k = p and rank r k = r , the total number of observable values is O ( np ) and the time com-plexity of ISLET is O (cid:0) np r + nr + T p (cid:1) where T is the number of HOOI iterations. Forgeneral order- d tensor regression, time complexity of ISLET is O (cid:0) np d r + nr d + T p d +1 (cid:1) .In contrast, the time complexity of the nonconvex PGD [29] is O (cid:0) T (cid:48) ( np d + rp d +1 ) (cid:1) , where T (cid:48) is the number of iterations of gradient descent; [13] introduced an optimization basedmethod with time complexity O ( T (cid:48) dnp d r ) where T (cid:48) is the number of iterations in Gauss-Newton method. We can see if T (cid:48) ≥ r , a typical situation in practice, ISLET is signiﬁcantlyfaster than these previous methods.It is worth pointing out that the computing time of ISLET is still high when the tensorparameter has a large order d . In fact, without any structural assumption on the design11

1, ( + … + ) ( / ( )1 * / * (a) Construct the covariance tensor (cid:101) A ≈ ̃𝑆𝑆 × �𝑈𝑈 × �𝑈𝑈 × �𝑈𝑈 ̃𝒜𝒜 (b) Perform HOOI on (cid:101) A to obtain sketching direc-tions(c) The sketching directions yield low-rank approximations for M k ( (cid:101) A ) Figure 1: Illustration for Step 1 of ISLET12 a) Construct importance sketching covariates by projections(b) Perform regression of submodel with importance sketching covariates

Figure 2: Illustration for Step 2 of ISLETFigure 3: Illustration for Step 3 of ISLET13ensors X j , such a time cost may be unavoidable since reading in all data requires O ( np d )operations. If there is extra structure on the design tensor, e.g., Kronecker product [7, 59,60, 80] and low separation rank [10, 48], the computing time can be signiﬁcantly reducedby applying methods in this body of literature. Here, we mainly focus on the setting where X j does not satisfy a clear structural assumption since in many real data applications, e.g.,the neuroimaging data example studied in this and many other works [1, 77, 113, 143], thedesign tensors X j may not have a clear known structure.Moreover, in the order-3 tensor case, instead of storing all { X j } nj =1 in the memory whichrequires O ( np ) RAM, ISLET only requires O ( p + n ( pr + r )) RAM space if one choosesto access the samples from hard disks but not to store to RAM. This makes large-scalecomputing possible. We empirically investigate the computation cost by simulation studiesin Section 5.The proposed ISLET procedure also allows convenient parallel computing. Suppose wedistribute all n samples across B machines: { ( X bi , y bi ) } B b i =1 , b = 1 , . . . , B , where B b ≈ n/B .To evaluate the covariance tensor in Step 1, we can calculate (cid:101) A b = (cid:80) B i i =1 y bi X bi in eachmachine, then summarize them as (cid:101) A = n (cid:80) Bb =1 (cid:101) A b ; to construct sketching covariates andperform partial regression in Step 2, we calculate y b = ( y b , . . . , y bB b ) (cid:62) ∈ R B b , (10) (cid:101) X bi = (cid:104) (cid:101) X B ,bi (cid:101) X D ,bi (cid:101) X D ,bi (cid:101) X D ,bi (cid:105) ∈ R m , (cid:101) X B ,bi = vec (cid:16) X bi × (cid:101) U (cid:62) × (cid:101) U (cid:62) × (cid:101) U (cid:62) (cid:17) , (cid:101) X D k ,bi = vec (cid:16) (cid:101) U (cid:62) k ⊥ M k (cid:16) X bi × k +1 (cid:101) U (cid:62) k +1 × k +2 (cid:101) U (cid:62) k +2 (cid:17) (cid:101) V k (cid:17) , (11) (cid:101) G b = B b (cid:88) i =1 (cid:101) X (cid:62) bi (cid:101) X bi , (cid:101) z b = B b (cid:88) i =1 (cid:101) X (cid:62) bi y bi (12)in each machine. Then we combine the outcomes to (cid:98) γ = (cid:32) B (cid:88) b =1 (cid:101) G b (cid:33) − (cid:32) B (cid:88) b =1 (cid:101) z b (cid:33) . The computational complexity can be reduced to O (cid:16) np r + nr B + T p (cid:17) via the parallelscheme. In the large-scale simulation we present in this article, we implement this par-allel scheme for speed-up.To implement the proposed procedure, the inputs of Tucker rank are required as tuningparameters. When they are unknown in practice, we can perform cross-validation or anadaptive rank selection scheme. A more detailed description and numerical results arepostponed to Section D in the supplementary materials [137].14 .3 Sparse Low-rank Tensor Recovery When the target tensor A is simultaneously low-rank and sparse, in the sense that (3)holds for a subset J s ⊆ { , , } known a priori, we introduce the following sparse ISLETprocedure. The pseudocode for sparse ISLET is summarized in Algorithm 2.Step 1 (Probing sketching directions) When E vec( X )vec( X ) (cid:62) = I p p p , we still evaluate thecovariance tensor (cid:101) A as Equation (4). Noting that A = (cid:74) S ; U , U , U (cid:75) and { U k } k ∈ J s are row-wise sparse, we apply the sparse tensor alternating thresholding SVD (STAT-SVD) [136] on (cid:101) A to obtain (cid:101) U k ∈ O p k ,r k , k = 1 , , U k . Here,STAT-SVD is a sparse tensor decomposition method proposed by [136] with central ideasof the double projection & thresholding scheme and power iteration. Via STAT-SVD,we obtain the following sparse and low-rank approximation of A , A ≈ (cid:74) (cid:101) S ; (cid:101) U , (cid:101) U , (cid:101) U (cid:75) , (cid:101) U k ∈ O p k ,r k , (cid:101) S = (cid:74) (cid:101) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) ∈ R r × r × r . We further evaluate (cid:101) V k = QR (cid:16) M (cid:62) k ( (cid:101) S ) (cid:17) ∈ O r k +1 r k +2 ,r k . Step 2 (Group Lasso on sketched covariates) We perform sketching and construct the followingimportance sketching covariates based on { (cid:101) U k , (cid:101) V k } k =1 , (cid:101) X B ∈ R n × ( r r r ) , ( (cid:101) X B ) [ i, :] = vec (cid:16) X i × (cid:101) U (cid:62) × (cid:101) U (cid:62) × (cid:101) U (cid:62) (cid:17) , (cid:101) X E k ∈ R n × p k r k , ( (cid:101) X E k ) [ i, :] = vec (cid:16) M k (cid:16) X i × k +1 (cid:101) U (cid:62) k +1 × k +2 (cid:101) U (cid:62) k +2 (cid:17) (cid:101) V k (cid:17) . (13)Then we perform regression on sub-models with these reduced-dimensional covariates (cid:101) X B and (cid:101) X E k respectively using least squares and group Lasso [46, 133], (cid:98) B ∈ R r × r × r , vec( (cid:98) B ) = arg min γ ∈ R r r r (cid:107) y − (cid:101) X B γ (cid:107) , (14) (cid:98) E k ∈ R p k × r k , vec( (cid:98) E k ) = (cid:40) arg min γ (cid:107) y − (cid:101) X E k γ (cid:107) , if k / ∈ J s ;arg min γ (cid:107) y − (cid:101) X E k γ (cid:107) + η k (cid:80) p k j =1 (cid:107) γ G kj (cid:107) , if k ∈ J s . (15)Here, { η k } k ∈ J s are the penalization level and G kj = { j, j + p k , . . . , j + p k ( r k − } , j = 1 , . . . , p k (16)form a partition of { , . . . , p k r k } that is induced by the construction of (cid:101) X E k (details forwhy to use group lasso can be found in Section 3.2).Step 3 (Constructing the ﬁnal estimator) (cid:98) A can be constructed using the regression coeﬃcients (cid:98) B and (cid:98) E k ’s in the submodels (14) and (15), (cid:98) A = (cid:114) (cid:98) B , ( (cid:98) E ( (cid:101) U (cid:62) (cid:98) E ) − ) , ( (cid:98) E ( (cid:101) U (cid:62) (cid:98) E ) − ) , ( (cid:98) E ( (cid:101) U (cid:62) (cid:98) E ) − ) (cid:122) . (17)More interpretation of (17) can be found in Section 3.2.15 lgorithm 1 Importance Sketching Low-rank Estimation for Tensors (ISLET): Order-3Case Input: sample { y j , X j } nj =1 , Tucker rank r = ( r , r , r ). Calculate (cid:101) A = n (cid:80) nj =1 y j X j . Apply HOOI on (cid:101) A and obtain initial estimates (cid:101) U , (cid:101) U , (cid:101) U . Let (cid:101) S = (cid:74) (cid:101) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) . Evaluate the sketching direction, (cid:101) V k = QR (cid:104) M k ( (cid:101) S ) (cid:62) (cid:105) , k = 1 , , . Construct (cid:101) X = (cid:104) (cid:101) X B (cid:101) X D (cid:101) X D (cid:101) X D (cid:105) ∈ R n × m , where (cid:101) X B ∈ R n × m B , ( (cid:101) X B ) [ i, :] = vec (cid:16) X i × (cid:101) U (cid:62) × (cid:101) U (cid:62) × (cid:101) U (cid:62) (cid:17) , (cid:101) X D k ∈ R n × m D k , ( (cid:101) X D k ) [ i, :] = vec (cid:16) (cid:101) U (cid:62) k ⊥ M k (cid:16) X i × k +1 (cid:101) U (cid:62) k +1 × k +2 (cid:101) U (cid:62) k +2 (cid:17) (cid:101) V k (cid:17) , for m B = r r r , m D k = ( p k − r k ) r k , and k = 1 , , Solve (cid:98) γ = arg min γ ∈ R m (cid:107) y − (cid:101) X γ (cid:107) . Partition (cid:98) γ and assign each part to (cid:98) B , (cid:98) D , (cid:98) D , (cid:98) D , respectively,vec( (cid:98) B ) := (cid:98) γ B = (cid:98) γ [1: m B ] , vec( (cid:98) D k ) := (cid:98) γ D k = (cid:98) γ (cid:104)(cid:16) m B + (cid:80) k − k (cid:48) =1 m D k (cid:48) +1 (cid:17) : (cid:16) m B + (cid:80) kk (cid:48) =1 m D k (cid:48) (cid:17)(cid:105) , k = 1 , , . Let (cid:98) B k = M k ( (cid:98) B ). Evaluate (cid:98) A = (cid:74) (cid:98) B ; (cid:98) L , (cid:98) L , (cid:98) L (cid:75) , (cid:98) L k = (cid:16) (cid:101) U k (cid:98) B k (cid:101) V k + (cid:101) U k ⊥ (cid:98) D k (cid:17) (cid:16) (cid:98) B k (cid:101) V k (cid:17) − , k = 1 , , . While one of the main focuses of this article is on low-rank tensor regression, from asketching perspective, ISLET can be seen as a special case of a more general algorithm thatbroadly applies to high-dimensional statistical problems with dimension-reduced structure.In fact the three steps of the ISLET procedure are completely general and are summarizedinformally here:Step 1 (Probing projection directions) For the tensor regression problem, we use the HOOI [36]or STAT-SVD [136] approach for ﬁnding the informative low-rank subspaces along whichwe project/sketch. More generally, if we let (cid:101) A = n (cid:80) nj =1 y j X j , where X j has ambientdimension p , we can deﬁne a general projection operator (with a slight abuse of notation) P m ( . ) : R p → R m indexed by low dimension m and let S ( (cid:101) A ) be the m -dimensionalsubspace of R p determined by performing P m ( (cid:101) A ).16 lgorithm 2 Sparse Importance Sketching Low-Rank Estimation for Tensors (SparseISLET): Order-3 Case Input: sample { y j , X j } nj =1 , Tucker rank r = ( r , r , r ), sparsity index J s ⊆ { , , } . Evaluate (cid:101) A = n (cid:80) nj =1 y j X j . Apply STAT-SVD on (cid:101) A with sparsity index J s . Let the outcome be (cid:101) U , (cid:101) U , (cid:101) U . Let (cid:101) S = (cid:74) (cid:101) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) and evaluate the probing direction, (cid:101) V k = QR (cid:104) M k ( (cid:101) S ) (cid:62) (cid:105) , k = 1 , , . Construct (cid:101) X B ∈ R n × ( r r r ) , ( (cid:101) X B ) [ i, :] = vec( X i × (cid:101) U (cid:62) × (cid:101) U (cid:62) × (cid:101) U (cid:62) ) , (cid:101) X E k ∈ R n × ( p k r k ) , ( (cid:101) X E k ) [ i, :] = vec (cid:16) M k (cid:16) X i × k +1 (cid:101) U (cid:62) k +1 × k +2 (cid:101) U (cid:62) k +2 (cid:17) (cid:101) V k (cid:17) . Solve (cid:98) B ∈ R r r r , vec( (cid:98) B ) = arg min γ ∈ R r r r (cid:107) y − (cid:101) X B γ (cid:107) ; (cid:98) E k ∈ R p k × r k , vec( (cid:98) E k ) = (cid:40) arg min γ (cid:107) y − (cid:101) X E k γ (cid:107) + λ k (cid:80) p k j =1 (cid:107) γ G kj (cid:107) , k ∈ J s ;arg min γ (cid:107) y − (cid:101) X E k γ (cid:107) , k / ∈ J s . Evaluate (cid:98) A = (cid:114) (cid:98) B ; ( (cid:98) E ( (cid:101) U (cid:62) (cid:98) E ) − ) , ( (cid:98) E ( (cid:101) U (cid:62) (cid:98) E ) − ) , ( (cid:98) E ( (cid:101) U (cid:62) (cid:98) E ) − ) (cid:122) . Step 2 (Estimation in subspaces) The second step involves ﬁrst projecting the data X on tothe subspace S ( (cid:101) A ), speciﬁcally (cid:101) X = P S ( (cid:101) A ) ( X ) ∈ R n × m . Then we perform regressionor other procedures of choice using the sketched data (cid:101) X onto determine the dimension-reduced parameter (cid:98) γ ∈ R m .Step 3 (Embedding to high-dimensional space) Finally, we need to project the estimator back tothe high-dimensional space R p by applying an equivalent to the inverse of the projectionoperator P − S ( (cid:101) A ) : R m → R p . For low-rank tensor regression we require the formula (9).The description above illustrates that the idea of ISLET is applicable to more general high-dimensional problems with dimension-reduced structure. In fact, the well-regarded sureindependence screening in high-dimensional sparse linear regression [44, 129] can be seenas a special case of this idea. To be speciﬁc, consider the high-dimensional linear regressionmodel, y i = X [ i, :] β + ε i , i = 1 , . . . , n, where β is the m -sparse vector of interests and y i ∈ R and X (cid:62) [ i, :] ∈ R p are the observable17esponse and covariate. Then the m -dimensional subspace S ( (cid:101) β ) in Step 1 can be thecoordinates corresponding to the m largest entries of (cid:101) β = (cid:80) ni =1 X (cid:62) [ i, :] y i ; Step 2 correspondsto the dimension reduced least squares in sure independence screening; the inverse operatorin Step 3 is simply ﬁlling in 0’s in the coordinates that do not correspond to S ( (cid:101) β ). Inaddition, this idea applies more broadly to problems such as matrix and tensor completion.One of the novel contributions of this article is ﬁnding suitable projection and inverseoperators for low-rank tensors.We can also contrast this approach with prior approaches that involve randomizedsketching [38, 100, 102]. These prior approaches showed that the randomized sketchingmay lose data substantially, increase the variance, and yield suboptimal result for manystatistical problems. There are two key diﬀerences with how we exploit sketching in ourcontext: (1) we sketch along the parameter directions of X , reducing the data from R n × p to R n × m ; whereas approaches in [38, 100, 102] sketch along the sample directions, reducing thedata from R n × p to R m × p , which reduces the eﬀective sample size from n to m ; (2) secondand most importantly rather than using the randomized sketching that is unsupervised without the response y , our importance sketching is supervised , that is, obtained usingboth the response y and covariates X . Then we sketch along the subspace S ( (cid:101) A ) whichcontains information on the low-dimensional structure of the parameter A . This is why ourgeneral procedure has both desirable statistical and computational properties. In this section, we provide general oracle inequalities without focusing on speciﬁc design,which provides a general guideline for the theoretical analyses of our ISLET procedure. Weﬁrst introduce a quantiﬁcation of the errors in sketching directions obtained in the ﬁrst stepof ISLET. Let V k ∈ O r k +1 r k +2 ,r k be the right singular subspace of M k ( S ), where S is thecore tensor in the Tucker decomposition of A : A = (cid:74) S ; U , U , U (cid:75) . By Lemma 1 in thesupplementary materials [137], W := ( U ⊗ U ) V ∈ O p p ,r , W := ( U ⊗ U ) V ∈ O p p ,r , and W := ( U ⊗ U ) V ∈ O p p ,r (18)are the right singular subspaces of M ( A ) , M ( A ), and M ( A ), respectively. Recall thatwe initially estimate U k and V k by (cid:101) U k and (cid:101) V k , respectively in Step 1 of ISLET. Deﬁne (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V , (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V , and (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V in parallel to (18). Intuitively speaking, { (cid:101) U k , (cid:102) W k } k =1 can be seen as the initial sampleapproximations for { U k , W k } k =1 . Therefore, we quantify the sketching direction error by θ := max k =1 , , (cid:110) (cid:107) sin Θ( (cid:101) U k , U k ) (cid:107) , (cid:107) sin Θ( (cid:102) W k , W k ) (cid:107) (cid:111) . (19)18ext, we provide the oracle inequality via θ for ISLET under regular and sparse settings,respectively in the next two subsections. In order to study the theoretical properties of the proposed procedure, we need to introduceanother representation of the original model (1). Decompose the vectorized parameter A as follows,vec( A ) = P (cid:101) U vec( A ) + P (cid:101) U ⊥ vec( A )= P (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U vec( A ) + P R ( (cid:102) W ⊗ (cid:101) U ⊥ ) vec( A ) + P R ( (cid:102) W ⊗ (cid:101) U ⊥ ) vec( A )+ P R ( (cid:102) W ⊗ (cid:101) U ⊥ ) vec( A ) + P (cid:101) U ⊥ vec( A )=( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U )vec( (cid:101) B ) + R ( (cid:102) W ⊗ (cid:101) U ⊥ )vec( (cid:101) D ) + R ( (cid:102) W ⊗ (cid:101) U ⊥ )vec( (cid:101) D )+ R ( (cid:102) W ⊗ (cid:101) U ⊥ )vec( (cid:101) D ) + P (cid:101) U ⊥ vec( A ) . (20)(See the proof of Theorem 2 for a detailed derivation of (20)). Here, (cid:101) U = (cid:104) (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U R ( (cid:102) W ⊗ (cid:101) U ⊥ ) R (cid:16) (cid:102) W ⊗ (cid:101) U ⊥ (cid:17) R (cid:16) (cid:102) W ⊗ (cid:101) U ⊥ (cid:17)(cid:105) , (cid:101) B := (cid:114) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:122) ∈ R r r r and (cid:101) D k := (cid:101) U (cid:62) k ⊥ M k ( A ) (cid:102) W k ∈ R ( p k − r k ) × r k are the singular subspace of the “Cross structure” and the low-dimensional projections of A onto the “body” and “arms” formed by sketching directions { (cid:101) U k , (cid:101) V k } k =1 , respectively(See Figure 4 for an illustration of (cid:101) U , (cid:101) B , and (cid:101) V k ). Due to diﬀerent alignments, the i throw of { W k ⊗ U k ⊥ } k =1 does not necessarily correspond to the i th entry of vec( A ) forall 1 ≤ i ≤ p p p . We thus permute the rows of { (cid:102) W k ⊗ (cid:101) U k ⊥ } k =1 to match each rowof R k ( (cid:102) W k ⊗ (cid:101) U k ⊥ ) to the corresponding entry in vec( A ). The formal deﬁnition of therowwise permutation operator R k is rather clunky and is postponed to Section A in thesupplementary materials. Intuitively speaking, P (cid:101) U vec( A ) represents the projection of A onto to the Cross structure and P (cid:101) U ⊥ vec( A ) can be seen as a residual. If the estimates { (cid:101) U k , (cid:102) W k } k =1 are close enough to { U k , W k } k =1 , i.e., θ deﬁned in (19) is small, we expectthat the residual P (cid:101) U ⊥ vec( A ) has small amplitude.Based on (20), we can rewrite the original regression model (1) into the following partialregression model: y j =( (cid:101) X B ) [ j, :] vec( (cid:101) B ) + (cid:88) k =1 ( (cid:101) X D k ) [ j, :] vec( (cid:101) D k ) + vec( X j ) (cid:62) P (cid:101) U ⊥ vec( A ) + ε j = (cid:101) X [ j, :] (cid:101) γ + (cid:101) ε j , j = 1 , . . . , n. (21)(See the proof of Theorem 2 for a detailed derivation of (21).) Here,19igure 4: Illustration of decomposition (20). Here we assume (cid:101) U (cid:62) k = [ I r k r k × ( p k − r k ) ], k = 1 , ,

3, for a better visualization. The gray, green, blue, and red cubes represent thesubspaces of (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U , (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ⊥ , (cid:101) U ⊗ (cid:101) U ⊥ ⊗ (cid:101) U , (cid:101) U ⊥ ⊗ (cid:101) U ⊗ (cid:101) U . The gray cubealso corresponds to the projected parameters (cid:101) B ; matricizations of green, blue and red cubescorrespond to the projected parameters (cid:101) U (cid:62) ⊥ M ( A )( (cid:101) U ⊗ (cid:101) U ), (cid:101) U (cid:62) ⊥ M ( A )( (cid:101) U ⊗ (cid:101) U ), and (cid:101) U (cid:62) ⊥ M ( A )( (cid:101) U ⊗ (cid:101) U ), respectively. The three plains in the right panel correspond to thesubspace of (cid:101) V , (cid:101) V , and (cid:101) V , respectively. • (cid:101) ε j = vec( X j ) (cid:62) P (cid:101) U ⊥ vec( A ) + ε j is the oracle noise; (cid:101) ε = ( (cid:101) ε , . . . , (cid:101) ε n ) (cid:62) ; • (cid:101) X B , (cid:101) X D k are sketching covariates introduced in Equation (6); • (cid:101) γ = (cid:104) vec( (cid:101) B ) (cid:62) , vec( (cid:101) D ) (cid:62) , vec( (cid:101) D ) (cid:62) , vec( (cid:101) D ) (cid:62) (cid:105) (cid:62) = (cid:101) U (cid:62) vec( A ) ∈ R m is the dimension-reduced parameter.(21) reveals the essence of the least squares estimator (7) in the ISLET procedure – theoutcomes of (7) and (8), i.e., (cid:98) B and (cid:98) D k , are sample-based estimates of (cid:101) B and (cid:101) D k . Finally,based on the detailed algebraic calculation in Step 3 and the proof of Theorem 2, A = (cid:114) (cid:101) B ; (cid:101) L , (cid:101) L , (cid:101) L (cid:122) , (cid:101) L k = (cid:16) (cid:101) U k (cid:101) B k (cid:101) V k + (cid:101) U k ⊥ (cid:101) D k (cid:17) (cid:16) (cid:101) B k (cid:101) V k (cid:17) − . (22)(22) is essentially a higher-order version of the Schur complement formula (also see [20]).Finally, we apply the plug-in estimator to obtain the ﬁnal estimator (cid:98) A (Equation (9) inStep 3 of the ISLET procedure).Based on previous discussions, it can be seen that the estimation error of the original ten-sor regression is driven by the error of the least squares estimator (cid:98) γ , i.e., (cid:107) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:107) .We have the following oracle inequality for the proposed ISLET procedure. Theorem 2 (Oracle Inequality of Regular Tensor Estimation: Order-3 Case) . Suppose A ∈ R p × p × p has Tucker rank- ( r , r , r ) tensor and (cid:98) A is the outcome of Algorithm 1.Assume the sketching directions { (cid:101) U k , (cid:101) V k } k =1 satisfy θ < / (see (19) for the deﬁnition f θ ) and (cid:13)(cid:13)(cid:13) (cid:98) D k ( (cid:98) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13) ≤ ρ . We don’t impose other speciﬁc assumptions on X i and ε i .Then, we have (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≤ (1 + C ( θ + ρ )) (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:13)(cid:13)(cid:13) for uniform constant C > that does not rely on any other parameters.Proof. See Appendix F.1 for a complete proof. In particular, the proof contains threemajor steps. After introducing a number of notations, we ﬁrst transform the originalregression model to the partial regression model (21) and then rewrite the upper bound (cid:107) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:107) to (cid:107) (cid:98) B − (cid:101) B (cid:107) + (cid:80) k =1 (cid:107) (cid:98) D k − (cid:101) D k (cid:107) F . Next, we introduce a factorizationof A in parallel with the one of (cid:98) A , based on which the loss (cid:107) (cid:98) A − A (cid:107) HS is decomposedinto eight terms. Finally, we introduce a novel deterministic error bound for the “Crossscheme” (Lemma 3 in the supplementary materials [137]; also see [135]), carefully analyzeeach term in the decomposition of (cid:107) (cid:98) A − A (cid:107) HS , and ﬁnalize the proof.Theorem 2 shows that once the sketching directions (cid:101) U and (cid:101) V are reasonably accurate,the estimation error for (cid:98) A will be close to the error of partial linear regression in Equation(21). This bound is general and deterministic, which can be used as a key step in morespeciﬁc settings of low-rank tensor regression. Next, we study the oracle performance of the proposed procedure for sparse tensor regres-sion, where A further satisﬁes the sparsity constraint (3). As in the previous section, wedecompose the vectorized parameter asvec( A ) = P (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U vec( A ) + P ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) ⊥ vec( A )=( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U )vec( (cid:101) B ) + P ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) ⊥ vec( A ); (23)vec( A ) = P R k ( (cid:102) W k ⊗ I pk ) vec( A ) + P R k ( (cid:102) W k ⊗ I pk ) ⊥ vec( A )= R k ( (cid:102) W k ⊗ I p k )vec( (cid:101) E k ) + P R k ( (cid:102) W k ⊗ I pk ) ⊥ vec( A ) , k = 1 , , . (24)Here, (cid:101) B := (cid:74) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) ∈ R r r r ; (cid:101) E k := M k (cid:16) A × ( k +1) (cid:101) U (cid:62) k +1 × ( k +2) (cid:101) U (cid:62) k +2 (cid:17) (cid:101) V k ∈ R p k × r k , k = 1 , , , (25)are the low-dimensional projections of A onto the importance sketching directions. Since { U k , W k } are the left and right singular subspaces of M k ( A ), we can demonstrate that P ( U ⊗ U ⊗ U ) ⊥ vec( A ) and P R k ( W k ⊗ I pk ) ⊥ vec( A ) are zeros. Thus if the estimates { (cid:101) U k , (cid:102) W k } k =1 are suﬃciently accurate, i.e., θ deﬁned in Eq. (19) is small, we can expect that the residuals21 ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) ⊥ vec( A ) and P R k ( (cid:102) W k ⊗ I pk ) ⊥ vec( A ) have small amplitudes. Then, based on amore detailed calculation in the proof of Theorem 3, the model of sparse and low-rank tensorregression y j = (cid:104) X j , A (cid:105) + ε j can be rewritten as the following partial linear regression, y j = ( (cid:101) X B ) [ j, :] vec( (cid:101) B ) + ( (cid:101) ε B ) j , (26) y j = ( (cid:101) X E k ) [ j, :] vec( (cid:101) E k ) + ( (cid:101) ε E k ) j , k = 1 , , . (27)Here, (cid:101) X B and (cid:101) X E k are the covariates deﬁned in Equation (13) and (cid:101) ε B = (( (cid:101) ε B ) , . . . , ( (cid:101) ε B ) n ) (cid:62) , (cid:101) ε E k = (( (cid:101) ε E k ) , . . . , ( (cid:101) ε E k ) n ) (cid:62) are oracle noises deﬁned as( (cid:101) ε B ) j = (cid:68) vec( X j ) , P ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) ⊥ vec( A ) (cid:69) + ε j and ( (cid:101) ε E k ) j = (cid:28) vec( X j ) , P ( R k ( (cid:102) W k ⊗ I pk ) ) ⊥ vec( A ) (cid:29) + ε j . (28)Therefore, the Step 2 of sparse ISLET can be interpreted as the estimation of (cid:101) B and (cid:101) E k .We apply regular least squares to estimate (cid:101) B and (cid:101) E k for k / ∈ J s . For any sparse mode k ∈ J s , (cid:101) E k are group sparse due to the deﬁnition (25) and the assumption that U k arerow-wise sparse. Speciﬁcally, (cid:101) E k satisﬁes (cid:13)(cid:13)(cid:13) vec( (cid:101) E k ) (cid:13)(cid:13)(cid:13) , := p k (cid:88) i =1 (cid:26) ( vec( (cid:101) E k ) ) Gki (cid:54) =0 (cid:27) ≤ s k , (29)where G ki = { i, i + p k , . . . , i + p k ( r k − } , i = 1 , . . . , p k , ∀ k ∈ J s , is a partition of { , . . . , p k r k } (see the proof for Theorem 3 for a more detailed argumentfor (29)). By detailed calculations in Step 3 of the proof for Theorem 2, one can verify that A = (cid:114) (cid:101) B , ( (cid:101) E ( (cid:101) U (cid:62) (cid:101) E ) − ) , ( (cid:101) E ( (cid:101) U (cid:62) (cid:101) E ) − ) , ( (cid:101) E ( (cid:101) U (cid:62) (cid:101) E ) − ) (cid:122) . Then the ﬁnally sparse ISLET estimator (cid:98) A in (17) can be seen as the plug-in estimator.To ensure that the group Lasso estimator in (15) provides a stable estimation for theproposed procedure, we introduce the following group restricted isometry condition, whichcan also be seen as an extension of restricted isometry property (RIP), a commonly usedcondition in compressed sensing and high-dimensional linear regression literature [26]. Condition 1.

We say a matrix X ∈ R n × p satisﬁes the group restricted isometry property(GRIP) with respect to partition G , . . . , G m ⊆ { , . . . , p } , if there exists δ > such that n (1 − δ ) (cid:107) v (cid:107) ≤ (cid:107) Xv (cid:107) ≤ n (1 + δ ) (cid:107) v (cid:107) (30) for all groupwise sparse vector v satisfying (cid:80) mk =1 { v Gk (cid:54) =0 } ≤ s .

22e still use θ deﬁned in Eq. (19) to characterize the sketching direction errors. Thefollowing oracle inequality holds for sparse tensor regression with importance sketching. Theorem 3 (Oracle Inequality for Sparse Tensor Regression: Order-3 Case) . Considerthe sparse low-rank tensor regression (1) (3) . Suppose θ < / , the importance sketchingcovariates (cid:101) X B and (cid:101) X E k ( k / ∈ J s ) are nonsingular. For any k ∈ J s , (cid:101) X E k satisﬁes grouprestricted isometry property (Condition 1) with respect to partition G k , . . . , G kp k in (16) and δ < / . We apply the proposed Algorithm 2 with group Lasso penalty η k = C max i =1 ,...,p k (cid:13)(cid:13)(cid:13) ( (cid:101) X E k , [: ,G ki ] ) (cid:62) (cid:101) ε E k (cid:13)(cid:13)(cid:13) for k ∈ J s and some constant C ≥ . We also assume (cid:107) (cid:101) U (cid:62) k ⊥ (cid:98) E k ( (cid:101) U (cid:62) k (cid:98) E k ) − (cid:107) ≤ ρ . Then, (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≤ (1 + C s ( θ + ρ )) (cid:32) (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) B (cid:101) X B ) − (cid:101) X (cid:62) B (cid:101) ε B (cid:13)(cid:13)(cid:13) + (cid:88) k / ∈ J s (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) E k (cid:101) X E k ) − (cid:101) X (cid:62) E k (cid:101) ε E k (cid:13)(cid:13)(cid:13) + C (cid:88) k ∈ J s s k · max i =1 ,...,p k (cid:13)(cid:13)(cid:13) ( (cid:101) X E k , [: ,G ki ] ) (cid:62) (cid:101) ε E k /n (cid:13)(cid:13)(cid:13) (cid:33) . (31) Proof.

See Appendix F.2.

Remark 3.

In the oracle error bound (31) , (cid:107) ( (cid:101) X (cid:62) B (cid:101) X B ) − (cid:101) X (cid:62) B (cid:101) ε B (cid:107) , (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) E k (cid:101) X E k ) − (cid:101) X (cid:62) E k (cid:101) ε B (cid:13)(cid:13)(cid:13) , and s k max i =1 ,...,p k (cid:107) ( (cid:101) X E k , [: ,G ki ] ) (cid:62) (cid:101) ε E k /n (cid:107) correspond to the esti-mation errors of (cid:98) B , (cid:98) E k of the nonsparse mode, and (cid:98) E k of sparse mode, respectively. Whenthe group restricted isometry property (Condition 1) is replaced by group restricted eigen-value condition (see, e.g., [79]), a similar result to Theorem 3 can be derived. We further study the low-rank tensor regression with Gaussian ensemble design, i.e., X i hasi.i.d. standard normal entries. This has been considered a benchmark setting for low-ranktensor/matrix recovery literature [25, 29]. For convenience, we denote p = ( p , p , p ) , r =( r , r , r ), p = max { p , p , p } , and r = max { r , r , r } . We discuss the regular low-rankand sparse low-rank tensor regression in the next two subsections, respectively. We have the following theoretical guarantee for ISLET under Gaussian ensemble design.

Theorem 4 (Upper bound for tensor regression via ISLET) . Consider the tensor regressionmodel (1) , where A ∈ R p × p × p is Tucker rank- ( r , r , r ) , X i has i.i.d. standard normal ntries, and ε i.i.d. ∼ N (0 , σ ) . Denote (cid:101) σ = (cid:107) A (cid:107) + σ , λ = min k λ k , λ k = σ r k ( M k ( A )) , κ = max k (cid:107)M k ( A ) (cid:107) /σ r k ( M k ( A )) , and m = r r r + (cid:80) k =1 ( p k − r k ) r k . If n ∧ n ≥ C (cid:101) σ ( p / + κpr ) λ , then the sample-splitting ISLET estimator (see the forthcoming Remark 5)satisﬁes (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≤ mn (cid:18) σ + C (cid:101) σ mpn λ (cid:19) (cid:32) C (cid:114) log pm + C (cid:115) m (cid:101) σ ( n ∧ n ) λ (cid:33) with probability at least − p − C .Proof. See Section F.3 for details. Speciﬁcally, we ﬁrst derive the estimation error up-per bounds for sketching directions (cid:101) U k via the deterministic error bound of HOOI [138].Then we apply concentration inequalities to obtain upper bounds for (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:13)(cid:13)(cid:13) and (cid:107) (cid:98) D k ( (cid:98) B k (cid:101) V k ) − (cid:107) for k = 1 , ,

3. Finally, the oracle inequality of Theorem 2 leads to thedesired upper bound.

Remark 4 (Sample Complexity) . In Theorem 4, we show that as long as the sample size n = Ω( p / r + pr ) , ISLET achieves consistent estimation under regularity conditions.This sample complexity outperforms many computationally feasible algorithms in previousliterature, e.g., n = Ω( p r polylog ( p )) in projected gradient descent [29], sum of nuclear normminimization [117], and square norm minimization [91]. To the best of our knowledge,ISLET is the ﬁrst computationally eﬃcient algorithm that achieves this sample complexityresult.On the other hand, [91] showed that the direct nonconvex Tucker rank minimization, acomputationally infeasible method, can do exact recovery with O ( pr + r ) linear measure-ments in the noiseless setting. [13] showed that if tensor parameter A is CP rank- r , thelinear system y j = (cid:104) A , X j (cid:105) , j = 1 , . . . , n has a unique solution with probability one if onehas O ( pr ) measurements. It remains an open question whether the sample complexity of n = Ω( p / r + pr ) is necessary for all computationally eﬃcient procedures. Remark 5 (Sample splitting) . The direct analysis for the proposed ISLET in Algorithm1 is technically involved, among which one major diﬃculty is the dependency between thesketching directions (cid:101) U k obtained in Step 1 and the regression noise (cid:101) ε in Step 2. To overcomethis diﬃculty, we choose to analyze a modiﬁed procedure with the sample splitting scheme:we randomly split all n samples into two sets with cardinalities n and n , respectively.Then we use the ﬁrst set of n samples to construct the covariance tensor (cid:101) A (Step 1)and use the second set of n samples to evaluate the importance sketching covariates (Step2). As illustrated by numerical studies in Section 5, such a scheme is mainly for technicalpurposes and is not necessary in practice. Simulations suggest that it is preferable to use ll samples { y i , X i } ni =1 for both constructing the initial estimate (cid:101) A and performing linearregression on sketching covariates. We further consider the statistical limits for low-rank tensor regression with Gaussianensemble. Consider the following class of general low-rank tensors, A p , r = (cid:8) A ∈ R p × p × p : Tucker rank( A ) ≤ ( r , r , r ) (cid:9) . (32)The following minimax lower bound holds for all low-rank tensors in A p , r . Theorem 5 (Minimax Lower Bound) . If n > m + 1 , the following nonasymptotic lowerbound in estimation error hold, inf (cid:98) A sup A ∈A p , r E (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ mn − m − · σ . (33) If n ≤ m + 1 , inf (cid:98) A sup A ∈A p , r E (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) = + ∞ . (34) Proof.

See Appendix F.4.Combining Theorems 4 and 5, we can see that as long as the sample size satisﬁes m (cid:101) σ n λ = o (1), m ( p + p + p ) (cid:101) σ n n λ = o ( σ ), and n = (1 + o (1)) n , the statistical loss of theproposed method is sharp with matching constant to the lower bound. Remark 6 (Matrix ISLET vs. Previous Matrix Recovery Methods) . If the order of tensorreduces to two, the tensor regression becomes the well-regarded low-rank matrix recovery inliterature [25, 104]: y i = (cid:104) X i , A (cid:105) + ε i , i = 1 , . . . , n. Here, A ∈ R p × p is the unknown rank- r target matrix, { X i } ni =1 are design matrices, and ε i ∼ N (0 , σ ) are noises. The low-rank matrix recovery, including its instances such asphase retrieval [23], has been widely considered in recent literature. Various methods, suchas nuclear norm minimization [24, 104], projected gradient descent [115], singular valuethresholding [15], Procrustes ﬂow [119], etc, have been introduced and both the theoreticaland computational performances have been extensively studied. By similar proof of The-orem 4, the following upper bound for matrix ISLET estimator (cid:98) A (Algorithm 4 in thesupplementary materials [137]) (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) F ≤ mn (cid:18) σ + C (cid:101) σ mpn λ (cid:19) (cid:32) C (cid:114) log pm + C (cid:115) m (cid:101) σ ( n ∧ n ) λ (cid:33) can be established with high probability. Here, m = ( p + p − r ) r , λ = σ r ( A ) , (cid:101) σ = (cid:107) A (cid:107) F + σ . The lower bound similarly to Theorem 5 also holds. .2 Sparse Tensor Regression with Importance Sketching We further consider the simultaneously sparse and low-rank tensor regression with Gaussianensemble design. We have the following theoretical guarantee for sparse ISLET. Due to thesame reason as for regular ISLET (see Remark 5), the sample splitting scheme is introducedin our technical analysis.

Theorem 6 (Upper Bounds for Sparse Tensor Regression via ISLET) . Consider the tensorregression model (1) , where A is simultaneously low-rank and sparse (3) , X i has i.i.d. stan-dard Gaussian entries, and ε i i.i.d. ∼ N (0 , σ ) . Denote λ = min k σ r k ( M k ( A )) , s k = p k if k / ∈ J s , m s = r r r + (cid:80) k ∈ J s s k ( r k +log p k )+ (cid:80) k / ∈ J s p k r k , and κ = max k (cid:107)M k ( A ) (cid:107) /σ r k ( M k ( A )) .We apply the proposed Algorithm 2 with sample splitting scheme (see Remark 5) and groupLasso penalty η k = C (cid:101) σ (cid:112) n ( r k + log( p k )) . If log( p ) (cid:16) log( p ) (cid:16) log( p ) (cid:16) log( p ) , n ≥ C κ (cid:101) σ λ (cid:32) s s s log( p ) + (cid:88) k =1 ( s k r k + r k +1 r k +2 ) (cid:33) , n ≥ C m s κ (cid:101) σ λ , the output (cid:98) A of sparse ISLET satisﬁes (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≤ C m s n (cid:18) σ + C m s κ (cid:101) σ n (cid:19) (35) with probability at least − p − C .Proof. See Appendix F.5.We further consider the following class of simultaneously sparse and low-rank tensors, A p , r , s = { A = (cid:74) S ; U , U , U (cid:75) : U k ∈ O p k ,r k , (cid:107) U k (cid:107) , ≤ s k , k ∈ J s } . (36)The following minimax lower bound of the estimation risk holds in this class. Theorem 7 (Lower Bounds) . There exists constant

C > such that whenever m s ≥ C ,the following lower bound holds for any arbitrary estimator (cid:98) A based on { X i , y i } ni =1 , inf A sup A ∈A p , r , s E (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ cm s n σ . (37) Proof.

See Appendix F.6.Combining Theorems 6 and 7, we can see the proposed procedure achieves optimal rateof convergence if m s (cid:107) A (cid:107) n σ = O (1) and n (cid:16) n .26 Numerical Analysis

In this section, we conduct a simulation study to investigate the numerical performance ofISLET. In each study, we construct sensing tensors X j ∈ R p × p × p with independent standardnormal entries. In the nonsparse settings, using the Tucker decomposition we generate thecore tensor S ∈ R r × r × r and E k ∈ R p,r with i.i.d. Gaussian entries, the coeﬃcient tensor A = (cid:74) S ; E ; E ; E (cid:75) ; in the sparse settings, we construct S and A in the same way andgenerate E k as( E k ) [ i, :] = (cid:40) ( ¯ E k ) [ j, :] , i ∈ Ω k , and i is the j -th element of Ω k ;0 , i / ∈ Ω k , where Ω k is a uniform random subset of { , . . . , p } with cardinality s k and ¯ E k has s k -by- r i.i.d. Gaussian entries. Finally, let the response y j = (cid:104) X j , A (cid:105) + ε j , j = 1 , , . . . , n ,where ε j i.i.d. ∼ N (0 , σ ). We report both the average root mean-squared error (RMSE) (cid:107) (cid:98) A − A || HS / || A (cid:107) HS and the run time for each setting. Unless otherwise noted, the reportedresults are based on the average of 100 repeats and on a computer with Intel Xeon E5-26802.50GHz CPU. Additional simulation results of tuning-free ISLET and approximate low-rank tensor regression are collected in Sections D and E in the supplementary materials[137].Since we proposed to evaluate sketching directions and dimension-reduced regression(Steps 1 and 2 of Algorithm 1) both using the complete sample, but introduced a samplesplitting scheme (Remark 5) to prove Theorems 4 and 6, we investigate how the samplesplitting scheme aﬀects the numerical performance of ISLET in this simulation setting. Let n vary from 1000 to 4000, p = 10, r = 3 , σ = 5. In addition to the original ISLET withoutsplitting, we also implement sample-splitting ISLET, where a random n ≈ { n, n, n } samples are allocated for importance direction estimation (Step 1 of ISLET) and n − n are allocated for dimension-reduced regression (Step 2 of ISLET). The results plotted inFigure 5 clearly show that the no-sample-splitting scheme yields much smaller estimationerror than all sample-splitting approaches. Although the sample splitting scheme bringsadvantages for our theoretical analyses for ISLET, it is not necessary in practice. Therefore,we will only perform ISLET without sample splitting for the rest of the simulation studies.We also compare the performance of nonsparse ISLET with a number of contemporarymethods, including nonconvex projected gradient descent (nonconvex PGD) [29], Tuckerlow-rank regression via alternating gradient descent (Tucker regression) [77, 143], and Software package downloaded at https://hua-zhou.github.io/TensorReg/

000 1500 2000 2500 3000 3500 4000 . . . . . . . r = 3 Sample Size R M SE l l l l l l ll Without SplitSplit−Prop 0.3Split−Prop 0.4Split−Prop 0.5 (a) r = 3 . . . . r = 5 Sample Size R M SE l l l l l l ll Without SplitSplit−Prop 0.3Split−Prop 0.4Split−Prop 0.5 (b) r = 5 Figure 5: No-splitting vs. splitting ISLET: n varies from 1000 to 4000, p = 10, r = 3 , σ = 5.convex regularization low-rank tensor recovery (convex regularization) [78, 103, 117]. Weimplement all four methods for p = 10, but only the ISLET and nonconvex projected PGDfor p = 50, as the time cost of Tucker regression and convex regularization are beyond ourcomputational limit if p = 50. Results for p = 10 and p = 50 are respectively plotted inPanels (a)(b) and Panels (c)(d) of Fig. 6. Plots in Fig. 6 (a) and (c) show that the RMSEsof ISLET, tucker tensor regression and nonconvex PGD are close, and all of them areslightly better than the convex regularization method; Figure 6 (b) and (d) further indicatethat ISLET is much faster than other methods – the advantage signiﬁcantly increasesas n and p grow. In particular, ISLET is about 10 times faster than nonconvex PGDwhen p = 50 , n = 12000. In summary, the proposed ISLET achieves similar statisticalperformance within in a signiﬁcantly shorter time period comparing to the other state-or-the-art methods.Next, we investigate the performance of ISLET when p and n substantially grow. Let p = 100 , , r = 3 , n ∈ [8000 , n grows, the dimension p decreases, or the Tuckerrank r decreases.We further ﬁx r = 2 , n = 30000 and let p grow to 400. Now the space cost for storing { X i } ni =1 reaches 400 × × .

68 terabytes, which is far beyond the volume The convex regularization aims to minimize the following objective function n (cid:88) i n ( y i − (cid:104) X i , A (cid:105) ) + λ (cid:88) k =1 ||M k ( A ) || ∗ . Here, (cid:107) · (cid:107) ∗ is the matrix nuclear norm. l l l l l l l l . . . . . . p = 10 Sample Size R M SE l l l l l l l l ll l l l l l l l ll l l l l l l l lllll ISLET,r=3Non−convex PGD,r=3Convex Regularization,r=3Tucker Regression,r=3 (a) RMSE l l l l l l l l l p = 10

Sample Size R unn i ng T i m e ( s ) l l l l l l l l ll l l l l l l l ll l l l l l l l lllll ISLET,r=3Non−convex PGD,r=3Convex Regularization,r=3Tucker Regression,r=3 (b) Run Time . . . . . . . p = 50 Sample Size R M SE l l l l l ll l l l l lll ISLET, r=5Non−convex PGD, r=5ISLET, r=3Non−convex PGD, r=3 (c) RMSE p = 50

Sample Size R unn i ng T i m e ( s ) l l l l l ll l l l l lll ISLET, r=5Non−convex PGD, r=5ISLET, r=3Non−convex PGD, r=3 (d) Run Time

Figure 6: ISLET vs. nonconvex PGD, Tucker regression, convex regularization. Here, σ = 5; Panels (a)(b): p = 10; Panels (c)(d): p = 50.of most personal computing devices. Since each sample is used only twice in ISLET, weperform this experiment in a parallel way. To be speciﬁc, in each machine b = 1 , . . . , X bi , evaluate y bi and (cid:101) A b by theprocedure in Section 2.2, and clean up the memory of X bi . After synchronizing the outcomesand obtaining the importance sketching directions, for each machine b = 1 , . . . ,

40, wegenerate pseudorandom covariates X bi again using the stored random seeds, evaluate (cid:101) G b and (cid:101) X bi by (11)-(12), and clean up the memory of X bi again. The rest of the procedurefollows from Section 2.2 and the original ISLET in Algorithm 1. The average RMSE andrun time for ﬁve repeats are shown in Figure 8. We clearly see that ISLET yields goodstatistical performance within a reasonable amount of time, while the other contemporarymethods can hardly do so in such an ultrahigh-dimensional setting.In addition, we explore the numerical performance of ISLET for simultaneously sparseand low-rank tensor regression. To perform sparse ISLET (Algorithm 2), we apply the29 l l l . . . . . . . r = 3 Sample Size R M SE l p=100p=150p=200 (a) RMSE l l l l r = 3 Sample Size R unn i ng T i m e ( h ) l p=100p=150p=200 (b) Run Time (Unit: hours) l l l l . . . . r = 5 Sample Size R M SE l p=100p=150p=200 (c) RMSE l l l l r = 5 Sample Size R unn i ng T i m e ( h ) l p=100p=150p=200 (d) Run Time (Unit: hours) Figure 7: Performance of ISLET when p and n signiﬁcantly grow. l l l l l l l

100 150 200 250 300 350 400 . . . . . p R M SE (a) RMSE l l l l l l l

100 150 200 250 300 350 400 p R unn i ng T i m e ( s ) (b) Run Time Figure 8: Performance of ISLET in ultrahigh-dimensional setting. p grows up to 400, n = 30000. 30 glasso package [131] for group Lasso and penalty level selection. Let n vary from 1500 to4000, p = 20 , , r = 3 , σ = 5, s = s = s = s = 8. The result is shown in Fig. 9.Similar to the nonsparse ISLET, as sample size n increases or Tucker rank r decreases, theaverage estimation errors decrease.We also compare sparse ISLET with slice-sparse nonconvex PGD proposed by [29]. Let n ∈ [5000 , p = 50, r = 3 , σ = 5, s = s = s = 15. From Fig. 10, we can see thatISLET yields much smaller estimation error with signiﬁcantly shorter time than nonconvexPGD – the diﬀerence between two algorithms becomes more signiﬁcant as n grows. l l l l l l . . . . r = 3 Sample Size R M SE l p=20p=25p=30 (a) r = 3 l l l l l l . . . . . r = 5 Sample Size R M SE l p=20p=25p=30 (b) r = 5 Figure 9: RMSE of ISLET for sparse and low-rank tensor recovery . . . . . . p = 50 Sample Size R M SE l l l l l ll l l l l lll ISLET−r=5Non−convex PGD,r=5ISLET−r=3Non−convex PGD,r=3 (a) RMSE p = 50

Sample Size R unn i ng T i m e ( s ) l l l l l ll l l l l lll ISLET−r=5Non−convex PGD,r=5ISLET−r=3Non−convex PGD,r=3 (b) Run Time

Figure 10: ISLET vs. nonconvex PGD for sparse tensor regressionFinally, if the tensor is of order 2, tensor regression becomes the classic low-rank matrixrecovery problem [25, 104]. Among existing approaches for low-rank matrix recovery, thenuclear norm minimization (NNM) has been proposed and extensively studied in recent Available online at: https://cran.r-project.org/web/packages/gglasso/index.html . n (cid:88) i =1 ( y i − (cid:104) X i , A (cid:105) ) + λ || A || ∗ , where (cid:107) A (cid:107) ∗ = (cid:80) i σ i ( A ) is the matrix nuclear norm. We consider two speciﬁc settings: (1) p = p = 50, r = 2, σ = 10, n ∈ [2000 , p = p = 100 , r = 4 , σ = 10 , n ∈ [2000 , l l l l l l l l . . . . . . p = 50 Sample Size R M SE l l l l l l l lll ISLET, r=4Nuclear Norm Mini, r=4ISLET, r=2Nuclear Norm Mini, r=2 (a) RMSE, p = 50 l l l l l l l l p = 50 Sample Size R unn i ng T i m e ( s ) l l l l l l l lll ISLET, r=4Nuclear Norm Mini, r=4ISLET, r=2Nuclear Norm Mini, r=2 (b) Run Time, p = 50 l l l l l l l l . . . . . . p = 100 Sample Size R M SE l l l l l l l lll ISLET, r=4Nuclear Norm Mini, r=4ISLET, r=2Nuclear Norm Mini, r=2 (c) RMSE, p = 100 l l l l l l l l p = 100 Sample Size R unn i ng T i m e ( s ) l l l l l l l lll ISLET, r=4Nuclear Norm Mini, r=4ISLET, r=2Nuclear Norm Mini, r=2 (d) Run Time, p = 100 Figure 11: ISLET vs. nuclear norm minimization for low-rank matrix recovery The optimization of NNM is implemented by accelerated proximal gradient method [115] using thesoftware package available online at https://blog.nus.edu.sg/mattohkc/softwares/nnls/ . Discussion

In this article, we develop a general importance sketching algorithm for high-dimensionallow-rank tensor regression. In particular, to suﬃciently reduce the dimension of the higher-order structure, we propose a fast algorithm named importance sketching low-rank estimationfor tensors (ISLET). The proposed algorithm includes three major steps: we ﬁrst applytensor decomposition approaches, such as HOOI and STAT-SVD, to obtain importancesketching directions; then we perform regression using the sketched tensor/matrices (in thesparse case, we add group-sparsity regularizers); ﬁnally we assemble the ﬁnal estimator.We establish deterministic oracle inequalities for the proposed procedure under generaldesign and noise distributions. We also prove that ISLET achieve optimal mean-squarederror rate under Gaussian ensemble design – regular ISLET can further achieves the op-timal constant for mean-squared error. As illustrated in simulation studies, the proposedprocedure is computationally eﬃcient comparing to contemporary methods. Although thepresentation mainly focuses on order-3 tensors here, the method and theory for the generalorder- d tensors can be elaborated similarly.It is also noteworthy that the storage cost for Tucker decomposition in the proposedprocedure grows exponentially with the order d . Thus, if the target tensor has a largeorder, it is more desirable to consider other low-rank approximation methods than Tucker,such as the CP decomposition [12, 13], Hierarchical Tucker (HT) decomposition [7, 50, 54],and Tensor Train (TT) decomposition [93, 96], etc. The ISLET framework can be adaptedto these structures as long as there are two key components: there exists a sketching ap-proach for dimension reduction and a computational inversion step for embedding the low-dimensional estimate back to the high-dimensional space (also see Section 2.4). Whetherthese components hold for the previously described methods remains an interesting openquestion.In addition to low-rank tensor regression, the idea of ISLET can be applied to variousother high-dimensional problems. First, high-order interaction pursuit is an important topicin high-dimensional statistics that aims at the interaction among three or more variablesin the regression setting. This problem can be transformed to the tensor estimation basedon a number of rank-1 projections by the argument in [55]. Similarly to analysis on tensorregression in this paper, the idea of ISLET can be used to develop an optimal and eﬃcientprocedure for high-order interaction pursuit with provable advantages over other baselinemethods.In addition, matrix/tensor completion has attracted signiﬁcant attention in the recentliterature [27, 78, 127, 128, 134]. The central task of matrix/tensor completion is to completethe low-rank matrix/tensor based on a limited number of observable entries. Since eachobservable entry in matrix/tensor completion can be seen as a special rank-one projection33f the original matrix/tensor, the idea behind ISLET can be used to achieve a more eﬃcientalgorithm in matrix/tensor completion with theoretical guarantees. It will be an interestingfuture topic to further investigate the performance of ISLET on other high-dimensionalproblems. Acknowledgment

The authors would like to thank the editors and anonymous referees for the helpful sugges-tions that helped to improve the presentation of this paper.

References [1] Genevera I Allen. Regularized tensor factorizations and higher-order principal com-ponents analysis. arXiv preprint arXiv:1202.2476 , 2012.[2] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Tel-garsky. Tensor decompositions for learning latent variable models.

The Journal ofMachine Learning Research , 15(1):2773–2832, 2014.[3] Haim Avron, Kenneth L Clarkson, and David P Woodruﬀ. Sharper boundsfor regression and low-rank approximation with regularization. arXiv preprintarXiv:1611.03225 , 6, 2016.[4] Haim Avron, Huy Nguyen, and David Woodruﬀ. Subspace embeddings for the polyno-mial kernel. In

Advances in Neural Information Processing Systems , pages 2258–2266,2014.[5] Krishnakumar Balasubramanian, Jianqing Fan, and Zhuoran Yang. Tensor meth-ods for additive index models under discordance and heterogeneity. arXiv preprintarXiv:1807.06693 , 2018.[6] Nicolai Baldin and Quentin Berthet. Optimal link prediction with matrix logisticregression. arXiv preprint arXiv:1803.07054 , 2018.[7] Jonas Ballani and Lars Grasedyck. A projection method to solve linear systems intensor format.

Numerical linear algebra with applications , 20(1):27–43, 2013.[8] Frank Ban, Vijay Bhattiprolu, Karl Bringmann, Pavel Kolev, Euiwoong Lee, andDavid P Woodruﬀ. A ptas for (cid:96) p -low rank approximation. In Proceedings of theThirtieth Annual ACM-SIAM Symposium on Discrete Algorithms , pages 747–766.SIAM, 2019. 349] Mario Bebendorf. Adaptive cross approximation of multivariate functions.

Construc-tive approximation , 34(2):149–179, 2011.[10] Gregory Beylkin and Martin J Mohlenkamp. Algorithms for numerical analysis inhigh dimensions.

SIAM Journal on Scientiﬁc Computing , 26(6):2133–2159, 2005.[11] Xuan Bi, Annie Qu, and Xiaotong Shen. Multilayer tensor factorization with appli-cations to recommender systems.

The Annals of Statistics , 46(6B):3308–3333, 2018.[12] M Bouss´e, I Domanov, and L De Lathauwer. Linear systems with a multilinearsingular value decomposition constrained solution.

ESAT-STADIUS, KU Leuven,Belgium, Tech. Rep , 2017.[13] Martijn Bouss´e, Nico Vervliet, Ignat Domanov, Otto Debals, and Lieven De Lath-auwer. Linear systems with a canonical polyadic decomposition constrained solu-tion: Algorithms and applications.

Numerical Linear Algebra with Applications ,25(6):e2190, 2018.[14] Christos Boutsidis and David P Woodruﬀ. Optimal cur matrix decompositions.

SIAMJournal on Computing , 46(2):543–589, 2017.[15] Jian-Feng Cai, Emmanuel J Cand`es, and Zuowei Shen. A singular value thresholdingalgorithm for matrix completion.

SIAM Journal on Optimization , 20(4):1956–1982,2010.[16] T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence fornoisy sparse phase retrieval via thresholded wirtinger ﬂow.

The Annals of Statistics ,44(5):2221–2251, 2016.[17] T Tony Cai and Anru Zhang. Sparse representation of a polytope and recoveryof sparse signals and low-rank matrices.

IEEE transactions on information theory ,60(1):122–132, 2014.[18] T Tony Cai and Anru Zhang. ROP: Matrix recovery via rank-one projections.

TheAnnals of Statistics , 43(1):102–138, 2015.[19] T Tony Cai and Anru Zhang. Rate-optimal perturbation bounds for singular sub-spaces with applications to high-dimensional statistics.

The Annals of Statistics ,46(1):60–89, 2018.[20] Tianxi Cai, T. Tony Cai, and Anru Zhang. Structured matrix completion with appli-cations to genomic data integration.

Journal of the American Statistical Association ,111(514):621–633, 2016. 3521] Cesar F Caiafa and Andrzej Cichocki. Generalizing the column–row matrix decom-position to multi-way arrays.

Linear Algebra and its Applications , 433(3):557–573,2010.[22] Raﬀaello Camoriano, Tom´as Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro:When subsampling meets early stopping. In

Artiﬁcial Intelligence and Statistics ,pages 1403–1411, 2016.[23] Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval viawirtinger ﬂow: Theory and algorithms.

IEEE Transactions on Information Theory ,61(4):1985–2007, 2015.[24] Emmanuel J Candes and Yaniv Plan. Matrix completion with noise.

Proceedings ofthe IEEE , 98(6):925–936, 2010.[25] Emmanuel J Candes and Yaniv Plan. Tight oracle inequalities for low-rank matrixrecovery from a minimal number of noisy random measurements.

IEEE Transactionson Information Theory , 57(4):2342–2359, 2011.[26] Emmanuel J Candes and Terence Tao. Decoding by linear programming.

IEEEtransactions on information theory , 51(12):4203–4215, 2005.[27] Emmanuel J Cand`es and Terence Tao. The power of convex relaxation: Near-optimalmatrix completion.

IEEE Transactions on Information Theory , 56(5):2053–2080,2010.[28] Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items indata streams. In

International Colloquium on Automata, Languages, and Program-ming , pages 693–703. Springer, 2002.[29] Han Chen, Garvesh Raskutti, and Ming Yuan. Non-convex projected gradient descentfor generalized low-rank tensor regression. arXiv preprint arXiv:1611.10349 , 2016.[30] Yuxin Chen, Yuejie Chi, and Andrea J Goldsmith. Exact and stable covarianceestimation from quadratic sampling via convex programming.

Information Theory,IEEE Transactions on , 61(7):4034–4059, 2015.[31] Flavio Chierichetti, Sreenivas Gollapudi, Ravi Kumar, Silvio Lattanzi, Rina Pani-grahy, and David P Woodruﬀ. Algorithms for (cid:96) p low-rank approximation. In Pro-ceedings of the 34th International Conference on Machine Learning-Volume 70 , pages806–814. JMLR. org, 2017. 3632] Andrzej Cichocki, Danilo Mandic, Lieven De Lathauwer, Guoxu Zhou, Qibin Zhao,Cesar Caiafa, and Huy Anh Phan. Tensor decompositions for signal processing ap-plications: From two-way to multiway component analysis.

IEEE Signal ProcessingMagazine , 32(2):145–163, 2015.[33] Kenneth L Clarkson and David P Woodruﬀ. Input sparsity and hardness for robustsubspace approximation. In , pages 310–329. IEEE, 2015.[34] Kenneth L Clarkson and David P Woodruﬀ. Low-rank approximation and regressionin input sparsity time.

Journal of the ACM (JACM) , 63(6):54, 2017.[35] Gautam Dasarathy, Parikshit Shah, Badri Narayan Bhaskar, and Robert D Nowak.Sketching sparse matrices, covariances, and graphs via tensor products.

IEEE Trans-actions on Information Theory , 61(3):1373–1388, 2015.[36] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. On the best rank-1 andrank-(r 1, r 2,..., rn) approximation of higher-order tensors.

SIAM Journal on MatrixAnalysis and Applications , 21(4):1324–1342, 2000.[37] Huaian Diao, Zhao Song, Wen Sun, and David Woodruﬀ. Sketching for kroneckerproduct regression and p-splines. In

International Conference on Artiﬁcial Intelligenceand Statistics , pages 1299–1308, 2018.[38] Edgar Dobriban and Sifan Liu. A new theory for sketching in linear regression. arXivpreprint arXiv:1810.06089 , 2018.[39] Petros Drineas, Malik Magdon-Ismail, Michael W Mahoney, and David P Woodruﬀ.Fast approximation of matrix coherence and statistical leverage.

Journal of MachineLearning Research , 13(Dec):3475–3506, 2012.[40] Petros Drineas and Michael W Mahoney. Eﬀective resistances, statistical leverage,and applications to linear equation solving. arXiv preprint arXiv:1005.3097 , 2010.[41] Lars Eld´en and Berkant Savas. A newton–grassmann method for computing the bestmultilinear rank-(r 1, r 2, r 3) approximation of a tensor.

SIAM Journal on MatrixAnalysis and applications , 31(2):248–271, 2009.[42] Mike Espig, Wolfgang Hackbusch, Thorsten Rohwedder, and Reinhold Schneider.Variational calculus with sums of elementary tensors of ﬁxed rank.

Numerische Math-ematik , 122(3):469–488, 2012. 3743] Jianqing Fan, Wenyan Gong, and Ziwei Zhu. Generalized high-dimensional traceregression via nuclear norm regularization. arXiv preprint arXiv:1710.08083 , 2017.[44] Jianqing Fan and Jinchi Lv. Sure independence screening for ultrahigh dimensionalfeature space.

Journal of the Royal Statistical Society: Series B (Statistical Method-ology) , 70(5):849–911, 2008.[45] Jianqing Fan, Weichen Wang, and Ziwei Zhu. A shrinkage principle for heavy-tailed data: High-dimensional robust low-rank matrix recovery. arXiv preprintarXiv:1603.08315 , 2016.[46] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. A note on the group lassoand a sparse group lasso. arXiv preprint arXiv:1001.0736 , 2010.[47] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle pointsonlinestochastic gradient for tensor decomposition. In

Conference on Learning Theory ,pages 797–842, 2015.[48] Irina Georgieva and Clemens Hofreither. Greedy low-rank approximation in tuckerformat of solutions of tensor linear systems.

Journal of Computational and AppliedMathematics , 358:206–220, 2019.[49] SA Goreinov, Ivan V Oseledets, and Dmitry V Savostyanov. Wedderburn rank re-duction and krylov subspace method for tensor approximation. part 1: Tucker case.

SIAM Journal on Scientiﬁc Computing , 34(1):A1–A27, 2012.[50] Lars Grasedyck. Hierarchical singular value decomposition of tensors.

SIAM Journalon Matrix Analysis and Applications , 31(4):2029–2054, 2010.[51] Lars Grasedyck, Daniel Kressner, and Christine Tobler. A literature survey of low-rank tensor approximation techniques.

GAMM-Mitteilungen , 36(1):53–78, 2013.[52] Rajarshi Guhaniyogi, Shaan Qamar, and David B Dunson. Bayesian tensor regression. arXiv preprint arXiv:1509.06490 , 2015.[53] Weiwei Guo, Irene Kotsia, and Ioannis Patras. Tensor learning for regression.

IEEETransactions on Image Processing , 21(2):816–827, 2012.[54] Wolfgang Hackbusch and Stefan K¨uhn. A new scheme for the tensor representation.

Journal of Fourier analysis and applications , 15(5):706–722, 2009.[55] Botao Hao, Anru Zhang, and Guang Cheng. Sparse and low-rank tensor estimationvia cubic sketchings. arXiv preprint arXiv:1801.09326 , 2018.3856] Jarvis Haupt, Xingguo Li, and David P Woodruﬀ. Near optimal sketching of low-ranktensor regression. arXiv preprint arXiv:1709.07093 , 2017.[57] Shiyuan He, Jianxin Yin, Hongzhe Li, and Xing Wang. Graphical model selectionand estimation for high dimensional tensor data.

Journal of Multivariate Analysis ,128:165–185, 2014.[58] Peter D Hoﬀ. Multilinear tensor regression for longitudinal relational data.

TheAnnals of Applied Statistics , 9(3):1169, 2015.[59] Clemens Hofreither. A black-box low-rank approximation algorithm for fast matrixassembly in isogeometric analysis.

Computer Methods in Applied Mechanics andEngineering , 333:311–330, 2018.[60] Thomas JR Hughes, John A Cottrell, and Yuri Bazilevs. Isogeometric analysis: Cad,ﬁnite elements, nurbs, exact geometry and mesh reﬁnement.

Computer methods inapplied mechanics and engineering , 194(39-41):4135–4195, 2005.[61] Mariya Ishteva, P-A Absil, Sabine Van Huﬀel, and Lieven De Lathauwer. Best lowmultilinear rank approximation of higher-order tensors, based on the riemanniantrust-region scheme.

SIAM Journal on Matrix Analysis and Applications , 32(1):115–135, 2011.[62] Mariya Ishteva, Lieven De Lathauwer, P-A Absil, and Sabine Van Huﬀel. Diﬀerential-geometric newton method for the best rank-(r 1, r 2, r 3) approximation of tensors.

Numerical Algorithms , 51(2):179–194, 2009.[63] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Score function fea-tures for discriminative learning: Matrix and tensor framework. arXiv preprintarXiv:1412.2863 , 2014.[64] Daniel M Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms.

Journalof the ACM (JACM) , 61(1):4, 2014.[65] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications.

SIAMreview , 51(3):455–500, 2009.[66] Tamara Gibson Kolda.

Multilinear operators for higher-order decompositions , vol-ume 2. United States. Department of Energy, 2006.[67] Vladimir Koltchinskii. A remark on low rank matrix recovery and noncommuta-tive bernstein type inequalities. In

From Probability to Statistics and Back: High-Dimensional Models and Processes–A Festschrift in Honor of Jon A. Wellner , pages213–226. Institute of Mathematical Statistics, 2013.3968] Vladimir Koltchinskii, Karim Lounici, and Alexandre B Tsybakov. Nuclear-normpenalization and optimal rates for noisy low-rank matrix completion.

The Annals ofStatistics , 39(5):2302–2329, 2011.[69] Daniel Kressner, Michael Steinlechner, and Bart Vandereycken. Preconditioned low-rank riemannian optimization for linear systems with tensor product structure.

SIAMJournal on Scientiﬁc Computing , 38(4):A2018–A2044, 2016.[70] Daniel Kressner and Christine Tobler. Krylov subspace methods for linear systemswith tensor product structure.

SIAM journal on matrix analysis and applications ,31(4):1688–1714, 2010.[71] Pieter M Kroonenberg.

Applied multiway data analysis , volume 702. John Wiley &Sons, 2008.[72] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functionalby model selection.

Annals of Statistics , pages 1302–1338, 2000.[73] Jason D Lee, Ben Recht, Nathan Srebro, Joel Tropp, and Ruslan R Salakhutdinov.Practical large-scale optimization for max-norm regularization. In

Advances in NeuralInformation Processing Systems , pages 1297–1305, 2010.[74] Erich L Lehmann and George Casella.

Theory of point estimation . Springer Science& Business Media, 2006.[75] Lexin Li and Xin Zhang. Parsimonious tensor response regression.

Journal of theAmerican Statistical Association , pages 1–16, 2017.[76] Nan Li and Baoxin Li. Tensor completion for on-board compression of hyperspectralimages. In , pages 517–520.IEEE, 2010.[77] Xiaoshan Li, Da Xu, Hua Zhou, and Lexin Li. Tucker tensor regression and neu-roimaging analysis.

Statistics in Biosciences , pages 1–26, 2018.[78] Ji Liu, Przemyslaw Musialski, Peter Wonka, and Jieping Ye. Tensor completion forestimating missing values in visual data.

IEEE Transactions on Pattern Analysis andMachine Intelligence , 35(1):208–220, 2013.[79] Karim Lounici, Massimiliano Pontil, Sara Van De Geer, and Alexandre B Tsybakov.Oracle inequalities and optimal inference under group sparsity.

The Annals of Statis-tics , 39(4):2164–2204, 2011. 4080] RE Lynch, JOHN R Rice, and DONALD H Thomas. Tensor product analysis of par-tial diﬀerence equations.

Bulletin of the American Mathematical Society , 70(3):378–384, 1964.[81] Xiang Lyu, Will Wei Sun, Zhaoran Wang, Han Liu, Jian Yang, and Guang Cheng.Tensor graphical model: Non-convex optimization and statistical inference.

IEEEtransactions on pattern analysis and machine intelligence , 2019.[82] Michael W Mahoney. Randomized algorithms for matrices and data.

Foundationsand Trends R (cid:13) in Machine Learning , 3(2):123–224, 2011.[83] Michael W Mahoney, Mauro Maggioni, and Petros Drineas. Tensor-cur decompo-sitions for tensor-based data. SIAM Journal on Matrix Analysis and Applications ,30(3):957–987, 2008.[84] Ameur M Manceur and Pierre Dutilleul. Maximum likelihood estimation for thetensor normal distribution: Algorithm, minimum sample size, and empirical bias anddispersion.

Journal of Computational and Applied Mathematics , 239:37–49, 2013.[85] Panos P Markopoulos, George N Karystinos, and Dimitris A Pados. Optimal algo-rithms for l { } -subspace signal processing. IEEE Transactions on Signal Processing ,62(19):5046–5058, 2014.[86] Panos P Markopoulos, Sandipan Kundu, Shubham Chamadia, and Dimitris A Pados.Eﬃcient l1-norm principal-component analysis via bit ﬂipping.

IEEE Transactionson Signal Processing , 65(16):4252–4264, 2017.[87] Pascal Massart.

Concentration inequalities and model selection . Springer, 2007.[88] Deyu Meng, Zongben Xu, Lei Zhang, and Ji Zhao. A cyclic weighted median methodfor l1 low-rank matrix factorization with missing entries. In

Twenty-Seventh AAAIConference on Artiﬁcial Intelligence , 2013.[89] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings ininput-sparsity time and applications to robust linear regression. In

Proceedings ofthe forty-ﬁfth annual ACM symposium on Theory of computing , pages 91–100. ACM,2013.[90] Andrea Montanari and Nike Sun. Spectral algorithms for tensor completion. arXivpreprint arXiv:1612.07866 , 2016.[91] Cun Mu, Bo Huang, John Wright, and Donald Goldfarb. Square deal: Lower boundsand improved relaxations for tensor recovery. In

ICML , pages 73–81, 2014.4192] Jelani Nelson and Huy L Nguyˆen. Osnap: Faster numerical linear algebra algorithmsvia sparser subspace embeddings. In , pages 117–126. IEEE, 2013.[93] Ivan V Oseledets. Tensor-train decomposition.

SIAM Journal on Scientiﬁc Comput-ing , 33(5):2295–2317, 2011.[94] Ivan V Oseledets, DV Savostianov, and Eugene E Tyrtyshnikov. Tucker dimension-ality reduction of three-dimensional arrays in linear time.

SIAM Journal on MatrixAnalysis and Applications , 30(3):939–956, 2008.[95] Ivan V Oseledets, Dmitry V Savostyanov, and Eugene E Tyrtyshnikov. Cross ap-proximation in tensor electron density computations.

Numerical Linear Algebra withApplications , 17(6):935–952, 2010.[96] Ivan V Oseledets and Eugene E Tyrtyshnikov. Breaking the curse of dimensionality,or how to use svd in many dimensions.

SIAM Journal on Scientiﬁc Computing ,31(5):3744–3759, 2009.[97] Rasmus Pagh. Compressed matrix multiplication.

ACM Transactions on Computa-tion Theory (TOCT) , 5(3):9, 2013.[98] Yuqing Pan, Qing Mai, and Xin Zhang. Covariate-adjusted tensor classiﬁcation inhigh dimensions.

Journal of the American Statistical Association , pages 1–15, 2018.[99] Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicitfeature maps. In

Proceedings of the 19th ACM SIGKDD international conference onKnowledge discovery and data mining , pages 239–247. ACM, 2013.[100] Mert Pilanci and Martin J Wainwright. Randomized sketches of convex programswith sharp guarantees.

IEEE Transactions on Information Theory , 61(9):5096–5115,2015.[101] Mert Pilanci and Martin J Wainwright. Iterative hessian sketch: Fast and accu-rate solution approximation for constrained least-squares.

The Journal of MachineLearning Research , 17(1):1842–1879, 2016.[102] Garvesh Raskutti and Michael Mahoney. A statistical perspective on randomizedsketching for ordinary least-squares. arXiv preprint arXiv:1406.5986 , 2014.[103] Garvesh Raskutti, Ming Yuan, and Han Chen. Convex regularization for high-dimensional multi-response tensor regression. arXiv preprint arXiv:1512.01215 , 2015.42104] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-ranksolutions of linear matrix equations via nuclear norm minimization.

SIAM review ,52(3):471–501, 2010.[105] Berkant Savas and Lars Eld´en. Krylov-type methods for tensor computations i.

LinearAlgebra and its Applications , 438(2):891–918, 2013.[106] Berkant Savas and Lek-Heng Lim. Quasi-newton methods on grassmannians andmultilinear approximations of tensors.

SIAM Journal on Scientiﬁc Computing ,32(6):3352–3393, 2010.[107] Nicholas D Sidiropoulos, Lieven De Lathauwer, Xiao Fu, Kejun Huang, Evangelos EPapalexakis, and Christos Faloutsos. Tensor decomposition for signal processing andmachine learning.

IEEE Transactions on Signal Processing , 65(13):3551–3582, 2017.[108] Nicholas D Sidiropoulos and Anastasios Kyrillidis. Multi-way compressed sensing forsparse low-rank tensors.

IEEE Signal Processing Letters , 19(11):757–760, 2012.[109] Nicholas D Sidiropoulos, Evangelos E Papalexakis, and Christos Faloutsos. Paral-lel randomly compressed cubes: A scalable distributed architecture for big tensordecomposition.

IEEE Signal Processing Magazine , 31(5):57–70, 2014.[110] Zhao Song, David P Woodruﬀ, and Peilin Zhong. Low rank approximation with en-trywise l 1-norm error. In

Proceedings of the 49th Annual ACM SIGACT Symposiumon Theory of Computing , pages 688–701. ACM, 2017.[111] Zhao Song, David P Woodruﬀ, and Peilin Zhong. Relative error tensor low rankapproximation. In

Proceedings of the Thirtieth Annual ACM-SIAM Symposium onDiscrete Algorithms , pages 2772–2789. Society for Industrial and Applied Mathemat-ics, 2019.[112] Will Wei Sun and Lexin Li. Sparse low-rank tensor response regression. arXiv preprintarXiv:1609.04523 , 2016.[113] Will Wei Sun and Lexin Li. Store: sparse tensor response regression and neuroimaginganalysis.

The Journal of Machine Learning Research , 18(1):4908–4944, 2017.[114] Yiming Sun, Yang Guo, Charlene Luo, Joel Tropp, and Madeleine Udell. Low-rank tucker approximation of a tensor from streaming data. arXiv preprintarXiv:1904.10951 , 2019.[115] Kim-Chuan Toh and Sangwoon Yun. An accelerated proximal gradient algorithm fornuclear norm regularized linear least squares problems.

Paciﬁc Journal of Optimiza-tion , 6(615-640):15, 2010. 43116] Ryota Tomioka and Taiji Suzuki. Convex tensor decomposition via structured schat-ten norm regularization. In

Advances in neural information processing systems , pages1331–1339, 2013.[117] Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, and Hisashi Kashima. Statistical perfor-mance of convex tensor decomposition. In

Advances in Neural Information ProcessingSystems , pages 972–980, 2011.[118] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketchingalgorithms for low-rank matrix approximation.

SIAM Journal on Matrix Analysis andApplications , 38(4):1454–1485, 2017.[119] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht.Low-rank solutions of linear matrix equations via procrustes ﬂow. In

InternationalConference on Machine Learning , pages 964–973, 2016.[120] Ledyard R Tucker. Some mathematical notes on three-mode factor analysis.

Psy-chometrika , 31(3):279–311, 1966.[121] Madeleine Udell and Alex Townsend. Why are big data matrices approximately lowrank?

SIAM Journal on Mathematics of Data Science , 1(1):144–160, 2019.[122] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027 , 2010.[123] Nico Vervliet and Lieven De Lathauwer. A randomized block sampling approach tocanonical polyadic decomposition of large-scale tensors.

IEEE Journal of SelectedTopics in Signal Processing , 10(2):284–295, 2015.[124] Jialei Wang, Jason D Lee, Mehrdad Mahdavi, Mladen Kolar, Nathan Srebro, et al.Sketching meets random projection in the dual: A provable recovery algorithm for bigand high-dimensional data.

Electronic Journal of Statistics , 11(2):4896–4944, 2017.[125] Yining Wang, Hsiao-Yu Tung, Alexander J Smola, and Anima Anandkumar. Fast andguaranteed tensor decomposition via sketching. In

Advances in Neural InformationProcessing Systems , pages 991–999, 2015.[126] David P Woodruﬀ. Sketching as a tool for numerical linear algebra.

Foundations andTrends R (cid:13) in Theoretical Computer Science , 10(1–2):1–157, 2014.[127] Dong Xia and Ming Yuan. On polynomial time methods for exact low rank tensorcompletion. arXiv preprint arXiv:1702.06980 , 2017.44128] Dong Xia, Ming Yuan, and Cun-Hui Zhang. Statistically optimal and computa-tionally eﬃcient low rank tensor completion from noisy entries. arXiv preprintarXiv:1711.04934 , 2017.[129] Lingzhou Xue and Hui Zou. Sure independence screening and compressed randomsensing. Biometrika , pages 371–380, 2011.[130] Dan Yang, Zongming Ma, and Andreas Buja. A sparse singular value decompositionmethod for high-dimensional data.

Journal of Computational and Graphical Statistics ,23(4):923–942, 2014.[131] Yi Yang and Hui Zou. A fast uniﬁed algorithm for solving group-lasso penalizelearning problems.

Statistics and Computing , 25(6):1129–1141, 2015.[132] Ming Yu, Zhaoran Wang, Varun Gupta, and Mladen Kolar. Recovery of simultane-ous low rank and two-way sparse coeﬃcient matrices, a nonconvex approach. arXivpreprint arXiv:1802.06967 , 2018.[133] Ming Yuan and Yi Lin. Model selection and estimation in regression with groupedvariables.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,68(1):49–67, 2006.[134] Ming Yuan and Cun-Hui Zhang. On tensor completion via nuclear norm minimization.

Foundations of Computational Mathematics , pages 1–38, 2014.[135] Anru Zhang. Cross: Eﬃcient low-rank tensor completion.

The Annals of Statistics ,47(2):936–964, 2019.[136] Anru Zhang and Rungang Han. Optimal sparse singular value decomposition forhigh-dimensional high-order data.

Journal of the American Statistical Association ,page to appear, 2018.[137] Anru Zhang, Yuetian Luo, Garvesh Raskutti, and Ming Yuan. Supplement to“ISLET: Fast and optimal low-rank tensor regression via importance sketching”, 2018.[138] Anru Zhang, Yuetian Luo, Garvesh Raskutti, and Ming Yuan. A sharp blockwisetensor perturbation bound for higher-order orthogonal iteration. preprint , 2019.[139] Anru Zhang and Dong Xia. Tensor SVD: Statistical and computational limits.

IEEETransactions on Information Theory , 64(11):7311–7338, 2018.[140] Lijun Zhang, Mehrdad Mahdavi, Rong Jin, Tianbao Yang, and Shenghuo Zhu. Ran-dom projections for classiﬁcation: A recovery approach.

IEEE Transactions on In-formation Theory , 60(11):7300–7316, 2014.45141] Yinqiang Zheng, Guangcan Liu, Shigeki Sugimoto, Shuicheng Yan, and MasatoshiOkutomi. Practical low-rank matrix approximation under robust l 1-norm. In , pages 1410–1417.IEEE, 2012.[142] Hua Zhou. Matlab tensorreg toolbox version 1.0, 2017. Available online athttps://hua-zhou.github.io/TensorReg/.[143] Hua Zhou, Lexin Li, and Hongtu Zhu. Tensor regression with applications inneuroimaging data analysis.

Journal of the American Statistical Association ,108(502):540–552, 2013.[144] Shuheng Zhou. Gemini: Graph estimation with matrix variate normal instances.

TheAnnals of Statistics , 42(2):532–562, 2014.46 upplement to “ISLET: Fast and Optimal Low-rank TensorRegression via Importance Sketching”

Anru Zhang, Yuetian Luo, Garvesh Raskutti, and Ming Yuan

Abstract

In this supplement, we provide additional notation, preliminaries, ISLET procedurefor general order tensor estimations, more details on tuning parameter selection, andall proofs for the main results of the paper.

A Additional Notation and Preliminaries

To conveniently specify the dimensions of tensors, for an order- d tensor A with dimensions p × · · · × p d , we denote p − k = p · · · p d /p k for k = 1 , . . . , d . Then the mode- k matricizationof A , denoted as M k ( A ), has dimension p k × p − k . For any matrix D ∈ R p × p and order- d tensor A , we formally deﬁne the vectorization asvec( D ) ∈ R ( p p ) , vec( D ) [ i +( i − p ] = D [ i ,i ] ;vec( A ) ∈ R ( p ··· p d ) , vec( A ) [ i + p ( i − ··· +( i d − p ··· p d ] = A [ i ,...,i d ] . For any tensor A ∈ R p ×···× p d , the Mode- k matricization is formally deﬁned as M k ( A ) ∈ R p k × p − k , A [ i ,...,i d ] = ( M k ( A )) [ i k ,j ] , j = 1 + d (cid:88) l =1 l (cid:54) = k  ( i l − l − (cid:89) m =1 m (cid:54) = k p m  for any 1 ≤ i l ≤ p l , l = 1 , . . . , d . Also see [65, Section 2.4] for more discussions on tensormatricizations.In order to better illustrate the proposed procedure, we have introduced a row-permutationoperator R k that matches the index of W k ⊗ V k to vec( A ). In particular if A ∈ R p × p × p , W k ∈ R p − k × r k , V k ∈ R p k × r k , R k is deﬁned as follows:( R ( W ⊗ V )) [ i +( i − p +( i − p p , :] = ( W ⊗ V ) [ i +( i − p +( i − p p , :] , ( R ( W ⊗ V )) [ i +( i − p +( i − p p , :] = ( W ⊗ V ) [ i +( i − p +( i − p p , :] , ( R ( W ⊗ V )) [ i +( i − p +( i − p p , :] = ( W ⊗ V ) [ i +( i − p +( i − p p , :] for 1 ≤ i ≤ p , ≤ i ≤ p , ≤ i ≤ p . 1 ADHD MRI Imaging Data Analysis

In this section, we display the value of our method on predicting attention deﬁcit hyper-activity disorder (ADHD) with magnetic resonance imaging (MRI) dataset provided byNeuro Bureau . The dataset involves 973 subjects, where each subject is associated witha 121-by-145-by-121 MRI image and several demographic variables. After removing themissing values, we obtain 930 samples, among which 356 and 574 are diagnosed and controlsubjects, respectively.We aim to do prediction based on the association between the diagnosis label y i of i th observation and its covariates with MRI imaging X i , demographic variables age x i ,gender x i , and handedness x i . To better cope the job of predicting binary response y i and incorporate the demographic information in addition to tensor image covariates, weapply importance sketching, the central idea of ISLET, for dimension reduction. The 5-foldcross-validation is applied to examine the prediction power. Speciﬁcally for l = 1 , . . . , { Ω ( l ) j } j =1 ,..., ⊆ { , . . . , } .For j = 1 , . . . ,

5, we assign one fold Ω ( l ) j and the other four folds Ω ( l ) − j = ∪ j (cid:48) (cid:54) = j Ω j (cid:48) asthe testing and training sets, respectively. We apply Step 1 of sparse ISLET (described inSection 2.3) on { y i , X i } i ∈ Ω ( l )( − j ) to obtain (cid:101) U , (cid:101) U , (cid:101) U and construct the importance sketchingcovariates (cid:101) x i = vec( X i × (cid:101) U (cid:62) , × (cid:101) U (cid:62) , × (cid:101) U (cid:62) ), perform logistic regression for y i versus thecombined covariates (cid:2)(cid:101) x i , x i , x i , x i (cid:3) , i ∈ Ω ( l ) − j and possible (cid:96) regularizer to get the estimates.Then we use estimates and (cid:2)(cid:101) x i , x i , x i , x i (cid:3) , i ∈ Ω ( l ) j to predict the labels of samples in thetesting set Ω ( l ) j . For comparison, we also perform Tucker regression and Tucker regressionwith regularizer proposed by [77, 143] under the same setting. Since it is computationallyintensive to perform full Tucker regression on complete tensor covariates of dimension 121 × × × ×

12 using the code available at theauthors’ website [142]. For all methods, we input Tucker rank ( r, r, r ) for r = 3 , , l = 1 , . . . , , j = 1 , . . . , (cid:96) regularizer provides more accurate prediction but costs more time. In addition,compared to the downsizing method by [143, 77] that deterministically relies on external Link: http://neurobureau.projects.nitrc.org/ADHD200/Data.html (0.009) 0.624(0.014) 0.647(0.009)Accuracy 4 0.673(0.009) (0.008) 0.609(0.014) 0.648(0.007)5 0.653(0.009) (0.007) 0.591(0.015) 0.644(0.007)Runtime 3

C ISLET for General Order Tensor Estimation

For completeness, we provide the ISLET procedure for general order- d low-rank tensorestimation in this section. The procedure for d ≥ d = 2 (i.e., the low-rank matrix estimation) is provided in Algorithm 4. The sparseversions for d ≥ d = 2 are provided in Algorithms 5 and 6, respectively. D More Details on Tuning Parameter Selection

The implementation of ISLET requires the rank r as inputs. When r is unknown inpractice, we propose a two-stage-scheme for adaptive low-rank tensor regression. First, weinput a conservatively large value of r ini into ISLET to obtain (cid:98) B , (cid:98) D k (regular case) or (cid:98) B , (cid:98) E k (sparse case), based on which we estimate the rank (cid:98) r by the “Cross scheme” introducedrecently by [135]. Then, we run ISLET again with (cid:98) r to obtain the ﬁnal estimates. Thepseudo-codes for regular and sparse order- d tensor regression are provided in Algorithms 7and 8, respectively.Next, we perform simulation studies to verify the proposed rank selection scheme in boththe regular and sparse cases. In particular, let p = 20 , r ini = (cid:98) p / (cid:99) , n ∈ [2000 , σ = 5, s = 12, and the actual rank r = 3 ,

5. We randomly generate the regular and sparseregression settings as described in Section 5, then perform Algorithms 7 and 8. The average3 lgorithm 3

Order- d ISLET ( d ≥ Input: y , . . . , y n ∈ R , X , . . . , X n ∈ R p ×···× p d , rank r = ( r , . . . , r d ). Evaluate (cid:101) A = n (cid:80) nj =1 y j X j . Apply order- d HOOI on (cid:101) A to obtain initial estimates (cid:101) U k , k = 1 , . . . , d . Let (cid:101) S = (cid:74) (cid:101) A ; (cid:101) U (cid:62) , . . . , (cid:101) U (cid:62) d (cid:75) . Evaluate the sketching directions, (cid:101) V k = QR (cid:104) M k ( (cid:101) S ) (cid:62) (cid:105) , k = 1 , . . . , d. Construct (cid:101) X = (cid:104) (cid:101) X B (cid:101) X D · · · (cid:101) X D d (cid:105) ∈ R n × m , where (cid:101) X B ∈ R n × m B , ( (cid:101) X B ) [ i, :] = vec (cid:16) X i × dl =1 (cid:101) U (cid:62) l (cid:17) , (cid:101) X D k ∈ R n × m D k , ( (cid:101) X D k ) [ i, :] = vec (cid:18) (cid:101) U (cid:62) k ⊥ M k (cid:18) X i × dl =1 l (cid:54) = k (cid:101) U (cid:62) l (cid:19) (cid:101) V k (cid:19) for m B = r · · · r d , m D k = ( p k − r k ) r k , k = 1 , . . . , d , and m = m B + m D + · · · + m D d . Solve (cid:98) γ = arg min γ ∈ R m (cid:107) y − (cid:101) X γ (cid:107) . Partition (cid:98) γ to (cid:98) B , (cid:98) D , . . . , (cid:98) D d ,vec( (cid:98) B ) := (cid:98) γ B = (cid:98) γ [1: m B ] , vec( (cid:98) D k ) := (cid:98) γ D k = (cid:98) γ (cid:104)(cid:16) m B + (cid:80) k − k (cid:48) =1 m D k (cid:48) +1 (cid:17) : (cid:16) m B + (cid:80) kk (cid:48) =1 m D k (cid:48) (cid:17)(cid:105) , k = 1 , . . . , d. Let (cid:98) B k = M k ( (cid:98) B ), evaluate (cid:98) A = (cid:74) (cid:98) B ; (cid:98) L , . . . , (cid:98) L d (cid:75) , (cid:98) L k = (cid:16) (cid:101) U k (cid:98) B k (cid:101) V k + (cid:101) U k ⊥ (cid:98) D k (cid:17) (cid:16) (cid:98) B k (cid:101) V k (cid:17) − , k = 1 , . . . , d. lgorithm 4 Matrix ISLET Input: y , . . . , y n ∈ R , X , . . . , X n ∈ R p × p , rank r . Evaluate (cid:101) A = n (cid:80) nj =1 y j X j . and let (cid:101) U = SVD r ( (cid:101) A ) , (cid:101) U = SVD r ( (cid:101) A (cid:62) ). Construct (cid:101) X = (cid:104) (cid:101) X B (cid:101) X D (cid:101) X D (cid:105) ∈ R n × r ( p + p − r ) , where (cid:101) X B ∈ R n × r , ( (cid:101) X B ) [ i, :] = vec (cid:16) (cid:101) U (cid:62) X i (cid:101) U (cid:17) , (cid:101) X D k ∈ R n × ( p k − r ) r , ( (cid:101) X D ) [ i, :] = vec (cid:16) (cid:101) U (cid:62) ⊥ X i (cid:101) U (cid:17) , ( (cid:101) X D ) [ i, :] = vec (cid:16) (cid:101) U (cid:62) ⊥ X (cid:62) i (cid:101) U (cid:17) . Solve (cid:98) γ = arg min γ ∈ R m (cid:107) y − (cid:101) X γ (cid:107) . Partition (cid:98) γ and assign to (cid:98) B , (cid:98) D , (cid:98) D ,vec( (cid:98) B ) := (cid:98) γ [1: r ] , vec( (cid:98) D ) := (cid:98) γ [( r +1): rp ] , vec( (cid:98) D ) := (cid:98) γ [( rp +1):( r ( p + p − r ))] . Evaluate (cid:98) A = (cid:98) L (cid:98) B (cid:98) L (cid:62) , (cid:98) L = (cid:16) (cid:101) U (cid:98) B + (cid:101) U ⊥ (cid:98) D (cid:17) (cid:98) B − , (cid:98) L = (cid:16) (cid:101) U (cid:98) B (cid:62) + (cid:101) U ⊥ (cid:98) D (cid:17) (cid:16) (cid:98) B (cid:62) (cid:17) − . Algorithm 5

Order- d Sparse ISLET Input: y , . . . , y n ∈ R , X , . . . , X n ∈ R p ×···× p d , rank r = ( r , r , . . . , r d ), sparsity index J s . Evaluate (cid:101) A = n (cid:80) nj =1 y j X j . Apply STAT-SVD on (cid:101) A with sparsity index J s . Let the outcome be (cid:101) U , (cid:101) U , (cid:101) U , . . . , (cid:101) U d . Let (cid:101) S = (cid:74) (cid:101) A ; (cid:101) U (cid:62) , . . . , (cid:101) U (cid:62) d (cid:75) and evaluate the probing directions (cid:101) V k =QR (cid:104) M k ( (cid:101) S ) (cid:62) (cid:105) , k = 1 , . . . , d. Construct (cid:101) X B ∈ R n × ( r ··· r d ) , ( (cid:101) X B ) [ i, :] = vec (cid:16) X i × dl =1 (cid:101) U (cid:62) l (cid:17) , (cid:101) X E k ∈ R n × ( p k r k ) , ( (cid:101) X E k ) [ i, :] = vec (cid:18) M k (cid:18) X i × dl =1 l (cid:54) = k (cid:101) U (cid:62) l (cid:19) (cid:101) V k (cid:19) , k = 1 , . . . , d. Solve (cid:98) B ∈ R r ··· r d , vec( (cid:98) B ) = arg min γ ∈ R r ··· rd (cid:107) y − (cid:101) X B γ (cid:107) ; (cid:98) E k ∈ R p k × r k , vec( (cid:98) E k ) = (cid:40) arg min γ ∈ R pkrk (cid:107) y − (cid:101) X E k γ (cid:107) + λ k (cid:80) p k j =1 (cid:107) γ G kj (cid:107) , k ∈ J s ;arg min γ ∈ R pkrk (cid:107) y − (cid:101) X E k γ (cid:107) , k / ∈ J s . Evaluate (cid:98) A = (cid:74) (cid:98) B ; ( (cid:98) E ( (cid:101) U (cid:62) (cid:98) E ) − ) , . . . , ( (cid:98) E d ( (cid:101) U (cid:62) d (cid:98) E d ) − ) (cid:75) lgorithm 6 Matrix Sparse ISLET Input: y , . . . , y n ∈ R , X , . . . , X n ∈ R p × p , rank r , sparsity index J s ⊆ { , } . Evaluate (cid:101) A = n (cid:80) nj =1 y j X j . Apply sparse matrix SVD (the Two-Way Iterative Thresh-olding in [130] or the order-2 version of STAT-SVD in [136]) on (cid:101) A with sparsity index J s . Let the estimated left and right subspaces be (cid:101) U , (cid:101) U . Construct (cid:101) X B ∈ R n × ( r ) , ( (cid:101) X B ) [ i, :] = vec( (cid:101) U (cid:62) X i (cid:101) U ) , (cid:101) X E k ∈ R n × ( p k r ) , ( (cid:101) X E ) [ i, :] = vec (cid:16) X i (cid:101) U (cid:17) , ( (cid:101) X E ) [ i, :] = vec (cid:16) (cid:101) U (cid:62) X i (cid:17) . Solve (cid:98) B ∈ R r × r , vec( (cid:98) B ) = arg min γ ∈ R r (cid:107) y − (cid:101) X B γ (cid:107) ; (cid:98) E k ∈ R p k × r k , vec( (cid:98) E k ) = (cid:40) arg min γ ∈ R pkr (cid:107) y − (cid:101) X E k γ (cid:107) + λ k (cid:80) p k j =1 (cid:107) γ G kj (cid:107) , k ∈ J s ;arg min γ ∈ R pkr (cid:107) y − (cid:101) X E k γ (cid:107) , k / ∈ J s . Evaluate (cid:98) A = (cid:98) E ( (cid:101) U (cid:62) (cid:98) E ) − (cid:98) B ( (cid:101) U (cid:62) (cid:98) E ) −(cid:62) (cid:98) E (cid:62) . Algorithm 7

Order- d ISLET, unknown r Input: y , . . . , y n ∈ R , X , . . . , X n ∈ R p ×···× p d , rank r ini = ( r ,ini , . . . , r d,ini ). Apply Algorithms 1, 3, 4 with rank r ini to obtain (cid:101) U k , (cid:101) V k , (cid:98) B , and (cid:98) D k for k = 1 , . . . , d . Denote (cid:98) B k = M k ( (cid:98) B ). Evaluate U ( B ) k and V ( A ) k via SVDs. Then rotate, U ( B ) k ∈ O r k,ini , as the left singular vectors of (cid:98) B k , V ( A ) k ∈ O r k,ini , as the right singular vectors of (cid:16) (cid:101) U k (cid:98) B k (cid:101) V k + (cid:101) U k ⊥ (cid:98) D k (cid:17) ; A k = (cid:16) (cid:101) U k (cid:98) B k (cid:101) V k + (cid:101) U k ⊥ (cid:98) D k (cid:17) V ( A ) k ∈ R p k × r k,ini , J k = ( U ( B ) k ) (cid:62) · (cid:16) (cid:98) B k (cid:101) V k (cid:17) · V ( A ) k ∈ R r k,ini × r k,ini . for k = 1 , . . . , d do for s = r k,ini : − do if J k, [1: s, s ] is not singular and (cid:107) A k, [: , s ] J − k, [1: s, s ] (cid:107) ≤ then (cid:98) r k = s ; break from the loop; end if end for If (cid:98) r k is still unassigned then (cid:98) r k = 0. end for Apply Algorithm 1 again with rank (cid:98) r = ( (cid:98) r , . . . , (cid:98) r d ). Let the ﬁnal output be (cid:98) A .6 lgorithm 8 Order- d Sparse ISLET, unknown r Input: y , . . . , y n ∈ R , X , . . . , X n ∈ R p ×···× p d , rank r ini , sparsity index J s . Apply Algorithms 2, 5, or 6 with rank r ini to obtain (cid:101) U k , (cid:101) V k , (cid:98) B , and (cid:98) E k for k = 1 , . . . , d . Denote (cid:98) B k = M k ( (cid:98) B ). Evaluate U ( B ) k and V ( A ) k via SVDs, then rotate, U ( B ) k ∈ O r k,ini , as the left singular vectors of (cid:98) B k , V ( A ) k ∈ O r k,ini , as the right singular vectors of (cid:98) E k ; A k = (cid:98) E k V ( A ) k ∈ R p k × r k,ini , J k = ( U ( B ) k ) (cid:62) · (cid:16) (cid:98) B k (cid:101) V k (cid:17) · V ( A ) k ∈ R r k,ini × r k,ini . for k = 1 , . . . , d do for s = r k,ini : − do if J k, [1: s, s ] is not singular and (cid:107) A k, [: , s ] J − k, [1: s, s ] (cid:107) ≤ then (cid:98) r k = s ; break from the loop; end if end for If (cid:98) r k is still unassigned then (cid:98) r k = 0. end for Apply Algorithm 2 again with rank (cid:98) r = ( (cid:98) r , . . . , (cid:98) r d ). Let the ﬁnal output be (cid:98) A .estimation error results are plots in Figures 12 and 13 respectively for the regular and sparsecases. We can see from both cases that the estimation errors with known rank are close tothe one without known rank and the diﬀerence decreases when the sample size gets larger. E Simulation Study on Approximate Low-rank Tensor Re-gression

We provide simulation results on the performance of ISLET when the parameter A isapproximately low rank. Speciﬁcally, we ﬁrst simulate the exact low Tucker rank tensor A in the same way as the one in previous settings and simulate Z as the perturbationtensor with i.i.d. standard normal entries. Then we set A = A + τ (cid:107) A (cid:107) F Z p . The response y j and covariate X j are generated the same to previous settings. Let σ = 5 , p = 20 , n =[2000 , , s = s = s = 12 , τ = 0 , . , . , . τ here characterizes how close A isto the exact low-rank tensor – A is exact low rank if τ = 0. We apply ISLET in boththe regular and sparse regimes with the tuning parameter selection scheme described inAlgorithms 7 and 8. The results are collected in the Figure 14. We can see that theestimation error decreases as τ decreases or n increases; generally speaking, ISLET achievegood performance under both the regular and sparse regime when the true parameter A is7 l l l l l l . . . . . . r=3 Sample Size R M SE l l l l l l lll p=20,known rp=30,known rp=20,unknown rp=30,unknown r (a) r = 3 l l l l l l l . . . . . r=5 Sample Size R M SE l l l l l l lll p=20,known rp=30,known rp=20,unknown rp=30,unknown r (b) r = 5 Figure 12: ISLET: known rank vs unknown rank. Here, σ = 5, r ini = (cid:98) p / (cid:99) . l l l l l l l . . . . . . . r=3 Sample Size R M SE l l l l l l lll p=20,known rp=30,known rp=20,unknown rp=30,unknown r (a) r = 3 l l l l l l l . . . . . . . r=5 Sample Size R M SE l l l l l l lll p=20,known rp=30,known rp=20,unknown rp=30,unknown r (b) r = 5 Figure 13: Sparse ISLET: known rank vs unknown rank. Here, σ = 5; r ini = (cid:98) p / (cid:99) , s = 12only approximately low rank. F Proofs

We collect all proofs of the main technical results in this section.

F.1 Proof of Theorem 2

This theorem aims to develop a deterministic error bound for (cid:107) (cid:98) A − A (cid:107) in terms of thesketching direction error θ , ρ , and error term (cid:107) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:107) . Since the proof is longand technically challenging, we divide the whole argument into six steps for a better pre-sentation. In Step 1, we introduce the notation to be used throughout the proof. In Step 2,8 l l l l l l . . . . . . . Sample Size R M SE l t = t = t = t = (a) Regular ISLET l l l l l l l . . . . Sample Size R M SE l t = t = t = t = (b) Sparse ISLET Figure 14: Average estimation error of ISLET under approximate low Tucker rank case.Left panel: regular case; right panel: sparse case. Here, σ = 5 , p = 20 , n = [2000 , , s = s = s = 12 , τ = 0 , . , . , . (cid:107) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:107) to (cid:107) (cid:98) B − (cid:101) B (cid:107) + (cid:80) k =1 (cid:107) (cid:98) D k − (cid:101) D k (cid:107) F . In step 3, we introduce the factorization for A and (cid:98) A . Based on this factorization and the property of orthogonal projection, in step 4, wedecompose the loss (cid:107) (cid:98) A − A (cid:107) HS into eight terms. In step 5, we bound some intermediateerror terms in terms of θ and ρ using properties of the spectral norm and least singularvalue. In the last Step 6, we ﬁnish the proof by bounding each of the eight terms in Step4 using the results in Step 2, 5, and Lemma 3.Step 1 For simplicity, we denote x j = vec( X j ) ∈ R p p p , X jk = M k ( X j ) ∈ R p k × ( p k +1 p k +2 ) , a = vec( A ) ∈ R p p p , A k = M k ( A ) ∈ R p k × ( p k +1 p k +2 ) as the vectorized and matricized tensor covariates and parameter. (Note that X jk is amatrix rather than the ( j, k )-th entry of X . Instead, we use X [ j,k ] to denote the speciﬁc( i, j )-th entry of the matrix X in our notation system.) All mode indices ( · ) k are inmodule-3, e.g., p = p , A = A , X j = X j , etc. Recall W = ( U ⊗ U ) V , W = ( U ⊗ U ) V , W = ( U ⊗ U ) V , (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V , (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V , (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V . (cid:101) B = (cid:114) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:122) = (cid:114) S × U × U × U ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:122) ∈ R r × r × r ; (cid:101) D = (cid:101) U (cid:62) ⊥ M ( A × (cid:101) U (cid:62) × (cid:101) U ) (cid:101) V = (cid:101) U (cid:62) ⊥ A (cid:102) W ∈ R ( p − r ) × r , (cid:101) D = (cid:101) U (cid:62) ⊥ M ( A × (cid:101) U (cid:62) × (cid:101) U ) (cid:101) V = (cid:101) U (cid:62) ⊥ A (cid:102) W ∈ R ( p − r ) × r , (cid:101) D = (cid:101) U (cid:62) ⊥ M ( A × (cid:101) U (cid:62) × (cid:101) U ) (cid:101) V = (cid:101) U (cid:62) ⊥ A (cid:102) W ∈ R ( p − r ) × r . (38)Intuitively speaking, (cid:101) B is the parameter core tensor lying in the singular subspaces (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U and (cid:101) D , (cid:101) D , (cid:101) D are the parameter matrices corresponding to the arm-minus-body part lying in the singular subspace of R (cid:16) (cid:102) W ⊗ (cid:101) U ⊥ (cid:17) , R (cid:16) (cid:102) W ⊗ (cid:101) U ⊥ (cid:17) , R (cid:16) (cid:102) W ⊗ (cid:101) U ⊥ (cid:17) .Step 2 In this step, we introduce an important decomposition for y j and the error term (cid:107) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:107) .In correspondence to (cid:98) γ (7), we construct (cid:101) γ as (cid:101) γ = (cid:16) vec( (cid:101) B ) (cid:62) , vec( (cid:101) D ) (cid:62) , vec( (cid:101) D ) (cid:62) , vec( (cid:101) D ) (cid:62) (cid:17) (cid:62) ∈ R m . (39)Then for j = 1 , . . . , n , the response y j can be decomposed as y j = (cid:104) X j , A (cid:105) + ε j = (cid:104) x j , a (cid:105) + ε j = (cid:10) x j , P (cid:101) U a (cid:11) + ε j + (cid:68) x j , P (cid:101) U ⊥ a (cid:69) = (cid:68) x j , P (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U a (cid:69) + (cid:88) k =1 (cid:68) x j , P R k ( (cid:101) U k ⊥ ⊗ (cid:102) W k ) a (cid:69) + (cid:101) ε j (38) = (cid:68) ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) (cid:62) x j , ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) (cid:62) a (cid:69) + (cid:88) k =1 (cid:68) (cid:101) U (cid:62) k ⊥ X jk (cid:102) W k , (cid:101) U (cid:62) k ⊥ A k (cid:102) W k (cid:69) + (cid:101) ε j (38) = ( (cid:101) X B ) [ j, :] vec( (cid:101) B ) + (cid:88) k =1 ( (cid:101) X D k ) [ j, :] vec( (cid:101) D k ) + (cid:101) ε j = (cid:101) X [ j, :] · (cid:101) γ + (cid:101) ε j . (40)Given the deﬁnitions of (cid:98) D k , (cid:98) B (38) and (cid:98) γ (7) and the fact that (cid:101) X is non-singular, (cid:98) γ canbe rewritten into the following vectorized form, (cid:98) γ = arg min γ ∈ R m n (cid:88) i =1 (cid:16) y i − (cid:101) X [ i, :] γ (cid:17) = arg min γ ∈ R m (cid:13)(cid:13)(cid:13) y − (cid:101) X γ (cid:13)(cid:13)(cid:13) = (cid:16) (cid:101) X (cid:62) (cid:101) X (cid:17) − (cid:101) X (cid:62) y = (cid:16) (cid:101) X (cid:62) (cid:101) X (cid:17) − (cid:101) X (cid:62) (cid:16) (cid:101) X (cid:101) γ + (cid:101) ε (cid:17) = (cid:101) γ + (cid:16) (cid:101) X (cid:62) (cid:101) X (cid:17) − (cid:101) X (cid:62) (cid:101) ε . m = r r r + (cid:80) k =1 ( p k − r k ) r k . Thus, by the deﬁnition of (cid:101) γ (39), (cid:98) γ (7), (cid:98) B and (cid:98) D k (8), we have (cid:107) (cid:98) B − (cid:101) B (cid:107) + (cid:88) k =1 (cid:107) (cid:98) D k − (cid:101) D k (cid:107) F = (cid:107) (cid:98) γ − (cid:101) γ (cid:107) = (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:13)(cid:13)(cid:13) := κ . (41)Step 3 In this step, we introduce the factorization for A (43). Since the left and right singularsubspaces of A k are U k and W k , respectively, σ r k (cid:16) (cid:101) U (cid:62) k A k (cid:102) W k (cid:17) = σ r k (cid:16) (cid:101) U (cid:62) k P U k A k P W k (cid:102) W k (cid:17) = σ r k (cid:16) ( (cid:101) U (cid:62) k U k ) U (cid:62) k A k W k ( W (cid:62) k (cid:102) W k ) (cid:17) ≥ σ min ( (cid:101) U (cid:62) k U k ) · σ min ( U (cid:62) k A k W k ) · σ min ( W (cid:62) k (cid:102) W k )= (cid:113) − (cid:107) sin Θ( (cid:101) U k , U k ) (cid:107) · σ r k ( A k ) · (cid:113) − (cid:107) sin Θ( (cid:102) W k , W k ) (cid:107) ≥ σ r k ( A k )(1 − θ ) > . (42)Here, the last but one equality is due to the property of sin Θ distance (c.f., Lemma 1in [19]). Thus, rank( (cid:101) U (cid:62) k A k (cid:102) W k ) = r k , which is a full rank matrix. Thus, A = (cid:74) B ; U , U , U (cid:75) = (cid:114)(cid:74) B ; U , U , U (cid:75) ; U ( (cid:101) U (cid:62) U ) − (cid:101) U (cid:62) , U ( (cid:101) U (cid:62) U ) − (cid:101) U (cid:62) , U ( (cid:101) U (cid:62) U ) − (cid:101) U (cid:62) (cid:122) = (cid:114) A ; U ( (cid:101) U (cid:62) U ) − (cid:101) U (cid:62) , U ( (cid:101) U (cid:62) U ) − (cid:101) U (cid:62) , U ( (cid:101) U (cid:62) U ) − (cid:101) U (cid:62) (cid:122) = (cid:114) A ; A (cid:102) W ( (cid:101) U (cid:62) A (cid:102) W ) − (cid:101) U (cid:62) , A (cid:102) W ( (cid:101) U (cid:62) A (cid:102) W ) − (cid:101) U (cid:62) , A (cid:102) W ( (cid:101) U (cid:62) A (cid:102) W ) − (cid:101) U (cid:62) (cid:122) (43)The fourth equality is because the left singular space and right singular space of A k is U k and W k .Recall (cid:98) A = (cid:114) (cid:98) B ; (cid:98) L , (cid:98) L , (cid:98) L (cid:122) , (cid:98) L k = ( (cid:101) U k (cid:98) B k (cid:101) V k + (cid:101) U k ⊥ (cid:98) D k )( (cid:98) B k (cid:101) V k ) − , k = 1 , , . Denote (cid:101) B k = M k ( (cid:101) B ), (cid:98) B k = M k ( (cid:98) B ). In parallel to the deﬁnition of (cid:98) L k , we deﬁne (cid:101) L =( (cid:101) U (cid:101) B (cid:101) V + (cid:101) U ⊥ (cid:101) D )( (cid:101) B (cid:101) V ) − , = (cid:16) (cid:101) U (cid:101) U (cid:62) A ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V + (cid:101) U ⊥ (cid:101) U (cid:62) ⊥ A ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V (cid:17) · (cid:16) (cid:101) U (cid:62) A ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V (cid:17) − = A (cid:102) W (cid:16) (cid:101) U (cid:62) A (cid:102) W (cid:17) − . (44)11imilarly, (cid:101) L =( (cid:101) U (cid:101) B (cid:101) V + (cid:101) U ⊥ (cid:101) D )( (cid:101) B (cid:101) V ) − = A (cid:102) W (cid:16) (cid:101) U (cid:62) A (cid:102) W (cid:17) − , (cid:101) L =( (cid:101) U (cid:101) B (cid:101) V + (cid:101) U ⊥ (cid:101) D )( (cid:101) B (cid:101) V ) − = A (cid:102) W (cid:16) (cid:101) U (cid:62) A (cid:102) W (cid:17) − . Thus, in addition to (cid:98) A = (cid:74) (cid:98) B ; (cid:98) L , (cid:98) L , (cid:98) L (cid:75) , we have A = (cid:74) (cid:101) B ; (cid:101) L , (cid:101) L , (cid:101) L (cid:75) (45)Step 4 Next, we analyze the estimation error of (cid:98) A . First, the error bound of (cid:98) A − A can bedecomposed into eight parts, (cid:107) (cid:98) A − A (cid:107) = (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) A − A ; P (cid:101) U + P (cid:101) U ⊥ , P (cid:101) U + P (cid:101) U ⊥ , P (cid:101) U + P (cid:101) U ⊥ (cid:75) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) , (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) ⊥ (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) , (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) ⊥ (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) , (cid:101) U (cid:62) ⊥ (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) (cid:75) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) ⊥ (cid:75) (cid:13)(cid:13)(cid:13) . (46)Here we used the fact that P (cid:101) U and P (cid:101) U ⊥ are orthogonal complementary. We aim toapply Lemma 3 to analyze each term above in the next two steps.Step 5 Before giving the upper bounds for each term of (46), we denote λ k = max (cid:110)(cid:13)(cid:13)(cid:13) (cid:98) D k ( (cid:98) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) (cid:101) D k ( (cid:101) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13)(cid:111) ,π k = (cid:107) ( (cid:101) B k (cid:101) V k ) − (cid:101) B k (cid:107) , k = 1 , , λ k , π k in this step. By deﬁnition of (cid:101) B k and thefact that the right singular vector of A k is W k , π = (cid:13)(cid:13)(cid:13) ( (cid:101) B (cid:101) V ) − (cid:101) B (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) ( (cid:101) U (cid:62) A (cid:102) W ) − (cid:101) U (cid:62) A ( (cid:101) U ⊗ (cid:101) U ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:101) U (cid:62) A (cid:102) W (cid:17) − (cid:101) U (cid:62) A (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:101) U (cid:62) A W W (cid:62) (cid:102) W (cid:17) − (cid:101) U (cid:62) A W (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ( W (cid:62) (cid:102) W ) − (cid:13)(cid:13)(cid:13) = σ − ( (cid:102) W (cid:62) W ) = (cid:16) − (cid:107) sin Θ( (cid:102) W k , W k ) (cid:107) (cid:17) − / ≤ − θ ) / . (48)Similarly, the same upper bounds also applies to π and π .12ased on deﬁnitions of (cid:101) D k and (cid:101) B k and the fact that the left singular subspace of A k is U k , we have (cid:107) (cid:101) D k ( (cid:101) B k (cid:101) V k ) − (cid:107) + 1 = (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k ⊥ A k (cid:102) W k ( (cid:101) U (cid:62) k A k (cid:102) W k ) − (cid:13)(cid:13)(cid:13) + 1= (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:34) I r k (cid:101) U (cid:62) k ⊥ A k (cid:102) W k ( (cid:101) U (cid:62) k A k (cid:102) W k ) − (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:34) (cid:101) U (cid:62) k A k (cid:102) W k ( (cid:101) U (cid:62) k A k (cid:102) W k ) − (cid:101) U (cid:62) k ⊥ A k (cid:102) W k ( (cid:101) U (cid:62) k A k (cid:102) W k ) − (cid:35)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) A k (cid:102) W k (cid:16) (cid:101) U (cid:62) k A k (cid:102) W k (cid:17) − (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) U (cid:62) k A k (cid:102) W k (cid:16) (cid:101) U (cid:62) k U k U (cid:62) k A k (cid:102) W k (cid:17) − (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) U (cid:62) k A k (cid:102) W k (cid:16) U (cid:62) k A k (cid:102) W k (cid:17) − (cid:16) (cid:101) U (cid:62) k U k (cid:17) − (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:101) U (cid:62) U (cid:17) − (cid:13)(cid:13)(cid:13)(cid:13) = σ − (cid:16) (cid:101) U (cid:62) U (cid:17) = (cid:16) − (cid:107) sin Θ( (cid:101) U , U ) (cid:107) (cid:17) − ≤ − θ , (49)which implies (cid:107) (cid:101) D k ( (cid:101) B k (cid:101) V k ) − (cid:107) ≤ (cid:114) − θ − (cid:114) θ − θ . By the assumption of the theorem that (cid:107) (cid:98) D ( (cid:98) B (cid:101) V ) − (cid:107) ≤ ρ and θ ≤ /

2, we have λ k ≤ max (cid:26) ρ, θ √ − θ (cid:27) ≤ ρ + 2 √ θ, k = 1 , , . (50)Step 6 Now we are ready to give upper bounds for all terms in (46). • First, by deﬁnition of (cid:98) B , (cid:98) A (9), (cid:74) (cid:98) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) = (cid:114)(cid:74) (cid:98) B ; (cid:98) L , (cid:98) L , (cid:98) L (cid:75) ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:122) = (cid:114) (cid:98) B ; (cid:101) U (cid:62) (cid:98) L , (cid:101) U (cid:62) (cid:98) L , (cid:101) U (cid:62) (cid:98) L (cid:122) . (51)Here, (cid:101) U (cid:62) k (cid:98) L k = (cid:101) U (cid:62) k (cid:16) ( (cid:101) U k (cid:98) B k (cid:101) V k + (cid:101) U k ⊥ (cid:98) D k )( (cid:98) B k (cid:101) V k ) − (cid:17) = ( (cid:98) B k (cid:101) V k )( (cid:98) B k (cid:101) V k ) − = I r k . Similarly, we have (cid:101) U (cid:62) k (cid:101) L k = I r k .Thus, (cid:74) (cid:98) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) = (cid:98) B . By deﬁnition of (cid:101) B (38), we have (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) − (cid:74) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) (cid:13)(cid:13)(cid:13) = (cid:107) (cid:98) B − (cid:101) B (cid:107) . (52)13 Note that (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) (cid:13)(cid:13)(cid:13) , (51) = (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) B ; (cid:101) U (cid:62) ⊥ (cid:98) L , (cid:101) U (cid:62) (cid:98) L , (cid:101) U (cid:62) (cid:98) L (cid:75) − (cid:74) (cid:101) B ; (cid:101) U (cid:62) ⊥ (cid:101) L , (cid:101) U (cid:62) (cid:101) L , (cid:101) U (cid:62) (cid:101) L (cid:75) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) B ; (cid:98) D ( (cid:98) B (cid:101) V ) − , I , I (cid:75) − (cid:74) (cid:101) B ; (cid:101) D ( (cid:101) B (cid:101) V ) − , I , I (cid:75) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:98) D ( (cid:98) B (cid:101) V ) − (cid:98) B − (cid:101) D ( (cid:101) B (cid:101) V ) − (cid:101) B (cid:13)(cid:13)(cid:13) F (53)By the ﬁrst part of Lemma 3, (cid:13)(cid:13)(cid:13) (cid:98) D ( (cid:98) B (cid:101) V ) − (cid:98) B − (cid:101) D ( (cid:101) B (cid:101) V ) − (cid:101) B (cid:13)(cid:13)(cid:13) F ≤ (cid:16) π (cid:107) (cid:98) D − (cid:101) D (cid:107) F + λ (cid:107) (cid:98) B − (cid:101) B (cid:107) F + π λ (cid:107) (cid:98) B (cid:101) V − (cid:101) B (cid:101) V (cid:107) F (cid:17) ≤ (cid:18) √ − θ (cid:107) (cid:98) D − (cid:101) D (cid:107) F + ( ρ + 2 √ θ ) κ + ( ρ + 2 √ θ ) 1 √ − θ κ (cid:19) ≤ − θ (cid:107) (cid:98) D − (cid:101) D (cid:107) F + C ( ρ + θ ) (cid:107) (cid:98) D − (cid:101) D (cid:107) F κ + C ( ρ + θ ) κ ≤(cid:107) (cid:98) D − (cid:101) D (cid:107) F + 2 θ (cid:107) (cid:98) D − (cid:101) D (cid:107) F + C ( ρ + θ ) (cid:107) (cid:98) D − (cid:101) D (cid:107) F κ + C ( ρ + θ ) κ ≤(cid:107) (cid:98) D − (cid:101) D (cid:107) F + C ( ρ + θ ) κ . Here, the last inequality is due to the fact that (cid:107) (cid:98) D − (cid:101) D (cid:107) F ≤ κ . Therefore, (cid:13)(cid:13)(cid:13) (cid:114) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) , (cid:101) U (cid:122) (cid:13)(cid:13)(cid:13) ≤ (cid:107) (cid:98) D − (cid:101) D (cid:107) F + C ( ρ + θ ) κ ;similarly (cid:13)(cid:13)(cid:13) (cid:114) ( (cid:98) A − A ); (cid:101) U (cid:62) , (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:122) (cid:13)(cid:13)(cid:13) ≤ (cid:107) (cid:98) D − (cid:101) D (cid:107) F + C ( ρ + θ ) κ , (cid:13)(cid:13)(cid:13) (cid:114) ( (cid:98) A − A ); (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U ⊥ (cid:122) (cid:13)(cid:13)(cid:13) ≤ (cid:107) (cid:98) D − (cid:101) D (cid:107) F + C ( ρ + θ ) κ . (54) • By similar argument as (53), we have (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:75) (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) B ; (cid:98) D ( (cid:98) B (cid:101) V ) − , (cid:98) D ( (cid:98) B (cid:101) V ) − , I (cid:75) − (cid:74) (cid:101) B ; (cid:101) D ( (cid:101) B (cid:101) V ) − , (cid:101) D ( (cid:101) B (cid:101) V ) − , I (cid:75) (cid:13)(cid:13)(cid:13) F By the second part of Lemma 3, (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) B ; (cid:98) D ( (cid:98) B (cid:101) V ) − , (cid:98) D ( (cid:98) B (cid:101) V ) − , I (cid:75) − (cid:74) (cid:101) B ; (cid:101) D ( (cid:101) B (cid:101) V ) − , (cid:101) D ( (cid:101) B (cid:101) V ) − , I (cid:75) (cid:13)(cid:13)(cid:13) F ≤  λ λ (cid:107) (cid:98) B − (cid:101) B (cid:107) F + (cid:88) k =1 , π k λ λ /λ k (cid:107) (cid:98) D k − (cid:101) D k (cid:107) F + (cid:88) k =1 , π k λ λ (cid:107) (cid:98) B k (cid:101) V k − (cid:101) B k (cid:101) V k (cid:107) F  ≤ ( λ λ + π λ + π λ + π λ λ + π λ λ ) κ ≤ C ( ρ + θ ) κ . (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) (cid:75) (cid:13)(cid:13)(cid:13) F ≤ C ( ρ + θ ) κ ;similarly, (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) , (cid:101) U (cid:62) ⊥ (cid:75) (cid:13)(cid:13)(cid:13) F ≤ C ( ρ + θ ) κ , (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) , (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) ⊥ (cid:75) (cid:13)(cid:13)(cid:13) F ≤ C ( ρ + θ ) κ . (55) • By the second part of Lemma 3, (cid:13)(cid:13)(cid:13) (cid:74) ( (cid:98) A − A ); (cid:101) U (cid:62) ⊥ , (cid:101) U (cid:62) ⊥ , (cid:101) U ⊥ (cid:75) (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) B ; (cid:98) D ( (cid:98) B (cid:101) V ) − , (cid:98) D ( (cid:98) B (cid:101) V ) − , (cid:98) D ( (cid:98) B (cid:101) V ) − (cid:75) − (cid:74) (cid:101) B ; (cid:101) D ( (cid:101) B (cid:101) V ) − , (cid:101) D ( (cid:101) B (cid:101) V ) − , (cid:101) D ( (cid:101) B (cid:101) V ) − (cid:75) (cid:13)(cid:13)(cid:13) F ≤ (cid:16) λ λ λ (cid:107) (cid:98) B − (cid:101) B (cid:107) F + (cid:88) k =1 , , π k λ λ λ /λ k (cid:107) (cid:98) D k − (cid:101) D k (cid:107) F + (cid:88) k =1 , , π k λ λ λ (cid:107) (cid:98) B k (cid:101) V k − (cid:101) B k (cid:101) V k (cid:107) F (cid:17) ≤ C ( ρ + θ ) κ . (56)Combining (46), (52), (54), (55) and (56), we ﬁnally have (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≤ (cid:107) (cid:98) B − B (cid:107) F + (cid:88) k =1 (cid:107) (cid:98) D k − (cid:101) D k (cid:107) F + C ( ρ + θ ) κ = (1 + C ( ρ + θ )) κ . In summary, we have ﬁnished the proof of this theorem. (cid:3)

F.2 Proof of Theorem 3

This theorem gives a deterministic error bound of (cid:107) (cid:98) A − A (cid:107) in terms of θ, ρ and (cid:107) ( (cid:101) X (cid:62) B (cid:101) X B ) − (cid:101) X (cid:62) B (cid:101) ε B (cid:107) , (cid:107) ( (cid:101) X (cid:62) E k (cid:101) X E k ) − (cid:101) X (cid:62) E k (cid:101) ε E k (cid:107) , (cid:107) ( (cid:101) X E k , [: ,G ki ] ) (cid:62) (cid:101) ε E k /n (cid:107) for the sparse ISLET estimator (cid:98) A in thesparse low-rank tensor regression model. To prove this theorem, we ﬁrst rewrite the orig-inal high-dimensional regression model to four dimension-reduced ones (59), (60). Thenwe derive error bounds for the least square estimator or group Lasso estimator in terms of (cid:107) (cid:98) B − B (cid:107) or (cid:107) (cid:98) E k − (cid:101) E k (cid:107) F for each of these dimension-reduced regression models. The restof the proof aims to assemble the upper bound for (cid:107) (cid:98) A − A (cid:107) , which essentially followsfrom Steps 3-6 in the proof of Theorem 2.Denote A k = M k ( A ) , a = vec( A ) , X jk = M k ( X j ) , x j = vec( X j ) , ≤ j ≤ n, k = 1 , , B = (cid:74) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) ; (cid:101) E k = M k ( A × k +1 (cid:101) U (cid:62) k +1 × k +2 (cid:101) U (cid:62) k +2 ) (cid:101) V k = A k (cid:102) W k ∈ R p k × r k , k = 1 , ,

3; (57) (cid:101) γ B = vec( (cid:101) B ) ∈ R p p p , (cid:101) γ E k = vec( (cid:101) E k ) ∈ R p k r k , k = 1 , , . (58)Then similarly as the argument (40) in the proof of Theorem 2, we can write down thefollowing partial regression formulas that relate y j and ( X j , A ), y j = (cid:104) X j , A (cid:105) + ε j = (cid:104) x j , a (cid:105) + ε j = (cid:68) x j , P (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U a (cid:69) + ε j + (cid:104) x j , P ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) ⊥ a (cid:105) = (cid:68) ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) (cid:62) x j , ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) (cid:62) a (cid:69) + ( (cid:101) ε B ) j (57)(58) = ( (cid:101) X B ) [ j, :] (cid:101) γ B + ( (cid:101) ε B ) j , (59) y j = (cid:104) X j , A (cid:105) + ε j = (cid:68) X j , P R k ( (cid:102) W k ⊗ I pk )[ A ] (cid:69) + ε j + (cid:28) X j , P ( R k ( (cid:102) W k ⊗ I pk ) ) ⊥ [ A ] (cid:29) = (cid:68) X jk (cid:102) W k , A k (cid:102) W k (cid:69) + ( (cid:101) ε E k ) j (57)(58) = ( (cid:101) X E k ) [ j, :] (cid:101) γ E k + ( (cid:101) ε E k ) j (60)for j = 1 , . . . , n and k = 1 , ,

3. We discuss the estimation errors of (cid:98) γ E k ( k ∈ J s ), (cid:98) γ E k ( k / ∈ J s ), and (cid:98) B separately as below. • For any k ∈ J s , due to the deﬁnition that (cid:101) γ E k = vec( (cid:101) E k ) , (cid:101) E k = A k (cid:102) W k , and the left singular vectors of A k is U k that satisfying (cid:107) U k (cid:107) = (cid:80) p k i =1 { ( U k ) [ i, :] (cid:54) =0 } ≤ s k , (cid:101) γ E k is correspondingly group-wise sparse. More speciﬁcally, let G ik = { i, i + p k , . . . , i + p k ( r k − } with i = 1 , . . . , p k be a partition of { , . . . , p k r k } . Then (cid:101) γ i E k := ( (cid:101) γ E k ) G ik ∈ R r k , p k (cid:88) i =1 { (cid:101) γ i E k (cid:54) =0 } ≤ s k . (61)Accordingly, (cid:101) X E k ∈ R n × ( p k r k ) are with grouped covariates with respect to { G k , . . . , G p k k } : (cid:101) X i E k = ( (cid:101) X E k ) [: ,G ik ] ∈ R n × r k , i = 1 , . . . , p k . (62)Recall (cid:98) γ E k is the group Lasso estimator, (cid:98) γ E k = arg min γ ∈ R ( pkrk ) (cid:107) y − (cid:101) X E k γ (cid:107) + η k p k (cid:88) i =1 (cid:107) γ G ik (cid:107) .

16y the group-wise sparsity structure (61)(62), the partial linear regression model (60),the assumption that (cid:101) X E k ∈ R n × ( p k r k ) satisﬁes GRIP assumption with δ < /

4, and η k = C max ≤ i ≤ p k (cid:107) ( (cid:101) X i E k ) (cid:62) (cid:101) ε E k (cid:107) for constant C ≥

3, Lemma 11 yields (cid:107) (cid:98) E k − (cid:101) E k (cid:107) F = (cid:107) (cid:98) γ E k − (cid:101) γ E k (cid:107) ≤ C √ s k η k n ≤ C √ s k max ≤ i ≤ p k (cid:107) ( (cid:101) X i E k ) (cid:62) (cid:101) ε E k /n (cid:107) , ∀ k ∈ J s . (63) • For k / ∈ J s , recall (cid:98) E k is evaluated via the least square estimator,vec( (cid:98) E k ) = (cid:98) γ E k , (cid:98) γ E k = arg min γ ∈ R ( pkrk ) (cid:13)(cid:13)(cid:13) y − (cid:101) X E k γ (cid:13)(cid:13)(cid:13) . By linear regression model (60) and the deﬁnition of the least square estimator, (cid:107) (cid:98) E k − (cid:101) E k (cid:107) F = (cid:107) (cid:98) γ E k − (cid:101) γ E k (cid:107) = (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) E k (cid:101) X E k ) − (cid:101) X (cid:62) E k (cid:101) ε E k (cid:13)(cid:13)(cid:13) . (64) • In addition, recall vec( (cid:98) B ) = (cid:98) γ B , (cid:98) γ B = arg min γ ∈ R r r r (cid:107) y − (cid:101) X B γ (cid:107) . By linear regression model (59) and the deﬁnition of the least square estimator (cid:98) γ B , (cid:107) (cid:98) B − B (cid:107) = (cid:107) (cid:98) γ B − γ B (cid:107) = (cid:107) ( (cid:101) X (cid:62) B (cid:101) X B ) − (cid:101) X (cid:62) B (cid:101) ε B (cid:107) . (65)Given θ = max {(cid:107) sin Θ( (cid:101) U k , U k ) (cid:107) , (cid:107) sin Θ( (cid:102) W k , W k ) (cid:107)} ≤ /

2, similarly as the proof of The-orem 2, one can show (cid:101) U (cid:62) k (cid:101) E k is non-singular. Therefore, (cid:107) (cid:98) B − B (cid:107) + (cid:88) k =1 (cid:107) (cid:98) E k − (cid:101) E k (cid:107) F ≤ (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) B (cid:101) X B ) − (cid:101) X (cid:62) B (cid:101) ε B (cid:13)(cid:13)(cid:13) + C (cid:88) k ∈ J s s k max ≤ i ≤ p k (cid:13)(cid:13)(cid:13) ( (cid:101) X i E k ) (cid:62) (cid:101) ε E k /n (cid:13)(cid:13)(cid:13) + (cid:88) k / ∈ J s (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) E k (cid:101) X E k ) − (cid:101) X (cid:62) E k (cid:101) ε E k (cid:13)(cid:13)(cid:13) . The rest of the proof directly follows from Steps 3 - 6 in Theorem 3. (cid:3)

F.3 Proof of Theorem 4

The goal of Theorem 4 is to give a probabilistic error bound for regular tensor regressionvia ISLET. The high level idea is to ﬁrst derive the error bound for importance sketchingregression by a perturbation bound of the HOOI outcome (Theorem 1 in [138]), and thenapply the oracle inequality in Theorem 2 to obtain the ﬁnal estimation error rate. For abetter presentation, we divide the long proof into six steps. First in Step 1, we boundthe initialization error of (cid:101) U (0) k using perturbation theory [19] and concentration inequality17Lemmas 2 and 4). Then in Step 2, we aim to apply Theorem 1 in [138] to get an error boundfor the importance sketching directions (cid:101) U k . The central goal of Step 3 is to prove an errorbound for θ . In Steps 4, we move on to the second batch of sample and derive error boundsfor a few intermediate terms. In step 5, we evaluate key quantities ρ and (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:13)(cid:13)(cid:13) in the context of Theorem 2. Finally, we plug in all quantities to Theorem 2 and ﬁnish theproof.We begin the proof by introducing some notations. Throughout the proof, the modeindices ( · ) k are presented in modulo 3: e.g., U = U , V = V . For convenience, wedenote (cid:101) σ = (cid:107) A (cid:107) + σ , A k = M k ( A ) , (cid:101) A k = M k ( (cid:101) A ) , X ik = M k ( X i )for k = 1 , , p = max { p , p , p } , r = max { r , r , r } . To avoid repeating similarnotations consecutively, throughout the proof of this theorem we slightly abuse the notationand denote U k +2 ⊗ U k +1 =  U ⊗ U , k = 1; U ⊗ U , k = 2; U ⊗ U , k = 3without ambiguity. Other related notations, e.g., ( U k +2 ⊥ V ) ⊗ U k +1 , are deﬁned in a similarfashion.The rest of the proof for Theorem 4 is divided into 6 steps.Step 1 We ﬁrst develop the error bound for (cid:101) U (0)1 , (cid:101) U (0)2 , and (cid:101) U (0)3 . Particularly, we aim to showthat P (cid:32)(cid:13)(cid:13)(cid:13) sin Θ( (cid:101) U (0) k , U k ) (cid:13)(cid:13)(cid:13) ≤ (cid:32) C (cid:101) σ (cid:112) p k /n λ k + (cid:101) σ √ p p p /n λ k (cid:33) ∧ , k = 1 , , (cid:33) ≥ − p − C . (66)We only focus on (cid:101) U (0)1 as the conclusions for (cid:101) U (0)2 and (cid:101) U (0)3 similarly follow. Recall thebaseline unbiased estimator (cid:101) A = 1 n n (cid:88) i =1 y (1) i X (1) i = 1 n n (cid:88) i =1 (cid:16) (cid:104) X (1) i , A (cid:105) + ε (1) i (cid:17) X (1) i ∈ R p × p × p . Since the left and right singular subspaces of A are U and W , respectively, we further18ave (cid:101) A ∈ R p × ( p p ) and (cid:101) A = M (cid:16) (cid:101) A (cid:17) = 1 n n (cid:88) i =1 y (1) i X (1) i = 1 n n (cid:88) i =1 (cid:16) (cid:104) X (1) i , A (cid:105) + ε (1) i (cid:17) X (1) i = 1 n n (cid:88) i =1 (cid:16) (cid:104) X (1) i , P U A P W (cid:105) + ε (1) i (cid:17) X (1) i = 1 n n (cid:88) i =1 (cid:16) tr (cid:16) ( X (1) i ) (cid:62) U U (cid:62) A W W (cid:62) (cid:17) + ε (1) i (cid:17) X (1) i = 1 n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε (1) i (cid:17) X (1) i . Since (cid:101) U (0)1 = SVD r ( (cid:101) A ), the one-sided perturbation bound [19, Proposition 1] yields (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:101) U (0)1 , U (cid:17)(cid:13)(cid:13)(cid:13) ≤ σ r ( U (cid:62) (cid:101) A ) (cid:107) U (cid:62) ⊥ (cid:101) A P ( U (cid:62) (cid:101) A ) (cid:62) (cid:107) σ r ( U (cid:62) (cid:101) A ) − σ r +1 ( (cid:101) A ) ∧ σ (cid:16) U (cid:62) (cid:101) A (cid:17) , σ r +1 ( (cid:101) A ), and (cid:107) U (cid:62) ⊥ (cid:101) A P ( U (cid:62) (cid:101) A ) (cid:62) (cid:107) , respectively. • σ (cid:16) U (cid:62) (cid:101) A (cid:17) Lemma 2 ≥ σ (cid:16) U (cid:62) (cid:101) A W (cid:17) + σ (cid:16) U (cid:62) (cid:101) A ( W ) ⊥ (cid:17) = σ (cid:32) n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε (1) i (cid:17) U (cid:62) X (1) i W (cid:33) + σ (cid:32) n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε (1) i (cid:17) U (cid:62) X (1) i ( W ) ⊥ (cid:33) . By Lemma 4, U (cid:62) A W ∈ R r × r , and n ≥ Cp / r , we have σ min (cid:32) n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε (1) i (cid:17) U (cid:62) X (1) i W (cid:33) ≥ σ min ( U (cid:62) A W ) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε (1) i (cid:17) U (cid:62) X (1) i W − U (cid:62) A W (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Lemma 4 ≥ σ r ( A ) − C (cid:114) log pn (cid:0) r (cid:107) A (cid:107) F + σ (cid:1) ≥ (1 − c ) σ r ( A )with probability at least 1 − p − c . When X (1) i has i.i.d. Gaussian entries and W is ﬁxedorthogonal matrix, U (cid:62) X (1) i ( W ) ⊥ ∈ R r × ( p − − r ) and (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε i (cid:17) ∈ R are independently Gaussian distributed and (cid:68) U (cid:62) X (1) i W , U (cid:62) A W (cid:69) + ε (1) i ∼ N (0 , (cid:101) σ ) .

19y Lemma 6, σ (cid:32) n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε (1) i (cid:17) U (cid:62) X (1) i W ⊥ (cid:33) ≥ (cid:101) σ · n − C √ n log pn · (cid:16) √ p − − r − √ r − C (cid:112) log p (cid:17) ≥ (cid:101) σ n · (cid:32) − C (cid:114) log pn (cid:33) · (cid:16) p − − C √ p − r − C (cid:112) p − log p (cid:17) ≥ (cid:101) σ n (cid:16) p − − C √ p − r − C (cid:112) p − log p (cid:17) with probability at least 1 − p − c . To sum up, σ (cid:16) U (cid:62) (cid:101) A (cid:17) ≥ (1 − c ) σ r ( A ) + (cid:101) σ n · (cid:16) p − − C √ p − r − C (cid:112) p − log p (cid:17) (68)with probability at least 1 − p − c . • Next, we consider σ r +1 ( (cid:101) A ), note that σ r +1 ( (cid:101) A ) = min rank( M ) ≤ r (cid:13)(cid:13)(cid:13) (cid:101) A − M (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:101) A − P U (cid:101) A (cid:13)(cid:13)(cid:13) ≤ (cid:107) U (cid:62) ⊥ (cid:101) A (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i , U (cid:62) A (cid:105) + ε (1) i (cid:17) U (cid:62) ⊥ X (1) i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Since (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε (1) i (cid:17) ∼ N (cid:0) , (cid:101) σ (cid:1) , which is also independent of U (cid:62) ⊥ X (1) i . Thus, σ r +1 ( (cid:101) A ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i , U (cid:62) A (cid:105) + ε (1) i (cid:17) U (cid:62) ⊥ X (1) i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:101) σ · n + C ( √ n log p + log p ) n · (cid:16) √ p − r + √ p − + C (cid:112) log p (cid:17) ≤ (cid:101) σ n (cid:32) C (cid:114) log pn (cid:33) (cid:16) p − + C √ p − p + C (cid:112) p − log p + Cp + C log p (cid:17) ≤ (cid:101) σ n · (cid:16) p − + C √ p − p + C (cid:112) p − log p + Cp + C log p (cid:17) (69)with probability at least 1 − p − c . • Then we consider (cid:13)(cid:13)(cid:13) U (cid:62) ⊥ (cid:101) A P ( U (cid:62) (cid:101) A ) (cid:62) (cid:13)(cid:13)(cid:13) . Note that U (cid:62) ⊥ (cid:101) A P ( U (cid:62) (cid:101) A ) (cid:62) = 1 n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε (1) i (cid:17) U (cid:62) ⊥ X (1) i P ( U (cid:62) (cid:101) A ) (cid:62) , (cid:16) (cid:104) U (cid:62) X (1) i W , U (cid:62) A W (cid:105) + ε i (cid:17) ∼ N (0 , (cid:101) σ ); by independence, conditioning onﬁxed value of U (cid:62) X (1) i , U (cid:62) ⊥ X (1) i is still standard normal, and then U (cid:62) ⊥ X (1) i P ( U (cid:62) (cid:101) A ) (cid:62) (cid:12)(cid:12)(cid:12) U (cid:62) X (1) i is a ( p − r )-by- r i.i.d. standard Gaussian matrix. By Lemma 6, we have (cid:13)(cid:13)(cid:13) U (cid:62) ⊥ (cid:101) A P ( U (cid:62) (cid:101) A ) (cid:62) (cid:13)(cid:13)(cid:13) ≤ (cid:101) σ (cid:115) n + C √ n log p + C log pn · (cid:16) √ p − r + √ r + C (cid:112) log p (cid:17) ≤ C (cid:101) σ · (cid:114) p n (70)with probability at least 1 − p − C .Combining (68)-(70) with (67), we have the following inequality holds with probabilityat least 1 − p − C , (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:101) U (0)1 , U (cid:17)(cid:13)(cid:13)(cid:13) ≤ σ r ( U (cid:62) (cid:101) A ) (cid:107) U (cid:62) ⊥ (cid:101) A P ( U (cid:62) (cid:101) A ) (cid:62) (cid:107) σ r ( U (cid:62) (cid:101) A ) − σ r +1 ( (cid:101) A ) ∧ ≤ (cid:16) (1 − c ) σ r ( A ) + (cid:101) σ (cid:112) p − /n (cid:17) · C (cid:101) σ (cid:112) p /n (cid:16) (1 − c ) σ r ( A ) + (cid:101) σ (cid:112) p − /n (cid:17) − (cid:101) σ n · (cid:16) p − + C √ p − p + C (cid:112) p − log p + C p + C log p (cid:17) ∧ n ≥ Cp / (cid:101) σ /λ for large constant C >

0, we have (cid:16) (1 − c ) σ r ( A ) + (cid:101) σ (cid:112) p − /n (cid:17) − (cid:101) σ n · (cid:16) p − + C √ p − p + C (cid:112) p − log p + C p + C log p (cid:17) ≥ (1 − c ) σ r ( A ) + 2(1 − c ) σ r ( A ) (cid:101) σ (cid:112) p − /n − C (cid:101) σ n (cid:16) √ p p p + (cid:112) p − log p + C p + C log p (cid:17) ≥ cσ r ( A )and additionally, (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:101) U (0)1 , U (cid:17)(cid:13)(cid:13)(cid:13) ≤ (cid:32) C (cid:101) σ (cid:112) p /n · σ r ( A ) + (cid:101) σ √ p p p /n σ r ( A ) (cid:33) ∧ . with probability at least 1 − p − C . Similar inequalities also hold for (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:101) U (0)2 , U (cid:17)(cid:13)(cid:13)(cid:13) and (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:101) U (0)3 , U (cid:17)(cid:13)(cid:13)(cid:13) . Based on these arguments, we conclude that (66) holds. (66)21urther implies that e := max k (cid:13)(cid:13)(cid:13) (cid:101) U (0) (cid:62) k ⊥ M k ( A ) (cid:13)(cid:13)(cid:13) = max k (cid:13)(cid:13)(cid:13) (cid:101) U (0) (cid:62) k ⊥ U k U (cid:62) k M k ( A ) (cid:13)(cid:13)(cid:13) ≤ max k (cid:107) (cid:101) U (0) (cid:62) k ⊥ U k (cid:107) · (cid:107) U (cid:62) k M k ( A ) (cid:107) ≤ max k (cid:107) sin Θ( (cid:101) U (0) k , U k ) (cid:107) · (cid:107) U (cid:62) k M k ( A ) (cid:107)≤ max k C (cid:107) A k (cid:107) (cid:32) (cid:101) σ (cid:112) p k /n σ r k ( A k ) + (cid:101) σ √ p p p /n σ r k ( A k ) (cid:33) ≤ C κ (cid:32) (cid:101) σp / n / + (cid:101) σ p / λ n (cid:33) (71)with probability at least 1 − p − C .Step 2 Then we develop the error bound for (cid:101) U k after enough number of iterations in this step.In particular, we aim to apply Theorem 1 in [138] to give an error bound for the output (cid:101) U k from the high-order order orthogonal iteration (HOOI). To this end, we verify theconditions in Theorem 1 in [138] in this step. Deﬁning Z = (cid:101) A − A , T = A + Z × P U × P U × P U , (cid:101) T = (cid:101) A . (72)Then, (cid:101) T − T = Z − Z × P U × P U × P U . (73)In order to apply Theorem 1 in [138], we develop the following upper bounds under theassumptions of Theorem 4. • Since M (cid:16) ( (cid:101) A − A ) × U (cid:62) × U (cid:62) × U (cid:62) (cid:17) is a r -by-( r r ) matrix, Lemma 4 implies (cid:13)(cid:13)(cid:13) M (cid:16) ( (cid:101) A − A ) × U (cid:62) × U (cid:62) × U (cid:62) (cid:17)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) U (cid:62) M (cid:16) (cid:101) A − A (cid:17) ( U ⊗ U ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16)(cid:68) U (cid:62) X (1) i ( U ⊗ U ) , U (cid:62) A ( U ⊗ U ) (cid:69) + ε (1) i (cid:17) U (cid:62) X (1) i ( U ⊗ U ) − U (cid:62) A ( U ⊗ U ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Lemma 4 ≤ C (cid:115) log p · ( r + r r ) (cid:101) σ n (74)22ith probability at least 1 − p − C . Similar results also hold for M ( · ) and M ( · ). Then λ k ( T ) := σ r k ( M k ( T )) (73) ≥ σ r k ( M k ( A )) − (cid:13)(cid:13)(cid:13) M k (cid:16) ( (cid:101) A − A ) × P U × P U × P U (cid:17)(cid:13)(cid:13)(cid:13) ≥ λ k − C (cid:115) log p · ( r k + r k +1 r k +2 ) (cid:101) σ n ≥ (1 − c ) λ (75)with probability at least 1 − p − C . • Next, we consider τ k := (cid:13)(cid:13)(cid:13) M k ( (cid:101) T − T ) ( U k +2 ⊗ U k +1 ) (cid:13)(cid:13)(cid:13) , k = 1 , , . In particular, (cid:13)(cid:13)(cid:13) M ( (cid:101) T − T ) ( U ⊗ U ) (cid:13)(cid:13)(cid:13) (73) = (cid:107)M ( Z − (cid:74) Z ; P U , P U , P U (cid:75) )( U ⊗ U ) (cid:107) = (cid:13)(cid:13)(cid:13) M (cid:16)(cid:16) (cid:101) A − A − (cid:74) (cid:101) A − A ; P U , P U , P U (cid:75) (cid:17) × U (cid:62) × U (cid:62) (cid:17)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) M (cid:16) ( (cid:101) A − A ) × ( P U + P U ⊥ ) × U (cid:62) × U (cid:62) (cid:17) − M (cid:16) ( (cid:101) A − A ) × P U × U (cid:62) × U (cid:62) (cid:17) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) M (cid:16) ( (cid:101) A − A ) × P U ⊥ × U (cid:62) × U (cid:62) (cid:17)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) U (cid:62) ⊥ ( (cid:101) A − A ) · ( U ⊗ U ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16) (cid:104) U (cid:62) X (1) i ( U ⊗ U ) , U (cid:62) A ( U ⊗ U ) (cid:105) + ε (1) i (cid:17) U (cid:62) ⊥ X (1) i ( U ⊗ U ) − U (cid:62) ⊥ A ( U ⊗ U ) (cid:13)(cid:13)(cid:13) Lemma 6 ≤ (cid:101) σ (cid:115) n + C √ n log pn (cid:16) √ p − r + √ r r + C (cid:112) log p (cid:17) ≤ C (cid:101) σ (cid:114) p n , (76)with probability at least 1 − p − C . Thus, P (cid:16) τ k ≤ C (cid:101) σ (cid:112) p k /n , k = 1 , , (cid:17) ≥ − p − C . (77) • Next we consider the upper bound of τ := max k (cid:110) max V ∈ R ( pk +1 − rk +1) × rk +1 (cid:107) V (cid:107)≤ (cid:13)(cid:13)(cid:13) M k ( (cid:101) T − T ) · { ( U k +2 , ⊥ V ) ⊗ U k +1 } (cid:13)(cid:13)(cid:13) , max V ∈ R ( pk +2 − rk +2) × rk +2 (cid:107) V (cid:107)≤ (cid:13)(cid:13)(cid:13) M k ( (cid:101) T − T ) · { U k +2 ⊗ ( U k +1 , ⊥ V ) } (cid:13)(cid:13)(cid:13) (cid:111) . (78)23ote that M (cid:16) (cid:101) T − T (cid:17) ( U ⊥ V ) ⊗ U = ( M ( Z ) − M ( Z × P U × P U × P U )) ( U ⊥ V ) ⊗ U = M ( Z )( U ⊥ V ) ⊗ U = 1 n n (cid:88) i =1 y (1) i X (1) i (( U ⊥ V ) ⊗ U ) ,y (1) i = (cid:104) X (1) i , A (cid:105) + ε (1) i = (cid:104) U (cid:62) X (1) i ( U ⊗ U ) , U (cid:62) A ( U ⊗ U ) (cid:105) + ε (1) i . Since U ⊥ and U are orthogonal, y (1) i and X (1) i ( U ⊥ ⊗ U ) are independently Gaussiandistributed. Thus, conditioning on ﬁxed values of { y (1) i } n i =1 ,1 n n (cid:88) i =1 y (1) i X (1) i ( U ⊥ ⊗ U ) (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) y (1) (cid:107) is a p -by-(( p − r ) r ) random matrix with i.i.d. Gaussian entries with mean zeroand variance (cid:107) y (1) (cid:107) /n . By Lemma 5 in [139], P (cid:32) max V ∈ R ( p − r × r (cid:107)M ( Z ( U ⊥ V ⊗ U )) (cid:107)≥ C (cid:107) y (1) (cid:107) n (cid:0) √ p + √ r r + √ t ( √ p r + √ p r ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) y (1) (cid:107) (cid:33) ≤ C exp ( − Ct ( p r + p r )) . (79)Note that (cid:107) y (1) (cid:107) ∼ (cid:101) σ χ n , we have P (cid:16) (cid:107) y (1) (cid:107) ≥ (cid:101) σ ( n + 2 √ n t + 2 t ) (cid:17) ≤ exp( − t ) . (80)Combining (79) (with t = pr/ ( p r + p r )), (80) (with t = Cpr ), and the fact that n ≥ Cpr for large constant

C >

0, we have P  max V ∈ R ( p − r × r (cid:107) V (cid:107)≤ (cid:13)(cid:13)(cid:13) M (cid:16) (cid:101) T − T (cid:17) ( U ⊥ V ) ⊗ U (cid:13)(cid:13)(cid:13) ≥ C (cid:101) σ (cid:114) prn  ≤ C exp ( − cpr ) . By symmetry, we have similar results for other terms in the right hand side of (78)and the following conclusion, P (cid:18) τ ≥ C (cid:101) σ (cid:114) prn (cid:19) ≤ C exp( − cpr ) . (81)24 Based on essentially the same argument as the previous step, we can also show τ := max k max V ∈ R ( pk +1 − rk +1) × rk +1 : (cid:107) V (cid:107)≤ V (cid:48) ∈ R ( pk +2 − rk +2) × rk +2 : (cid:107) V (cid:48) (cid:107)≤ (cid:13)(cid:13) M k ( Z ) (cid:8) ( U k +1 ⊥ V ) ⊗ ( U k +2 ⊥ V (cid:48) ) (cid:9)(cid:13)(cid:13) ≤ C (cid:101) σ (cid:114) prn (82)with probability at least 1 − C exp( − cpr ).Now, when the statements in (77), (81), (82) all hold, given n ≥ (cid:101) σ λ ( κpr ∨ p / ) forlarge enough constant C >

0, we have n ≥ C (cid:101) σ λ p / r / (by H¨older’s inequality) andthe condition τ λ ( T ) + max k τ (4 τ k + e ) λ ( T ) ≤ C (cid:101) σ (cid:112) pr/n λ + C (cid:101) σ (cid:112) pr/n (cid:16)(cid:101) σ (cid:112) p/n + κ (cid:101) σ (cid:112) p/n + κ (cid:101) σ p / / ( λ n ) (cid:17) λ ≤ C (cid:101) σp / r / λ n / + C (cid:101) σ κpr / λ n + C κ (cid:101) σ p r / λ n / ≤ (cid:101) U k and (cid:102) W k . First, Theorem 1in [138] and (77), (81), (82) imply (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:101) U k , U k (cid:17)(cid:13)(cid:13)(cid:13) ≤ Cτ k σ r k ( M k ( T )) ≤ C (cid:101) σ (cid:112) p k /n λ k , k = 1 , , , and (cid:13)(cid:13)(cid:13) (cid:74) (cid:101) T ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) − T (cid:13)(cid:13)(cid:13) HS ≤ C (cid:101) σ (cid:114) p r + p r + p r + r r r n with probability at least 1 − p − C . Moreover, (cid:107) T − A (cid:107) HS (72) = (cid:13)(cid:13)(cid:13)(cid:16) (cid:101) A − A (cid:17) × U (cid:62) × U (cid:62) × U (cid:62) (cid:13)(cid:13)(cid:13) HS = (cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:16)(cid:68) vec( X i × U (cid:62) × U (cid:62) × U (cid:62) ) , vec( A × U (cid:62) × U (cid:62) × U (cid:62) ) (cid:69) + ε i (cid:17) · vec( X i × U (cid:62) × U (cid:62) × U (cid:62) ) − vec( A × U (cid:62) × U (cid:62) × U (cid:62) ) (cid:13)(cid:13)(cid:13) ≤ C (cid:115) (cid:101) σ n (cid:16) √ r r r + (cid:112) log p (cid:17) − p − C . Combing the previous two inequalities, we have (cid:13)(cid:13)(cid:13) (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) − A (cid:13)(cid:13)(cid:13) HS ≤ (cid:13)(cid:13)(cid:13) (cid:74) (cid:101) T ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) − T (cid:13)(cid:13)(cid:13) HS + (cid:107) A − T (cid:107) HS ≤ C (cid:101) σ (cid:114) p r + p r + p r + r r r n (cid:16) C (cid:101) σ (cid:112) m/n (83)with probability at least 1 − p − C . Then, for k = 1 , , (cid:107) (cid:101) U (cid:62) k ⊥ A k (cid:107) F ≤ (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k ⊥ (cid:16) P (cid:101) U k (cid:101) A k ( P (cid:101) U k +2 ⊗ P (cid:101) U k +1 ) − A k (cid:17)(cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) P (cid:101) U k (cid:101) A k ( P (cid:101) U k +2 ⊗ P (cid:101) U k +1 ) − A k (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) − A (cid:13)(cid:13)(cid:13) HS ≤ C (cid:101) σ (cid:112) m/n with probability at least 1 − p − C .Next, we are in the position of evaluating the estimation errors of (cid:102) W k . Denote (cid:101) S = (cid:101) A × (cid:101) U (cid:62) × (cid:101) U (cid:62) × (cid:101) U (cid:62) , (cid:101) V k = SVD r k (cid:16) M k ( (cid:101) S ) (cid:62) (cid:17) , we know (cid:102) W k =( (cid:101) U k +2 ⊗ (cid:101) U k +1 ) (cid:101) V k = SVD r k (cid:16) ( (cid:101) U k +2 ⊗ (cid:101) U k +1 ) M k ( (cid:101) S ) (cid:62) (cid:17) =SVD r k (cid:18) M k (cid:16) (cid:101) S × ( k +1) (cid:101) U k +1 × ( k +2) (cid:101) U k +2 (cid:17) (cid:62) (cid:19) =SVD r k (cid:18) M k (cid:16) (cid:101) S × ( k +1) (cid:101) U k +1 × ( k +2) (cid:101) U k +2 (cid:17) (cid:62) (cid:101) U (cid:62) k (cid:19) =SVD r k (cid:18) M k (cid:16) (cid:101) S × k (cid:101) U k × ( k +1) (cid:101) U k +1 × ( k +2) (cid:101) U k +2 (cid:17) (cid:62) (cid:19) =SVD r k (cid:18) M k (cid:16) (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) (cid:17) (cid:62) (cid:19) . On the other hand, W k = SVD r k ( A (cid:62) k ) = SVD r k (cid:0) M k ( A ) (cid:62) (cid:1) . By Lemma 7, (cid:107) A k (cid:102) W k ⊥ (cid:107) F ≤ (cid:13)(cid:13)(cid:13) M k ( (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) ) − M k ( A ) (cid:13)(cid:13)(cid:13) F =2 (cid:13)(cid:13)(cid:13) (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) − A (cid:13)(cid:13)(cid:13) HS (83) ≤ C (cid:101) σ (cid:114) mn (84)with probability at least 1 − p − C . Therefore, we also have (cid:13)(cid:13)(cid:13) sin Θ( (cid:102) W k , W k ) (cid:13)(cid:13)(cid:13) F ≤ (cid:107) (cid:102) W (cid:62) k ⊥ W k (cid:107) F ≤ (cid:107) (cid:102) W (cid:62) k ⊥ W k (cid:102) W (cid:62) k ⊥ A (cid:62) k (cid:107) F σ r k ( (cid:102) W (cid:62) k ⊥ A (cid:62) k ) ≤ C (cid:115) (cid:101) σ mλ k n (85)with probability at least 1 − p − C .To summarize the progress in this step, we have established the following probabilisticinequalities for (cid:101) U , (cid:101) U , (cid:101) U and (cid:102) W , (cid:102) W , (cid:102) W , (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:101) U k , U k (cid:17)(cid:13)(cid:13)(cid:13) ≤ C (cid:101) σ (cid:112) p k /n λ k , (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:102) W k , W k (cid:17)(cid:13)(cid:13)(cid:13) F ≤ C (cid:101) σ (cid:112) m/n λ k , k = 1 , , , (86)26 (cid:13)(cid:13) (cid:101) U (cid:62) k A k (cid:13)(cid:13)(cid:13) F ≤ C (cid:101) σ (cid:112) m/n , (cid:13)(cid:13)(cid:13) A k (cid:102) W k ⊥ (cid:13)(cid:13)(cid:13) F ≤ C (cid:101) σ (cid:112) m/n , k = 1 , , , (87)with probability at least 1 − p − C .Step 4 For the rest of the proof, we assume (86) and (87) hold. Next, we move on to evaluatethe estimation error bound for (cid:98) A . The focus now shifts from the ﬁrst batch of samples( X (1) , y (1) ) to the second one ( X (2) , y (2) ). Denote θ k := (cid:13)(cid:13)(cid:13) sin Θ (cid:16) (cid:101) U k , U k (cid:17)(cid:13)(cid:13)(cid:13) (86) ≤ C (cid:101) σ (cid:112) p k /n λ k , k = 1 , ,

3; (88) ξ k := (cid:107) A k (cid:102) W k ⊥ (cid:107) F (87) ≤ C (cid:101) σ (cid:112) m/n , k = 1 , ,

3; (89) η k := (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k A k (cid:13)(cid:13)(cid:13) F (87) ≤ C (cid:101) σ (cid:112) m/n , k = 1 , ,

3; (90) (cid:98) σ := (cid:13)(cid:13)(cid:13) P (cid:101) U ⊥ vec( A ) (cid:13)(cid:13)(cid:13) + σ . (91)By Lemma 9, (cid:107) P (cid:101) U ⊥ vec( A ) (cid:107) ≤ C (cid:101) σ mpn λ + C (cid:101) σ mp λ n . Provided that m = r r r + (cid:80) k ( p k − r k ) r k and n ≥ C (cid:101) σ pλ , we know (cid:107) P (cid:101) U ⊥ vec( A ) (cid:107) ≤ C (cid:101) σ mpn λ , (cid:98) σ ≤ σ + C (cid:101) σ mpn λ . (92)Step 5 In this step, we evaluate two crucial quantities for applying the oracle inequality (The-orem 2). Recall the importance sketching covariates (6) are deﬁned as (cid:101) X = (cid:104) (cid:101) X B (cid:101) X D (cid:101) X D (cid:101) X D (cid:105) ∈ R n × m , (cid:101) X B ∈ R n × ( r r r ) , (cid:16) (cid:101) X B (cid:17) [ i, :] = vec (cid:16) X (2) i × (cid:101) U (cid:62) × (cid:101) U (cid:62) × (cid:101) U (cid:62) (cid:17) , (cid:101) X D k ∈ R n × ( p k − r k ) r k , (cid:16) (cid:101) X D k (cid:17) [ i, :] = vec (cid:16) (cid:101) U (cid:62) k ⊥ M k (cid:16) X (2) i × k +1 (cid:101) U (cid:62) k +1 × k +2 (cid:101) U (cid:62) k +2 (cid:17) (cid:101) V k (cid:17) . When X (2) i are i.i.d. Gaussian matrices and independent of (cid:101) U k , (cid:101) V k , (cid:102) W k , (cid:101) X can be seenas an orthogonal projection of X (2) i and has i.i.d. Gaussian entries. Thus, by Proposition5.35 in [122], P (cid:16) σ min ( (cid:101) X (cid:62) (cid:101) X ) = σ ( (cid:101) X ) ≥ (cid:0) √ n − √ m − t (cid:1) (cid:17) ≥ − exp( − t / . By deﬁnition, (cid:101) ε ∈ R n is independent of (cid:101) X , and (cid:101) ε j = (cid:104) X (2) j , P (cid:101) U ⊥ A (cid:105) + ε j ∼ N (cid:18) , (cid:13)(cid:13)(cid:13) P (cid:101) U ⊥ vec( A ) (cid:13)(cid:13)(cid:13) + σ (cid:19) = N (0 , (cid:98) σ ) . (cid:107) (cid:101) ε (cid:107) ∼ (cid:98) σ χ n and (cid:107) (cid:101) X (cid:62) (cid:101) ε (cid:107) (cid:12)(cid:12)(cid:12) (cid:107) ε (cid:107) ∼ (cid:107) ε (cid:107) χ m . Based on χ distribution tail bound[72, Lemma 1] and n ≥ C ( p / + r ) ≥ Cm , (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:13)(cid:13)(cid:13) ≤ (cid:98) σ (cid:16) n + 2 (cid:112) n C log( p ) + 2 C log( p ) (cid:17) (cid:16) m + 2 (cid:112) mC log( p ) + 2 C log( p ) (cid:17)(cid:0) √ n − √ m − C log( p ) (cid:1) ≤ (cid:98) σ mn (cid:16) (cid:113) C log pn + 2 log pn (cid:17) (cid:16) (cid:113) tm + 2 tm (cid:17)(cid:16) − (cid:113) mn − C log( p ) √ n (cid:17) = (cid:98) σ mn (cid:32) C (cid:114) mn + C (cid:114) log pm (cid:33) . (93)with probability at least 1 − p − C .We assume (93) holds. It remains to check (cid:13)(cid:13)(cid:13) (cid:98) D k ( (cid:98) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13) . Similarly as the proof ofTheorem 2, we deﬁne (cid:101) B = (cid:114) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:122) = (cid:114) S × U × U × U ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:122) ∈ R r × r × r ; (cid:101) B k = M k ( (cid:101) B ) ∈ R r k × ( r k +1 r k +2 ) , k = 1 , , , (cid:101) D = (cid:101) U (cid:62) ⊥ M ( A × (cid:101) U (cid:62) × (cid:101) U ) (cid:101) V = (cid:101) U (cid:62) ⊥ A (cid:102) W ∈ R ( p − r ) × r , (cid:101) D = (cid:101) U (cid:62) ⊥ M ( A × (cid:101) U (cid:62) × (cid:101) U ) (cid:101) V = (cid:101) U (cid:62) ⊥ A (cid:102) W ∈ R ( p − r ) × r , (cid:101) D = (cid:101) U (cid:62) ⊥ M ( A × (cid:101) U (cid:62) × (cid:101) U ) (cid:101) V = (cid:101) U (cid:62) ⊥ A (cid:102) W ∈ R ( p − r ) × r . By the proof of Theorem 2, we have (cid:13)(cid:13)(cid:13) (cid:98) B − (cid:101) B (cid:13)(cid:13)(cid:13) + (cid:88) k =1 (cid:13)(cid:13)(cid:13) (cid:98) D k − (cid:101) D k (cid:13)(cid:13)(cid:13) F (41) ≤ (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:13)(cid:13)(cid:13) ≤ (cid:98) σ mn (cid:32) C s (cid:114) log mn + C (cid:114) log pm (cid:33) , (94) (cid:107) (cid:101) D k ( (cid:101) B k (cid:101) V k ) − (cid:107) (49) ≤ C max k (cid:110) (cid:107) sin Θ( (cid:101) U k , U k ) (cid:107) , (cid:107) sin Θ( W k , W k ) (cid:107) (cid:111) ≤ C (cid:101) σ (cid:112) m/n λ k , (95) σ min ( (cid:101) B k (cid:101) V k ) = σ min ( (cid:101) U (cid:62) k A k (cid:102) W k ) (42) ≥ λ k (cid:18) − C (cid:101) σ mλ k n (cid:19) ≥ λ k (1 − c )for some constant 0 < c <

1. This additionally means σ min (cid:16) (cid:98) B k (cid:101) V k (cid:17) ≥ σ min ( (cid:101) B k (cid:101) V k ) − (cid:107) (cid:98) B k − B k (cid:107) (94) ≥ λ k (cid:18) − C (cid:101) σ mλ k n (cid:19) − C (cid:98) σ mn ≥ (1 − c ) λ k . (96)28t is easy to check that the following equality,( (cid:98) B k (cid:101) V k ) − = ( (cid:101) B k (cid:101) V k ) − + ( (cid:101) B k (cid:101) V k ) − (cid:16) (cid:101) B k (cid:101) V k − (cid:98) B k (cid:101) V k (cid:17) ( (cid:98) B k (cid:101) V k ) − . Thus, ρ := (cid:13)(cid:13)(cid:13) (cid:98) D k ( (cid:98) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ( (cid:98) D k − (cid:101) D k )( (cid:98) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:101) D k ( (cid:98) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13) (cid:98) D k − (cid:101) D k (cid:13)(cid:13)(cid:13) λ k + (cid:13)(cid:13)(cid:13) (cid:101) D k ( (cid:101) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) (cid:101) D k ( (cid:101) B k (cid:101) V k ) − (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) ( (cid:101) B k − (cid:98) B k ) (cid:101) V k (cid:13)(cid:13)(cid:13) · (cid:107) ( (cid:98) B k (cid:101) V k ) − (cid:107) (94)(95)(96) ≤ C (cid:101) σλ k (cid:114) mn + C (cid:98) σλ k (cid:114) mn . (97)Step 6 Finally, we apply the oracle inequality, i.e., Theorem 2, and obtain the ﬁnal upper boundfor (cid:98) A . We have shown that the conditions of Theorem 2 holds if (86), (87), and (93)hold. Then Theorem 2 implies (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≤ (1 + Cθ + Cρ ) (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) (cid:101) X ) − (cid:101) X (cid:62) (cid:101) ε (cid:13)(cid:13)(cid:13) ≤ (cid:98) σmn (cid:32) C (cid:114) mn + C (cid:114) log pm + C (cid:101) σλ (cid:114) mn + C (cid:98) σλ (cid:114) mn (cid:33) (92) ≤ mn (cid:18) σ + C (cid:101) σ mpn λ (cid:19) (cid:32) C (cid:114) mn + C (cid:114) log pm + C (cid:101) σλ (cid:114) mn + C (cid:98) σλ (cid:114) mn (cid:33) ≤ mn (cid:18) σ + C (cid:101) σ mpn λ (cid:19) (cid:32) C (cid:114) log pm + C (cid:115) m (cid:101) σ ( n ∧ n ) λ (cid:33) with probability at least 1 − p − C . Here, the last inequality is due to n ∧ n ≥ C (cid:101) σ ( p / + r ) /λ and (cid:98) σ = (cid:107) A (cid:107) + σ ≥ λ . (cid:3) F.4 Proof of Theorem 5

In this theorem, we provide an estimation error lower bound for low-rank tensor regression.The central idea is to carefully transform the original high-dimensional low-rank tensorregression model to the unconstrained dimension-reduced linear regression model (103),then apply the classic Bayes risk of linear regression (Lemma 10) to ﬁnalize the desiredlower bound on estimation error.Since r , r , and r satisfy r k ≤ r k +1 r k +2 for k = 1 , ,

3, the r -by- r -by- r tensorwith i.i.d. normal entries has full Tucker rank with probability 1. Thus, we can set S ∈ R r × r × r as a ﬁxed tensor with full Tucker rank, i.e., rank( S ) = ( r , r , r ). Let T > A ∈ R p × p × p , ( A ) [1: r , r , r ] = T S , ( A ) [1: r , r , r ] c = 0 . (98)29uppose U k ∈ O p k ,r k and W k ∈ O p − k ,r k are the left and right singular subspaces of M k ( A ), respectively; V k ∈ O r k +1 r k +2 ,r k is the right singular subspace of M k ( S ). Thenby deﬁnition of A , U k = (cid:34) I r k ( p k − r k ) × r k (cid:35) , k = 1 , , . Next, for to-be-speciﬁed values τ, T >

0, we introduce a prior distribution ¯ P τ,T on the classof A p , r : the p -by- p -by- p random tensor ¯ A ∼ ¯ P τ,T if and only if it can be generated basedon the following process.1. Generate an r -by- r -by- r tensor B iid ∼ N (0 , τ ) and assign ¯ A [1: r , r , r ] = T S + B .2. Suppose M k ( ¯ A [1: r , r , r ] ) = ¯ A k ∈ R r k × r − k and ¯ V k = SVD r k ( ¯ A (cid:62) k ) ∈ O r − k ,r k . Assign M (cid:0) ¯ A [( r +1): p , r , r ] (cid:1) = B · ¯ V (cid:62) , M (cid:0) ¯ A [1: r , ( r +1): p , r ] (cid:1) = B · ¯ V (cid:62) , M (cid:0) ¯ A [1: r , r , ( r +1): p ] (cid:1) = B · ¯ V (cid:62) , where all entries of B ∈ R ( p − r ) × r , B ∈ R ( p − r ) × r , B ∈ R ( p − r ) × r are indepen-dently drawn from N (0 , τ ).3. The other blocks of ¯ A are calculated as follows,¯ A [( r +1): p , ( r +1): p , r ] = ¯ A [1: r , r , r ] × (cid:0) B ( ¯ A ¯ V ) − (cid:1) × (cid:0) B ( ¯ A ¯ V ) − (cid:1) , ¯ A [( r +1): p , r , ( r +1): p ] = ¯ A [1: r , r , r ] × (cid:0) B ( ¯ A ¯ V ) − (cid:1) × (cid:0) B ( ¯ A ¯ V ) − (cid:1) , ¯ A [1: r , ( r +1): p , ( r +1): p ] = ¯ A [1: r , r , r ] × (cid:0) B ( ¯ A ¯ V ) − (cid:1) × (cid:0) B ( ¯ A ¯ V ) − (cid:1) , ¯ A [( r +1): p , ( r +1): p , ( r +1): p ] = ¯ A [1: r , r , r ] × (cid:0) B ( ¯ A ¯ V ) − (cid:1) × (cid:0) B ( ¯ A ¯ V ) − (cid:1) × (cid:0) B ( ¯ A ¯ V ) − (cid:1) . (99)One can check by comparing each block that ¯ A satisﬁes¯ A = (cid:113) T S + B ; ¯ L , ¯ L , ¯ L (cid:121) , where ¯ L k = (cid:34) I r k B k ( ¯ A k ¯ V k ) − (cid:35) , k = 1 , , . (100)Thus, rank( ¯ A ) ≤ ( r , r , r ) and ¯ A ∈ A p , r . Then we consider another distribution P ∗ τ,T onthe whole tensor space R p × p × p , A ∗ ∼ P ∗ τ,T , such that A ∗ [1: r , r , r ] = T S + B , M (cid:16) A ∗ [( r +1): p , r , r ] (cid:17) = B · V (cid:62) ; M (cid:16) A ∗ [1: r , ( r +1): p , r ] (cid:17) = B · V (cid:62) ; M (cid:16) A ∗ [( r +1): p , r , r ] (cid:17) = B · V (cid:62) ;the other blocks of A ∗ are set to zero . (101)30ere, B , B , B , B iid ∼ N (0 , τ ). Suppose ¯ A ∼ ¯ P τ,T and A ∗ ∼ P ∗ τ,T . Recall that V k =SVD r k ( M k ( S ) (cid:62) )) and ¯ V k = SVD r k ( M k ( S + B /T ) (cid:62) ). As T → ∞ , we must have¯ V k d → V k and ( ¯ A − A ) d → ( A ∗ − A ) . (102)Next, we move on to the regular tensor regression model y i = (cid:104) X i , A (cid:105) + ε i , i = 1 , . . . , n. For convenience, we divide X i and A into eight blocks and denote them separately as X i,s s s = ( X i ) [ I ,s ,I ,s ,I ,s ] , A s s s = A [ I ,s ,I ,s ,I ,s ] , for s , s , s ∈ { , } , where I k, = { , . . . , r k } , I k, = { r k + 1 , . . . , p k } , k = 1 , , . If A ∗ ∼ P ∗ τ,T , A ∗ , A ∗ , A ∗ , A ∗ are all zeros. Then, y i = (cid:104) X i , A ∗ (cid:105) + ε i = (cid:88) s ,s ,s =1 (cid:104) X i,s s s , A ∗ s s s (cid:105) + ε i = (cid:104) ( X i, , T S + B (cid:105) + (cid:104)M ( X i, ) , B V (cid:62) (cid:105) + (cid:104)M ( X i, ) , B V (cid:62) (cid:105) + (cid:104)M ( X i, ) , B V (cid:62) (cid:105) + ε i = (cid:104) X i , A (cid:105) + ε i + (cid:104) vec( X i, ) , vec( B ) (cid:105) + (cid:104)M ( X i, ) V , B (cid:105) + (cid:104)M ( X i, ) V , B (cid:105) + (cid:104)M ( X i, ) V , B (cid:105) := (cid:104) X i , A (cid:105) + (cid:104) ¯ X i , b (cid:105) + ε i , where¯ X i =  vec ( X i, )vec ( M ( X i, ) V )vec ( M ( X i, ) V )vec ( M ( X i, ) V )  ∈ R m , ¯ X =  ¯ X (cid:62) ...¯ X (cid:62) n  ∈ R n × m , b =  vec( B )vec( B )vec( B )vec( B )  ∈ R m . Suppose the parameter A ∗ is drawn from the prior distribution P ∗ τ,T . Then, b iid ∼ N (0 , τ ).Note that ¯ X i is an orthogonal projection of X i , so ¯ X i iid ∼ N (0 , y i , ¯ X i , ¯ b can berelated by the following regression model, y i − (cid:104) X i , A (cid:105) = ¯ X (cid:62) i b + ε i , i = 1 , . . . , n ; b iid ∼ N (0 , τ ) , ε iid ∼ N (0 , σ ) . (103)By the construction of A ∗ and the setting that S is ﬁxed, the estimation of A ∗ is equivalentto the estimation b . By Lemma 10, the Bayes risk of estimating b (and the Bayes risk ofestimating A ∗ if A ∗ ∼ P τ,T ) is (cid:13)(cid:13)(cid:13) (cid:98) A ∗ − A ∗ (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) { ¯ X i } ni =1 = (cid:13)(cid:13)(cid:13)(cid:98) b − b (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) { ¯ X i } ni =1 = tr (cid:32)(cid:18) I m τ + ¯ X (cid:62) ¯ X σ (cid:19) − (cid:33) . (cid:98) A ∗ and (cid:98) b are the posterior mean of A ∗ and b , respectively.Since ¯ P τ,T → P τ,T and ¯ A − A → A ∗ − A as T → ∞ , we have E (cid:13)(cid:13)(cid:13) (cid:98) A − ¯ A (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) { ¯ X i } ni =1 → E (cid:13)(cid:13)(cid:13) (cid:98) A ∗ − A ∗ (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) { ¯ X i } ni =1 = tr (cid:32)(cid:18) I m τ + ¯ X (cid:62) ¯ X σ (cid:19) − (cid:33) , where (cid:98) A is the posterior mean of ¯ A if ¯ A ∼ ¯ P τ,T . Since ¯ A ∼ ¯ P τ,T and ¯ P τ,T is the distributionon A p , r , we have the following estimation lower bound,inf (cid:98) A sup A ∈A p , r (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12) { ¯ X i } ni =1 ≥ tr (cid:32)(cid:18) I m τ + ¯ X (cid:62) ¯ X σ (cid:19) − (cid:33) . Finally, since ( ¯ X (cid:62) ¯ X ) − is inverse Wishart distributed and tr( E ( ¯ X (cid:62) ¯ X ) − ) = (cid:40) n − m − tr( I m ) = mn − m − n > m + 1; ∞ n ≤ m + 1 . By letting τ → ∞ , we ﬁnally obtaininf (cid:98) A sup A ∈A p , r (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ lim sup τ →∞ E tr (cid:32)(cid:18) I m τ + ¯ X (cid:62) ¯ X σ (cid:19) − (cid:33) =tr (cid:18) σ I m n − m − (cid:19) = (cid:40) mσ n − m − , if n > m + 1;+ ∞ if n ≤ m + 1 . (cid:3) F.5 Proof of Theorem 6

In this theorem, we aim to establish an estimation error upper bound for sparse ISLET insparse low-rank tensor regression problem. After introducing some necessary notations, wedevelop the estimation error bounds for sketching directions (cid:101) U k and (cid:102) W k in Steps 1 and2. In Step 3, we give error bounds for a number of intermediate terms. In Step 4, weprove upper bounds for key quantities ρ, (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) B (cid:101) X B ) − (cid:101) X (cid:62) B (cid:101) ε B (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) E k (cid:101) X E k ) − (cid:101) X (cid:62) E k (cid:101) ε E k (cid:13)(cid:13)(cid:13) ,and max i =1 ,...,p k (cid:13)(cid:13)(cid:13) ( (cid:101) X E k , [: ,G ki ] ) (cid:62) (cid:101) ε E k /n (cid:13)(cid:13)(cid:13) . Finally, we plug in these values to Theorem 3 toﬁnalize the proof.We ﬁrst introduce a number of notations that will be used in the proof. Similarly asthe proof of Theorem 4, denote A k = M k ( A ) , S k = M k ( S ) , See https://en.wikipedia.org/wiki/Inverse-Wishart_distribution for expectation of inverseWishart distribution. A k = M k ( (cid:101) A ) , (cid:101) S = (cid:74) (cid:101) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) , (cid:101) S k = M k ( (cid:101) S ) , X jk = M k ( X j ) , k = 1 , , . Recall (cid:101) σ = (cid:107) A (cid:107) + σ , λ k = σ r k ( M k ( A )) ,m s = r r r + (cid:88) k ∈ J s s k ( r k + log( p k )) + (cid:88) k / ∈ J s p k r k , (104)and (cid:101) U , (cid:101) U , (cid:101) U are the output from Step 1. We also denote I k = (cid:8) i : U k, [ i, :] (cid:54) = 0 (cid:9) , k = 1 , , ,ζ j = ( U ⊗ U ⊗ U ) (cid:62) vec( X (1) j ) = vec( (cid:74) X (1) j ; U (cid:62) , U (cid:62) , U (cid:62) (cid:75) ) ∈ R r r r , j = 1 , . . . , n , (105) (cid:101) σ ζ = 1 n n (cid:88) j =1 (cid:16) ε (1) j + ζ (cid:62) j vec( S ) (cid:17) . (106)Step 1 In this ﬁrst step, we develop the perturbation bound for (cid:101) U k and (cid:102) W k . First, (cid:101) A can bedecomposed as (cid:101) A = 1 n n (cid:88) j =1 y (1) j X (1) j = 1 n n (cid:88) j =1 (cid:16) ε (1) j + (cid:104) X (1) j , A (cid:105) (cid:17) X (1) j = 1 n n (cid:88) j =1 (cid:16) ε (1) j + (cid:104) (cid:74) X (1) j ; U (cid:62) , U (cid:62) , U (cid:62) (cid:75) , S (cid:105) (cid:17) X (1) j = 1 n n (cid:88) j =1 (cid:16) ε (1) j + (cid:104) (cid:74) X (1) j ; U (cid:62) , U (cid:62) , U (cid:62) (cid:75) , S (cid:105) (cid:17) (cid:74) X (1) j ; P U , P U , P U (cid:75) + 1 n n (cid:88) j =1 (cid:16) ε (1) j + (cid:104) (cid:74) X (1) j ; U (cid:62) , U (cid:62) , U (cid:62) (cid:75) , S (cid:105) (cid:17) P ( U ⊗ U ⊗ U ) ⊥ [ X (1) j ]:= H + R . (107)In particular, H is fully determined by ζ j and ε (1) j ; H is of Tucker rank-( p , p , p ) andhas loadings U , U , U . By Lemma 4, (cid:107)M ( H ) − A (cid:107) = (cid:13)(cid:13)(cid:13) U (cid:62) M ( H )( U ⊗ U ) − U (cid:62) A ( U ⊗ U ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) j =1 (cid:16) ε (1) j + (cid:104) (cid:74) X (1) j ; U (cid:62) , U (cid:62) , U (cid:62) (cid:75) , S (cid:105) (cid:17) U (cid:62) X (1) jk ( U ⊗ U ) − U (cid:62) A ( U ⊗ U ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) j =1 (cid:16) ε (1) j + (cid:68) U (cid:62) X (1) j ( U ⊗ U ) , S (cid:69)(cid:17) U (cid:62) X (1) j ( U ⊗ U ) − S (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:115) ( r + r r ) (cid:101) σ log pn (108)33ith probability at least 1 − p − C . Similar inequalities also hold for (cid:107)M ( H ) − A (cid:107) and (cid:107)M ( H ) − A (cid:107) . Provided that λ = min k =1 , , σ r k ( A k ) satisﬁes λ ≥ C (cid:101) σ ( r r + r r + r r ) /n , we have σ r k ( M k ( H )) ≥ σ r k ( M k ( A )) − (cid:107)M k ( H ) − A k (cid:107) ≥ (1 − c ) λ k (109)with probability at least 1 − p − C .Recall the deﬁnition of ζ j and (cid:101) σ ζ in (105) (106). For any j = 1 , . . . , n , ε (1) j + ζ (cid:62) j vec( S ) ∼ N (0 , σ + (cid:107) S (cid:107) ) ∼ N (0 , σ + (cid:107) A (cid:107) ) ∼ N (0 , (cid:101) σ ), which means (cid:101) σ ζ ∼ (cid:101) σ n χ n . By the tailbound of χ distribution [72, Lemma 1], (cid:12)(cid:12)(cid:101) σ ζ − (cid:101) σ (cid:12)(cid:12) ≤ C (cid:101) σ (cid:32)(cid:114) log pn + log pn (cid:33) ≤ C (cid:101) σ (cid:114) log pn (110)with probability at least 1 − p − C .Since vec( X (1) j ) has i.i.d. Gaussian entries and ( U ⊗ U ⊗ U ) is orthogonal to ( U ⊗ U ⊗ U ) ⊥ , we have that ( U ⊗ U ⊗ U ) (cid:62) vec( X (1) j ) is independent of ( U ⊗ U ⊗ U ) (cid:62)⊥ vec( X (1) j ) and R (deﬁned in (107)) is Gaussian distributed conditioning on ﬁxedvalues of ζ j and ε (1) j :vec( R ) (cid:12)(cid:12)(cid:12)(cid:12) { ε (1) j , ζ j } n j =1 has same distribution as P ( U ⊗ U ⊗ U ) ⊥ vec( R ) , where R ∈ R p × p × p , R iid ∼ N (cid:32) , (cid:101) σ ζ n (cid:33) . (111)Particularly, R [ I ,I ,I ] c (cid:12)(cid:12)(cid:12) { ε (1) j , ζ j } n j =1 iid ∼ N (0 , (cid:101) σ ζ ) , i.e., R is i.i.d. Gaussian outside ofthe support of A .Step 2 The rest of this proof will be conditioning on the ﬁxed value of { ε (1) j , ζ j } n j =1 that satisﬁes(108), (109), and (110). Provided (109), (110), and n ≥ C (cid:101) σ λ (cid:32) s s s log p + (cid:88) k =1 ( s k r k + r k +1 r k +2 ) (cid:33) , we have the following signal-noise-ratio assumption for denoising problem: (cid:101) A = H + R ,min k σ r k ( M k ( H )) ≥ C (cid:101) σ ζ √ n (cid:32) ( s s s log p ) / + (cid:88) k =1 ( s k r k + r k +1 r k +1 ) (cid:33) . By [136, Theorem 4] (with mild modiﬁcations to the proof to accommodate the factthat R [ I ,I ,I ] here is projection of i.i.d. Gaussian but not exactly i.i.d. Gaussian), the34TAT-SVD with the tuning parameter (cid:98) σ = Med( | vec( (cid:101) A ) | / . (cid:13)(cid:13)(cid:13) sin Θ( (cid:101) U k , U k ) (cid:13)(cid:13)(cid:13) F ≤ C (cid:101) σ ζ (cid:112) ( s k r k + s k log( p k )) /n σ r k ( M k ( H )) (109)(110) ≤ C (cid:101) σ (cid:112) ( s k r k + s k log( p k )) /n λ k , k ∈ J s , (112) (cid:13)(cid:13)(cid:13) sin Θ( (cid:101) U k , U k ) (cid:13)(cid:13)(cid:13) F ≤ C (cid:101) σ ζ (cid:112) p k r k /n σ r k ( M k ( H )) (109)(110) ≤ C (cid:101) σ (cid:112) p k r k /n λ k , k / ∈ J s , (113)and (cid:13)(cid:13)(cid:13) (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) − H (cid:13)(cid:13)(cid:13) ≤ C (cid:101) σ η n (cid:16) r r r + (cid:88) k ∈ J s s k ( r k + log p ) + (cid:88) k / ∈ J s p k r k (cid:17) (104) ≤ C (cid:101) σ m s n (114)with probability at least 1 − p − C , where (cid:101) U , (cid:101) U , (cid:101) U are the outcomes of STAT-SVDprocedure. Since the leading right singular vectors of M k (cid:16) (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) (cid:17) and M k ( A ) are (cid:102) W k and W k , respectively, we have (cid:13)(cid:13)(cid:13) sin Θ( (cid:102) W k , W k ) (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) (cid:102) W (cid:62) k ⊥ W k (cid:13)(cid:13)(cid:13) F ≤ (cid:107) (cid:102) W (cid:62) k ⊥ W k W (cid:62) k M k ( H ) (cid:62) (cid:107) F σ r k (cid:0) W (cid:62) k M k ( H ) (cid:62) (cid:1) = (cid:107) (cid:102) W (cid:62) k ⊥ M k ( H ) (cid:62) (cid:107) F σ r k ( M k ( H )) Lemma 7 ≤ (cid:13)(cid:13)(cid:13) M k (cid:16) (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) (cid:17) − M k ( A ) (cid:13)(cid:13)(cid:13) F σ r k ( M k ( H )) (109)(114) ≤ C (cid:101) σ (cid:112) m s /n λ k , k = 1 , , . (cid:13)(cid:13)(cid:13) A k (cid:102) W k ⊥ (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) A k W k W (cid:62) k (cid:102) W k ⊥ (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) W (cid:62) k (cid:102) W k ⊥ (cid:13)(cid:13)(cid:13) F · (cid:107) A k (cid:107) = (cid:13)(cid:13)(cid:13) sin Θ( (cid:102) W k , W k ) (cid:13)(cid:13)(cid:13) F · (cid:107) A k (cid:107) ≤ Cκ (cid:101) σ (cid:112) m s /n . Since (cid:101) U k and U k are the leading left singular values of M k (cid:16) (cid:74) (cid:101) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) (cid:17) and A k , respectively, (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k ⊥ A k (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k ⊥ U k U (cid:62) k A k (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k ⊥ U k (cid:13)(cid:13)(cid:13) F · (cid:13)(cid:13)(cid:13) U (cid:62) k A k (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) sin Θ( (cid:101) U k , U k ) (cid:13)(cid:13)(cid:13) F · (cid:107) A k (cid:107)≤  C (cid:101) σ √ ( s k r k + s k log( p k )) /n λ k · (cid:107) A k (cid:107) ≤ Cκ (cid:101) σ (cid:112) ( s k r k + s k log( p k )) /n , k ∈ J s ; C (cid:101) σ √ p k r k /n λ k · (cid:107) A k (cid:107) ≤ Cκ (cid:101) σ (cid:112) p k r k /n , k / ∈ J s .

35n summary, in the previous two steps, we have shown (cid:13)(cid:13)(cid:13) sin Θ( (cid:101) U k , U k ) (cid:13)(cid:13)(cid:13) F ≤  C (cid:101) σ √ ( s k r k + s k log( p k )) /n λ k , k ∈ J s ; C (cid:101) σ √ p k r k /n λ k , k / ∈ J s , (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k ⊥ A k (cid:13)(cid:13)(cid:13) F ≤ Cκ (cid:101) σ (cid:112) m s /n , (cid:13)(cid:13)(cid:13) sin Θ( (cid:102) W k , W k ) (cid:13)(cid:13)(cid:13) F ≤ C (cid:101) σ (cid:112) m s /n λ k , (cid:13)(cid:13)(cid:13) A k (cid:102) W k ⊥ (cid:13)(cid:13)(cid:13) F ≤ Cκ (cid:101) σ (cid:112) m s /n , for k = 1 , , − p − C .Step 3 Next, we move on to analyze the second batch of samples { X (2) j , ε (2) j } n j =1 . We ﬁrstintroduce the following notations, (cid:98) σ B = σ + (cid:13)(cid:13)(cid:13) P ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) ⊥ vec( A ) (cid:13)(cid:13)(cid:13) , (cid:98) σ E k = σ + (cid:13)(cid:13)(cid:13)(cid:13) P ( R k ( (cid:102) W k ⊗ I pk ) ) ⊥ vec( A ) (cid:13)(cid:13)(cid:13)(cid:13) . In this step, we give an upper bound for (cid:98) σ B and (cid:98) σ E k given (115) holds. Note that (cid:13)(cid:13)(cid:13) P ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) ⊥ vec( A ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) vec( A ) − P ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) vec( A ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) A − (cid:74) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) (cid:13)(cid:13)(cid:13) HS = (cid:13)(cid:13)(cid:13) (cid:74) A ; P (cid:101) U + P (cid:101) U ⊥ , P (cid:101) U + P (cid:101) U ⊥ , P (cid:101) U + P (cid:101) U ⊥ (cid:75) − (cid:74) A ; P (cid:101) U , P (cid:101) U , P (cid:101) U (cid:75) (cid:13)(cid:13)(cid:13) HS ≤ (cid:13)(cid:13)(cid:13) A ; P (cid:101) U ⊥ , P (cid:101) U , P (cid:101) U (cid:13)(cid:13)(cid:13) HS + (cid:13)(cid:13)(cid:13) A ; I p , P (cid:101) U ⊥ , P (cid:101) U (cid:13)(cid:13)(cid:13) HS + (cid:13)(cid:13)(cid:13) A ; I p , I p , P (cid:101) U ⊥ (cid:13)(cid:13)(cid:13) HS ≤ (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A (cid:13)(cid:13)(cid:13) F (115) ≤ Cκ (cid:101) σ (cid:112) m s /n , (cid:13)(cid:13)(cid:13)(cid:13) P ( R k ( (cid:102) W k ⊗ I pk ) ) ⊥ vec( A ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) vec( A ) − P R k ( (cid:102) W k ⊗ I pk ) vec( A ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) A k P (cid:102) W k ⊥ (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) A k (cid:102) W k ⊥ (cid:13)(cid:13)(cid:13) F ≤ Cκ (cid:101) σ (cid:112) m s /n . Therefore, (cid:98) σ B ≤ σ + Cm s κ (cid:101) σ n , (cid:98) σ E k ≤ σ + Cm s κ (cid:101) σ n , k = 1 , , . (116)Step 4 In this step, we analyze the estimation error for (cid:98) B and (cid:98) E k under the assumption that(115) hold (which further means (116) holds). Recall the partial linear models on im-portance sketching covariates (see (25) - (28); also see the proof of Theorem 3), y (2) = (cid:101) X B vec( (cid:101) B ) + (cid:101) ε B , (2) = (cid:101) X E k vec( (cid:101) E k ) + (cid:101) ε E k , k = 1 , , , where the covariates, parameters, and noises of these two regressions are (cid:101) X B ∈ R n × ( r r r ) , ( (cid:101) X B ) i · = vec (cid:16) X (2) i × (cid:101) U × (cid:101) U × (cid:101) U (cid:17) ; (cid:101) X E k ∈ R n × ( p k r k ) , ( (cid:101) X E k ) i · =vec (cid:16) X (2) ik (cid:16) (cid:101) U k +2 ⊗ (cid:101) U k +1 (cid:17) (cid:101) V k (cid:17) =vec (cid:16) X (2) ik (cid:102) W k (cid:17) , k = 1 , , (cid:101) ε B ∈ R n , ( (cid:101) ε B ) j = (cid:68) vec (cid:16) X (2) j (cid:17) ; P ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U ) ⊥ vec( A ) (cid:69) + ε (2) j , (cid:101) ε E k ∈ R n , ( (cid:101) ε E k ) j = (cid:28) vec (cid:16) X (2) j (cid:17) , P ( R k ( (cid:102) W k ⊗ I pk ) ) ⊥ vec( A ) (cid:29) + ε (2) j , k = 1 , , (cid:101) B ) = vec( (cid:74) A ; (cid:101) U (cid:62) , (cid:101) U (cid:62) , (cid:101) U (cid:62) (cid:75) ) = ( (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U )vec( A ) ∈ R r r r ;and (cid:101) E k = M k (cid:16) A × k +1 (cid:101) U (cid:62) k +1 × k +2 (cid:101) U (cid:62) k +2 (cid:17) (cid:101) V k = A k W k ∈ R p k × r k , k = 1 , , . These quantities satisfy the following properties. • Based on the proof of Theorem 3, (cid:101) E k , k ∈ J s are group-wise sparse, (cid:13)(cid:13)(cid:13) vec( (cid:101) E k ) (cid:13)(cid:13)(cid:13) , = p k (cid:88) i =1 (cid:26) (vec( (cid:101) E k )) Gki (cid:54) =0 (cid:27) ≤ s k , where G ki = { i + p k , . . . , i + p k ( r k − } , i = 1 , . . . , p k , k ∈ J s . • Conditioning on ﬁxed values of (cid:101) U k (cid:101) V k , (cid:102) W k , the noise distribution satisﬁes (cid:101) ε B (cid:12)(cid:12)(cid:12) (cid:101) U k , (cid:101) V k , (cid:102) W k iid ∼ N (cid:16) , σ + (cid:13)(cid:13)(cid:13) P ( (cid:101) U ⊗ (cid:101) U ⊗ U ) ⊥ [ A ] (cid:13)(cid:13)(cid:13) HS (cid:17) ∼ N (0 , (cid:98) σ B ); (cid:101) ε E k (cid:12)(cid:12)(cid:12) (cid:101) U k , (cid:101) V k , (cid:102) W k iid ∼ N (cid:18) , σ + (cid:13)(cid:13)(cid:13)(cid:13) P ( R k ( (cid:102) W k ⊗ I pk ) ) ⊥ [ A ] (cid:13)(cid:13)(cid:13)(cid:13) HS (cid:19) ∼ N (0 , (cid:98) σ E k ) . • Note that (cid:101) X B is an n -by-( r r r ) matrix with i.i.d. Gaussian entries. Similarly tothe argument in Step 5 in the proof of Theorem 4, (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:101) X (cid:62) B (cid:101) X B (cid:17) − (cid:101) X (cid:62) B (cid:101) ε B (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:98) σ B (cid:16) n + 2 (cid:112) n C log( p ) + 2 C log( p ) (cid:17) (cid:16) r r r + 2 (cid:112) Cr r r log( p ) + 2 C log( p ) (cid:17)(cid:0) √ n − √ r r r − C log( p ) (cid:1) ≤ (cid:98) σ B n (cid:16) (cid:113) C log pn + 2 log pn (cid:17) Cm s (cid:18) − (cid:113) r r r n − C (cid:113) log( p ) n (cid:19) ≤ C (cid:98) σ B m s n . − p − C . Here, the second last inequality is due to (cid:112) r r r log( p ) ≤ ( r r r + log( p )) ≤ m s and the last inequality is due to n ≥ Cm s . By the proof ofTheorem 3, (cid:13)(cid:13)(cid:13) (cid:98) B − (cid:101) B (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:101) X (cid:62) B (cid:101) X B (cid:17) − (cid:101) X (cid:62) B (cid:101) ε B (cid:13)(cid:13)(cid:13)(cid:13) ≤ Cm s (cid:98) σ B n . (117)with probability at least 1 − p − C . Similarly, we can show for k / ∈ J s , the least squareestimator (cid:98) E k satisﬁes (cid:13)(cid:13)(cid:13) (cid:98) E k − E k (cid:13)(cid:13)(cid:13) F (64) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:101) X (cid:62) E k (cid:101) X E k (cid:17) − (cid:101) X (cid:62) E k (cid:101) ε E k (cid:13)(cid:13)(cid:13)(cid:13) ≤ Cm s (cid:98) σ E k n . (118) • By Lemma 12 and n ≥ Cm s for large constant C > (cid:101) X D k satisﬁes group restrictedisometry property with δ = 1 / − exp( − cn ).Next, since (cid:101) ε E k iid ∼ N n (cid:16) , (cid:98) σ E k (cid:17) and ( (cid:101) X i E k ) (cid:62) (cid:101) ε E k (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) (cid:101) ε E k (cid:107) ∼ N r k (cid:0) , (cid:107) (cid:101) ε E j (cid:107) (cid:1) , we know (cid:107) (cid:101) ε E k (cid:107) ∼ (cid:98) σ E k χ n and (cid:107) ( (cid:101) X i E k ) (cid:62) (cid:101) ε E k (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) (cid:101) ε E k (cid:107) ∼ (cid:107) (cid:101) ε E k (cid:107) · χ r k By the tail bound of χ distribution, (cid:13)(cid:13)(cid:13) ( (cid:101) X i E k ) (cid:62) (cid:101) ε E k (cid:13)(cid:13)(cid:13) ≤ (cid:98) σ E k (cid:16) n + 2 (cid:112) n C log( p ) + 2 C log( p ) (cid:17) (cid:16) r k + 2 (cid:112) r k C log( p ) + 2 C log( p ) (cid:17) ≤ Cn (cid:98) σ E k ( r k + log( p ))with probability at least 1 − p − C . Since log( p k ) (cid:16) log( p ), we havemax ≤ i ≤ p k (cid:13)(cid:13)(cid:13) ( (cid:101) X i E k ) (cid:62) (cid:101) ε E k (cid:13)(cid:13)(cid:13) ≤ Cn (cid:98) σ E k ( r k + log( p k )) (119)with probability at least 1 − p − C . • Similarly as the Step 5 in the proof of Theorem 4, one can show (cid:13)(cid:13)(cid:13) (cid:98) E k ( (cid:101) U (cid:62) k (cid:98) E k ) − (cid:13)(cid:13)(cid:13) ≤ C κ (cid:101) σλ k (cid:114) m s n + C κ (cid:101) σλ k (cid:114) m s n ≤ c, k = 1 , , < c < / − p − C under the scenario of Theorem 6. Finally, Theorem 3 implies (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≤ (cid:18) C κ (cid:101) σλ (cid:114) m s n ∧ n (cid:19) (cid:32) (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) B (cid:101) X B ) − (cid:101) X (cid:62) B (cid:101) ε B (cid:13)(cid:13)(cid:13) + C (cid:88) k ∈ J s s k max ≤ i ≤ p k (cid:13)(cid:13)(cid:13) ( (cid:101) X i E k ) (cid:62) (cid:101) ε E k /n (cid:13)(cid:13)(cid:13) + (cid:88) k / ∈ J s (cid:13)(cid:13)(cid:13) ( (cid:101) X (cid:62) E k (cid:101) X E k ) − (cid:101) X (cid:62) E k (cid:101) ε E k (cid:13)(cid:13)(cid:13) (cid:33) ( a ) ≤ C (cid:32) m s ( (cid:98) σ B + (cid:98) σ E k ) n + C (cid:88) k =1 s k ( r k + log( p k )) (cid:98) σ E k n (cid:33) ( b ) ≤ C m s n (cid:18) σ + C m s κ (cid:101) σ n (cid:19) with probability at least 1 − p − C . Here, (a) is due to (117), (118), and (119); (b) is dueto (116). (cid:3) F.6 Proof of Theorem 7

This theorem gives a lower bound on the estimation error of sparse low-rank tensor regres-sion. In order to prove the desired lower bound, we only need to prove the forthcoming(120) and (123), respectively. To prove each inequality, we ﬁrst construct a series of tensorparameters A ( j ) that satisfy: (1) there are suﬃcient distances between A ( j ) and A ( l ) for any j (cid:54) = l ; (2) the Kullback-Leiber divergence between the resulting observations, { y ( j ) i , X ( j ) i } ni =1 and { y ( l ) i , X ( l ) i } ni =1 , are close. Finally, the lower bound is proved by an application of thegeneralized Fano’s Lemma.In order to prove this theorem, we only need to showinf (cid:98) A sup A ∈A p , s , r E (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ max (cid:26) cr r r σ n , max l =1 , , cσ ( s l r l + s l log( ep l /s l )) n (cid:27) .

1. If r r r = max (cid:26) r r r , max k =1 , , ( s k r k + s k log( ep k /s k )) (cid:27) , we only need to prove inf (cid:98) A sup A ∈A p , s , r E (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ cr r r σ n , (120)for r r r ≥ S as an r -by- r -by- r tensor with i.i.d. Gaussian entries. Since r k ≥ r k +1 r k +2 for k = 1 , , S has Tucker rank-( r , r , r ) with probability one. Let U , U , U be arbitrary ﬁxed39rthogonal matrices that satisfy U k ∈ O p k ,r k , (cid:107) U k (cid:107) , = p k (cid:88) i =1 { ( U k ) [ i, :] (cid:54) =0 } ≤ s k , k = 1 , , . By Varshamov-Gilbert bound [87, Lemma 4.7], we can ﬁnd B (1) , . . . , B ( N ) ⊆ {− , } r × r × r such that ∀ j (cid:54) = l, (cid:107) B ( j ) − B ( l ) (cid:107) = 2 (cid:88) i ,i | B ( j )[ i ,i ] − B ( l )[ i ,i ] | ≥ r r r and N ≥ exp( r r r / . On the other hand, (cid:107) B ( j ) − B ( l ) (cid:107) ≤ (cid:107) B ( j ) (cid:107) + 2 (cid:107) B ( l ) (cid:107) ≤ r r r . (121)Since r r r ≥ N ≥

3. Then we construct A ( j ) = (cid:74) S + τ B j ; U , U , U (cid:75) , j = 1 , . . . , N, where τ > A (1) , . . . , A ( N ) ⊆ A p , s , r . Now, the KullbackLeibler divergence between the samplesgenerated from A ( j ) and the samples generated from A ( l ) satisfy D KL (cid:16) { X i , y ( j ) i } ni =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { X i , y ( l ) i } ni =1 (cid:17) Lemma 13 = n σ (cid:13)(cid:13)(cid:13) A ( j ) − A ( l ) (cid:13)(cid:13)(cid:13) ≤ n σ (cid:13)(cid:13)(cid:13) τ B ( j ) − τ B ( l ) (cid:13)(cid:13)(cid:13) ≤ n σ (4 τ r r r ) (122)and ∀ j (cid:54) = l, (cid:13)(cid:13)(cid:13) A ( j ) − A ( l ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) τ B ( j ) − τ B ( l ) (cid:13)(cid:13)(cid:13) ≥ τ r r r . By generalized Fano’s lemma,inf (cid:98) A sup A ∈A p , s , r (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ inf (cid:98) A sup A ∈ { A (1) ,..., A ( N ) } (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ τ r r r (cid:18) − τ r r r n/σ + log(2)log( N ) (cid:19) . By setting τ = σ log( N/ . / (2 r r r n ), we haveinf (cid:98) A sup A ∈A p , s , r (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ cτ r r r = cσ r r r n , which has shown (120) if r r r ≥

9. 40. If s k r k + s k log( ep k /s k ) = max (cid:26) r r r , max l =1 , , ( s l r l + s k log( ep l /s l )) (cid:27) , we only need to proveinf (cid:98) A sup A ∈A p , r E (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ cσ ( s k r k + s k log( ep k /s k )) n , (123)provided that s k r k + s k log( ep k /s k ) ≥ C for large constant C >

0. Without loss ofgenerality we assume k = 1.To this end, we randomly generate an orthogonal matrix S ∈ O r r ,r and construct S ∈ R r × r × r such that M ( S ) = S (cid:62) . We also construct U and U as ﬁxed orthogonalmatrices that satisﬁes (cid:107) U (cid:107) , ≤ s and (cid:107) U (cid:107) , ≤ s . By Lemma 14, there exists { U ( k )1 } Nk =1 ⊆ { , , − } p × r such that (cid:107) U ( j )1 (cid:107) , = p (cid:88) i =1 (cid:110) ( U ( j )1 ) [ i, :] (cid:54) =0 (cid:111) ≤ s , j = 1 , . . . , N, (cid:13)(cid:13)(cid:13) U ( j )1 − U ( l )1 (cid:13)(cid:13)(cid:13) , = (cid:88) i,j (cid:12)(cid:12)(cid:12) ( U ( j )1 ) ij − ( U ( l )1 ) ij (cid:12)(cid:12)(cid:12) > s r / , ≤ j (cid:54) = l ≤ N, (124)and N ≥ exp ( c ( s r + s log( ep /s ))). We further let A ( j ) = (cid:74) τ S ; U ( j )1 , U , U (cid:75) , j = 1 , , . . . , N, where τ is a ﬁxed and to-be-determined value. By such the construction, for any 1 ≤ j (cid:54) = l ≤ N , (cid:13)(cid:13)(cid:13) A ( j ) − A ( l ) (cid:13)(cid:13)(cid:13) = τ (cid:13)(cid:13)(cid:13) U ( j )1 M ( S ) U (cid:62) ⊗ U (cid:62) − U ( l )1 M ( S ) U (cid:62) ⊗ U (cid:62) (cid:13)(cid:13)(cid:13) F = τ (cid:13)(cid:13)(cid:13) U ( j )1 S (cid:62) U (cid:62) ⊗ U (cid:62) − U ( l )1 S (cid:62) U (cid:62) ⊗ U (cid:62) (cid:13)(cid:13)(cid:13) F = τ (cid:13)(cid:13)(cid:13) U ( j )1 − U ( l )1 (cid:13)(cid:13)(cid:13) F (since all entries of U ( j )1 , U ( l )1 ∈ {− , , } ) ≥ τ (cid:13)(cid:13)(cid:13) U ( j )1 − U ( l )1 (cid:13)(cid:13)(cid:13) , > τ s r / , and D KL (cid:16) { X i , y ( j ) i } ni =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { X i , y ( l ) i } ni =1 (cid:17) = n σ (cid:13)(cid:13)(cid:13) A ( j ) − A ( l ) (cid:13)(cid:13)(cid:13) = n σ τ (cid:13)(cid:13)(cid:13) U ( j )1 − U ( l )1 (cid:13)(cid:13)(cid:13) F ≤ nτ σ (cid:16) (cid:107) U ( j )1 (cid:107) + (cid:107) U ( l )1 (cid:107) (cid:17) ≤ nτ σ · s r . (125)By setting τ = σ log( N/ . / (2 ns r ), we haveinf (cid:98) A sup A ∈A p , r (cid:13)(cid:13)(cid:13) (cid:98) A − A (cid:13)(cid:13)(cid:13) ≥ τ s r (cid:32) − nτ s r σ − log(2)log( N ) (cid:33) ≥ σ log( N/ . ns r · s r · c ≥ cσ ( s r + s log( ep /s )) n , which has shown (123).In summary of the previous two parts, we have ﬁnished the proof of this theorem. (cid:3) Technical Lemmas

Lemma 1 (Kronecker Product, Vectorization, and Matricization) . Suppose A ∈ R p × p , A ∈ R p × p × ... × p d , B k ∈ R p k × r k , B (cid:48) k ∈ R r k × d k , k = 1 , . . . , d . Then, ( B ⊗ · · · ⊗ B d ) · ( B (cid:48) ⊗ · · · ⊗ B (cid:48) d ) = ( B B (cid:48) ) ⊗ · · · ⊗ ( B d B (cid:48) d ) , (126)vec (cid:16) B (cid:62) AB (cid:17) = ( B (cid:62) ⊗ B (cid:62) )vec( A ) , (127)vec (cid:16) (cid:74) A ; B (cid:62) , . . . , B (cid:62) d (cid:75) (cid:17) = ( B (cid:62) d ⊗ · · · ⊗ B (cid:62) )vec( A ) , (128) M k (cid:16) (cid:74) A ; B (cid:62) , . . . , B (cid:62) d (cid:75) (cid:17) = B (cid:62) k M k ( A ) ( B d ⊗ · · · ⊗ B k +1 ⊗ B k − ⊗ · · · ⊗ B ) . (129) Finally, for any V k ∈ R r − k × r k , vec (cid:16) B (cid:62) k M k (cid:16) (cid:114) A ; B (cid:62) , . . . , B (cid:62) k − , B (cid:62) k +1 , . . . , B (cid:62) d (cid:122) (cid:17) V k (cid:17) = V (cid:62) k (cid:16) B (cid:62) d ⊗ · · · ⊗ B (cid:62) k +1 ⊗ B (cid:62) k − ⊗ · · · ⊗ B (cid:62) (cid:17) ⊗ ( B (cid:62) k ) · vec( M k ( A )) (130) Proof of Lemma 1.

See [65, 66] for the proof of (126), (128) and (129). We shall alsonote that (127) is the order-2 case of (128). Finally,vec (cid:16) B (cid:62) k M k (cid:16) (cid:114) A ; B (cid:62) , . . . , B (cid:62) k − , B (cid:62) k +1 , . . . , B (cid:62) d (cid:122) (cid:17) V k (cid:17) (127) = ( V (cid:62) k ⊗ B (cid:62) k )vec (cid:16) M k (cid:16) (cid:114) A ; B (cid:62) , . . . , B (cid:62) k − , I p k , B (cid:62) k +1 , . . . , B (cid:62) d (cid:122) (cid:17)(cid:17) (129) = ( V (cid:62) k ⊗ B (cid:62) k )vec ( M k ( A )( B d ⊗ · · · ⊗ B k +1 ⊗ B k − ⊗ · · · ⊗ B )) (127) = ( V (cid:62) k ⊗ B (cid:62) k ) (cid:16) B (cid:62) d ⊗ · · · ⊗ B (cid:62) k +1 ⊗ B (cid:62) k − ⊗ · · · ⊗ B (cid:62) ⊗ I (cid:17) vec( M k ( A ))= V (cid:62) k (cid:16) B (cid:62) d ⊗ · · · ⊗ B (cid:62) k +1 ⊗ B (cid:62) k − ⊗ · · · ⊗ B (cid:62) (cid:17) ⊗ ( B (cid:62) k ) · vec( M k ( A )) (cid:3) Lemma 2.

Suppose A ∈ R p × r and U ∈ O p,m . Then, σ r ( A ) ≥ σ r ( U (cid:62) A ) + σ r ( U (cid:62)⊥ A ) , (cid:107) A (cid:107) ≤ (cid:13)(cid:13)(cid:13) U (cid:62) A (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) U (cid:62)⊥ A (cid:13)(cid:13)(cid:13) . Proof of Lemma 2.

Let v be the right singular vector associated with the r -th singularvalue of A . Then (cid:107) Av (cid:107) = σ r ( A ) (cid:107) v (cid:107) = σ r ( A ) and σ r ( A ) = (cid:107) Av (cid:107) = (cid:107) P U Av (cid:107) + (cid:107) P U ⊥ Av (cid:107) = (cid:107) U (cid:62) Av (cid:107) + (cid:107) U (cid:62)⊥ Av (cid:107) ≥ σ r ( U (cid:62) A ) (cid:107) v (cid:107) + σ r ( U (cid:62)⊥ A ) (cid:107) v (cid:107) = σ r ( U (cid:62) A ) + σ r ( U (cid:62)⊥ A ) . On the other hand, (cid:107) A (cid:107) = max v : (cid:107) v (cid:107) ≤ (cid:107) Av (cid:107) = max v : (cid:107) v (cid:107) ≤ (cid:0) (cid:107) P U Av (cid:107) + (cid:107) P U ⊥ Av (cid:107) (cid:1) ≤ max v : (cid:107) v (cid:107) ≤ (cid:107) P U Av (cid:107) + max v : (cid:107) v (cid:107) ≤ (cid:107) P U ⊥ Av (cid:107) = (cid:107) U (cid:62) A (cid:107) + (cid:107) U (cid:62)⊥ A (cid:107) . The following lemma establish a deterministic upper bound for (cid:107) (cid:98) F (cid:98) G − (cid:98) H − FG − H (cid:107) interms of (cid:107) (cid:98) F − F (cid:107) F , (cid:107) (cid:98) G − G (cid:107) F , (cid:107) (cid:98) H − H (cid:107) F and its more general high-order form. This resultserves as a key technical lemma for the theoretical analysis of the oracle inequalities. Lemma 3.

Suppose F , (cid:98) F ∈ R p × r , G , (cid:98) G ∈ R r × r , H , (cid:98) H ∈ R r × p . If G and (cid:98) G are invertible, (cid:107) FG − (cid:107) ≤ λ , (cid:107) G − H (cid:107) ≤ λ , and (cid:107) (cid:98) G − (cid:98) H (cid:107) ≤ λ , we have (cid:13)(cid:13)(cid:13)(cid:98) F (cid:98) G − (cid:98) H − FG − H (cid:13)(cid:13)(cid:13) F ≤ λ (cid:107) (cid:98) F − F (cid:107) F + λ (cid:107) (cid:98) H − H (cid:107) F + λ λ (cid:107) (cid:98) G − G (cid:107) F . (131) More generally for any d ≥ , suppose (cid:98) F , F ∈ R r ×···× r d are order- d tensors, G k , (cid:98) G k ∈ R r k × r k H k , (cid:98) H k ∈ R p k × r k . If (cid:107) H k G − k (cid:107) ≤ λ k , (cid:107) (cid:98) H k (cid:98) G − k (cid:107) ≤ λ k , and (cid:107) G − k M k ( F ) (cid:107) ≤ π k , wehave (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) F ; ( (cid:98) H (cid:98) G − ) , . . . , ( (cid:98) H d (cid:98) G − d ) (cid:75) − (cid:74) F ; ( H G − ) , . . . , ( H d G − d ) (cid:75) (cid:13)(cid:13)(cid:13) HS ≤ λ · · · λ d (cid:107) (cid:98) F − F (cid:107) HS + d (cid:88) k =1 π k λ · · · λ d (cid:107) (cid:98) G − G (cid:107) F + d (cid:88) k =1 π k λ · · · λ d /λ k (cid:107) (cid:98) H k − H k (cid:107) F . (132) Proof of Lemma 3.

First, it is easy to check the following identity for any non-singularmatrices G and (cid:98) G , (cid:98) G − = G − − G − ( (cid:98) G − G ) (cid:98) G − . Thus, (cid:13)(cid:13)(cid:13)(cid:98) F (cid:98) G − (cid:98) H − FG − H (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) ( (cid:98) F − F ) (cid:98) G − (cid:98) H (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) F (cid:16) G − − G − ( (cid:98) G − G ) (cid:98) G − (cid:17) (cid:98) H − FG − H (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13)(cid:98) F − F (cid:13)(cid:13)(cid:13) F · (cid:13)(cid:13)(cid:13) (cid:98) G − (cid:98) H (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) FG − (cid:98) H − FG − H (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13)(cid:13) FG − ( (cid:98) G − G ) (cid:98) G − (cid:98) H (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13)(cid:98) F − F (cid:13)(cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) (cid:98) G − (cid:98) H (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13) FG − (cid:13)(cid:13) (cid:13)(cid:13)(cid:13) (cid:98) H − H (cid:13)(cid:13)(cid:13) F + (cid:13)(cid:13) FG − (cid:13)(cid:13) (cid:13)(cid:13)(cid:13) (cid:98) G − G (cid:13)(cid:13)(cid:13) F (cid:13)(cid:13)(cid:13) (cid:98) G − (cid:98) H (cid:13)(cid:13)(cid:13) ≤ λ (cid:107) (cid:98) F − F (cid:107) F + λ (cid:107) (cid:98) H − H (cid:107) F + λ λ (cid:107) (cid:98) G − G (cid:107) F . Then we consider the proof of (132). Deﬁne (cid:98)(cid:101) F d = M d ( (cid:98) F ) (cid:16) (cid:98) H d − (cid:98) G − d − ⊗ · · · ⊗ (cid:98) H (cid:98) G − (cid:17) (cid:62) , (cid:101) F d = M d ( F ) (cid:0) H d − G − d − ⊗ · · · ⊗ H G − (cid:1) (cid:62) . We shall note that (cid:13)(cid:13)(cid:13) G − d (cid:101) F d (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13) G − d M d ( F ) (cid:0) H d − G − d − ⊗ · · · ⊗ H G − (cid:1)(cid:13)(cid:13) ≤ (cid:13)(cid:13) G − d M d ( F ) (cid:13)(cid:13) · (cid:107) H d − G − d − (cid:107) · · · (cid:107) H G − (cid:107) ≤ π d λ · · · λ d − , (cid:13) H d G − d (cid:13)(cid:13) ≤ λ d , (cid:107) (cid:98) H d (cid:98) G − d (cid:107) ≤ λ d . By the ﬁrst part of this lemma and tensor algebra, (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) F ; (cid:98) H (cid:98) G − , . . . , (cid:98) H d (cid:98) G − d (cid:75) − (cid:74) F ; H G − , . . . , H d G − d (cid:75) (cid:13)(cid:13)(cid:13) HS = (cid:13)(cid:13)(cid:13) M d (cid:16) (cid:74) (cid:98) F ; (cid:98) H (cid:98) G − , . . . , (cid:98) H d (cid:98) G − d (cid:75) (cid:17) − M d (cid:0) (cid:74) F ; H G − , . . . , H d G − d (cid:75) (cid:1)(cid:13)(cid:13)(cid:13) F Lemma 1 = (cid:13)(cid:13)(cid:13)(cid:13) (cid:98) H d (cid:98) G − d (cid:98)(cid:101) F d − H d G − d (cid:101) F d (cid:13)(cid:13)(cid:13)(cid:13) F ≤ λ d (cid:107) (cid:98)(cid:101) F d − (cid:101) F d (cid:107) F + λ · · · λ d π d (cid:107) (cid:98) G d − G d (cid:107) F + λ · · · λ d − π d (cid:107) (cid:98) H d − H d (cid:107) F . (133)Next, we analyze (cid:107) (cid:98)(cid:101) F d − (cid:101) F d (cid:107) F . Deﬁne (cid:98)(cid:101) F d − = M d − ( (cid:98) F ) (cid:16) I r d ⊗ (cid:98) H d − (cid:98) G − d − ⊗ · · · ⊗ (cid:98) H (cid:98) G − (cid:17) (cid:62) , (cid:101) F d − = M d − ( F ) (cid:0) I r d ⊗ H d − G − d − ⊗ · · · ⊗ H G − (cid:1) (cid:62) . Then by tensor algebra (Lemma 1), (cid:107) (cid:98)(cid:101) F d − (cid:101) F d (cid:107) F = (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) F ; (cid:98) H (cid:98) G − , . . . , (cid:98) H d − (cid:98) G − d − , I r d (cid:75) − (cid:74) F ; H G − , . . . , H d − G − d − , I r d (cid:75) (cid:13)(cid:13)(cid:13) HS = (cid:13)(cid:13)(cid:13) M d − (cid:16) (cid:74) (cid:98) F ; (cid:98) H (cid:98) G − , . . . , (cid:98) H d − (cid:98) G − d − , I r d (cid:75) (cid:17) − M d − (cid:0) (cid:74) F ; H G − , . . . , H d − G − d − , I r d (cid:75) (cid:1)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13)(cid:13) (cid:98) H d − (cid:98) G − d − (cid:98)(cid:101) F d − − H d − G − d − (cid:101) F d − (cid:13)(cid:13)(cid:13)(cid:13) F . Similarly as the previous argument, one can show by the ﬁrst part of this lemma that (cid:107) (cid:98)(cid:101) F d − (cid:101) F d (cid:107) F = (cid:13)(cid:13)(cid:13)(cid:13) (cid:98) H d − (cid:98) G − d − (cid:98)(cid:101) F d − − H d − G − d − (cid:101) F d − (cid:13)(cid:13)(cid:13)(cid:13) F ≤ λ d − (cid:107) (cid:98)(cid:101) F d − − (cid:101) F d − (cid:107) F + λ · · · λ d − π d − (cid:107) (cid:98) G d − − G d − (cid:107) F + λ · · · λ d − π d − (cid:107) (cid:98) H d − − H d − (cid:107) F . Therefore, by (133) and the previous inequality, (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) F ; (cid:98) H (cid:98) G − , . . . , (cid:98) H d (cid:98) G − d (cid:75) − (cid:74) F ; H G − , . . . , H d G − d (cid:75) (cid:13)(cid:13)(cid:13) HS ≤ λ d − λ d (cid:13)(cid:13)(cid:13)(cid:13)(cid:98)(cid:101) F d − − (cid:101) F d − (cid:13)(cid:13)(cid:13)(cid:13) F + (cid:88) k = d − ,d λ · · · λ d π k (cid:107) (cid:98) G k − G k (cid:107) F + (cid:88) k = d − ,d λ · · · λ d π k λ k (cid:13)(cid:13)(cid:13) (cid:98) H k − H k (cid:13)(cid:13)(cid:13) F . We further introduce (cid:98)(cid:101) F d − , (cid:101) F d − , . . . , (cid:98)(cid:101) F , (cid:101) F , repeat the previous argument for d time, andcan ﬁnally obtain (cid:13)(cid:13)(cid:13) (cid:74) (cid:98) F ; (cid:98) H (cid:98) G − , . . . , (cid:98) H d (cid:98) G − d (cid:75) − (cid:74) F ; H G − , . . . , H d G − d (cid:75) (cid:13)(cid:13)(cid:13) HS ≤ λ · · · λ d (cid:107) (cid:98) F − F (cid:107) HS + d (cid:88) k =1 λ · · · λ d π k (cid:107) (cid:98) G k − G k (cid:107) F + d (cid:88) k =1 λ · · · λ d π k λ k (cid:107) (cid:98) H k − H k (cid:107) F , (cid:3) The following lemma characterizes the concentration of Gaussian ensemble measure-ments, which will be extensively used in the proof of Theorem 4.

Lemma 4 (Gaussian Ensemble Concentration Inequality for Matrices) . Suppose A ∈ R a × b is a ﬁxed matrix, X , . . . , X n ∈ R a × b are random matrices with i.i.d. standard Gaussianentries, and ε , . . . , ε n iid ∼ N (0 , σ ) . Let E = n (cid:80) ni =1 ( (cid:104) A , X i (cid:105) + ε i ) X i . Then there exists auniform constant C > such that, P (cid:32) (cid:107) E − A (cid:107) ≥ C (cid:113) ( a + b )( (cid:107) A (cid:107) F + σ ) (cid:32)(cid:114) log( a + b ) + tn + log( a + b ) + tn (cid:33)(cid:33) ≤ exp( − t )(134) Proof of Lemma 4.

Denote Z i = ( (cid:104) A , X i (cid:105) + ε i ) X i . It is easy to check that E Z i = A .Then, E ( Z i − A )( Z i − A ) (cid:62) = E Z i Z (cid:62) i − A ( E Z i ) (cid:62) − ( E Z i ) A (cid:62) + AA (cid:62) = E Z i Z (cid:62) i − AA (cid:62) = E (cid:104) A , X i (cid:105) X i X (cid:62) i + σ E X i X (cid:62) i − AA (cid:62) = E (cid:104) A , X i (cid:105) X i X (cid:62) i + σ · b I a − AA (cid:62) Note that for any entry ( X i ) [ j,k ] , E ( X i ) [ j,k ] = 0 , E ( X i ) j,k ] = 1 , E ( X i ) j,k ] = 0 , E ( X i ) j,k ] = 3.When j (cid:54) = k , (cid:16) E (cid:104) A , X i (cid:105) X i X (cid:62) i (cid:17) jk = E (cid:104) A , X i (cid:105) b (cid:88) l =1 ( X i ) [ j,l ] ( X i ) [ k,l ] = E b (cid:88) l =1 (cid:0) A [ j,l ] A [ k,l ] ( X i ) [ i,l ] ( X i ) [ k,l ] (cid:1) ( X i ) [ i,l ] ( X i ) [ k,l ] =2 b (cid:88) l =1 A [ j,l ] A [ k,l ] = 2( AA (cid:62) ) [ j,k ] ;when j = k , (cid:16) E (cid:104) A , X i (cid:105) X i X (cid:62) i (cid:17) [ j,j ] = E (cid:104) A , X i (cid:105) b (cid:88) l =1 ( X i ) j,l ] = E a (cid:88) j (cid:48) =1 b (cid:88) l (cid:48) =1 (cid:16) A j (cid:48) ,l (cid:48) ] ( X i ) j (cid:48) ,l (cid:48) ] (cid:17) · b (cid:88) l =1 ( X i ) j,l ] = a (cid:88) j (cid:48) =1 b (cid:88) l (cid:48) =1 ( A j (cid:48) ,l (cid:48) ] ) · b + 2 b (cid:88) l =1 A j,l ] = b (cid:107) A (cid:107) F + 2( AA (cid:62) ) [ j,j ] . Therefore, E (cid:104) A , X i (cid:105) X i X (cid:62) i = 2 AA (cid:62) + b (cid:107) A (cid:107) F I a , and (cid:13)(cid:13)(cid:13) E ( Z i − A )( Z i − A ) (cid:62) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) AA (cid:62) + b (cid:107) A (cid:107) F I a + bσ I a − AA (cid:62) (cid:13)(cid:13)(cid:13) = (cid:107) A (cid:107) + b (cid:107) A (cid:107) F + bσ . (135)45imilarly, we can also show (cid:13)(cid:13)(cid:13) E ( Z i − A ) (cid:62) ( Z i − A ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) A (cid:62) A + a (cid:107) A (cid:107) F I b + σ I a − A (cid:62) A (cid:13)(cid:13)(cid:13) = (cid:107) A (cid:107) + a (cid:107) A (cid:107) F + aσ . (136)Next, we consider the spectral norm of Z i and aim to show that (cid:13)(cid:13) (cid:107) Z i − A (cid:107) (cid:13)(cid:13) ψ = inf u ≥ (cid:26) u : E exp (cid:18) (cid:107) Z i − A (cid:107) u (cid:19) ≤ (cid:27) ≤ C (cid:16) √ a + √ b (cid:17) (cid:113) (cid:107) A (cid:107) F + σ (137)for uniform constant C >

0. Note that (cid:104) A , X i (cid:105) + ε i ∼ N (cid:0) , (cid:107) A (cid:107) F + σ (cid:1) , X i is a randommatrix, by Gaussian tail bound inequality and random matrix theory (Corollary 5.35 in[122]), P (cid:18) |(cid:104) A , X i (cid:105) + ε i | ≥ t (cid:113) (cid:107) A (cid:107) F + σ (cid:19) ≤ − t / , P (cid:16) (cid:107) X i (cid:107) ≥ √ a + √ b + t (cid:17) ≤ exp( − t / . (138)We set u = C (cid:16) √ a + √ b (cid:17) (cid:113) (cid:107) A (cid:107) F + σ for large uniform constant C ≥

80. Thus, for any x ≥ P ( (cid:107) Z i − A (cid:107) ≥ xu ) ≤ P ( (cid:107) ( (cid:104) A , X i (cid:105) + ε i ) X i (cid:107) ≥ xu − (cid:107) A (cid:107) ) ≤ P (cid:32) (cid:107) ( (cid:104) A , X i (cid:105) + ε i ) X i (cid:107) ≥ xC ( √ a + √ b )2 (cid:113) (cid:107) A (cid:107) F + σ (cid:33) ≤ P (cid:32) |(cid:104) A , X i (cid:105) + ε i | ≥ (cid:114) xC · (cid:0) (cid:107) A (cid:107) F + σ (cid:1)(cid:33) + P (cid:32) (cid:107) X i (cid:107) ≥ (cid:114) xC · ( √ a + √ b ) (cid:33) (138) ≤ − C x/ . For any real valued function smooth g and non-negative random variable Y with density f Y , the following identity holds, E g ( Y ) = (cid:90) ∞ g (cid:48) ( y ) P ( Y ≥ y ) dy. Thus, E exp (cid:18) (cid:107) Z i − A (cid:107) u (cid:19) = (cid:90) ∞ exp ( x ) P (cid:18) (cid:107) Z i − A (cid:107) u ≥ x (cid:19) dx ≤ (cid:90) exp( u ) du + (cid:90) ∞ exp( x ) · − C x/ dx ≤ exp(1) − C / − ≤ , which implies (cid:13)(cid:13) (cid:107) Z i − A (cid:107) (cid:13)(cid:13) ψ ≤ C (cid:16) √ a + √ b (cid:17) (cid:113) (cid:107) A (cid:107) F + σ for some uniform constant C >

0. 46inally we apply the Bernstein-type matrix concentration inequality (c.f., Proposition2 in [68] and Theorem 4 in [67]), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 Z i − A (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C max (cid:40) σ Z (cid:114) t + log( a + b ) n , ( √ a + √ b ) (cid:113) (cid:107) A (cid:107) F + σ log  C ( √ a + √ b ) (cid:113) (cid:107) A (cid:107) F + σ σ Z  · t + log( a + b ) n (cid:41) (139)with probability at least 1 − exp( − t ). Here, σ Z := max (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 E ( Z i − A )( Z i − A ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 E ( Z i − A ) (cid:62) ( Z i − A ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) /  = (cid:113) (cid:107) A (cid:107) + ( a ∨ b ) (cid:0) (cid:107) A (cid:107) F + σ (cid:1) . Noting that (cid:113) ( a ∨ b )( (cid:107) A (cid:107) F + σ ) ≤ σ Z ≤ (cid:113) ( a ∨ b + 1)( (cid:107) A (cid:107) F + σ ), (139) implies (134). (cid:3) Lemma 5 (Gaussian Ensemble Concentration Inequality for Vector) . Suppose x , . . . , x n iid ∼ N (0 , I m ) are i.i.d. m -dimensional random vectors, ε , . . . , ε n iid ∼ N (0 , σ ) , and a ∈ R m is aﬁxed vector. Then P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( (cid:104) x i , a (cid:105) + ε i ) x i − a (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:112) (cid:107) a (cid:107) + σ (cid:0) √ n + √ t (cid:1) (cid:0) √ m + √ t (cid:1) n (cid:33) ≥ − − t ) . Proof of Lemma 5.

Denote x i = ( x i , . . . , x im ) (cid:62) , i = 1 , . . . , n. Since the distribution of Gaussian random vectors are invariant after orthogonal transfor-mation, without loss of generality we assume a = ( θ, , . . . , n (cid:32) n (cid:88) i =1 (cid:104) x i , a (cid:105) + ε i (cid:33) x i − a =  n (cid:80) ni =1 ( x i − θ n (cid:80) ni =1 x i θx i ... n (cid:80) ni =1 x i θx im  + 1 n n (cid:88) i =1 ε i x i := h + 1 n n (cid:88) i =1 ε i x i ;Note that (cid:80) ni =1 x i ∼ χ n , by tail bounds of χ (c.f., [72, Lemma 1]), P (cid:32) n − √ nt ≤ n (cid:88) i =1 x i (cid:33) ≥ − exp( − t ) , P (cid:32) n (cid:88) i =1 x i ≤ n + 2 √ nt + 2 t (cid:33) ≥ − exp( − t ) . ξ := (cid:80) ni =1 x i , we have1 n n (cid:88) i =1 x i θx ik (cid:12)(cid:12)(cid:12) ξ ∼ N (cid:18) , θ ξn (cid:19) , k = 2 , . . . , n, (cid:107) h (cid:107) (cid:12)(cid:12)(cid:12) ξ ∼ (cid:18) ξn − (cid:19) θ + θ ξn χ m − . Thus, P  (cid:107) h (cid:107) ≥ θ (cid:32)(cid:114) tn + tn (cid:33) + θ (cid:0) n + 2 √ nt + 2 t (cid:1) (cid:16) m − (cid:112) ( m − t + 2 t (cid:17) n  ≤ P (cid:16) ξ ≥ n + 2 √ nt + 2 t (cid:17) + P (cid:16) ξ ≤ n − √ nt (cid:17) + P (cid:32) θ ξn χ m − ≥ θ ξ ( m − (cid:112) ( m − t + 2 t ) n (cid:33) ≤ − t ) . Conditioning on ﬁxed values of (cid:107) ε (cid:107) = (cid:80) i ε i , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ε i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) ε (cid:107) ∼ σ (cid:107) ε (cid:107) n χ m . Additionally, P (cid:0) (cid:107) ε (cid:107) ≥ σ ( n + 2 √ nt + 2 t ) (cid:1) ≤ exp( − t ), which means P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ε i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≥ σ (cid:0) n + 2 √ nt + 2 t (cid:1) (cid:0) m + 2 √ mt + 2 t (cid:1) n  ≤ P (cid:16) (cid:107) ε (cid:107) ≥ σ ( n + 2 √ nt + 2 t ) (cid:17) + P (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ε i x i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) ε (cid:107) ≥ σ (cid:107) ε (cid:107) n (cid:16) m + 2 √ mt + 2 t (cid:17) ≤ − t ) . Combining the previous two inequalities, we ﬁnally obtain P (cid:32)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 ( (cid:104) x i , a (cid:105) + ε i ) x i − a (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C √ θ + σ (cid:0) √ n + √ t (cid:1) (cid:0) √ m + √ t (cid:1) n (cid:33) ≥ − − t ) . for constant C > (cid:3) Lemma 6.

Suppose X , . . . , X n ∈ R a × b ( a ≤ b ) are i.i.d. standard Gaussian matrices, ξ , . . . , ξ n iid ∼ N (0 , τ ) , and E = n (cid:80) ni =1 ξ i X i . Then the largest and smallest singular valuesof E satisﬁes the following tail probability, P (cid:18) σ ( E ) ≥ τ n + 2 √ nx + 2 xn (cid:16) √ a + √ b + √ x (cid:17) (cid:19) ≤ − x ) , P (cid:18) σ ( E ) ≤ τ n − √ nxn (cid:16) √ b − √ a − √ x (cid:17) (cid:19) ≤ − x ) . roof of Lemma 6. In the given setting, (cid:107) ξ (cid:107) = (cid:80) ni =1 ξ i ∼ τ χ n , and E = 1 n n (cid:88) i =1 ξ i X i (cid:12)(cid:12)(cid:12) (cid:107) ξ (cid:107) iid ∼ N (cid:18) , (cid:107) ξ (cid:107) n (cid:19) . By Corollary 5.35 in [122], P (cid:18) σ ( E ) ≥ (cid:107) ξ (cid:107) n (cid:16) √ a + √ b + √ x (cid:17) (cid:12)(cid:12)(cid:12) (cid:107) ξ (cid:107) (cid:19) ≤ exp( − x ) , P (cid:18) σ ( E ) ≤ (cid:107) ξ (cid:107) n (cid:16) √ b − √ a − √ x (cid:17) (cid:12)(cid:12)(cid:12) (cid:107) ξ (cid:107) (cid:19) ≤ exp( − x ) . (140)By the tail bound of χ distribution (Lemma 1 in [72]), P (cid:0) (cid:107) ξ (cid:107) ≥ τ (cid:0) n + 2 √ nx + 2 x (cid:1)(cid:1) ≤ e − x , P (cid:0) (cid:107) ξ (cid:107) ≤ τ (cid:0) n − √ nx (cid:1)(cid:1) ≤ e − x . (141)By (140) and (141), we have P (cid:18) σ ( E ) ≥ τ n + 2 √ nx + 2 xn (cid:16) √ a + √ b + √ x (cid:17) (cid:19) ≤ P (cid:18) σ ( E ) ≥ (cid:107) ξ (cid:107) n (cid:16) √ a + √ b + √ x (cid:17) or (cid:107) ξ (cid:107) ≥ τ (cid:0) n + 2 √ nx + 2 x (cid:1)(cid:19) ≤ exp( − x ) + exp( − x ) = 2 exp( − x ); P (cid:18) σ ( E ) ≤ τ n − √ nxn (cid:16) √ b − √ a − √ x (cid:17) (cid:19) ≤ P (cid:18) σ ( E ) ≤ (cid:107) ξ (cid:107) n (cid:16) √ b − √ a − √ x (cid:17) or (cid:107) ξ (cid:107) ≤ τ (cid:0) n − √ nx (cid:1)(cid:19) ≤ exp( − x ) + exp( − x ) = 2 exp( − x ) . (cid:3) The next lemma provides an upper bound for the projection error after perturbation,which is useful in the singular subspace perturbation analysis in the proofs of the mainresults.

Lemma 7 (Projection error after perturbation) . Suppose A , Z are two matrices of thesame dimension and (cid:98) U = SVD r ( A + Z ) . Then, (cid:13)(cid:13)(cid:13) P (cid:98) U ⊥ A (cid:13)(cid:13)(cid:13) ≤ σ r +1 ( A ) + 2 (cid:107) Z (cid:107) , (cid:13)(cid:13)(cid:13) P (cid:98) U ⊥ A (cid:13)(cid:13)(cid:13) F ≤ (cid:115) (cid:88) k ≥ r +1 σ k ( A ) + 2 (cid:107) Z (cid:107) F . In particular when rank( A ) ≤ r , (cid:13)(cid:13)(cid:13) P (cid:98) U ⊥ A (cid:13)(cid:13)(cid:13) ≤ (cid:107) Z (cid:107) , (cid:13)(cid:13)(cid:13) P (cid:98) U ⊥ A (cid:13)(cid:13)(cid:13) F ≤ (cid:8) (cid:107) Z (cid:107) F , √ r (cid:107) Z (cid:107) (cid:9) . roof of Lemma 7. Suppose A = (cid:80) k σ k ( A ) u k v (cid:62) k is the singular value decomposition.Then, (cid:107) P (cid:98) U ⊥ A (cid:107) ≤ (cid:13)(cid:13)(cid:13) P (cid:98) U ⊥ ( A + Z ) (cid:13)(cid:13)(cid:13) + (cid:107) Z (cid:107) = σ r +1 ( A + Z ) + (cid:107) Z (cid:107) = min rank( M ) ≤ r (cid:107) A + Z − M (cid:107) + (cid:107) Z (cid:107)≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) A + Z − r (cid:88) k =1 σ k ( A ) u k v (cid:62) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:107) Z (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Z + (cid:88) k ≥ r +1 σ k ( A ) u k v (cid:62) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:107) Z (cid:107)≤ σ r +1 ( A ) + 2 (cid:107) Z (cid:107) . (cid:107) P (cid:98) U ⊥ A (cid:107) F ≤ (cid:13)(cid:13)(cid:13) P (cid:98) U ⊥ ( A + Z ) (cid:13)(cid:13)(cid:13) F + (cid:107) P (cid:98) U ⊥ Z (cid:107) F = (cid:115) (cid:88) k ≥ r +1 σ k ( A + Z ) + (cid:107) Z (cid:107) F = min rank( M ) ≤ r (cid:107) A + Z − M (cid:107) F + (cid:107) Z (cid:107) F ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) A + Z − r (cid:88) k =1 σ k ( A ) u k v (cid:62) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) F + (cid:107) Z (cid:107) F ≤ (cid:115) (cid:88) k ≥ r +1 σ k ( A ) + 2 (cid:107) Z (cid:107) F . Finally, when rank( A ) ≤ r , rank( P (cid:98) U ⊥ A ) ≤ rank( A ) ≤ r , then (cid:107) P (cid:98) U ⊥ A (cid:107) F ≤ min (cid:115) (cid:88) k ≥ r +1 σ k ( A ) + 2 (cid:107) Z (cid:107) F , √ r (cid:13)(cid:13)(cid:13) P (cid:98) U ⊥ A (cid:13)(cid:13)(cid:13) ≤ min (cid:8) (cid:107) Z (cid:107) F , √ r (cid:107) Z (cid:107) (cid:9) . (cid:3) The Lemma 8 below provides a inequality for tensors after tensor-matrix product pro-jections.

Lemma 8.

Suppose A ∈ R p ×···× p d is an order- d tensor and (cid:101) U k ∈ O p k ,r k , k = 1 , . . . , d , areorthogonal matrices. Let (cid:107) · (cid:107) • be a tensor norm that satisﬁes sub-multiplicative inequality,i.e., (cid:107) A × k B (cid:107) • ≤ (cid:107) A (cid:107) • · (cid:107) B (cid:107) for any tensor A and matrix B (in particular, the tensorHilbert-Schmitt norm satisﬁes this condition), we have (cid:13)(cid:13)(cid:13) (cid:74) A ; P (cid:101) U , . . . , P (cid:101) U d (cid:75) − A (cid:13)(cid:13)(cid:13) • ≤ d (cid:88) k =1 (cid:13)(cid:13)(cid:13) A × k P (cid:101) U k ⊥ (cid:13)(cid:13)(cid:13) • . Speciﬁcally, (cid:13)(cid:13)(cid:13) (cid:74) A ; P (cid:101) U , . . . , P (cid:101) U d (cid:75) − A (cid:13)(cid:13)(cid:13) HS = (cid:13)(cid:13)(cid:13) P ( (cid:101) U d ⊗···⊗ (cid:101) U ) ⊥ vec( A ) (cid:13)(cid:13)(cid:13) ≤ d (cid:88) k =1 (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k ⊥ M k ( A ) (cid:13)(cid:13)(cid:13) F . Proof of Lemma 8.

Note that A = (cid:114) A ; (cid:16) P (cid:101) U + P (cid:101) U ⊥ (cid:17) , . . . , (cid:16) P (cid:101) U d + P (cid:101) U d ⊥ (cid:17) (cid:122) = (cid:114) A ; P (cid:101) U , . . . , P (cid:101) U d (cid:122) + (cid:114) A ; P (cid:101) U ⊥ , . . . , P (cid:101) U d (cid:122) + (cid:114) A ; I p , P (cid:101) U ⊥ , . . . , P (cid:101) U d (cid:122) + · · · + (cid:114) A ; I p , I p , . . . , P (cid:101) U d ⊥ (cid:122) . (cid:107) P (cid:101) U k (cid:107) ≤ , (cid:107) P (cid:101) U k ⊥ (cid:107) ≤

1. Thus, (cid:13)(cid:13)(cid:13) (cid:74) A ; P (cid:101) U , . . . , P (cid:101) U d (cid:75) − A (cid:13)(cid:13)(cid:13) • ≤ (cid:13)(cid:13)(cid:13) (cid:114) A ; P (cid:101) U ⊥ , . . . , P (cid:101) U d (cid:122) (cid:13)(cid:13)(cid:13) • + (cid:13)(cid:13)(cid:13) (cid:114) A ; I p , P (cid:101) U ⊥ , . . . , P (cid:101) U d (cid:122) (cid:13)(cid:13)(cid:13) • + · · · + (cid:13)(cid:13)(cid:13) (cid:114) A ; I p , I p , . . . , P (cid:101) U d ⊥ (cid:122) (cid:13)(cid:13)(cid:13) • ≤ d (cid:88) k =1 (cid:13)(cid:13)(cid:13) A × k P (cid:101) U k ⊥ (cid:13)(cid:13)(cid:13) • . Speciﬁcally for the Hilbert-Schmitt norm, (cid:13)(cid:13)(cid:13) P ( (cid:101) U d ⊗···⊗ (cid:101) U ) ⊥ vec( A ) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) P ( (cid:101) U d ⊗···⊗ (cid:101) U ) vec( A ) − vec( A ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:74) A ; P (cid:101) U , . . . , P (cid:101) U d (cid:75) − A (cid:13)(cid:13)(cid:13) HS ≤ d (cid:88) k =1 (cid:13)(cid:13)(cid:13) A × k P (cid:101) U k ⊥ (cid:13)(cid:13)(cid:13) HS = d (cid:88) k =1 (cid:13)(cid:13)(cid:13) M k (cid:16) A × k P (cid:101) U k ⊥ (cid:17)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) k ⊥ M k ( A ) (cid:13)(cid:13)(cid:13) F . Therefore, we have ﬁnished the proof of lemma 8. (cid:3)

The next Lemma 9 introduces a useful inequality for the tensor projected orthogonalto a Cross structure (i.e., (cid:101) U in the statement below). Lemma 9.

Suppose A = (cid:74) S ; U , U , U (cid:75) is a rank- ( r , r , r ) tensor. U k ∈ O p k ,r k and W k ∈ O p k +1 p k +2 ,p k are the left and right singular subspaces of M k ( A ) := A k , respectively.Suppose (cid:101) U k ∈ O p k ,r k and (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ∈ O p ,r , (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ∈ O p ,r , (cid:102) W = ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ∈ O p ,r are sample estimates of U and W k , respectively. Assume (cid:101) U k and (cid:102) W k satisfy (cid:107) sin Θ( (cid:101) U k , U k ) (cid:107) ≤ θ k , (cid:107) (cid:101) U (cid:62) k ⊥ A k (cid:107) F ≤ η k , (cid:107) A k (cid:102) W k ⊥ (cid:107) F ≤ ξ k , k = 1 , , . Let (cid:101) U = (cid:104) (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U , R ( (cid:102) W ⊗ (cid:101) U ⊥ ) , R ( (cid:102) W ⊗ (cid:101) U ⊥ ) , R ( (cid:102) W ⊗ (cid:101) U ⊥ ) (cid:105) , where R k ( · ) is the row-permutation operator that matches the row indices of (cid:102) W k ⊗ (cid:101) U k ⊥ to vec( A ) and the actual deﬁnitions of R k are provided in Section A in the supplementarymaterials. Recall (cid:101) U ⊥ is the orthogonal complement of U . Then, (cid:107) P (cid:101) U ⊥ vec( A ) (cid:107) ≤ (cid:88) k =1 , , (cid:0) θ k ξ k + min { θ k +1 η k +2 , θ k +2 η k +1 } (cid:1) + min { η θ θ , θ η θ , θ θ η } . roof of Lemma 9. Since (cid:101) U = (cid:104) (cid:101) U ⊗ (cid:101) U ⊗ (cid:101) U , R ( (cid:102) W ⊗ (cid:101) U ⊥ ) , R ( (cid:102) W ⊗ (cid:101) U ⊥ ) , R ( (cid:102) W ⊗ (cid:101) U ⊥ ) (cid:105) ∈ O p p p ,m , where m = r r r + ( p − r ) r + ( p − r ) r + ( p − r ) r . Denote (cid:101) U = R (cid:16)(cid:16) ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ⊥ (cid:17) ⊗ (cid:101) U ⊥ (cid:17) ∈ O p p p , ( p − r )( r r − r ) , (cid:101) U = R (cid:16)(cid:16) ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ⊥ (cid:17) ⊗ (cid:101) U ⊥ (cid:17) ∈ O p p p , ( p − r )( r r − r ) , (cid:101) U = R (cid:16)(cid:16) ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ⊥ (cid:17) ⊗ (cid:101) U ⊥ (cid:17) ∈ O p p p , ( p − r )( r r − r ) , (142) (cid:101) U = (cid:101) U ⊥ ⊗ (cid:101) U ⊥ ⊗ (cid:101) U ∈ O p p p ,r ( p − r )( p − r ) ; (cid:101) U = (cid:101) U ⊥ ⊗ (cid:101) U ⊗ (cid:101) U ⊥ ∈ O p p p ,r ( p − r )( p − r ) ; (cid:101) U = (cid:101) U ⊗ (cid:101) U ⊥ ⊗ (cid:101) U ⊥ ∈ O p p p ,r ( p − r )( p − r ) ; (143) (cid:101) U ∗ = (cid:101) U ⊥ ⊗ (cid:101) U ⊥ ⊗ (cid:101) U ⊥ ∈ O p p p , ( p − r )( p − r )( p − r ) . (144)Then it is not hard to verify that [ (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U , (cid:101) U ∗ ] forms an orthogonalcomplement of (cid:101) U . Thus, we have the following decomposition, (cid:107) P (cid:101) U ⊥ vec( A ) (cid:107) = (cid:88) k =1 , , (cid:107) P (cid:101) U k vec( A ) (cid:107) + (cid:88) k =1 , , (cid:107) P (cid:101) U k vec( A ) (cid:107) + (cid:107) P (cid:101) U ∗ vec( A ) (cid:107) . We analyze each term separately as follows. • Note that (cid:104) ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V , ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ⊥ , ( (cid:101) U ⊗ (cid:101) U ) ⊥ (cid:105) is a square orthogonal matrix, we know (cid:104) ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ⊥ , ( (cid:101) U ⊗ (cid:101) U ) ⊥ (cid:105) is an orthogonal complement to (cid:102) W . Given the left and right singular subspaces of A are U and W , we have (cid:107) P (cid:101) U vec( A ) (cid:107) F (142) = (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A (cid:16) ( (cid:101) U ⊗ (cid:101) U ) (cid:101) V ⊥ (cid:17)(cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A (cid:102) W ⊥ (cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ U U (cid:62) A (cid:102) W ⊥ (cid:13)(cid:13)(cid:13) F ≤ (cid:107) (cid:101) U (cid:62) ⊥ U (cid:107) · (cid:107) A (cid:102) W ⊥ (cid:107) F ≤(cid:107) sin Θ( (cid:101) U , U ) (cid:107) · (cid:107) A (cid:102) W ⊥ (cid:107) F ≤ θ ξ . Similar inequalities also hold for (cid:107) P (cid:101) U vec( A ) (cid:107) F and (cid:107) P (cid:101) U vec( A ) (cid:107) F .52 (cid:107) P (cid:101) U vec( A ) (cid:107) = (cid:107) (cid:101) U (cid:62) vec( A ) (cid:107) = (cid:107) A × (cid:101) U (cid:62) × (cid:101) U (cid:62) ⊥ × (cid:101) U (cid:62) ⊥ (cid:107) = (cid:107) (cid:101) U (cid:62) ⊥ A ( (cid:101) U ⊥ ⊗ (cid:101) U ) (cid:107) F = (cid:107) (cid:101) U (cid:62) ⊥ U U (cid:62) A ( (cid:101) U ⊥ ⊗ (cid:101) U ) (cid:107) F ≤(cid:107) (cid:101) U (cid:62) ⊥ U (cid:107) · (cid:107) U (cid:62) A ( (cid:101) U ⊥ ⊗ (cid:101) U ) (cid:107) F ≤(cid:107) sin Θ( (cid:101) U , U ) (cid:107) · (cid:107) A ( (cid:101) U ⊥ ⊗ (cid:101) U ) (cid:107) F = θ · (cid:107) A × (cid:101) U (cid:62) × (cid:101) U (cid:62) ⊥ (cid:107) = θ · (cid:107) (cid:101) U (cid:62) ⊥ A (cid:107) F ≤ θ η . By symmetry, (cid:107) P (cid:101) U vec( A ) (cid:107) ≤ θ η . Similar inequalities also hold for (cid:107) P (cid:101) U vec( A ) (cid:107) and (cid:107) P (cid:101) U vec( A ) (cid:107) . Therefore, (cid:107) P (cid:101) U k vec( A ) (cid:107) ≤ min { θ k +1 η k +2 , θ k +2 η k +1 } , for k = 1 , , . (145) • Similarly as the previous part, (cid:107) P (cid:101) U ∗ vec( A ) (cid:107) ≤ (cid:13)(cid:13)(cid:13) A × (cid:101) U (cid:62) ⊥ × (cid:101) U (cid:62) ⊥ × (cid:101) U (cid:62) ⊥ (cid:13)(cid:13)(cid:13) HS = (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A (cid:16) (cid:101) U ⊥ ⊗ (cid:101) U ⊥ (cid:17)(cid:13)(cid:13)(cid:13) F = (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A ( U ⊗ U )( U ⊗ U ) (cid:62) ( (cid:101) U ⊥ ⊗ (cid:101) U ⊥ ) (cid:13)(cid:13)(cid:13) F ≤ (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A ( U ⊗ U ) (cid:13)(cid:13)(cid:13) F · (cid:13)(cid:13)(cid:13) ( U (cid:62) (cid:101) U ⊥ ) ⊗ ( U (cid:62) (cid:101) U ⊥ ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) (cid:101) U (cid:62) ⊥ A (cid:13)(cid:13)(cid:13) F · (cid:13)(cid:13)(cid:13) ( U (cid:62) (cid:101) U ⊥ ) (cid:13)(cid:13)(cid:13) · (cid:13)(cid:13)(cid:13) ( U (cid:62) (cid:101) U ⊥ ) (cid:13)(cid:13)(cid:13) ≤ η θ θ . Similar upper bounds of θ η θ and θ θ η also hold. Thus, (cid:107) P (cid:101) U ∗ vec( A ) (cid:107) ≤ min { η θ θ , θ η θ , θ θ η } . In summary, (cid:107) P U ⊥ vec( A ) (cid:107) ≤ (cid:88) k =1 , , (cid:0) θ k ξ k + min { θ k +1 η k +2 , θ k +2 η k +1 } (cid:1) + min { η θ θ , θ η θ , θ θ η } . (cid:3) The following lemma discusses the Bayes risk of regular linear regression. Though it isa standard result in statistical decision theory (c.f., Exercise 5.8, p. 403 in [74]), we presentthe proof here for completeness of statement.

Lemma 10.

Consider the linear regression model y = X β + ε . Here, ε iid ∼ N (0 , σ ) ; theparameter β is generated from a prior distribution: β iid ∼ N (0 , τ ) . We aim to estimate β based on ( y , X ) with the minimal (cid:96) risk. Then, the Bayes estimator for β and thecorresponding Bayes risk are (cid:98) β = (cid:18) σ I τ + X (cid:62) X (cid:19) − X (cid:62) and E (cid:16) ( (cid:98) β − β ) | X (cid:17) = tr (cid:32)(cid:18) I τ + X (cid:62) X σ (cid:19) − (cid:33) . roof of Lemma 10. When β iid ∼ N (0 , τ ) and ε iid ∼ N (0 , σ ), p ( β (cid:12)(cid:12)(cid:12) X , y ) ∝ p ( y | X , β ) · p ( β ) ∝ exp (cid:0) −(cid:107) y − X β (cid:107) / (2 σ ) (cid:1) · exp( − β (cid:62) β / (2 τ )) ∝ exp (cid:32) − β (cid:62) β τ − β (cid:62) X (cid:62) X β σ + y (cid:62) X β σ (cid:33) ∝ exp  − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:18) I τ + X (cid:62) X σ (cid:19) − / X (cid:62) y σ − (cid:18) I τ + X (cid:62) X σ (cid:19) / β (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  (146)Thus, the posterior distribution of β is β (cid:12)(cid:12)(cid:12) X , y ∼ N (cid:32)(cid:18) σ I τ + X (cid:62) X (cid:19) − X (cid:62) y , (cid:18) I τ + X (cid:62) X σ (cid:19) − (cid:33) . Then, the Bayes estimator, i.e., the posterior mean, and the corresponding Bayes risk are (cid:98) β = E ( β | X , y ) = (cid:18) σ I τ + XX (cid:62) (cid:19) − X (cid:62) y , E (( (cid:98) β − β ) | X , y ) = tr (cid:32)(cid:18) I τ + XX (cid:62) σ (cid:19) − (cid:33) , respectively. Thus, we have ﬁnished the proof of this lemma. (cid:3) The following lemma provides a deterministic bound for the group Lasso estimatorunder group restricted isometry property.

Lemma 11.

Suppose X ∈ R n × pr , { G , . . . , G p } is a partition of { , . . . , pr } and | G | = · · · = | G p | . Assume X satisﬁes group restricted isometry condition, such that (1 − δ ) n (cid:107) β (cid:107) ≤ (cid:107) X β (cid:107) ≤ (1 + δ ) n (cid:107) β (cid:107) , ∀ β such that m (cid:88) i =1 { β Gi (cid:54) =0 } ≤ s. Suppose y = X β + ε and (cid:80) pi =1 { β Gi (cid:54) =0 } ≤ s. Consider the following group Lasso estimator (cid:98) β = arg min γ ∈ R pr (cid:40) (cid:107) y − X γ (cid:107) + η p (cid:88) i =1 (cid:107) γ G i (cid:107) (cid:41) . (147) For η ≥ ≤ j ≤ p (cid:107) ( X [: ,G j ] ) (cid:62) ε (cid:107) and δ < / , the optimal solution of (147) yields (cid:107) (cid:98) β − β (cid:107) ≤ η (cid:112) s/ n (1 − δ/ . (148) Proof of Lemma 11.

For convenience, deﬁne the (2 , ∞ )- and (2 , v ∈ R pr as (cid:107) v (cid:107) , ∞ = max j =1 ,...,p (cid:107) v G j (cid:107) and (cid:107) v (cid:107) , = p (cid:88) j =1 (cid:107) v G j (cid:107) . (cid:107) · (cid:107) , ∞ and (cid:107) · (cid:107) , satisﬁes (cid:107) v (cid:107) , ∞ · (cid:107) w (cid:107) , ≥ (cid:104) v , w (cid:105) . We also deﬁne J = { j : β G j (cid:54) = 0 } as the group support of β , then | J | ≤ s based on the assumption. Suppose h = (cid:98) β − β ∈ R pr .By deﬁnition, 12 (cid:107) y − X (cid:98) β (cid:107) + η (cid:107) (cid:98) β (cid:107) , ≤ (cid:107) y − X β (cid:107) + η (cid:107) β (cid:107) , . Noting that 12 (cid:16) (cid:107) y − X (cid:98) β (cid:107) − (cid:107) y − X β (cid:107) (cid:17) = 12 (cid:0) (cid:107) ε − X h (cid:107) − (cid:107) ε (cid:107) (cid:1) = −

12 (2 ε − X h ) (cid:62) ( X h ) ≥ − ε (cid:62) X h ≥ −(cid:107) X (cid:62) ε (cid:107) , ∞ · (cid:107) h (cid:107) , = − (cid:107) X (cid:62) ε (cid:107) , ∞ ( (cid:107) h J (cid:107) , + (cid:107) h J c (cid:107) , ) ,η (cid:16) (cid:107) β (cid:107) , − (cid:107) (cid:98) β (cid:107) , (cid:17) = η (cid:16) (cid:107) β J (cid:107) , − (cid:107) (cid:98) β J (cid:107) , − (cid:107) (cid:98) β J c (cid:107) , (cid:17) ≤ η ( (cid:107) h J (cid:107) , − (cid:107) h J c (cid:107) , ) , we have − (cid:107) X (cid:62) ε (cid:107) , ∞ ( (cid:107) h J (cid:107) , + (cid:107) h J c (cid:107) , ) ≤ η ( (cid:107) h J (cid:107) , − (cid:107) h J c (cid:107) , ) , ⇒ (cid:107) h J c (cid:107) , ≤ η + (cid:107) X (cid:62) ε (cid:107) , ∞ η − (cid:107) X (cid:62) ε (cid:107) , ∞ (cid:107) h J (cid:107) , . Given η ≥ (cid:107) X (cid:62) ε (cid:107) , ∞ , we have (cid:107) h J c (cid:107) , ≤ (cid:107) h J (cid:107) , . (149)Now we can sort all groups of h by their (cid:96) norm and suppose (cid:107) h G i (cid:107) ≥ · · · ≥ (cid:107) h G ip (cid:107) ,where { i , . . . , i p } as a permutation of { , . . . , p } . Let h max( s ) ∈ R pr , ( h max( s ) ) j = (cid:40) h j , j ∈ G i ∪ · · · ∪ G i s ;0 , otherwise , Then h max( s ) is the vector h with all but the s largest groups in (cid:96) norm set to zero. Wealso denote h − max( s ) = h − h max( s ) . Then (149) implies (cid:107) h − max( s ) (cid:107) , ≤ (cid:107) h J c (cid:107) , ≤ (cid:107) h J (cid:107) , ≤ (cid:107) h max( s ) (cid:107) , . (150)Let v ∈ R p with v i = (cid:107) h G i (cid:107) , ≤ i ≤ p be the (cid:96) norms of each group of h .We can similarly deﬁne v max( s ) as the vector v with all but the s largest entries set tozero, and v − max( s ) = v − v max( s ) . Then, ( v max( s ) ) i = (cid:107) ( h max( s ) ) G i (cid:107) and ( v − max( s ) ) i = (cid:107) ( h − max( s ) ) G i (cid:107) . Let α = max {(cid:107) h − max( s ) (cid:107) , ∞ , (cid:107) h − max( s ) (cid:107) , /s } = max {(cid:107) v − max( s ) (cid:107) ∞ , (cid:107) v − max( s ) (cid:107) /s } .

55y the polytope representation lemma (Lemma 1 in [17]) with α , one can ﬁnd a ﬁnite seriesof vectors v (1) , · · · , v ( N ) ∈ R p and weights π , . . . , π N such thatsupp( v ( j ) ) ⊆ supp( v − max( s ) ) , (cid:107) v ( j ) (cid:107) ≤ s, (cid:107) v ( j ) (cid:107) ∞ ≤ α, (cid:107) v ( j ) (cid:107) = (cid:107) v − max( s ) (cid:107) , v − max( s ) = N (cid:88) j =1 π j v ( j ) , ≤ π j ≤ , and N (cid:88) j =1 π j = 1 . Now we construct h ( j ) ∈ R pr , where ( h ( j ) ) G i = ( h − max( s ) ) G i (cid:107) ( h − max( s ) ) G i (cid:107) · v ( j ) i , i = 1 , . . . , p ; j = 1 , . . . , N. (151)Then { h ( j ) } Nj =1 satisfysupp( h ( j ) ) ⊆ supp( h − max( s ) ) , p (cid:88) i =1 { ( h ( j ) ) Gi (cid:54) =0 } ≤ s, (cid:107) h ( j ) (cid:107) , ∞ ≤ α, (cid:107) h ( j ) (cid:107) , = (cid:107) h − max( s ) (cid:107) , , h − max( s ) = N (cid:88) j =1 π j h ( j ) , ≤ π j ≤ , N (cid:88) j =1 π j = 1 . (152)Therefore, h max( s ) and h ( j ) have distinct supports, (cid:80) mi =1 ( h max( s ) + h ( j ) ) Gi (cid:54) =0 ≤ s , (cid:107) h max( s ) + h ( j ) (cid:107) = (cid:107) h max( s ) (cid:107) + (cid:107) h ( j ) (cid:107) , and (cid:107) h ( j ) (cid:107) ≤(cid:107) h ( j ) (cid:107) , · (cid:107) h ( j ) (cid:107) , ∞ (152) ≤ (cid:107) h − max( s ) (cid:107) , · α (150) ≤ (cid:107) h max( s ) (cid:107) , · max (cid:8) (cid:107) h − max( s ) (cid:107) , ∞ , (cid:107) h − max( s ) (cid:107) , /s (cid:9) ≤ (cid:107) h max( s ) (cid:107) , · max (cid:40) min j : (cid:107) h Gj (cid:107) (cid:54) =0 (cid:107) h G j (cid:107) , (cid:107) h max( s ) (cid:107) , /s (cid:41) ≤ (cid:107) h max( s ) (cid:107) , /s ≤ (cid:107) h max( s ) (cid:107) . Thus, (cid:12)(cid:12) (cid:104) X h max( s ) , X h − max( s ) (cid:105) (cid:12)(cid:12) ≤ N (cid:88) j =1 π j (cid:12)(cid:12)(cid:12) (cid:104) X h max( s ) , X h ( j ) (cid:105) (cid:12)(cid:12)(cid:12) = N (cid:88) j =1 π j (cid:12)(cid:12)(cid:12) (cid:107) X h max( s ) + X h ( j ) (cid:107) − (cid:107) X h max( s ) − X h ( j ) (cid:107) (cid:12)(cid:12)(cid:12) ≤ N (cid:88) j =1 π j (cid:16) n (1 + δ )( (cid:107) h max( s ) (cid:107) + (cid:107) h ( j ) (cid:107) ) − n (1 − δ )( (cid:107) h max( s ) (cid:107) + (cid:107) h ( j ) (cid:107) ) (cid:17) ≤ δn (cid:0) (cid:107) h max( s ) (cid:107) + 4 (cid:107) h max( s ) (cid:107) (cid:1) = 5 δn (cid:107) h max( s ) (cid:107) , (cid:104) X h max( s ) , X h (cid:105) = (cid:107) X h max( s ) (cid:107) + (cid:104) X h max( s ) , X h − max( s ) (cid:105)≥ n (1 − δ ) (cid:107) h max( s ) (cid:107) − δn (cid:107) h max( s ) (cid:107) = n (1 − δ/ (cid:107) h max( s ) (cid:107) . (153)Next, by the KKT condition of (cid:98) β being the optimizer of (147), (cid:107) X (cid:62) ( y − X (cid:98) β ) (cid:107) , ∞ ≤ η. In addition, (cid:107) X (cid:62) ( y − X β ) (cid:107) , ∞ = (cid:107) X (cid:62) ε (cid:107) , ∞ ≤ η/

3, which means (cid:104) X h max( s ) , X h (cid:105) = h (cid:62) max( s ) X (cid:62) X h ≤ (cid:107) h max( s ) (cid:107) , · (cid:107) X (cid:62) X h (cid:107) , ∞ ≤(cid:107) h max( s ) (cid:107) , · (cid:16) (cid:107) X (cid:62) ( y − X (cid:98) β ) (cid:107) , ∞ + (cid:107) X (cid:62) ( y − X β ) (cid:107) , ∞ (cid:17) ≤ η/ · (cid:107) h max( s ) (cid:107) , ≤ η/ · √ s (cid:107) h max( s ) (cid:107) . (154)Combining the above inequality with (153), one has4 η √ s (cid:107) h max( s ) (cid:107) ≥ n (1 − δ/ (cid:107) h max( s ) (cid:107) , namely (cid:107) h max( s ) (cid:107) ≤ η √ sn (1 − δ/ . Finally, (cid:107) h − max( s ) (cid:107) ≤(cid:107) h − max( s ) (cid:107) , · (cid:107) h − max( s ) (cid:107) , ∞ ≤ (cid:107) h max( s ) (cid:107) , · min j :( h max( s ) ) Gj (cid:54) =0 (cid:107) ( h max( s ) ) G j (cid:107) ≤ (cid:107) h max( s ) (cid:107) . Therefore, (cid:107) h (cid:107) = (cid:113) (cid:107) h − max( s ) (cid:107) + (cid:107) h max( s ) (cid:107) ≤ √ (cid:107) h max( s ) (cid:107) ≤ η (cid:112) s/ n (1 − δ/ , which has ﬁnished the proof of this lemma. (cid:3) The next Lemma 12 shows that the Gaussian Ensemble satisﬁes group restricted isom-etry property with high probability.

Lemma 12.

Suppose X ∈ R n × ( pr ) , G , . . . , G p is a partition of { , . . . pr } and | G | = · · · | G p | = r . If X iid ∼ N (0 , and n ≥ C ( sr/δ + s log( ep/s )) for large constant C > , X satisﬁes the following group restricted isometry (GRIP) n (1 − δ ) (cid:107) β (cid:107) ≤ (cid:107) X β (cid:107) ≤ n (1 + δ ) (cid:107) β (cid:107) , ∀ β such that p (cid:88) i =1 { β G i (cid:54) =0 } ≤ s (155) with probability at least − exp( − cn ) . roof of Lemma 12. First, the statement (155) is equivalently to ∀ distinct i , . . . , i s ⊆ { , . . . , p } ,n (1 − δ ) ≤ σ ( X [: ,G i ∪···∪ G is ] ) ≤ σ ( X [: ,G i ∪···∪ G is ] ) ≤ n (1 + δ ) . (156)Since X [: ,G i ∪···∪ G is ] is an n -by- sr matrix with i.i.d. Gaussian entries, by random matrixtheory (c.f., [122, Corollary 5.35]), P (cid:16) √ n − √ sr − x ≤ σ min ( X [: ,G i ∪···∪ G is ] ) ≤ σ max ( X [: ,G i ∪···∪ G is ] ) ≤ √ n + √ sr + x (cid:17) ≥ − − x / , which means P ((156) does not hold) ≤ (cid:88) distinct i ,...,i s ⊆{ ,...,p } P (cid:16)(cid:110) n (1 − δ ) ≤ σ ( X [: ,G i ∪···∪ G is ] ) ≤ σ ( X [: ,G i ∪···∪ G is ] ) ≤ n (1 + δ ) (cid:111) c (cid:17) ≤ (cid:18) ps (cid:19) exp (cid:18) − (cid:16) √ n − (cid:112) n (1 − δ ) − √ sr (cid:17) ∧ (cid:16)(cid:112) n (1 + δ ) − √ n − √ sr (cid:17) (cid:19) , Provided that n ≥ C ( sr/δ + s log( ep/s )) for large constant C >

0, we have (cid:16) √ n − (cid:112) n (1 − δ ) − √ sr (cid:17) ∧ (cid:16)(cid:112) n (1 + δ ) − √ n − √ sr (cid:17) ≥ (1 − c ) n, (1 − c ) n ≥ (1 − c ) Cs log( ep/s ) ≥ (1 − c ) C log (cid:18)(cid:18) ps (cid:19)(cid:19) . Therefore, we have P ((156) does not hold) ≤ exp (cid:18) log (cid:18) (cid:18) ps (cid:19)(cid:19) − (1 − c ) n (cid:19) ≤ exp( − cn )and have ﬁnished the proof of this lemma. (cid:3) The next lemma gives the KullbackLeibler divergence between two regression modelswith random designs, which will be used in the lower bound argument in this paper.

Lemma 13.

Consider two linear regression models y (1) = X β (1) + ε and y (2) = X β (2) + ε .Here, y (1) , y (2) ∈ R n and X ∈ R n × p , β (1) , β (2) ∈ R p , and ε ∈ R n . Assume X iid ∼ N (0 , , ε iid ∼ N (0 , σ ) , and β (1) , β (2) are ﬁxed. Then, D KL (cid:16) { X , y (1) } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { X , y (2) } (cid:17) = n σ (cid:13)(cid:13)(cid:13) β (1) − β (2) (cid:13)(cid:13)(cid:13) . (157) Proof of Lemma 13.

Denote the j -th row vector of X as x j , i.e., X = [ x (cid:62) · · · x (cid:62) n ] (cid:62) . Then,( x (cid:62) , y (1) (cid:62) ) , . . . , ( x (cid:62) n , y (1) (cid:62) n ) are i.i.d. distributed vectors, y (1) j = x (cid:62) j β (1) + ε j , and (cid:16) x (cid:62) j , y (1) j (cid:17) ∼ N (0 , Σ ) , Σ = (cid:34) I p β (1) β (1) (cid:62) (cid:107) β (1) (cid:107) + σ (cid:35) . (cid:16) x (cid:62) j , y (2) j (cid:17) ∼ N (0 , Σ ) , Σ = (cid:34) I p β (2) β (2) (cid:62) (cid:107) β (2) (cid:107) + σ (cid:35) . Additionally,det(Σ i ) = det (cid:32)(cid:34) I p − β ( i ) (cid:62) (cid:35) · (cid:34) I p β ( i ) β ( i ) (cid:62) (cid:107) β ( i ) (cid:107) + σ (cid:35)(cid:33) = det (cid:32)(cid:34) I p β ( i ) σ (cid:35)(cid:33) = σ , i = 1 , , Σ − i = (cid:34) I p + β ( i ) β ( i ) (cid:62) σ − − β ( i ) σ − − β ( i ) (cid:62) σ − σ − (cid:35) , i = 1 , . By the formula for multivariate normal distribution KL-divergence, D KL (cid:16)(cid:110) x (cid:62) j , y (1) j (cid:111) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:110) x (cid:62) j , y (2) j (cid:111)(cid:17) = 12 (cid:18) tr (cid:0) Σ − Σ (cid:1) − ( p + 1) + log (cid:18) det(Σ )det(Σ ) (cid:19)(cid:19) = σ − (cid:16) tr (cid:16) β (2) β (2) (cid:62) − β (2) β (1) (cid:62) − β (1) (cid:62) β (2) (cid:17) + (cid:107) β (1) (cid:107) (cid:17) = 12 σ (cid:13)(cid:13)(cid:13) β (1) − β (2) (cid:13)(cid:13)(cid:13) . Therefore, D KL (cid:18)(cid:110) x (cid:62) j , y (1) j (cid:111) nj =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:110) x (cid:62) j , y (2) j (cid:111) nj =1 (cid:19) = nD KL (cid:16)(cid:110) x (cid:62) j , y (1) j (cid:111) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:110) x (cid:62) j , y (2) j (cid:111)(cid:17) = n σ (cid:13)(cid:13)(cid:13) β (1) − β (2) (cid:13)(cid:13)(cid:13) . (cid:3) The next lemma can be seen as a sparse version of Varshamov-Gilbert bound [87,Lemma 4.7]. This result is crucial in the proof of the lower bound argument in sparsetensor regression (Theorem 7).

Lemma 14.

There exists a series of matrices A (1) , . . . , A ( N ) ∈ { , , − } p × r , such that (cid:107) A ( k ) (cid:107) , := p (cid:88) i =1 (cid:110) A ( k )[ i, :] (cid:54) =0 (cid:111) ≤ s, (cid:107) A ( k ) − A ( l ) (cid:107) , = p (cid:88) i =1 r (cid:88) j =1 (cid:12)(cid:12)(cid:12) A ( k )[ i,j ] − A ( l )[ i,j ] (cid:12)(cid:12)(cid:12) > sr/ for all k, l , and N ≥ exp ( c ( sr + s log( ep/s ))) for some uniform constant c > . Proof of Lemma 14

First, if p/s ≤ C for some constant C >

0, the lemma directlyfollows from the Varshamov-Gilbert bound by restricting on the top s × r submatrices of A , . . . , A N . Thus, without loss of generality, we assume p ≥ s throughout the rest ofthe proof. 59ext for k = 1 , . . . , N , we randomly draw s elements from { , . . . , p } without replace-ment, form Ω ( k ) as a random subset of { , . . . , p } , and generate A ( k ) ∈ R p × r , (cid:16) A ( k ) (cid:17) ij (cid:40) ∼ Rademacher , i ∈ Ω ( k ) ;= 0 , i / ∈ Ω ( k ) , for k = 1 , , . . . , N . Here, A ∼ Rademacher if A is equally distributed on -1 and 1. By suchthe construction, (cid:107) A ( k ) (cid:107) , = p (cid:88) i =1 (cid:110) A ( k )[ i, :] (cid:54) =0 (cid:111) ≤ s. For any k (cid:54) = l , (cid:13)(cid:13)(cid:13) A ( k ) − A ( l ) (cid:13)(cid:13)(cid:13) , ∼ r | Ω ( k ) \ Ω ( l ) | + r | Ω ( l ) \ Ω ( k ) | + 2 · Bin (cid:16) r (cid:12)(cid:12)(cid:12) Ω ( k ) ∩ Ω ( l ) (cid:12)(cid:12)(cid:12) , / (cid:17) =2 sr − r | Ω ( l ) ∩ Ω ( k ) | − · Bin (cid:16) r | Ω ( l ) ∩ Ω ( k ) | , / (cid:17) ∼ sr − · Bin (cid:16) r | Ω ( l ) ∩ Ω ( k ) | , / (cid:17) . (159)Here, we used the fact that | Ω ( k ) \ Ω ( l ) | = | Ω ( k ) | − | Ω ( k ) ∩ Ω ( l ) | = s − | Ω ( k ) ∩ Ω ( l ) | . Moreover, | Ω ( l ) ∩ Ω ( k ) | satisﬁes the following hyper-geometric distribution: P (cid:16)(cid:12)(cid:12)(cid:12) Ω ( l ) ∩ Ω ( k ) (cid:12)(cid:12)(cid:12) = t (cid:17) = (cid:0) st (cid:1)(cid:0) p − ss − t (cid:1)(cid:0) ps (cid:1) , t = 0 , . . . , s. Let Z kl = (cid:12)(cid:12) Ω ( l ) ∩ Ω ( k ) (cid:12)(cid:12) . Then for any s/ ≤ t ≤ s , P ( Z = t ) = s ··· ( s − t +1) t ! · ( p − s ) ··· ( p − s + t +1)( s − t )! p ··· ( p − s +1) s ! ≤ (cid:18) st (cid:19) · (cid:18) sp − s + 1 (cid:19) t ≤ s (cid:18) sp − s + 1 (cid:19) t ≤ (cid:18) sp − s + 1 (cid:19) t . (160)Next, by Bernstein’s inequality, P (cid:18)(cid:13)(cid:13)(cid:13) A ( k ) − A ( l ) (cid:13)(cid:13)(cid:13) , ≤ sr/ (cid:12)(cid:12)(cid:12) Z (cid:19) (159) = P (cid:16) Bin ( rZ, / ≥ sr/ (cid:12)(cid:12)(cid:12) Z (cid:17) = P (cid:18) rZ, / − rZ ≥ sr − rZ (cid:19) ≤ (cid:40) (cid:16) − (3 sr/ − Zr ) rZ +(3 sr/ − Zr ) / (cid:17) , s/ ≤ Z ≤ s ;0 , Z < s/ . P (cid:18)(cid:13)(cid:13)(cid:13) A ( k ) − A ( l ) (cid:13)(cid:13)(cid:13) , ≤ sr/ (cid:19) ≤ (cid:88) s/ ≤ t ≤ s P (cid:18)(cid:13)(cid:13)(cid:13) A ( k ) − A ( l ) (cid:13)(cid:13)(cid:13) , ≤ sr/ (cid:12)(cid:12)(cid:12) Z = t (cid:19) · P ( Z = t ) ≤ (cid:88) s/ ≤ t ≤ s (cid:18) − (3 sr/ − tr ) rt + (3 sr/ − tr ) / (cid:19) (cid:18) sp − s + 1 (cid:19) t ≤ (cid:88) s/ ≤ t ≤ s (cid:18) − (3 sr/ − sr ) sr + (3 sr/ − sr ) / (cid:19) (cid:18) sp − s + 1 (cid:19) t ≤ (cid:88) t ≥ s/ − sr/ · (4 s/ ( p − r + 1)) t ≤ − sr/ · (4 s/ ( p − s + 1)) s/ ≤ − c ( sr + s log( ep/s )))for some uniform constant c >

0. Finally, P (cid:18) ∀ ≤ k (cid:54) = l ≤ N, (cid:13)(cid:13)(cid:13) A ( k ) − A ( l ) (cid:13)(cid:13)(cid:13) , > sr/ (cid:19) ≥ − (cid:18) N (cid:19) P (cid:18)(cid:13)(cid:13)(cid:13) A ( k ) − A ( l ) (cid:13)(cid:13)(cid:13) , ≤ sr/ (cid:19) ≥ − N · − c ( sr + s log( ep/s )))We can see if N ≤ exp( c ( sr + s log( ep/s ))) for some uniform constant c >

0, the previousevent happens with a positive probability, which means there exists ﬁxed A (1) , . . . , A ( N ) satisfying the targeting condition (158) for some N ≥ exp( c ( sr + s log( p/s ))). (cid:3)(cid:3)