TTensor-on-tensor regression
Eric F. Lock ∗ Division of Biostatistics, University of MinnesotaJune 28, 2017
Abstract
We propose a framework for the linear prediction of a multi-way array (i.e., a ten-sor) from another multi-way array of arbitrary dimension, using the contracted tensorproduct. This framework generalizes several existing approaches, including methodsto predict a scalar outcome from a tensor, a matrix from a matrix, or a tensor froma scalar. We describe an approach that exploits the multiway structure of both thepredictors and the outcomes by restricting the coefficients to have reduced CP-rank.We propose a general and efficient algorithm for penalized least-squares estimation,which allows for a ridge ( L ) penalty on the coefficients. The objective is shown togive the mode of a Bayesian posterior, which motivates a Gibbs sampling algorithmfor inference. We illustrate the approach with an application to facial image data.An R package is available at https://github.com/lockEF/MultiwayRegression . Keywords:
Multiway data, PARAFAC/CANDECOMP, ridge regression, reduced rank re-gression ∗ The author gratefully acknowledges the support of NIH grant ULI RR033183/KL2 RR0333182. a r X i v : . [ s t a t . M E ] J un Introduction
For many applications data are best represented in the form of a tensor , also called a multi-way or multi-dimensional array, which extends the familiar two-way data matrix( Samples × Variables ) to higher dimensions. Tensors are increasingly encountered in fieldsthat require the automated collection of high-throughput data with complex structure. Forexample, in molecular “omics” profiling it is now common to collect high-dimensional dataover multiple subjects, tissues, fluids or time points within a single study. For neuroimagingmodalities such as fMRI and EEG, data are commonly represented as multi-way arrays withdimensions that can represent subjects, time points, brain regions, or frequencies. In thisarticle we consider an application to a collection of facial images from the Faces in the Wilddatabase (Learned-Miller et al., 2016), which when properly aligned to a 90 ×
90 pixel gridcan be represented as a 4-way array with dimension
Faces × X × Y × Colors , where X and Y give the horizontal and vertical location of each pixel.This article concerns the prediction of an array of arbitrary dimension Q × · · · × Q M from another array of arbitrary dimension P × · · · × P L . For N training observations, thisinvolves an outcome array Y : N × Q ×· · ·× Q M and a predictor array X : N × P ×· · ·× P L .For example, we consider the simultaneous prediction of several describable attributes forfaces from their images (Kumar et al., 2009), which requires predicting the array Y : Faces × Attributes from X : Faces × X × Y × Colors . Other potential applications include theprediction of fMRI from EEG data (see De Martino et al. (2011)) and the prediction ofgene expression across multiple tissues from other genomic variables (see Ramasamy et al.(2014)).The task of tensor-on-tensor regression extends a growing literature on the predictivemodeling of tensors under different scenarios. Such methods commonly rely on tensor fac-torization techniques (Kolda and Bader, 2009), which reconstruct a tensor using a smallnumber of underlying patterns in each dimension. Tensor factorizations extend well knowntechniques for a matrix, such as the singular value decomposition and principal compo-nent analysis, to higher-order arrays. A classical and straightforward technique is thePARAFAC/CANDECOMP (CP) (Harshman, 1970) decomposition, in which the data areapproximated as a linear combination of rank-1 tensors. An alternative is the Tucker decom-position (Tucker, 1966), in which a tensor is factorized into basis vectors for each dimensionthat are combined using a smaller core tensor. The CP factorization is a special case ofthe Tucker factorization wherein the core tensor is diagonal. Such factorization techniquesare useful to account for and exploit multi-way dependence and reduce dimensionality.2everal methods have been developed for the prediction of a scalar outcome from atensor of arbitrary dimension: Y : N × X : N × P × · · · × P L . Zhou et al. (2013)and Guo et al. (2012) propose tensor regression models for a single outcome in which thecoefficient array is assumed to have a low-rank CP factorization. The proposed frameworkin Zhou et al. (2013) extends to generalized linear models and allows for the incorporationof sparsity-inducing regularization terms. An analogous approach in which the coefficientsare assumed to have a Tucker structure is described by Li et al. (2013). Several methodshave also been developed for the classification of multiway data (categorical Y : N × Y : N × Q and X : N × P . A classical approach is reduced rank regression, in which the P × Q coefficient matrix is restricted to have low rank (Izenman, 1975; Mukherjee and Zhu, 2011).Miranda et al. (2015) describe a Bayesian formulation for regression models with multipleoutcome variables and multiway predictors ( Y : N × Q and X : N × P × · · · × P L ), which isapplied to a neuroimaging study. Conversely, tensor response regression models have beendeveloped to predict a multiway outcome from vector predictors ( Y : N × Q ×· · ·× Q M , X : N × P ). Sun and Li (2016) propose a tensor response regression wherein a multiway outcomeis assumed to have a CP factorization, and Li and Zhang (2016) propose a tensor responseregression wherein a multiway outcome is assumed to have a Tucker factorization withweights determined by vector-valued predictors. For a similar context Lock and Li (2016)describe a supervised CP factorization, wherein the components of a CP factorization areinformed by vector-valued covariates. Hoff (2015) extend a bilinear regression model formatrices to the prediction of an outcome tensor from a predictor tensor with the samenumber of modes (e.g., Y : N × Q × · · · × Q K and X : N × P × · · · × P K ) via a Tuckerproduct and describe a Gibbs sampling approach to inference.The above methods address several important tasks, including scalar-on-tensor regres-sion, vector-on-vector regression, vector-on-tensor regression and tensor-on-vector regres-sion. However, there is a lack of methodology to addresses the important and increasinglyrelevant task of tensor-on-tensor regression, i.e., predicting an array of arbitrary dimensionfrom another array of arbitrary dimension. This scenario is considered within a compre-hensive theoretical study of convex tensor regularizers Raskutti and Yuan (2015), includingthe tensor nuclear norm. However, they do not discuss estimation algorithms for this con-3ext, and computing the tensor nuclear norm is NP-hard (Sun and Li, 2016; Friedland andLim, 2014). In this article we propose a contracted tensor product for the linear predictionof a tensor X from a tensor Y , where both X and Y have arbitrary dimension, through acoefficient array B of dimension P × · · · × P L × Q × · · · × Q M . This framework is shownto accommodate all valid linear relations between the variates of X and the variates of Y .In our implementation B is assumed to have reduced CP-rank, a simple restriction whichsimultaneously exploits the multi-way structure of both X and Y by borrowing informationacross the different modes and reducing dimensionality. We propose a general and efficientalgorithm for penalized least-squares estimation, which allows for a ridge ( L ) penalty onthe coefficients. The objective is shown to give the mode of a Bayesian posterior, whichmotivates a Gibbs sampling algorithm for inference.The primary novel contribution of this article is a framework and methodology thatallows for tensor-on-tensor regression with arbitrary dimensions. Other novel contributionsinclude optimization under a ridge penalty on the coefficients and Gibbs sampling forinference, and these contributions are also relevant to the more familiar special cases oftensor regression (scalar-on-tensor), reduced rank regression (vector-on-vector), and tensorresponse regression (tensor-on-vector). Throughout this article bold lowercase characters ( a ) denote vectors, bold uppercase char-acters ( A ) denote matrices, and uppercase blackboard bold characters ( A ) denote multi-wayarrays of arbitrary dimension.Define a K -way array (i.e., a K th-order tensor) by A : I × · · · × I K , where I k is thedimension of the k th mode . The entries of the array are defined by indices enclosed insquare brackets, A [ i , . . . , i K ], where i k ∈ { , . . . , I k } for k ∈ , . . . , K .For vectors a , . . . , a K of length I , . . . , I K , respectively, define the outer product A = a ◦ a · · · ◦ a K as the K -way array of dimensions I × · · · × I K , with entries A [ i , . . . , i K ] = K (cid:89) k =1 a k [ i k ] . rank
1. For matrices A , . . . , A K of thesame column dimension R , we introduce the notation[[ A , . . . , A K ]] = R (cid:88) r =1 a r ◦ · · · ◦ a Kr , (1)where a kr is the r th column of A k . This gives a CP factorization, and an array that canbe expressed in the form (1) is defined to have rank R.The vectorization operator vec( · ) transforms a multiway array to a vector containingthe array entries. Specifically, vec( A ) is a vector of length (cid:81) Kk =1 I K wherevec( A ) (cid:34) i + K (cid:88) k =2 (cid:32) k − (cid:89) l =1 I l (cid:33) ( i k − (cid:35) = A [ i , . . . , i K ] . It is often useful to represent an array in matrix form via unfolding it along a given mode.For this purpose we let the rows of A ( k ) : I k × (cid:16)(cid:81) j (cid:54) = k I j (cid:17) give the vectorized versions ofeach subarray in the k th mode.For two multiway arrays A : I ×· · ·× I K × P · · · P L and B : P ×· · ·× P L × Q ×· · ·× Q M we define the contracted tensor product (cid:104) A , B (cid:105) L : I × · · · × I K × Q × · · · × Q M by (cid:104) A , B (cid:105) L [ i , . . . i K , q , . . . , q M ] = P (cid:88) p =1 · · · P L (cid:88) p L =1 A [ i , . . . i K , p , . . . , p L ] B [ p , . . . p L , q , . . . , q M ] . An analogous definition of the contracted tensor product, with slight differences in notation,is given in Bader and Kolda (2006). Note that for matrices A : I × P and B : P × Q , (cid:104) A , B (cid:105) = AB , and thus the contracted tensor product extends the usual matrix product to higher-orderoperands. 5 General framework
Consider predicting a multiway array Y : N × Q × · · · × Q M from a multiway array X : N × P × · · · × P L with the model Y = (cid:104) X , B (cid:105) L + E (2)where B : P × · · · × P L × Q × · · · × Q M is a coefficient array and E : N × Q × · · · × Q M is an error array. The first L modes of B contract the dimensions of X that are not in Y ,and the last M modes of B expand along the modes in Y that are not in X . The predictedoutcome indexed by ( q , . . . , q M ) is Y [ n, q , . . . , q M ] ≈ P (cid:88) p · · · P L (cid:88) p L X [ N, p , . . . , p L ] B [ p , . . . , p L , q , . . . , q M ] (3)for observations n = 1 , . . . , N . In (2) we forgo the use of an intercept term for simplicity,and assume that X and Y are each centered to have mean 0 over all their values.Let P be the total number of predictors for each observation, P = (cid:81) Ll =1 P L , and Q bethe total number of outcomes for each observation, Q = (cid:81) Mm =1 Q M . Equation (2) can bereformulated by rearranging the entries of X , Y , B and E into matrix form Y (1) = X (1) B + E (1) (4)where Y (1) : N × Q , X (1) : N × P , and E (1) : N × Q are the arrays Y , X and E unfoldedalong the first mode. The columns of B : P × Q vectorize the first L modes of B (collapsing X ), and the rows of B vectorize the last M modes of B (expanding to Y ): B (cid:34) p + L (cid:88) l =2 (cid:32) l − (cid:89) i =1 P l (cid:33) ( p l − , q + M (cid:88) m =2 (cid:32) m − (cid:89) i =1 Q m (cid:33) ( q m − (cid:35) = B [ p , . . . , p L , q , . . . , q M ] . From its matrix form (4) it is clear that the general framework (2) supports all valid linearrelations between the P variates of X and the Q variates of Y . Consider choosing B to minimize the sum of squared residuals || Y − (cid:104) X , B (cid:105) L || F . B is given by separate OLS regressions for each of the Q outcomes in Y , each with design matrix X (1) ; this is clear from (4), where the columnsof B are given by separate OLS regressions of X (1) on each column of Y (1) . Therefore,the unrestricted solution is not well-defined if Q > N or more generally if X (1) is not offull column rank. The unrestricted least squares solution may be undesirable even if it iswell-defined, as it does not exploit the multi-way structure of X or Y , and requires fitting L (cid:89) l =1 P l M (cid:89) m =1 Q m (5)unknown parameters. Alternatively, the multi-way nature of X and Y suggests a low-ranksolution for B . The rank R solution can be represented as B = [[ U , . . . , U L , V , . . . , V M ]] , (6)where U l : P l × R for l = 1 , . . . , L and V m : Q m × R for m = 1 , . . . , M . The dimension ofthis model is R ( P + · · · + P L + Q + · · · + Q M ) , (7)which can be a several order reduction from the unconstrained dimensionality (5). More-over, the reduced rank solution allows for borrowing of information across the differentdimensions of both X and Y . However, the resulting least-squares solutionarg min rank( B ) ≤ R || Y − (cid:104) X , B (cid:105) L || F . (8)is still prone to over-fitting and instability if the model dimension (7) is high relative tothe number of observed outcomes, or if the predictors X have multicollinearity that is notaddressed by the reduced rank assumption (e.g., multicollinearity within a mode). High-dimensionality and multicollinearity are both commonly encountered in application areasthat involve multi-way data, such as imaging and genomics. To address these issues weincorporate an L penalty on the coefficient array B ,arg min rank( B ) ≤ R || Y − (cid:104) X , B (cid:105) L || F . + λ || B || F , (9)where λ controls the degree of penalization. This objective is equivalent to that of ridgeregression when predicting a vector outcome Y : N × X : N × P , wherenecessarily R = 1. 7 Identifiability
The general predictive model (2) is identifiable for B , in that B (cid:54) = B ∗ implies (cid:104) ˜ X , B (cid:105) L (cid:54) = (cid:104) ˜ X , B ∗ (cid:105) L for some ˜ X ∈ R P ×···× P L . To show this, note that if ˜ X is an array with 1 in position[ p , . . . , p L ] and zeros elsewhere, then (cid:104) ˜ X , B (cid:105) L [ q , . . . , q M ] = B [ p , . . . , p M , q , . . . , q M ] . However, the resulting components U , . . . , U L , V , . . . , V M in the factorized represen-tation of B (6) are not readily identified. Conditions for their identifiability are equivalentto conditions for the identifiability of the CP factorization, for which there is an extensiveliterature. To account for arbitrary scaling and ordering of the components, we impose therestrictions1. || u r || = · · · = || u Lr || = || v r || = · · · = || v Mr || for r = 1 . . . , R , and2. || u || ≥ || u || ≥ · · · ≥ || u R || .The above restrictions are generally enough to ensure identifiability when L + M ≥ L + M = 2 (i.e., when predicting amatrix from a matrix, a 3-way array from a vector, or a vector from a 3-way array), then B is a matrix and we require additional orthogonality restrictions:3. u T r u r ∗ = 0 for all r (cid:54) = r ∗ , or v T r v r ∗ = 0 for all r (cid:54) = r ∗ .In practice these restrictions can be imposed post-hoc, after the estimation proceduredetailed in Section 7. For L + M ≥
3, restrictions (a) and (b) can be imposed via are-ordering and re-scaling of the components. For L + M = 2, components that satisfyrestrictions (a), (b) and (c) can be identified via a singular value decomposition of B . Here we describe other methods that fall within the family given by the reduced rank ridgeobjective (9). When predicting a vector from a matrix ( Q = 0, P = 1), this framework is8quivalent to standard ridge regression (Hoerl and Kennard, 1970), which is equivalent toOLS when λ = 0. Moreover, a connection between standard ridge regression and continuumregression (Sundberg, 1993) implies that the coefficients obtained through ridge regressionare proportional to partial least squares regression for some λ = λ ∗ , and the coefficientsare proportional to the principal components of X when λ → ∞ .When predicting a matrix from another matrix ( Q = 1 , P = 1), the objective given by(9) is equivalent to reduced rank regression (Izenman, 1975) when λ = 0. For arbitrary λ the objective is equivalent to a recently proposed reduced rank ridge regression (Mukherjeeand Zhu, 2011).When predicting a scalar from a tensor of arbitrary dimension ( Q = 0, arbitrary P ),(9) is equivalent to tensor ridge regression (Guo et al., 2012). Guo et al. (2012) use analternating approach to estimation but claim that the subproblem for estimation of each U l cannot be computed in closed form and resort to gradient style methods instead. On thecontrary, our optimization approach detailed in Section 7 does give a closed form solutionto this subproblem (11). Alternatively, Guo et al. (2012) suggest the separable form of theobjective, arg min rank( B ) ≤ R || y − (cid:104) X , B (cid:105) L || F . + λ L (cid:88) l =1 || U l || F . (10)This separable objective is also used by Zhou et al. (2013), who consider a power fam-ily of penalty functions for predicting a vector from a tensor using a generalized linearmodel; their objective for a Gaussian response under L penalization is equivalent to (10).The solution of the separable L penalty depends on arbitrary scaling and orthogonalityrestrictions for identifiability of the U l ’s. For example, the separable penalty (10) is equiv-alent to the non-separable L penalty (9) if the columns of U , . . . , U L are restricted to beorthonormal.Without scale restrictions on the columns of U l , the solution to the separable L penaltyis equal to the solution for the non-separable penalty || B || ∗ for L = 2, where || B || ∗ definesthe nuclear norm (i.e., the sum of the singular values of B ). This interesting result is givenexplicitly in Proposition 1, and its proof is given in Appendix C. Proposition 1.
For B = [[ U , U ]] = U U T , where the columns of U and U are orthog-onal, arg min rank( B ) ≤ R || y − (cid:104) X , B (cid:105) || F + λ (cid:88) l =1 || U l || F = arg min rank( B ) ≤ R || y − (cid:104) X , B (cid:105) || F + 2 λ || B || ∗ . Optimization
We describe an iterative procedure to estimate B that alternatingly solves the objective (9)for the component vectors in each mode, { U , . . . , U L , V , . . . , V M } , with the others fixed. Here we consider the case without ridge regularization, λ = 0, wherein the componentvectors in each mode are updated via separate OLS regressions.To simplify notation we describe the procedure to update U with { U , . . . , U L , V , . . . , V M } fixed. The procedure to update each of U , . . . , U L is analogous, because the loss functionis invariant under permutation of the L modes of X .Define C r : N × P × Q × · · · × Q M to be the contracted tensor product of X and the r ’th component of the CP factorization without U : C r = (cid:104) X , u r ◦ · · · ◦ u Lr ◦ v r ◦ · · · ◦ v Mr (cid:105) L − . Unfolding C r along the dimension corresponding to P gives the design matrix to predictvec( Y ) for the r ’th column of U , C r : N Q × P . Thus, concatenating these matrices todefine C : N Q × RP by C = [ C . . . C R ] gives the design matrix for all of the entries of U , which are updated via OLS:vec( U ) = (cid:0) C T C (cid:1) − C T vec( Y ) . (11)For the outcome modes we describe the procedure to update V M with { U , . . . , U L , V , . . . , V M − } fixed. The procedure to update each of V , . . . , V L − is analogous, becausethe loss function is invariant under permutation of the M modes of Y .Let Y M : Q M × N (cid:81) M − m =1 Q m be Y unfolded along the mode corresponding to Q M .Define D : N (cid:81) M − m =1 Q m × R so that the r ’th column of D , d r , gives the entries of thecontracted tensor product of X and the r ’th component of the CP factorization without V M : d r = vec (cid:0) (cid:104) X , u r ◦ · · · ◦ u Lr ◦ v r ◦ · · · ◦ v ( M − r (cid:105) L (cid:1) . The entries of V M are then updated via Q M separate OLS regressions: V M = ( D T D ) − D T Y TM . (12)10 .2 Ridge-regularization For λ >
0, note that the objective (9) can be equivalently represented as an unregularizedleast squares problem with modified predictor and outcome arrays ˜ X and ˜ Y :arg min rank( B ) ≤ R || ˜ Y − (cid:104) ˜ X , B (cid:105) L || F . Here ˜ X : ( N + P ) × P × · · · × P L is the concatenation of X and a tensor wherein each P × · · · × P L dimensional slice has √ λ for a single entry and zeros elsewhere; ˜ Y : ( N + P ) × Q × · · · × Q M is the concatenation of Y and a P × Q × · · · × Q M tensor of zeros. Unfolding˜ X and ˜ Y along the first dimension yields the matrices˜ X (1) = (cid:20) X (1) √ λ I P × P (cid:21) and ˜ Y (1) = (cid:20) Y (1) P × (cid:81) Lm =1 Q m (cid:21) , where I is the identity matrix and is a matrix of zeros.Thus, one can optimize the objective (9) via alternating least squares by replacing ˜ X for X and ˜ Y for Y in the least-squares algorithm of Section 7.1. However, ˜ X and ˜ Y can bevery large for high-dimensional X . Thankfully, straightforward tensor algebra shows thatthis is equivalent to a direct application of algorithm in Section 7.1 to the original data X and Y , with computationally efficient modifications to the OLS updating steps (11) and(12). The updating step for U (11) isvec( U ) = (cid:0) C T C + λ ( U T U · · · · · U TL U L · V T V · · · · · V TM V M ) ⊗ I P × P (cid:1) − C T vec( Y )(13)where · defines the dot product and ⊗ defines the Kronecker product. The updating stepfor V M (12) is V M = ( D T D + λ ( U T U · · · · · U TL U L · V T V · · · · · V TM − V M − )) − D T Y TM . (14)This iterative procedure is guaranteed to improve the regularized least squares objective(9) at each sub-step, until convergence to a (potentially local) optimum. Higher levelsof regularization ( λ → ∞ ) tend to convexify the objective and facilitate convergence to aglobal optimum; a similar phenomenon is observed in Zhou et al. (2013). In practice we findthat robustness to initial values and local optima is improved by a tempered regularization,starting with larger values of λ that gradually decrease to the desired level of regularization.11 .3 Tuning parameter selection Selection of λ and R (if unknown) can be accomplished by assessing predictive accuracy witha training and test set, as illustrated in Section 10. More generally, these parameters can beselected via K-fold cross-validation. This approach has the advantage of being free of modelassumptions. Alternatively, it is straightforward to compute the deviance informationcriterion (DIC) (Spiegelhalter et al., 2014) for posterior draws under the Bayesian inferenceframework of Section 8 and use this as a model-based heuristic to select both λ and R . In the previous sections we have considered optimizing a given criteria for point estimation,without specifying a distributional form for the data or even a philosophical framework forinference. Indeed, the estimator given by the objective (9) is consistent under a widevariety of distributional assumptions, including those that allow for correlated responses orpredictors. See Appendix A for more details on its consistency.For inference and uncertainty quantification for this point estimate, we propose aMarkov chain Monte Carlo (MCMC) simulation approach. This approach is theoreticallymotivated by the observation that (9) gives the mode of a Bayesian probability distri-bution. There are several other reasons to use MCMC simulation for inference in thiscontext, rather than (e.g.,) asymptotic normality of the global optimizer under an assumedlikelihood model (see Zhou et al. (2013) and Zhang et al. (2014) for related results). Thealgorithm in Section 7 may converge to a local minimum, which can still be used as astarting value for MCMC. Moreover, our approach gives a framework for full posterior in-ference on ˆ β over its rank R support, the conditional mean for observed responses, and thepredictive distribution for the response array given new realizations of the predictor arraywithout requiring the identifiability of θ = { U , . . . , U L , V , . . . , V M } . Inference for θ isalso possible under the conditions of Section 5.If the errors E have independent N(0 , σ ) entries, the log-likelihood of Y implied by thegeneral model (2) islog pr( Y | σ , B , X ) = constant − σ || Y − (cid:104) X , B (cid:105) L || F , and thus the unregularized objective (8) gives the maximum likelihood estimate under therestriction rank( B ) = R . For λ >
0, consider a prior distribution for B that is proportional12o the spherical Gaussian distribution with variance σ /λ over the support of rank R tensors: pr( B ) ∝ (cid:40) exp (cid:0) − λ σ || B || F (cid:1) if rank( B ) ≤ R. B islog pr( B | Y , X , σ ) = constant − σ (cid:0) || Y − (cid:104) X , B (cid:105) L || F + λ || B || F (cid:1) (16)where rank( B ) = R , which is maximized by (9).Under the factorized form (6) the full conditional distributions implied by (16) for eachof U , . . . , U L , V , . . . , V M are multivariate normal. For example, the full conditional for U is pr(vec( U ) | U , . . . , U L , V , . . . , V M , Y , X , σ ) = N ( µ , Σ ) , where µ is the right hand side of (13) and Σ = σ (cid:0) C T C + λ ( U T U · · · · · U TL U L · V T V · · · · · V TM V M ) ⊗ I P × P (cid:1) − where C is defined as in Section 7.1. The full conditional for V M ispr(vec( V M ) | U , . . . , U L , V , . . . , V M , Y , X , σ ) = N ( µ L + M , Σ L + M ) , where µ L + M is given by the right hand side of (14) and Σ L + M = σ ( D T D + λ ( U T U · · · · · U TL U L · V T V · · · · · V TM − V M − )) − ⊗ I Q M × Q M The derivations of the conditional means and variances { µ i , Σ i } L + Mi =1 are given in Ap-pendix B. When λ = 0 the full conditionals correspond to a flat prior on B , pr( B ) ∝ B ) = R, and the posterior mode is given by the unregularized objective (8).In practice we use a flexible Jeffrey’s prior for σ , pr( σ ) ∝ /σ , which leads to aninverse-gamma (IG) full conditional distribution,pr( σ | B , Y , X ) = IG (cid:18) N Q , || Y − (cid:104) X , B (cid:105) L || F (cid:19) . (17)We simulate dependent samples from the marginal posterior distribution of B by Gibbssampling from the full conditionals of U , . . . , U L , V , . . . , V M , and σ :13. Initialize B (0) by the posterior mode (9) using the procedure in Section 7.For samples t = 1 , . . . , T , repeat (b) and (c):2. Draw σ t ) from P (cid:0) σ | B ( t − , Y , X (cid:1) as in (17).3. Draw B ( t ) = [[ U ( t )1 , . . . , U ( t ) L , V ( t )1 , . . . , V ( t ) M ]], as follows: U ( t )1 ∼ P (cid:16) U | U ( t − , . . . , U ( t − L , V ( t − , . . . , V ( t − M , Y , X , σ t ) (cid:17) ... U ( t ) L ∼ P (cid:16) U L | U ( t )1 , . . . , U ( t − L − , V ( t − , . . . , V ( t − M , Y , X , σ t ) (cid:17) V ( t )1 ∼ P (cid:16) V | U ( t )1 , . . . , U ( t ) L , V ( t − , . . . , V ( t − M , Y , X , σ t ) (cid:17) ... V ( t ) L ∼ P (cid:16) V M | U ( t )1 , . . . , U ( t ) L , V ( t )1 , . . . , V ( t ) M − , Y , X , σ t ) (cid:17) . For the above algorithm U , . . . , U L , V , . . . , V M serve as a parameter augmentation tofacilitate sampling for B . Interpreting the marginal distribution of each of the U (cid:48) l s or V (cid:48) m s separately requires careful consideration of their identifiability (see Section 5). Oneapproach is to perform a post-hoc transformation of the components at each samplingiteration B ( t ) = [[ U ∗ ( t )1 , . . . , U ∗ ( t ) L , V ∗ ( t )1 , . . . , V ∗ ( t ) M ]] , where { U ∗ ( t )1 , . . . , U ∗ ( t ) L , V ∗ ( t )1 , . . . , V ∗ ( t ) M } satisfy given restrictions for identifiability.For ˜ N out-of-sample observations with predictor array X new : ˜ N × P × · · · × P L , thepoint prediction for the responses isˆ Y new = (cid:104) X new , ˆ B (cid:105) L (18)where B is given by (9). Uncertainty in this prediction can be assessed using samples fromthe posterior predictive distribution of Y new: Y ( t ) new = (cid:104) X new , B ( t ) (cid:105) L + E ( t ) new , (19)where E ( t ) new is generated with independent N(0 , σ t ) ) entries.14 Simulation study
We conduct a simulation study to predict a three-way array Y from another three-wayarray X under various conditions. We implement a fully crossed factorial simulation designwith the following manipulated conditions: • Rank R = 0 , , , , • Sample size N = 30 or 120 (2 levels) • Signal-to-noise ratio SNR = 1 or 5 (2 levels).For each of the 24 scenarios, we simulate data as follows:1. Generate X : N × P × P with independent N (0 ,
1) entries.2. Generate U l : P l × R for l = 1 , . . . , L and V m : Q m × R for m = 1 , . . . , M , each withindependent N (0 ,
1) entries.3. Generate error E : N × Q × Q with independent N (0 ,
1) entries.4. Set Y = (cid:104) X , B (cid:105) L + E , where B = c [[ U , . . . , U L , V , . . . , V M ]]and c is the scalar giving ||(cid:104) X , B (cid:105) L || F || E || F = SNR . We fix the dimensions p = 15 , p = 20 , q = 5 , q = 10, and generate 10 replicateddatasets as above for each of the 24 scenarios, yielding 240 simulated datasets. For eachsimulated dataset, we estimate B as in Section 7 under each combination of the followingparameters: • Assumed rank ˆ R = 1 , , , • Regularization term λ = 0 , . , , R = 0. Training samples N = 30 N = 1201.48 1.08 Regularization λ = 0 λ = 0 . λ = 1 λ = 5 λ = 502.00 1.14 1.13 1.08 1.02 Assumed rank ˆ R = 1 ˆ R = 2 ˆ R = 3 ˆ R = 4 ˆ R = 51.04 1.12 1.20 1.32 1.70For each of the 240 simulated datasets and 5 · B . This isdone empirically by generating a new dataset with ˜ N = 500 observations: Y new = (cid:104) X new , B (cid:105) L + E newwhere X new and E new have independent N (0 ,
1) entries. The relative prediction error(RPE) for these test observations isRPE = || Y new − (cid:104) X new , ˆ B (cid:105) L || F || Y new || F . (20)Symmetric 95% credible intervals are created for each value of Y new using T = 1000outcome arrays simulated from the posterior (19). First we consider the results for those cases with no signal, R = 0, where the oracleRPE is 1. The marginal mean RPE across the levels of N , λ , and ˆ R are shown in Table 1.Overall, simulations with a higher training sample size N resulted in lower RPE, estimationwith higher regularization parameter λ resulted in lower RPE, and estimation with higherassumed rank resulted in higher RPE. These results are not surprising, as a lower samplesize, higher assumed rank and less regularization all encourage over-fitting.Table 2 shows the effect of the regularization parameter λ on the accuracy of the es-timated model, in terms of RPE and coverage rates, for different scenarios. As expected,prediction error is generally improved in scenarios with a higher training sample size andhigher signal-to-noise ratio. Higher values of λ generally improve predictive performance16able 2: The top panel shows mean RPE by regularization for different scenarios us-ing correct assumed ranks. The bottom panel shows the coverage rate for 95% credibleintervals, and their mean length relative to the standard deviation of Y . RPE (std error) λ = λ = . λ = λ = λ = N = 120 , SNR = 1 . (0 . . (0 . . (0 . . (0 . . (0 . N = 120 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (0 . N = 30 , SNR = 1 . (0 . . (0 . . (0 . . (0 . . (0 . N = 30 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (0 . Coverage (length) λ = λ = . λ = λ = λ = N = 120 , SNR = 1 . (2 . . (2 . . (2 . . (2 . . (2 . N = 120 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (1 . N = 30 , SNR = 1 . (4 . . (3 . . (3 . . (3 . . (3 . N = 30 , SNR = 5 . (3 . . (1 . . (1 . . (1 . . (2 . λ can lead to over-shrinkageof the estimated coefficients and introduce unnecessary bias, especially in scenarios thatare less prone to over-fitting. Coverage rates of the 95% credible intervals are generallyappropriate, especially with a higher training sample size. However, for the scenario withlow sample size and high signal ( N = 30 , λ = 120) coverage rates for moderate values of λ are poor, as inference is biased toward smaller values of B .Table 3 illustrates the effects of rank misspecification on performance, under the scenariowith N = 120, SNR= 1 and no regularization ( λ = 0). For each possible value of the truerank R = 1 , . . . ,
5, the RPE is minimized when the assumed rank is equal to the true rank.Predictive performance is generally more robust to assuming a rank higher than the truerank than it is to assuming a rank lower than the true rank.17able 3: Mean RPE by assumed rank for different true ranks for N = 120, S2N= 1, and λ = 0 ˆ R = 1 ˆ R = 2 ˆ R = 3 ˆ R = 4 ˆ R = 5 R = 1 . .
54 0 .
55 0 .
56 0 . R = 2 0 . . .
53 0 .
53 0 . R = 3 0 .
71 0 . . .
55 0 . R = 4 0 .
78 0 .
67 0 . . . R = 5 0 .
81 0 .
68 0 .
61 0 . . See Appendix D for additional simulation results when the predictors X or response Y are correlated.
10 Application
We use the tensor-on-tensor regression model to predict attributes from facial images, usingthe Labeled Faces in the Wild database (Learned-Miller et al., 2016). The database includesover 13000 publicly available images taken from the internet, where each image includesthe face of an individual. Each image is labeled only with the name of the individualdepicted, often a celebrity, and there are multiple images for each individual. The imagesare unposed and exhibit wide variation in lighting, image quality, angle, etc. (hence “in thewild”).Low-rank matrix factorization approaches are commonly used to analyze facial imagedata, particularly in the context of facial recognition (Sirovich and Kirby, 1987; Turk andPentland, 1991; Vasilescu and Terzopoulos, 2002; Kim and Choi, 2007). Although facialimages are not obviously multi-linear, the use of multi-way factorization techniques hasbeen shown to convey advantages over simply vectorizing images (e.g., from a P × P array of pixels to a vector of length P P ) (Vasilescu and Terzopoulos, 2002). Kim and Choi(2007) show that treating color as another mode within a tensor factorization frameworkcan improve facial recognition tasks with different lighting. Moreover, the CP factorizationhas been shown to be much more efficient as a dimension reduction tool for facial imagesthan PCA, and marginally more efficient than the Tucker and other multiway factorizationtechniques (Lock et al., 2011).Kumar et al. (2009) developed an attribute classifier, which gives describable attributesfor a given facial image. These attributes include characteristics that describe the individual18e.g., gender, race, age), that describe their expression (e.g., smiling, frowning, eyes open),and that describe their accessories (e.g., glasses, make-up, jewelry). These attribute weredetermined on the Faces in the Wild dataset, as well as other facial image databases.In total 72 attributes are measured for each image. The attributes are measured on acontinuous scale; for example, for the smiling attribute, higher values correspond to a moreobvious smile and lower values correspond to no smile.Our goal is to create an algorithm to predict the 72 describable and correlated attributesfrom a given image that contains a face. First, the images are frontalized as described inHassner et al. (2015). In this process the unconstrained images are rotated, scaled, andcropped so that all faces appear forward-facing and the image shows only the face. Afterthis step images are aligned over the coordinates, in that we expect the nose, mouth andother facial features to be in approximately the same location. Each frontalized image is90 ×
90 pixels, and each pixel gives the intensity for colors red, green and blue, resulting ina multiway array of dimensions 90 × ×
3. We center the array by subtracting the “meanface” from each image, i.e., we center each pixel triplet ( x × y × color) to have mean 0 overthe collection of frontalized images. We standardize the facial attribute data by convertingthe measurements to z-scores, wherein each attribute has mean zero and standard deviation1 over the collection of faces.To train the predictive model we use a use a random sample of 1000 images from uniqueindividuals. Thus the predictor array of images X is of dimension 1000 × × ×
3, andthe outcome array of attributes Y is of dimension 1000 ×
72. Another set of 1000 imagesfrom unique individuals are used as a validation set, X new : 1000 × × × Y new : 1000 ×
72. 19igure 1: Relative prediction error for characteristics of out-of-sample images for differentparameter choices. The top row (full rank) gives the results under separate ridge regressionmodels for each outcome without rank restriction.We run the optimization algorithm in Section 7 to estimate the coefficient array B :90 × × ×
72 under various values for the rank R and regularization parameter λ .We consider all combinations of the values λ = { , . , , , , , , } and R = { , , . . . , } . We also consider the full rank model that ignores multi-way structure, wherethe coefficients are given by separate ridge regressions for each of the 72 outcomes on the90 · · . R = 15 and λ = 10 . The performance of models with no regularization ( λ = 0), or without rankrestriction (rank=FULL), were much worse in comparison. This illustrates the benefitsof simultaneous rank restriction and ridge regularization for high-dimensional multi-wayprediction problems. 20n what follows we use R = 15 and λ = 10 . Figure 2 shows the predicted values vs.the given values, for the test data, over all 72 characteristics. The plot shows substantialresidual variation but a clear trend, with correlation r = 0 . .
934 for 95%credible intervals and 0 .
887 for 90% credible intervals. The full posterior distributions for21 small number of select characteristics, for a single test image, are shown in Figure 3 asan illustration of the results.Figure 3: Example test image (left), and its posterior samples for 5 select characteristics(right).
11 Discussion
In this article we have proposed a general framework for predicting one multiway arrayfrom another using the contracted tensor product. The simulation studies and facial imageapplication illustrate the advantages of CP-rank regularization and ridge regularization inthis framework. These two parameters define a broad class of models that are appropriatefor a wide variety of scenarios. The CP assumption accounts for multi-way dependencein both the predictor and outcome array, and the ridge penalty accounts for auxiliaryhigh-dimensionality and multi-collinearity of the predictors. However, several alternativeregularization strategies are possible. The coefficient array can be restricted to have a moregeneral Tucker structure (as in Li et al. (2013)), rather than a CP structure. A broad family22f separable penalty functions, such as the separable L penalty in (10), are straightforwardto impose within the general framework using an alternating estimation scheme similar tothat described in Zhou et al. (2013). In particular, a separable L penalty has advantageswhen a solution that includes sparse subregions of each mode is desired. The alternatingestimation scheme described herein for the non-separable L penalty is not easily extendedto alternative non-separable penalty functions.We have described a simple Gaussian likelihood and a prior distribution for B thatare motivated by the least-squares objective with non-separable L penalty. The resultingprobability model involves many simplifying assumptions, which may be over-simplifiedfor some situations. In particular, the assumption of independent and homoscadastic errorin the outcome array can be inappropriate for applications with auxiliary structure in theoutcome. The array normal distribution (Akdemir and Gupta, 2011; Hoff et al., 2011)allows for multiway dependence and can be used as a more flexible model for the errorcovariance. Alternatively, envelope methods (Cook and Zhang, 2015) rely on a generaltechnique to account for and ignore immaterial structure in the response and/or predictorsof a predictive model. A tensor envelope is defined in Li and Zhang (2016), and its use inthe tensor-on-tensor regression framework is an interesting direction for future work.The approach to inference used herein is “semi-Bayesian”, in that the prior is limitedto facilitate inference under the given penalized least-squares objective and is not intendedto be subjective. Fully Bayesian approaches, such as using a prior for both the rank of thecoefficient array and the shrinkage parameter, are another interesting direction of futurework. SUPPLEMENTARY MATERIAL
MultiwayRegression:
R package MultiwayRegression, containing documented code forall methods described in this article. (GNU zipped tar file)
A Consistency
Here we establish the consistency of the minimizer of the objective (9), for fixed dimensionas N → ∞ , under general conditions. Theorem 1.
Assume model (2) holds for B = B , where . For each response index ( q , . . . , q M ) , the errors E [ n, q , ..., q M ] are independent andidentically distributed (iid) for n = 1 , . . . , N , with mean and finite second moment.2. For each predictor index ( p , . . . , p L ) , X [ n, p , . . . , p L ] are iid for n = 1 , . . . , N from abounded distribution.3. B has a rank R factorization (6), where θ = { U , . . . , U L , V , . . . , V M } is in theinterior of a compact space Θ and is identifiable under the restrictions of Section 5.For R = R and fixed ridge penalty λ ≥ , the minimizer of the objective 9, ˆ β N , converges to B in probability as N → ∞ . Moreover, under the restrictions of Section 5 the factorizationparameters ˆ θ N converge to θ in probability as N → ∞ . For Theorem 1 we require that the observations are iid, but within an observation theelements of the error array E or predictor array X may be correlated or from differentdistributions. Also, note that the correct rank is assumed for the estimator, but the resultholds for any fixed penalty λ ≥
0. The requirements that the predictors X are boundedand that Θ is compact are similar to those used to show the constancy of tensor regressionunder a normal likelihood model in Zhou et al. (2013). These requirements facilitate theuse of Glivenko-Cantelli theory with a classical result on the asymptotic consistency ofM-estimators (Van der Vaart, 2000); the proof is given below. Proof.
Let B ( θ ) be the coefficient array resulting from θ = { U , . . . , U L , V , . . . , V M } : B ( θ ) = [[ U , . . . , U L , V , . . . , V M ]] . Let M ( θ ) be the expected squared error for a single observation: M ( θ ) = E (cid:0) || Y n − (cid:104) X n , B ( θ ) (cid:105) || F (cid:1) , which exists for all θ ∈ Θ because the entries of E are assumed to have finite secondmoment. Let M λN ( θ ) be the penalized empirical squared error loss (9) divided by N : M λN ( θ ) = 1 N (cid:32) N (cid:88) n =1 || Y n − (cid:104) X n , B ( θ ) (cid:105) | | F (cid:33) + λN || B ( θ ) || F . From Theorem 5.7 of Van der Vaart (2000), the following three properties imply ˆ θ is aconsistent estimator of θ : 24. inf θ : d ( θ,θ ) ≥ (cid:15) M ( θ ) > M ( θ ) for any (cid:15) > M λN (ˆ θ n ) ≤ M λN ( θ ) − O P (1), where O P (1) defines a stochastically bounded sequenceof random variables, and3. sup θ ∈ Θ | M λN ( θ ) − M ( θ ) | P → E ( Y ) = (cid:104) X , B ( θ ) (cid:105) L , the coefficient array B ( θ ) minimizes the expected squarederror. Property 1 then follows from the identifiability of θ . For any θ ∈ Θ, M λN ( θ ) → M ( θ )almost surely by the strong law of large numbers and the factlim N →∞ sup θ ∈ Θ λN || B ( θ ) || F = 0 . (21)Also, M ( θ ) is necessarily bounded over the compact space Θ. Thus, both M N (ˆ θ n ) and M n ( θ ) are stochastically bounded, and property 2 follows. For property 3 it suffices toshow uniform convergence of the unpenalized squared error M N ( θ ), bysup θ ∈ Θ | M λN ( θ ) − M ( θ ) | ≤ sup θ ∈ Θ | M N ( θ ) − M ( θ ) | + sup θ ∈ Θ λN || B ( θ ) || F and (21). The uniform convergence of M N ( θ ) can be verified by Glivenko-Cantelli theory.Define m θ ( X n , Y n ) = || Y n − (cid:104) X n , B ( θ ) (cid:105) || F . The class { m θ : θ ∈ Θ } is Glivenko-Cantelli, because Θ is compact and m θ ( X n , Y n ) iscontinuous as a function of θ and bounded on Θ for any ( X n , Y n ). Thus, property 3 holdsand ˆ θ is a consistent estimator of θ .By the continuous mapping theorem, B (ˆ θ ) is also consistent estimator of the true coef-ficient array B ( θ ). B Posterior derivations
Here we derive the full conditional distributions of the factorization components for B = [[ U , . . . , U L , V , . . . , V M ]], used in Section 8.First, we consider the a priori conditional distributions that are implied by the spher-ical Gaussian prior for B (15). Here we derive the prior conditional for U , pr( U | , . . . , U L , V , . . . , V M ); the prior conditionals for U , . . . , U L , V , . . . , V M are analogous,because the prior for B is permutation invariant over its L + M modes. Let b (1) r give thevectorized form of the CP factorization without U , b (1) r = vec ( u r ◦ · · · ◦ u Lr ◦ v r ◦ · · · ◦ v Mr ) , and define the matrix B (1) : Q (cid:81) Ll =2 P l × R by B (1) = (cid:104) b (1)1 . . . b (1) R (cid:105) . Thenvec (cid:16) U B (1) T (cid:17) = vec( B ) ∼ N (cid:18) , σ λ I P Q × P Q (cid:19) , and it follows thatpr(vec( U ) | U , . . . , U L , V , . . . , V M ) = N (cid:18) , (cid:16) B (1) T B (1) (cid:17) − ⊗ σ λ I P × P (cid:19) . The general model (2) implies C vec( U ) + vec( E ) = vec( Y ) , where C is defined as in (11). If E has independent N (0 , σ ) entries, a direct applicationof the Bayesian linear model (Lindley and Smith, 1972) givespr(vec( U ) | U , . . . , U L , V , . . . , V M , Y , X , σ ) = N ( µ , Σ )where µ = (cid:16) C T C + λ B (1) T B (1) ⊗ I P × P (cid:17) − C T vec( Y )and Σ = σ (cid:16) C T C + λ B (1) T B (1) ⊗ I P × P (cid:17) − . Basic tensor algebra shows B (1) T B (1) = U T U · · · · · U TL U L · V T V · · · · · V TM V M . The posterior mean and variance for U , . . . , U L are derived in an analogous way.For the V (cid:48) m s it suffices to consider V M , as the posterior derivations for V , . . . , V M − are analogous. The prior conditional for V M ispr(vec( V M ) | U , . . . , U L , V , . . . , V M − ) = N (cid:18) , (cid:16) B ( L + M ) T B ( L + M ) (cid:17) − ⊗ σ λ I Q M × Q M (cid:19) DV M + E = Y m , where D and Y M are defined as in (12), and E has independent N (0 , σ ) entries. Separateapplications of the Bayesian linear model for each row of V M givespr(vec( V M ) | U , . . . , U L , V , . . . , V M − , Y , X , σ ) = N ( µ L + M , Σ L + M )where µ L + M = vec(( D T D + λ B ( L + M ) T B ( L + M ) ) − D T Y TM )and Σ L + M = σ (cid:16) D T D + λ B ( L + M ) T B ( L + M ) (cid:17) − ⊗ I Q M × Q M . Basic tensor algebra shows B ( L + M ) T B ( L + M ) = U T U · · · · · U TL U L · V T V · · · · · V TM − V M − . C Proof of Proposition 1
Here we prove the equivalence of separable L penalization and nuclear norm penalizationstated in Proposition 1. The result is shown for predicting a vector from a three-wayarray, in which B = U U T . Analogous results exist for predicting a matrix from a matrix( B = U V T ) and predicting a three-way array from a vector ( B = V V T ).In the solution to arg min rank( B ) ≤ R || y − (cid:104) X , B (cid:105) || F + λ (cid:88) l =1 || U l || F (22)the columns of U and U , { u , . . . , u R } and { u , . . . , u R } , must satisfy || u r || = || u r || = || u r u T r || F for r = 1 , . . . , R. (23)Here (23) follows from the general result that for c > { ( a, b ) : ab = c } a + b = ( √ c, √ c ) , a = || u r || , b = || u r || , and c = || u r u T r || F . Thus, (cid:88) l =1 || U l || F = (cid:88) l =1 R (cid:88) r =1 || u lr || = 2 R (cid:88) r =1 || u r u T r || F , (24)Under orthogonality of the columns of U and U , the non-zero singular values of B are {|| u r u T r || F } Rr =1 , and therefore (24) is equal to 2 || B || ∗ . It follows that (22) is equivalent toarg min rank( B ) ≤ R || y − (cid:104) X , B (cid:105) || F + 2 λ || B || ∗ . D Correlated data simulation
Here we describe the results of a simulation study analogous to that in Section 9, but withcorrelation in the predictors X or in the response Y . We simulate Gaussian data withan exponential spatial correlation structure using the R package fields (Douglas Nychkaet al., 2015). The entries of E are assumed to be on a Q × Q grid ( Q = 5, Q = 10)with adjacent entries having distance 1. The entries of X are assumed to be on a P × P grid ( P = 15, P = 20) with adjacent entries having distance 1. The correlation betweenadjacent locations is ρ = 0 . X (step 1.) or E (step 3.).The resulting RPE and credible interval coverage rates are shown in Table 4, which isanalogous to Table 2 for the uncorrelated case. Interestingly, for penalized estimation and n = 30 the scenario with correlated X gives significantly better performance in terms of RPEthan the scenario without correlation. This was unexpected, but may be because correlationin X discourages the algorithm from converging to a local minimum. For correlated E theresults are often similar to the uncorrelated scenario but tend toward lower accuracy. Inparticular, the credible intervals tend to undercover more than for the uncorrelated scenario,or for the scenario with correlated X . This is probably because correlation in E violatesthe assumed likelihood model for inference, while correlation in X does not.28able 4: Mean RPE by regularization and coverage rate for correlated X or correlated E using correct assumed ranks. The coverage rates are for 95% credible intervals, and theirmean length relative to the standard deviation of Y is shown.Correlated X RPE (std error) λ = λ = . λ = λ = λ = N = 120 , SNR = 1 . (0 . . (0 . . (0 . . (0 . . (0 . N = 120 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (0 . N = 30 , SNR = 1 . (0 . . (0 . . (0 . . (0 . . (0 . N = 30 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (0 . Coverage (length) λ = λ = . λ = λ = λ = N = 120 , SNR = 1 . (2 . . (2 . . (2 . . (2 . . (2 . N = 120 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (1 . N = 30 , SNR = 1 . (4 . . (3 . . (3 . . (3 . . (3 . N = 30 , SNR = 5 . (3 . . (1 . . (1 . . (1 . . (2 . E RPE (std error) λ = λ = . λ = λ = λ = N = 120 , SNR = 1 . (0 . . (0 . . (0 . . (0 . . (0 . N = 120 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (0 . N = 30 , SNR = 1 . (0 . . (0 . . (0 . . (0 . . (0 . N = 30 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (0 . Coverage (length) λ = λ = . λ = λ = λ = N = 120 , SNR = 1 . (2 . . (2 . . (2 . . (2 . . (2 . N = 120 , SNR = 5 . (0 . . (0 . . (0 . . (0 . . (1 . N = 30 , SNR = 1 . (4 . . (3 . . (3 . . (2 . . (3 . N = 30 , SNR = 5 . (2 . . (1 . . (1 . . (1 . . (2 . eferences Akdemir, D. and Gupta, A. K. (2011). Array variate random variables with multiwayKronecker delta covariance matrix structure.
J. Algebr. Stat , 2(1):98–113.Bader, B. W. and Kolda, T. G. (2006). Algorithm 862: Matlab tensor classes for fast algo-rithm prototyping.
ACM Transactions on Mathematical Software (TOMS) , 32(4):635–653.Cook, R. D. and Zhang, X. (2015). Foundations for envelope models and methods.
Journalof the American Statistical Association , 110(510):599–611.De Martino, F., De Borst, A. W., Valente, G., Goebel, R., and Formisano, E. (2011). Pre-dicting EEG single trial responses with simultaneous fMRI and relevance vector machineregression.
Neuroimage , 56(2):826–836.Douglas Nychka, Reinhard Furrer, John Paige, and Stephan Sain (2015). fields: Tools forspatial data. R package version 9.0.Friedland, S. and Lim, L.-H. (2014). Nuclear norm of higher-order tensors. arXiv preprintarXiv:1410.6072 .Guo, W., Kotsia, I., and Patras, I. (2012). Tensor learning for regression.
IEEE Transac-tions on Image Processing , 21(2):816–827.Harshman, R. A. (1970). Foundations of the PARAFAC procedure: Models and conditionsfor an “explanatory” multi-modal factor analysis.
UCLA Working Papers in Phonetics ,16:1–84.Hassner, T., Harel, S., Paz, E., and Enbar, R. (2015). Effective face frontalization inunconstrained images. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 4295–4304.Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthog-onal problems.
Technometrics , 12(1):55–67.Hoff, P. D. (2015). Multilinear tensor regression for longitudinal relational data.
TheAnnals of Applied Statistics , 9(3):1169–1193.30off, P. D. et al. (2011). Separable covariance arrays via the Tucker product, with appli-cations to multivariate relational data.
Bayesian Analysis , 6(2):179–196.Izenman, A. J. (1975). Reduced-rank regression for the multivariate linear model.
Journalof multivariate analysis , 5(2):248–264.Kim, Y.-D. and Choi, S. (2007). Color face tensor factorization and slicing for illumination-robust recognition. In
International Conference on Biometrics , pages 19–28. Springer.Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications.
SIAMreview , 51(3):455–500.Kumar, N., Berg, A. C., Belhumeur, P. N., and Nayar, S. K. (2009). Attribute and simileclassifiers for face verification. In , pages 365–372. IEEE.Learned-Miller, E., Huang, G. B., RoyChowdhury, A., Li, H., and Hua, G. (2016). Labeledfaces in the wild: A survey. In
Advances in Face Detection and Facial Image Analysis ,pages 189–248. Springer.Li, L. and Zhang, X. (2016). Parsimonious tensor response regression.
Journal of theAmerican Statistical Association , (just-accepted).Li, X., Zhou, H., and Li, L. (2013). Tucker tensor regression and neuroimaging analysis. arXiv preprint arXiv:1304.5637 .Lindley, D. V. and Smith, A. F. (1972). Bayes estimates for the linear model.
Journal ofthe Royal Statistical Society. Series B (Methodological) , 34(1):1–41.Lock, E. F. and Li, G. (2016). Supervised multiway factorization. arXiv preprintarXiv:1609.03228 .Lock, E. F., Nobel, A. B., and Marron, J. S. (2011). Comment on Population Value De-composition, a framework for the analysis of image populations.
Journal of the AmericanStatistical Association , 106(495):798–802.Lyu, T., Lock, E. F., and Eberly, L. E. (2017). Discriminating sample groups with multi-way data.
Biostatistics , kxw057. 31iranda, M., Zhu, H., and Ibrahim, J. G. (2015). TPRM: Tensor partition regression mod-els with applications in imaging biomarker detection. arXiv preprint arXiv:1505.05482 .Mukherjee, A. and Zhu, J. (2011). Reduced rank ridge regression and its kernel extensions.
Statistical analysis and data mining , 4(6):612–622.Ramasamy, A., Trabzuni, D., Guelfi, S., Varghese, V., Smith, C., Walker, R., De, T., Coin,L., de Silva, R., Cookson, M. R., et al. (2014). Genetic variability in the regulation of geneexpression in ten regions of the human brain.
Nature neuroscience , 17(10):1418–1428.Raskutti, G. and Yuan, M. (2015). Convex regularization for high-dimensional tensorregression. arXiv preprint arXiv:1512.01215 .Sidiropoulos, N. D. and Bro, R. (2000). On the uniqueness of multilinear decomposition ofN-way arrays.
Journal of Chemometrics , 14(3):229–239.Sirovich, L. and Kirby, M. (1987). Low-dimensional procedure for the characterization ofhuman faces.
Journal of the Optical Society of America A , 4(3):519–524.Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Linde, A. (2014). The devianceinformation criterion: 12 years on.
Journal of the Royal Statistical Society: Series B(Statistical Methodology) , 76(3):485–493.Sun, W. W. and Li, L. (2016). Sparse low-rank tensor response regression. arXiv preprintarXiv:1609.04523 .Sundberg, R. (1993). Continuum regression and ridge regression.
Journal of the RoyalStatistical Society. Series B (Methodological) , pages 653–659.Tao, D., Li, X., Wu, X., Hu, W., and Maybank, S. J. (2007). Supervised tensor learning.
Knowledge and information systems , 13(1):1–42.Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis.
Psychome-trika , 31(3):279–311.Turk, M. and Pentland, A. (1991). Eigenfaces for recognition.
Journal of cognitive neuro-science , 3(1):71–86.Van der Vaart, A. W. (2000).
Asymptotic statistics , volume 3. Cambridge university press.32asilescu, M. A. O. and Terzopoulos, D. (2002). Multilinear analysis of image ensembles:Tensorfaces. In
European Conference on Computer Vision , pages 447–460. Springer.Wimalawarne, K., Tomioka, R., and Sugiyama, M. (2016). Theoretical and experimentalanalyses of tensor-based regression and classification.
Neural computation , 28(4):686–715.Zhang, X., Li, L., Zhou, H., Shen, D., et al. (2014). Tensor generalized estimating equationsfor longitudinal imaging analysis. arXiv preprint arXiv:1412.6592 .Zhou, H., Li, L., and Zhu, H. (2013). Tensor regression with applications in neuroimagingdata analysis.