Low-Rank Tensor Networks for Dimensionality Reduction and Large-Scale Optimization Problems: Perspectives and Challenges PART 1
A. Cichocki, N. Lee, I.V. Oseledets, A.-H. Phan, Q. Zhao, D. Mandic
LLow-Rank Tensor Networks forDimensionality Reduction and Large-ScaleOptimization Problems: Perspectives andChallenges PART 1 A. Cichocki N. Lee,I.V. Oseledets, A-H. Phan,Q. Zhao, D. Mandic a r X i v : . [ c s . NA ] S e p ndrzej CICHOCKIRIKEN Brain Science Institute (BSI), Japan and SKOLTECH, [email protected] LEERIKEN BSI, [email protected] OSELEDETSSkolkovo Institute of Science and Technology (SKOLTECH), andInstitute of Numerical Mathematics of Russian Academy of Sciences,[email protected] PHANRIKEN BSI, [email protected] ZHAORIKEN BSI, [email protected] P. MANDICImperial College, [email protected] Copyright A.Cichocki et al.
Please make reference to: A. Cichocki, N. Lee,I. Oseledets, A.-H. Phan, Q. Zhao and D.P. Mandic (2016), “Tensor Networks forDimensionality Reduction and Large-scale Optimization: Part 1 Low-Rank TensorDecompositions”, Foundations and Trends in Machine Learning: Vol. 9: No. 4-5,pp 249-429. bstract
Machine learning and data mining algorithms are becomingincreasingly important in analyzing large volume, multi-relationaland multi–modal datasets, which are often conveniently represented asmultiway arrays or tensors. It is therefore timely and valuable for themultidisciplinary research community to review tensor decompositionsand tensor networks as emerging tools for large-scale data analysis anddata mining. We provide the mathematical and graphical representationsand interpretation of tensor networks, with the main focus on theTucker and Tensor Train (TT) decompositions and their extensions orgeneralizations.To make the material self-contained, we also address the concept oftensorization which allows for the creation of very high-order tensors fromlower-order structured datasets represented by vectors or matrices. Then,in order to combat the curse of dimensionality and possibly obtain linearor even sub-linear complexity of storage and computation, we addresssuper-compression of tensor data through low-rank tensor networks.Finally, we demonstrate how such approximations can be used to solve awide class of huge-scale linear/ multilinear dimensionality reduction andrelated optimization problems that are far from being tractable when usingclassical numerical methods.The challenge for huge-scale optimization problems is therefore todevelop methods which scale linearly or sub-linearly (i.e., logarithmiccomplexity) with the size of datasets, in order to benefit from the well–understood optimization frameworks for smaller size problems. However,most efficient optimization algorithms are convex and do not scale wellwith data volume, while linearly scalable algorithms typically only applyto very specific scenarios. In this review, we address this problem throughthe concepts of low-rank tensor network approximations, distributedtensor networks, and the associated learning algorithms. We then elucidatehow these concepts can be used to convert otherwise intractable huge-scaleoptimization problems into a set of much smaller linked and/or distributedsub-problems of affordable size and complexity. In doing so, we highlightthe ability of tensor networks to account for the couplings between themultiple variables, and for multimodal, incomplete and noisy data.The methods and approaches discussed in this work can be consideredboth as an alternative and a complement to emerging methods for1uge-scale optimization, such as the random coordinate descent (RCD)scheme, subgradient methods, alternating direction method of multipliers(ADMM) methods, and proximal gradient descent methods. This is PART1which consists of Sections 1-4.Keywords: Tensor networks, Function-related tensors, CPdecomposition, Tucker models, tensor train (TT) decompositions,matrix product states (MPS), matrix product operators (MPO), basictensor operations, multiway component analysis, multilinear blindsource separation, tensor completion, linear/ multilinear dimensionalityreduction, large-scale optimization problems, symmetric eigenvaluedecomposition (EVD), PCA/SVD, huge systems of linear equations,pseudo-inverse of very large matrices, Lasso and Canonical CorrelationAnalysis (CCA). 2 hapter 1
Introduction and Motivation
This monograph aims to present a coherent account of ideas andmethodologies related to tensor decompositions (TDs) and tensor networksmodels (TNs). Tensor decompositions (TDs) decompose principally datatensors into factor matrices, while tensor networks (TNs) decomposehigher-order tensors into sparsely interconnected small-scale low-ordercore tensors. These low-order core tensors are called “components”,“blocks”, “factors” or simply “cores”. In this way, large-scale data can beapproximately represented in highly compressed and distributed formats.In this monograph, the TDs and TNs are treated in a unified way,by considering TDs as simple tensor networks or sub-networks; theterms “tensor decompositions” and “tensor networks” will therefore beused interchangeably. Tensor networks can be thought of as specialgraph structures which break down high-order tensors into a set ofsparsely interconnected low-order core tensors, thus allowing for bothenhanced interpretation and computational advantages. Such an approachis valuable in many application contexts which require the computationof eigenvalues and the corresponding eigenvectors of extremely high-dimensional linear or nonlinear operators. These operators typicallydescribe the coupling between many degrees of freedom within real-world physical systems; such degrees of freedom are often only weaklycoupled. Indeed, quantum physics provides evidence that couplingsbetween multiple data channels usually do not exist among all thedegrees of freedom but mostly locally, whereby “relevant” information,of relatively low-dimensionality, is embedded into very large-dimensionalmeasurements [148, 156, 183, 214].Tensor networks offer a theoretical and computational framework for3he analysis of computationally prohibitive large volumes of data, by“dissecting” such data into the “relevant” and “irrelevant” information,both of lower dimensionality. In this way, tensor network representationsoften allow for super-compression of datasets as large as 10 entries, downto the affordable levels of 10 or even less entries [22,68,69,110,112,120,133,161, 215].With the emergence of the big data paradigm, it is therefore bothtimely and important to provide the multidisciplinary machine learningand data analytic communities with a comprehensive overview of tensornetworks, together with an example-rich guidance on their application inseveral generic optimization problems for huge-scale structured data. Ouraim is also to unify the terminology, notation, and algorithms for tensordecompositions and tensor networks which are being developed not onlyin machine learning, signal processing, numerical analysis and scientificcomputing, but also in quantum physics/ chemistry for the representationof, e.g., quantum many-body systems. The volume and structural complexity of modern datasets are becomingexceedingly high, to the extent which renders standard analysis methodsand algorithms inadequate. Apart from the huge Volume, the otherfeatures which characterize big data include Veracity, Variety and Velocity(see Figures 1.1(a) and (b)). Each of the “V features” represents a researchchallenge in its own right. For example, high volume implies the need foralgorithms that are scalable; high Velocity requires the processing of bigdata streams in near real-time; high Veracity calls for robust and predictivealgorithms for noisy, incomplete and/or inconsistent data; high Varietydemands the fusion of different data types, e.g., continuous, discrete,binary, time series, images, video, text, probabilistic or multi-view. Someapplications give rise to additional “V challenges”, such as Visualization,Variability and Value. The Value feature is particularly interesting andrefers to the extraction of high quality and consistent information, fromwhich meaningful and interpretable results can be obtained.Owing to the increasingly affordable recording devices, extreme-scale volumes and variety of data are becoming ubiquitous across thescience and engineering disciplines. In the case of multimedia (speech,video), remote sensing and medical / biological data, the analysis alsorequires a paradigm shift in order to efficiently process massive datasets4a)
Batch Micro-batchNear real-timeStreamsVOLUME M i ss i ng d a t a A no m a l y O u tli e r s N o i s e I n c on s i s t e n c y T i m e s e r i e s I m a g e s B i n a r y d a t a D i m a g e s M u lti v i e w d a t a P r ob a b ili s ti c V E R A C I T Y PetabytesTerabytesGBMB VA R I ET Y VELOCITY (b)
StorageManagement,Scale
Integrationof Variety ofData
High Speed
Distributed,ParallelComputing
Robustness toNoise, Outliers,Missing Values
VOLUMEVERACITYVELOCITY VARIETY
Applications,Tasks Matrix/TensorCompletion,Inpainting,ImputationAnomalyDetectionFeatureExtraction,
Classifica tion,ClusteringCorrelation,Regression,Prediction,ForecastingPARAFACCPD,NTFTucker,NTDHierarchicalTucker Tensor Train,MPS/MPOPEPS,MERA
TensorModels
Sparseness OptimizationCriteria,ConstraintsSmoothnessNon-negativity StatisticalIndependence,Correlation
SignalProcessingand MachineLearning forBig Data
Challenges
Figure 1.1:
A framework for extremely large-scale data analysis. (a) The 4Vchallenges for big data. (b) A unified framework for the 4V challenges and thepotential applications based on tensor decomposition approaches.
Tensors are multi-dimensional generalizations of matrices. A matrix (2nd-order tensor) has two modes, rows and columns, while an N th-order tensorhas N modes (see Figures 1.2–1.7); for example, a 3rd-order tensor (withthree-modes) looks like a cube (see Figure 1.2). Subtensors are formedwhen a subset of tensor indices is fixed. Of particular interest are fibers which are vectors obtained by fixing every tensor index but one, and matrixslices which are two-dimensional sections (matrices) of a tensor, obtainedby fixing all the tensor indices but two. It should be noted that blockmatrices can also be represented by tensors, as illustrated in Figure 1.3 for4th-order tensors.We adopt the notation whereby tensors (for N ě
3) are denoted bybold underlined capital letters, e.g., X P R I ˆ I ˆ¨¨¨ˆ I N . For simplicity, weassume that all tensors are real-valued, but it is, of course, possible to definetensors as complex-valued or over arbitrary fields. Matrices are denotedby boldface capital letters, e.g., X P R I ˆ J , and vectors (1st-order tensors)by boldface lower case letters, e.g., x P R J . For example, the columns ofthe matrix A = [ a , a , . . . , a R ] P R I ˆ R are the vectors denoted by a r P R I ,while the elements of a matrix (scalars) are denoted by lowercase letters,e.g., a ir = A ( i , r ) (see Table 1.1).A specific entry of an N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N is denotedby x i , i ,..., i N = X ( i , i , . . . , i N ) P R . The order of a tensor is the numberof its “modes”, “ways” or “dimensions”, which can include space, time,frequency, trials, classes, and dictionaries. The term ‘ ‘size” stands forthe number of values that an index can take in a particular mode. Forexample, the tensor X P R I ˆ I ˆ¨¨¨ˆ I N is of order N and size I n in all modes- n ( n =
1, 2, . . . , N ) . Lower-case letters e.g, i , j are used for the subscripts in6 =1,2,..., j J Mode-2 M od e - ,..., , = i I M od e - ,..., , = k K x Horizontal Slices Lateral Slices Frontal Slices X ( i ,:,:) X (:, j ,:) X (:,:, k ) Column (Mode-1)Fibers Row (Mode-2)Fibers Tube (Mode-3)Fibers (:,3,1) (1,:,3) (1,3,:)
X X X
Figure 1.2: A 3rd-order tensor X P R I ˆ J ˆ K , with entries x i , j , k = X ( i , j , k ) , andits subtensors: slices (middle) and fibers (bottom). All fibers are treated ascolumn vectors.running indices and capital letters I , J denote the upper bound of an index,i.e., i =
1, 2, . . . , I and j =
1, 2, . . . , J . For a positive integer n , the shorthandnotation ă n ą denotes the set of indices t
1, 2, . . . , n u .Notations and terminology used for tensors and tensor networks differacross the scientific communities (see Table 1.2); to this end we employa unifying notation particularly suitable for machine learning and signalprocessing research, which is summarized in Table 1.1.Even with the above notation conventions, a precise description oftensors and tensor operations is often tedious and cumbersome, given7able 1.1: Basic matrix/tensor notation and symbols. X P R I ˆ I ˆ¨¨¨ˆ I N N th-order tensor of size I ˆ I ˆ ¨ ¨ ¨ ˆ I N x i , i ,..., i N = X ( i , i , . . . , i N ) ( i , i , . . . , i N ) th entry of X x , x , X scalar, vector and matrix G , S , G ( n ) , X ( n ) core tensors Λ P R R ˆ R ˆ¨¨¨ˆ R N th-order diagonal core tensor withnonzero entries λ r on the maindiagonal A T , A ´ , A : transpose, inverse and Moore–Penrose pseudo-inverse of a matrix AA = [ a , a , . . . , a R ] P R I ˆ R matrix with R column vectors a r P R I , with entries a ir A , B , C , A ( n ) , B ( n ) , U ( n ) component (factor) matrices X ( n ) P R I n ˆ I ¨¨¨ I n ´ I n + ¨¨¨ I N mode- n matricization of X P R I ˆ¨¨¨ˆ I N X ă n ą P R I I ¨¨¨ I n ˆ I n + ¨¨¨ I N mode-(1, . . . , n ) matricization of X P R I ˆ¨¨¨ˆ I N X ( :, i , i , . . . , i N ) P R I mode-1 fiber of a tensor X obtained byfixing all indices but one (a vector) X ( :, :, i , . . . , i N ) P R I ˆ I slice (matrix) of a tensor X obtainedby fixing all indices but two X ( :, :, :, i , . . . , i N ) subtensor of X , obtained by fixingseveral indices R , ( R , . . . , R N ) tensor rank R and multilinear rank ˝ , d , bb L , |b| outer, Khatri–Rao, Kronecker productsLeft Kronecker, strong Kronecker products x = vec ( X ) vectorization of X tr ( ‚ ) trace of a square matrixdiag ( ‚ ) diagonal matrix8able 1.2: Terminology used for tensor networks across the machinelearning / scientific computing and quantum physics / chemistrycommunities.Machine Learning Quantum Physics N th-order tensor rank- N tensorhigh/low-order tensor tensor of high/low dimensionranks of TNs bond dimensions of TNsunfolding, matricization grouping of indicestensorization splitting of indicescore sitevariables open (physical) indicesALS Algorithm one-site DMRG or DMRG1MALS Algorithm two-site DMRG or DMRG2column vector x P R I ˆ ket | Ψ y row vector x T P R ˆ I bra x Ψ | inner product x x , x y = x T x x Ψ | Ψ y Tensor Train (TT) Matrix Product State (MPS) (with OpenBoundary Conditions (OBC))Tensor Chain (TC) MPS with Periodic Boundary Conditions(PBC)Matrix TT Matrix Product Operators (with OBC)Hierarchical Tucker (HT) Tree Tensor Network State (TTNS) withrank-3 tensors9 . . G G G K . . . G G G K . . . G M G M G MK . . . . . . . . . ............ Figure 1.3: A block matrix and its representation as a 4th-order tensor,created by reshaping (or a projection) of blocks in the rows into lateral slicesof 3rd-order tensors. ... ... ... ... ............ ... ... ... ...
Scalar Vector Matrix 3rd-order Tensor 4th-order TensorOne-way 4-way 5-wayUnivariate MultivariateMultiway Analysis (High-o rder tensors) O n e s a m p l e A s a m p l e s e t Figure 1.4: Graphical representation of multiway array (tensor) data ofincreasing structural complexity and “Volume” (see [155] for more detail).the multitude of indices involved. To this end, in this monograph,we grossly simplify the description of tensors and their mathematicaloperations through diagrammatic representations borrowed from physicsand quantum chemistry (see [156] and references therein). In this way,tensors are represented graphically by nodes of any geometrical shapes(e.g., circles, squares, dots), while each outgoing line (“edge”, “leg”,“arm”)from a node represents the indices of a specific mode (see Figure 1.5(a)).In our adopted notation, each scalar (zero-order tensor), vector (first-order10a) a Scalar a Vector II A Matrix A I JI J A I I I I I I Λ I II I II (b) x I J A = I b = AxB
I J A = I C = AB K K B J K A MI PL = C Σ Kk =1 a i,j,k b k,l,m,p = c i,j,l,m,p JI MPL
Figure 1.5: Graphical representation of tensor manipulations. (a) Basicbuilding blocks for tensor network diagrams. (b) Tensor network diagramsfor matrix-vector multiplication (top), matrix by matrix multiplication(middle) and contraction of two tensors (bottom). The order of readingof indices is anti-clockwise, from the left position.tensor), matrix (2nd-order tensor), 3rd-order tensor or higher-order tensoris represented by a circle (or rectangular), while the order of a tensor isdetermined by the number of lines (edges) connected to it. Accordingto this notation, an N th-order tensor X P R I ˆ¨¨¨ˆ I N is represented by acircle (or any shape) with N branches each of size I n , n =
1, 2, . . . , N (seeSection 2). An interconnection between two circles designates a contraction11th-order tensor . . . = ...... ... ... ... = = = Figure 1.6: Graphical representations and symbols for higher-order blocktensors. Each block represents either a 3rd-order tensor or a 2nd-ordertensor. The outer circle indicates a global structure of the block tensor (e.g.a vector, a matrix, a 3rd-order block tensor), while the inner circle reflectsthe structure of each element within the block tensor. For example, in thetop diagram a vector of 3rd order tensors is represented by an outer circlewith one edge (a vector) which surrounds an inner circle with three edges (a3rd order tensor), so that the whole structure designates a 4th-order tensor.of tensors, which is a summation of products over a common index (seeFigure 1.5(b) and Section 2).Block tensors, where each entry (e.g., of a matrix or a vector) is anindividual subtensor, can be represented in a similar graphical form, asillustrated in Figure 1.6. Hierarchical (multilevel block) matrices are alsonaturally represented by tensors and vice versa, as illustrated in Figure 1.7for 4th-, 5th- and 6th-order tensors. All mathematical operations on tensorscan be therefore equally performed on block matrices.In this monograph, we make extensive use of tensor networkdiagrams as an intuitive and visual way to efficiently represent tensordecompositions. Such graphical notations are of great help in studying andimplementing sophisticated tensor operations. We highlight the significant12a)
X X R I R
I I I R R ( ) (cid:2) I I (cid:3) I I R R = (b) ... Vector (each entry is a block matrix)Block matrix Matrix = (c) Matrix (cid:2) = Figure 1.7: Hierarchical matrix structures and their symbolic representationas tensors. (a) A 4th-order tensor representation for a block matrix X P R R I ˆ R I (a matrix of matrices), which comprises block matrices X r , r P R I ˆ I . (b) A 5th-order tensor. (c) A 6th-order tensor.advantages of such diagrammatic notations in the description of tensormanipulations, and show that most tensor operations can be visualizedthrough changes in the architecture of a tensor network diagram.13 .3 Curse of Dimensionality and GeneralizedSeparation of Variables for Multivariate Functions The term curse of dimensionality was coined by [18] to indicate that thenumber of samples needed to estimate an arbitrary function with a givenlevel of accuracy grows exponentially with the number of variables, thatis, with the dimensionality of the function. In a general context ofmachine learning and the underlying optimization problems, the “curseof dimensionality” may also refer to an exponentially increasing numberof parameters required to describe the data/system or an extremely largenumber of degrees of freedom. The term “curse of dimensionality”, inthe context of tensors, refers to the phenomenon whereby the numberof elements, I N , of an N th-order tensor of size ( I ˆ I ˆ ¨ ¨ ¨ ˆ I ) growsexponentially with the tensor order, N . Tensor volume can thereforeeasily become prohibitively big for multiway arrays for which thenumber of dimensions (“ways” or “modes”) is very high, thus requiringenormous computational and memory resources to process such data.The understanding and handling of the inherent dependencies among theexcessive degrees of freedom create both difficult to solve problems andfascinating new opportunities, but comes at a price of reduced accuracy,owing to the necessity to involve various approximations.We show that the curse of dimensionality can be alleviated or even fullydealt with through tensor network representations; these naturally caterfor the excessive volume, veracity and variety of data (see Figure 1.1) andare supported by efficient tensor decomposition algorithms which involverelatively simple mathematical operations. Another desirable aspect oftensor networks is their relatively small-scale and low-order core tensors ,which act as “building blocks” of tensor networks. These core tensors arerelatively easy to handle and visualize, and enable super-compression ofthe raw, incomplete, and noisy huge-scale datasets. This also suggests asolution to a more general quest for new technologies for processing ofexceedingly large datasets within affordable computation times.To address the curse of dimensionality, this work mostly focuseson approximative low-rank representations of tensors, the so-calledlow-rank tensor approximations (LRTA) or low-rank tensor networkdecompositions. 14 .4 Separation of Variables and Tensor Formats A tensor is said to be in a full format when it is represented as an original(raw) multidimensional array [118], however, distributed storage andprocessing of high-order tensors in their full format is infeasible due to thecurse of dimensionality. The sparse format is a variant of the full tensorformat which stores only the nonzero entries of a tensor, and is usedextensively in software tools such as the Tensor Toolbox [8] and in thesparse grid approach [25, 80, 91].As already mentioned, the problem of huge dimensionality can bealleviated through various distributed and compressed tensor networkformats, achieved by low-rank tensor network approximations. Theunderpinning idea is that by employing tensor networks formats, bothcomputational costs and storage requirements may be dramaticallyreduced through distributed storage and computing resources. It isimportant to note that, except for very special data structures, a tensorcannot be compressed without incurring some compression error, sincea low-rank tensor representation is only an approximation of the originaltensor.The concept of compression of multidimensional large-scale databy tensor network decompositions can be intuitively explained asfollows. Consider the approximation of an N -variate function f ( x ) = f ( x , x , . . . , x N ) by a finite sum of products of individual functions, eachdepending on only one or a very few variables [16, 34, 67, 206]. In thesimplest scenario, the function f ( x ) can be (approximately) represented inthe following separable form f ( x , x , . . . , x N ) – f ( ) ( x ) f ( ) ( x ) ¨ ¨ ¨ f ( N ) ( x N ) . (1.1)In practice, when an N -variate function f ( x ) is discretized into an N th-order array, or a tensor, the approximation in (1.1) then corresponds tothe representation by rank-1 tensors, also called elementary tensors (seeSection 2). Observe that with I n , n =
1, 2, . . . , N denoting the size ofeach mode and I = max n t I n u , the memory requirement to store sucha full tensor is ś Nn = I n ď I N , which grows exponentially with N . Onthe other hand, the separable representation in (1.1) is completely definedby its factors, f ( n ) ( x n ) , ( n =
1, 2, . . . , N ), and requires only ř Nn = I n ! I N storage units. If x , x , . . . , x N are statistically independent randomvariables, their joint probability density function is equal to the productof marginal probabilities, f ( x ) = f ( ) ( x ) f ( ) ( x ) ¨ ¨ ¨ f ( N ) ( x N ) , in an exact15nalogy to outer products of elementary tensors. Unfortunately, the formof separability in (1.1) is rather rare in practice.The concept of tensor networks rests upon generalized (full or partial)separability of the variables of a high dimensional function. This can beachieved in different tensor formats, including: • The Canonical Polyadic (CP) format (see Section 3.2), where f ( x , x , . . . , x N ) – R ÿ r = f ( ) r ( x ) f ( ) r ( x ) ¨ ¨ ¨ f ( N ) r ( x N ) , (1.2)in an exact analogy to (1.1). In a discretized form, the above CP formatcan be written as an N th-order tensor F – R ÿ r = f ( ) r ˝ f ( ) r ˝ ¨ ¨ ¨ ˝ f ( N ) r P R I ˆ I ˆ¨¨¨ˆ I N , (1.3)where f ( n ) r P R I n denotes a discretized version of the univariatefunction f ( n ) r ( x n ) , symbol ˝ denotes the outer product, and R is thetensor rank. • The Tucker format, given by f ( x , . . . , x N ) – R ÿ r = ¨ ¨ ¨ R N ÿ r N = g r ,..., r N f ( ) r ( x ) ¨ ¨ ¨ f ( N ) r N ( x N ) , (1.4)and its distributed tensor network variants (see Section 3.3), • The Tensor Train (TT) format (see Section 4.1), in the form f ( x , x , . . . , x N ) – R ÿ r = R ÿ r = ¨ ¨ ¨ R N ´ ÿ r N ´ = f ( ) r ( x ) f ( ) r r ( x ) ¨ ¨ ¨¨ ¨ ¨ f ( N ´ ) r N ´ r N ´ ( x N ´ ) f ( N ) r N ´ ( x N ) , (1.5)with the equivalent compact matrix representation f ( x , x , . . . , x N ) – F ( ) ( x ) F ( ) ( x ) ¨ ¨ ¨ F ( N ) ( x N ) , (1.6)where F ( n ) ( x n ) P R R n ´ ˆ R n , with R = R N = The Hierarchical Tucker (HT) format (also known as the HierarchicalTensor format) can be expressed via a hierarchy of nested separationsin the following way. Consider nested nonempty disjoint subsets u , v , and t = u Y v Ă t
1, 2, . . . , N u , then for some 1 ď N ă N , with u = t
1, . . . , N u and v = t N +
1, . . . , N u , the HT format can beexpressed as f ( x , . . . , x N ) – R u ÿ r u = R v ÿ r v = g ( ¨¨¨ N ) r u , r v f ( u ) r u ( x u ) f ( v ) r v ( x v ) , f ( t ) r t ( x t ) – R u ÿ r u = R v ÿ r v = g ( t ) r u , r v , r t f ( u ) r u ( x u ) f ( v ) r v ( x v ) ,where x t = t x i : i P t u . See Section 2.3 for more detail. Example.
In a particular case for N =4, the HT format can beexpressed by f ( x , x , x , x ) – R ÿ r = R ÿ r = g ( ) r , r f ( ) r ( x , x ) f ( ) r ( x , x ) , f ( ) r ( x , x ) – R ÿ r = R ÿ r = g ( ) r , r , r f ( ) r ( x ) f ( ) r ( x ) , f ( ) r ( x , x ) – R ÿ r = R ÿ r = g ( ) r , r , r f ( ) r ( x ) f ( ) r ( x ) .The Tree Tensor Network States (TTNS) format, which is an extensionof the HT format, can be obtained by generalizing the two subsets, u , v , into a larger number of disjoint subsets u , . . . , u m , m ě
2. Inother words, the TTNS can be obtained by more flexible separationsof variables through products of larger numbers of functions at eachhierarchical level (see Section 2.3 for graphical illustrations and moredetail).All the above approximations adopt the form of “sum-of-products” ofsingle-dimensional functions, a procedure which plays a key role in alltensor factorizations and decompositions.Indeed, in many applications based on multivariate functions, verygood approximations are obtained with a surprisingly small number17f factors; this number corresponds to the tensor rank, R , or tensornetwork ranks, t R , R , . . . , R N u (if the representations are exact andminimal). However, for some specific cases this approach may fail to obtainsufficiently good low-rank TN approximations. The concept of generalizedseparability has already been explored in numerical methods for high-dimensional density function equations [34, 133, 206] and within a varietyof huge-scale optimization problems (see Part 2 of this monograph).To illustrate how tensor decompositions address excessive volumes ofdata, if all computations are performed on a CP tensor format in (1.3) andnot on the raw N th-order data tensor itself, then instead of the original, exponentially growing , data dimensionality of I N , the number of parametersin a CP representation reduces to N IR , which scales linearly in the tensororder N and size I (see Table 4.4). For example, the discretization of a5-variate function over 100 sample points on each axis would yield thedifficulty to manage 100 =
10, 000, 000, 000 sample points, while a rank-2CP representation would require only 5 ˆ ˆ = N ą g r , r ,..., r N ) scales exponentially in thetensor order N (curse of dimensionality).In contrast to CP decomposition algorithms, TT tensor network formatsin (1.5) exhibit both very good numerical properties and the abilityto control the error of approximation, so that a desired accuracy ofapproximation is obtained relatively easily. The main advantage of theTT format over the CP decomposition is the ability to provide stablequasi-optimal rank reduction, achieved through, for example, truncatedsingular value decompositions (tSVD) or adaptive cross-approximation[16, 116, 162]. This makes the TT format one of the most stable and simpleapproaches to separate latent variables in a sophisticated way, while theassociated TT decomposition algorithms provide full control over low-rank18N approximations . In this monograph, we therefore make extensiveuse of the TT format for low-rank TN approximations and employ the TTtoolbox software for efficient implementations [160]. The TT format willalso serve as a basic prototype for high-order tensor representations, whilewe also consider the Hierarchical Tucker (HT) and the Tree Tensor NetworkStates (TTNS) formats (having more general tree-like structures) wheneveradvantageous in applications.Furthermore, we address in depth the concept of tensorizationof structured vectors and matrices to convert a wide class of huge-scale optimization problems into much smaller-scale interconnectedoptimization sub-problems which can be solved by existing optimizationmethods (see Part 2 of this monograph).The tensor network optimization framework is therefore performedthrough the two main steps: • Tensorization of data vectors and matrices into a high-order tensor,followed by a distributed approximate representation of a costfunction in a specific low-rank tensor network format. • Execution of all computations and analysis in tensor network formats(i.e., using only core tensors) that scale linearly, or even sub-linearly(quantized tensor networks), in the tensor order N . This yieldsboth the reduced computational complexity and distributed memoryrequirements. In this monograph, we focus on two main challenges in huge-scale dataanalysis which are addressed by tensor networks: (i) an approximaterepresentation of a specific cost (objective) function by a tensor networkwhile maintaining the desired accuracy of approximation, and (ii) theextraction of physically meaningful latent variables from data in asufficiently accurate and computationally affordable way. The benefits ofmultiway (tensor) analysis methods for large-scale datasets then include: Although similar approaches have been known in quantum physics for a long time,their rigorous mathematical analysis is still a work in progress (see [156,158] and referencestherein). Ability to perform all mathematical operations in tractable tensornetwork formats; • Simultaneous and flexible distributed representations of both thestructurally rich data and complex optimization tasks; • Efficient compressed formats of large multidimensional dataachieved via tensorization and low-rank tensor decompositions intolow-order factor matrices and/or core tensors; • Ability to operate with noisy and missing data by virtue of numericalstability and robustness to noise of low-rank tensor / matrixapproximation algorithms; • A flexible framework which naturally incorporates variousdiversities and constraints, thus seamlessly extending the standard,flat view, Component Analysis (2-way CA) methods to multiwaycomponent analysis; • Possibility to analyze linked (coupled) blocks of large-scale matricesand tensors in order to separate common / correlated fromindependent / uncorrelated components in the observed raw data; • Graphical representations of tensor networks allow us to expressmathematical operations on tensors (e.g., tensor contractions andreshaping) in a simple and intuitive way, and without the explicit useof complex mathematical expressions.In that sense, this monograph both reviews current research in this areaand complements optimisation methods, such as the Alternating DirectionMethod of Multipliers (ADMM) [23].Tensor decompositions (TDs) have been already adopted in widelydiverse disciplines, including psychometrics, chemometrics, biometric,quantum physics / information, quantum chemistry, signal and imageprocessing, machine learning, and brain science [42, 43, 79, 91, 119, 124,190, 202]. This is largely due to their advantages in the analysis of datathat exhibit not only large volume but also very high variety (see Figure1.1), as in the case in bio- and neuroinformatics and in computationalneuroscience, where various forms of data collection include sparse tabularstructures and graphs or hyper-graphs.Moreover, tensor networks have the ability to efficientlyparameterize, through structured compact representations, very20eneral high-dimensional spaces which arise in modern applications[19, 39, 50, 116, 121, 136, 229]. Tensor networks also naturally accountfor intrinsic multidimensional and distributed patterns present in data,and thus provide the opportunity to develop very sophisticated modelsfor capturing multiple interactions and couplings in data – these aremore physically insightful and interpretable than standard pair-wiseinteractions.
Review and tutorial papers [7, 42, 54, 87, 119, 137, 163, 189] and books[43, 91, 124, 190] dealing with TDs and TNs already exist, however, theytypically focus on standard models, with no explicit links to very large-scale data processing topics or connections to a wide class of optimizationproblems. The aim of this monograph is therefore to extend beyond thestandard Tucker and CP tensor decompositions, and to demonstrate theperspective of TNs in extremely large-scale data analytics, together withtheir role as a mathematical backbone in the discovery of hidden structuresin prohibitively large-scale data. Indeed, we show that TN models providea framework for the analysis of linked (coupled) blocks of tensors withmillions and even billions of non-zero entries.We also demonstrate that TNs provide natural extensions of 2-way (matrix) Component Analysis (2-way CA) methods to multi-waycomponent analysis (MWCA), which deals with the extraction of desiredcomponents from multidimensional and multimodal data. This paradigmshift requires new models and associated algorithms capable of identifyingcore relations among the different tensor modes, while guaranteeing linear/ sub-linear scaling with the size of datasets .Furthermore, we review tensor decompositions and the associatedalgorithms for very large-scale linear / multilinear dimensionalityreduction problems. The related optimization problems often involvestructured matrices and vectors with over a billion entries (see [67, 81, 87]and references therein). In particular, we focus on Symmetric EigenvalueDecomposition (EVD/PCA) and Generalized Eigenvalue Decomposition(GEVD) [70, 120, 123], SVD [127], solutions of overdetermined andundetermined systems of linear algebraic equations [71, 159], the Moore–Penrose pseudo-inverse of structured matrices [129], and Lasso problems Usually, we assume that huge-scale problems operate on at least 10 parameters. hapter 2 Tensor Operations and TensorNetwork Diagrams
Tensor operations benefit from the power of multilinear algebra whichis structurally much richer than linear algebra, and even some basicproperties, such as the rank, have a more complex meaning. We nextintroduce the background on fundamental mathematical operations inmultilinear algebra, a prerequisite for the understanding of higher-ordertensor decompositions. A unified account of both the definitions andproperties of tensor network operations is provided, including the outer,multi-linear, Kronecker, and Khatri–Rao products. For clarity, graphicalillustrations are provided, together with an example rich guidance fortensor network operations and their properties. To avoid any confusionthat may arise given the numerous options on tensor reshaping, bothmathematical operations and their properties are expressed directly in theirnative multilinear contexts, supported by graphical visualizations.
The following symbols are used for most common tensor multiplications: b for the Kronecker product, d for the Khatri–Rao product, f for theHadamard (componentwise) product, ˝ for the outer product and ˆ n forthe mode- n product. Basic tensor operations are summarized in Table 2.1,and illustrated in Figures 2.1–2.13. We refer to [43, 119, 128] for more detailregarding the basic notations and tensor operations. For convenience,general operations, such as vec ( ¨ ) or diag ( ¨ ) , are defined similarly to theMATLAB syntax. 23able 2.1: Basic tensor/matrix operations. C = A ˆ n B Mode- n product of a tensor A P R I ˆ I ˆ¨¨¨ˆ I N and a matrix B P R J ˆ I n yields a tensor C P R I ˆ¨¨¨ˆ I n ´ ˆ J ˆ I n + ˆ¨¨¨ˆ I N , with entries c i ,..., i n ´ , j , i n + ,..., i N = ř I n i n = a i ,..., i n ,..., i N b j , i n C = (cid:74) G ; B ( ) , . . . , B ( N ) (cid:75) Multilinear (Tucker) product of a core tensor, G , and factor matrices B ( n ) , which gives C = G ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) C = A ¯ ˆ n b Mode- n product of a tensor A P R I ˆ¨¨¨ˆ I N and vector b P R I n yieldsa tensor C P R I ˆ¨¨¨ˆ I n ´ ˆ I n + ˆ¨¨¨ˆ I N ,with entries c i ,..., i n ´ , i n + ,..., i N = ř I n i n = a i ,..., i n ´ , i n , i n + ,..., i N b i n C = A ˆ N B = A ˆ B Mode- ( N , 1 ) contracted product of tensors A P R I ˆ I ˆ¨¨¨ˆ I N and B P R J ˆ J ˆ¨¨¨ˆ J M ,with I N = J , yields a tensor C P R I ˆ¨¨¨ˆ I N ´ ˆ J ˆ¨¨¨ˆ J M with entries c i ,..., i N ´ , j ,..., j M = ř I N i N = a i ,..., i N b i N , j ,..., j M C = A ˝ B Outer product of tensors A P R I ˆ I ˆ¨¨¨ˆ I N and B P R J ˆ J ˆ¨¨¨ˆ J M yields an ( N + M ) th-order tensor C , with entries c i ,..., i N , j ,..., j M = a i ,..., i N b j ,..., j M X = a ˝ b ˝ c P R I ˆ J ˆ K Outer product of vectors a , b and c forms arank-1 tensor, X , with entries x ijk = a i b j c k C = A b L B (Left) Kronecker product of tensors A P R I ˆ I ˆ¨¨¨ˆ I N and B P R J ˆ J ˆ¨¨¨ˆ J N yieldsa tensor C P R I J ˆ¨¨¨ˆ I N J N , with entries c i j ,..., i N j N = a i ,..., i N b j ,..., j N C = A d L B (Left) Khatri–Rao product of matrices A =[ a , . . . , a J ] P R I ˆ J and B = [ b , . . . , b J ] P R K ˆ J yields a matrix C P R IK ˆ J , withcolumns c j = a j b L b j P R IK atricization VectorizationTensorizationTensorData TensorizationVectorization ... ... = = = = Figure 2.1: Tensor reshaping operations: Matricization, vectorization andtensorization. Matricization refers to converting a tensor into a matrix,vectorization to converting a tensor or a matrix into a vector, whiletensorization refers to converting a vector, a matrix or a low-order tensorinto a higher-order tensor.
Multi–indices:
By a multi-index i = i i ¨ ¨ ¨ i N we refer to an index whichtakes all possible combinations of values of indices, i , i , . . . , i N , for i n =
1, 2, . . . , I n , n =
1, 2, . . . , N and in a specific order. Multi–indices can bedefined using two different conventions [71]:1. Little-endian convention (reverse lexicographic ordering) i i ¨ ¨ ¨ i N = i + ( i ´ ) I + ( i ´ ) I I + ¨ ¨ ¨ + ( i N ´ ) I ¨ ¨ ¨ I N ´ .2. Big-endian (colexicographic ordering) i i ¨ ¨ ¨ i N = i N + ( i N ´ ´ ) I N + ( i N ´ ´ ) I N I N ´ + ¨ ¨ ¨ + ( i ´ ) I ¨ ¨ ¨ I N .The little-endian convention is used, for example, in Fortran and MATLAB,while the big-endian convention is used in C language. Given the complexand non-commutative nature of tensors, the basic definitions, such asthe matricization, vectorization and the Kronecker product, should be25onsistent with the chosen convention . In this monograph, unlessotherwise stated, we will use little-endian notation. Matricization.
The matricization operator, also known as the unfoldingor flattening, reorders the elements of a tensor into a matrix (see Figure2.2). Such a matrix is re-indexed according to the choice of multi-indexdescribed above, and the following two fundamental matricizations areused extensively.
The mode- n matricization. For a fixed index n P t
1, 2, . . . , N u , the mode- n matricization of an N th-order tensor, X P R I ˆ¨¨¨ˆ I N , is defined as the(“short” and “wide”) matrix X ( n ) P R I n ˆ I I ¨¨¨ I n ´ I n + ¨¨¨ I N , (2.1)with I n rows and I I ¨ ¨ ¨ I n ´ I n + ¨ ¨ ¨ I N columns, the entries of which are ( X ( n ) ) i n , i ¨¨¨ i n ´ i n + ¨¨¨ i N = x i , i ,..., i N .Note that the columns of a mode- n matricization, X ( n ) , of a tensor X are themode- n fibers of X . The mode- t n u canonical matricization. For a fixed index n Pt
1, 2, . . . , N u , the mode- (
1, 2, . . . , n ) matricization, or simply mode- n canonical matricization, of a tensor X P R I ˆ¨¨¨ˆ I N is defined as the matrix X ă n ą P R I I ¨¨¨ I n ˆ I n + ¨¨¨ I N , (2.2)with I I ¨ ¨ ¨ I n rows and I n + ¨ ¨ ¨ I N columns, and the entries ( X ă n ą ) i i ¨¨¨ i n , i n + ¨¨¨ i N = x i , i ,..., i N .The matricization operator in the MATLAB notation (reverselexicographic) is given by X ă n ą = reshape ( X , I I ¨ ¨ ¨ I n , I n + ¨ ¨ ¨ I N ) . (2.3)As special cases we immediately have (see Figure 2.2) X ă ą = X ( ) , X ă N ´ ą = X T ( N ) , X ă N ą = vec ( X ) . (2.4) Note that using the colexicographic ordering, the vectorization of an outer product oftwo vectors, a and b , yields their Kronecker product, that is, vec ( a ˝ b ) = a b b , whileusing the reverse lexicographic ordering, for the same operation, we need to use the LeftKronecker product, vec ( a ˝ b ) = b b a = a b L b . I I I I I I I I I I I I I I I I I I A A (1) A ∈ A I × I I I I I I I I ×× R A (2) ∈ R A (2) ∈ R (b) A I I I n I N (cid:2) ... ... I n A ( ) n I N I n n N I I I I (cid:2) (cid:2) (cid:3) (cid:4) (cid:5) ... (c) I I I n I I n +1 I N J ... ... A
Matricization (flattening, unfolding) used in tensor reshaping. (a)Mode-1, mode-2, and mode-3 matricizations of a 3rd-order tensor, from the topto the bottom panel. (b) Tensor network diagram for the mode- n matricizationof an N th-order tensor, A P R I ˆ I ˆ¨¨¨ˆ I N , into a short and wide matrix, A ( n ) P R I n ˆ I ¨¨¨ I n ´ I n + ¨¨¨ I N . (c) Mode- t
1, 2, . . . , n u th (canonical) matricization of an N th-order tensor, A , into a matrix A ă n ą = A ( i ¨¨¨ i n ; i n + ¨¨¨ i N ) P R I I ¨¨¨ I n ˆ I n + ¨¨¨ I N . ector x ∊ K Matrix X ∊ K × X ∊ K × × X ∊ R K × × × IRIRIRI
Figure 2.3: Tensorization of a vector into a matrix, 3rd-order tensor and4th-order tensor.The tensorization of a vector or a matrix can be considered as a reverseprocess to the vectorization or matricization (see Figures 2.1 and 2.3).
Kronecker, strong Kronecker, and Khatri–Rao products of matrices andtensors.
For an I ˆ J matrix A and a K ˆ L matrix B , the standard (Right)Kronecker product, A b B , and the Left Kronecker product, A b L B , are thefollowing IK ˆ JL matrices A b B = a B ¨ ¨ ¨ a J B ... . . . ... a I ,1 B ¨ ¨ ¨ a I , J B , A b L B = A b ¨ ¨ ¨ A b L ... . . . ... A b K ,1 ¨ ¨ ¨ A b K , L .Observe that A b L B = B b A , so that the Left Kronecker product willbe used in most cases in this monograph as it is consistent with the little-endian notation.Using Left Kronecker product, the strong Kronecker product of two blockmatrices, A P R R I ˆ R J and B P R R K ˆ R L , given by A = A ¨ ¨ ¨ A R ... . . . ... A R ,1 ¨ ¨ ¨ A R , R , B = B ¨ ¨ ¨ B R ... . . . ... B R ,1 ¨ ¨ ¨ B R , R ,can be defined as a block matrix (see Figure 2.4 for a graphical illustration) C = A |b| B P R R IK ˆ R JL , (2.5)28 B = C = A BA A A A A A B B B B B B A B+A +A L B L B L A B+A +A L B L B L A B+A +A L B L B L A B+A +A L B L B L Figure 2.4: Illustration of the strong Kronecker product of two blockmatrices, A = [ A r , r ] P R R I ˆ R J and B = [ B r , r ] P R R I ˆ R J , whichis defined as a block matrix C = A |b| B P R R I I ˆ R J J , with the blocks C r , r = ř R r = A r , r b L B r , r P R I I ˆ J J , for r =
1, . . . , R , r =
1, . . . , R and r =
1, . . . , R .with blocks C r , r = ř R r = A r , r b L B r , r P R IK ˆ JL , where A r , r P R I ˆ J and B r , r P R K ˆ L are the blocks of matrices within A and B ,respectively [62, 112, 113]. Note that the strong Kronecker product issimilar to the standard block matrix multiplication, but performed usingKronecker products of the blocks instead of the standard matrix-matrixproducts. The above definitions of Kronecker products can be naturallyextended to tensors [174] (see Table 2.1), as shown below. The Kronecker product of tensors.
The (Left) Kronecker product of two N th-order tensors, A P R I ˆ I ˆ¨¨¨ˆ I N and B P R J ˆ J ˆ¨¨¨ˆ J N , yields a tensor C = A b L B P R I J ˆ¨¨¨ˆ I N J N of the same order but enlarged in size, withentries c i j ,..., i N j N = a i ,..., i N b j ,..., j N as illustrated in Figure 2.5. The mode- n Khatri–Rao product of tensors.
The Mode- n Khatri–Rao product of two N th-order tensors, A P R I ˆ I ˆ¨¨¨ˆ I n ˆ¨¨¨ˆ I N and B P R J ˆ J ˆ¨¨¨ˆ J n ˆ¨¨¨ˆ J N , for which I n = J n , yields a tensor C = A d n B P R I J ˆ¨¨¨ˆ I n ´ J n ´ ˆ I n ˆ I n + J n + ˆ¨¨¨ˆ I N J N , with subtensors C ( :, . . . :, i n , :, . . . , : ) = A ( :, . . . :, i n , :, . . . , : ) b B ( :, . . . :, i n , :, . . . , : ) . The mode- and mode-1 Khatri–Rao product of matrices. The abovedefinition simplifies to the standard Khatri–Rao (mode-2) product of twomatrices, A = [ a , a , . . . , a R ] P R I ˆ R and B = [ b , b , . . . , b R ] P R J ˆ R , or inother words a “column-wise Kronecker product”. Therefore, the standard29 I I I J J J J A B
K I J
K I J
K I J
K I J
Figure 2.5: The left Kronecker product of two 4th-order tensors, A and B , yields a 4th-order tensor, C = A b L B P R I J ˆ¨¨¨ˆ I J , with entries c k , k , k , k = a i ,..., i b j ,..., j , where k n = i n j n ( n =
1, 2, 3, 4). Note that theorder of tensor C is the same as the order of A and B , but the size in everymode within C is a product of the respective sizes of A and B .Right and Left Khatri–Rao products for matrices are respectively given by A d B = [ a b b , a b b , . . . , a R b b R ] P R I J ˆ R , (2.6) A d L B = [ a b L b , a b L b , . . . , a R b L b R ] P R I J ˆ R . (2.7)Analogously, the mode-1 Khatri–Rao product of two matrices A P R I ˆ R and B P R I ˆ Q , is defined as A d B = A (
1, : ) b B (
1, : ) ... A ( I , : ) b B ( I , : ) P R I ˆ RQ . (2.8) Direct sum of tensors.
A direct sum of N th-order tensors A P R I ˆ¨¨¨ˆ I N and B P R J ˆ¨¨¨ˆ J N yields a tensor C = A ‘ B P R ( I + J ) ˆ¨¨¨ˆ ( I N + J N ) ,with entries C ( k , . . . , k N ) = A ( k , . . . , k N ) if 1 ď k n ď I n , @ n , C ( k , . . . , k N ) = B ( k ´ I , . . . , k N ´ I N ) if I n ă k n ď I n + J n , @ n ,and C ( k , . . . , k N ) =
0, otherwise (see Figure 2.6(a)).
Partial (mode- n ) direct sum of tensors. A partial direct sum of tensors A P R I ˆ¨¨¨ˆ I N and B P R J ˆ¨¨¨ˆ J N , with I n = J n , yields a tensor C = A ‘ n B P R ( I + J ) ˆ¨¨¨ˆ ( I n ´ + J n ´ ) ˆ I n ˆ ( I n + + J n + ) ˆ¨¨¨ˆ ( I N + J N ) , where For simplicity, the mode 2 subindex is usually neglected, i.e., A d B = A d B . ( :, . . . , :, i n , :, . . . , : ) = A ( :, . . . , :, i n , :, . . . , : ) ‘ B ( :, . . . , :, i n , :, . . . , : ) , asillustrated in Figure 2.6(b). Concatenation of N th-order tensors. A concatenation along mode- n of tensors A P R I ˆ¨¨¨ˆ I N and B P R J ˆ¨¨¨ˆ J N , for which I m = J m , @ m ‰ n yields a tensor C = A ‘ n B P R I ˆ¨¨¨ˆ I n ´ ˆ ( I n + J n ) ˆ I n + ˆ¨¨¨ˆ ( I N ) ,with subtensors C ( i , . . . , i n ´ , :, i n + , . . . , i N ) = A ( i , . . . , i n ´ , :, i n + , . . . , i N ) ‘ B ( i , . . . , i n ´ , :, i n + , . . . , i N ) , as illustrated in Figure2.6(c). For a concatenation of two tensors of suitable dimensions alongmode- n , we will use equivalent notations C = A ‘ n B = A " n B .
3D Convolution.
For simplicity, consider two 3rd-order tensors A P R I ˆ I ˆ I and B P R J ˆ J ˆ J . Their 3D Convolution yields a tensor C = A ˚ B P R ( I + J ´ ) ˆ ( I + J ´ ) ˆ ( I + J ´ ) , with entries: C ( k , k , k ) = ř j ř j ř j B ( j , j , j ) A ( k ´ j , k ´ j , k ´ j ) asillustrated in Figure 2.7 and Figure 2.8. Partial (mode- n ) Convolution. For simplicity, consider two 3rd-ordertensors A P R I ˆ I ˆ I and B P R J ˆ J ˆ J . Their mode-2 (partial) convolutionyields a tensor C = A d B P R I J ˆ ( I + J ´ ) ˆ I J , the subtensors (vectors) ofwhich are C ( k , :, k ) = A ( i , :, i ) ˚ B ( j , :, j ) P R I + J ´ , where k = i j ,and k = i j . Outer product.
The central operator in tensor analysis is the outer or tensorproduct, which for the tensors A P R I ˆ¨¨¨ˆ I N and B P R J ˆ¨¨¨ˆ J M givesthe tensor C = A ˝ B P R I ˆ¨¨¨ˆ I N ˆ J ˆ¨¨¨ˆ J M with entries c i ,..., i N , j ,..., j M = a i ,..., i N b j ,..., j M .Note that for 1st-order tensors (vectors), the tensor product reduces tothe standard outer product of two nonzero vectors, a P R I and b P R J ,which yields a rank-1 matrix, X = a ˝ b = ab T P R I ˆ J . The outer productof three nonzero vectors, a P R I , b P R J and c P R K , gives a 3rd-orderrank-1 tensor (called pure or elementary tensor), X = a ˝ b ˝ c P R I ˆ J ˆ K ,with entries x ijk = a i b j c k . Rank-1 tensor.
A tensor, X P R I ˆ I ˆ¨¨¨ˆ I N , is said to be of rank-1 if it canbe expressed exactly as the outer product, X = b ( ) ˝ b ( ) ˝ ¨ ¨ ¨ ˝ b ( N ) of nonzero vectors, b ( n ) P R I n , with the tensor entries given by x i , i ,..., i N = b ( ) i b ( ) i ¨ ¨ ¨ b ( N ) i N . Kruskal tensor, CP decomposition.
For further discussion, it is important31 b)(a) I I I J JJ A BA B ∈ R ( + ) I J × ( + ) × ( + ) I J I J (c) A B A B A B A B A B A B I = J I = J I = J I = J I = J I I I J J
AAB B B J I I = J B A AB B I J I = J I J I I J I J I = J I = J I J A A I Figure 2.6: Illustration of the direct sum, partial direct sum andconcatenation operators of two 3rd-order tensors. (a) Direct sum. (b) Partial(mode-1, mode-2, and mode-3) direct sum. (c) Concatenations along mode-1,2,3. 32 = A B C ・ ・ (-1)=22 ・ (-1)+3 ・ ・ (-1)=83 ・ (-1)+3 ・ (-1)+2 ・ ・ (-1)+1 ・ (-1)=00 -1 -2 -3-1 2 1 40 -9 8 0-5 17 -10 -2 -4 012 -4-6 -112 -4-3 6 1 -40 -3 -1 0 4 -2-2 02 8 0 Figure 2.7: Illustration of the 2D convolution operator, performed througha sliding window operation along both the horizontal and vertical index.33
B C I I I * = J I + -1 J I + -1 * = Reduction(summation) J J J ( )( ) J I + -1 ( ) Hadamard product Σ Figure 2.8: Illustration of the 3D convolution operator, performed througha sliding window operation along all three indices.to highlight that any tensor can be expressed as a finite sum of rank-1tensors, in the form X = R ÿ r = b ( ) r ˝ b ( ) r ˝ ¨ ¨ ¨ ˝ b ( N ) r = R ÿ r = (cid:18) N ˝ n = b ( n ) r (cid:19) , b ( n ) r P R I n , (2.9)which is exactly the form of the Kruskal tensor, illustrated in Figure 2.9,also known under the names of CANDECOMP / PARAFAC, CanonicalPolyadic Decomposition (CPD), or simply the CP decomposition in (1.2).We will use the acronyms CP and CPD. Tensor rank.
The tensor rank, also called the CP rank, is a natural extensionof the matrix rank and is defined as a minimum number, R , of rank-1 termsin an exact CP decomposition of the form in (2.9).Although the CP decomposition has already found many practicalapplications, its limiting theoretical property is that the best rank- R approximation of a given data tensor may not exist (see [63] for more34 I I I X = I I I I (1) r b (2) r b (3) r b (4) r b =1 Rr Figure 2.9: The CP decomposition for a 4th-order tensor X of rank R .Observe that the rank-1 subtensors are formed through the outer productsof the vectors b ( ) r , . . . , b ( ) r , r =
1, . . . , R .detail). However, a rank- R tensor can be approximated arbitrarily wellby a sequence of tensors for which the CP ranks are strictly less than R .For these reasons, the concept of border rank was proposed [21], whichis defined as the minimum number of rank-1 tensors which provides theapproximation of a given tensor with an arbitrary accuracy. Symmetric tensor decomposition.
A symmetric tensor (sometimes calleda super-symmetric tensor) is invariant to the permutations of its indices. Asymmetric tensor of N th-order has equal sizes, I n = I , @ n , in all its modes,and the same value of entries for every permutation of its indices. Forexample, for vectors b ( n ) = b P R I , @ n , the rank-1 tensor, constructedby N outer products, ˝ Nn = b ( n ) = b ˝ b ˝ ¨ ¨ ¨ ˝ b P R I ˆ I ˆ¨¨¨ˆ I , is symmetric.Moreover, every symmetric tensor can be expressed as a linear combinationof such symmetric rank-1 tensors through the so-called symmetric CPdecomposition, given by X = R ÿ r = λ r b r ˝ b r ˝ ¨ ¨ ¨ ˝ b r , b r P R I , (2.10)where λ r P R are the scaling parameters for the unit length vectors b r ,while the symmetric tensor rank is the minimal number R of rank-1 tensorsthat is necessary for its exact representation. Multilinear products.
The mode- n (multilinear) product, also called thetensor-times-matrix product (TTM), of a tensor, A P R I ˆ¨¨¨ˆ I N , and amatrix, B P R J ˆ I n , gives the tensor C = A ˆ n B P R I ˆ¨¨¨ˆ I n ´ ˆ J ˆ I n + ˆ¨¨¨ˆ I N , (2.11)35a) ... ... B C (1) I I I I I J J JI I I I AC B I J ...... A A A I BA BA BA I C = A × B C = B A (1) (1) A (1) (b) I N A B I n I I ... . . . J J B I n A ( ) n n- n+ N I I I I n C A B ( ) ( ) n n
C B A
Figure 2.10: Illustration of the multilinear mode- n product, also known asthe TTM (Tensor-Times-Matrix) product, performed in the tensor format(left) and the matrix format (right). (a) Mode-1 product of a 3rd-ordertensor, A P R I ˆ I ˆ I , and a factor (component) matrix, B P R J ˆ I , yieldsa tensor C = A ˆ B P R J ˆ I ˆ I . This is equivalent to a simple matrixmultiplication formula, C ( ) = BA ( ) . (b) Graphical representation of amode- n product of an N th-order tensor, A P R I ˆ I ˆ¨¨¨ˆ I N , and a factormatrix, B P R J ˆ I n .with entries c i , i ,..., i n ´ , j , i n + ,..., i N = I n ÿ i n = a i , i ,..., i N b j , i n . (2.12)From (2.12) and Figure 2.10, the equivalent matrix form is C ( n ) = BA ( n ) ,which allows us to employ established fast matrix-by-vector andmatrix-by-matrix multiplications when dealing with very large-scaletensors. Efficient and optimized algorithms for TTM are, however, stillemerging [11, 12, 131]. 36 ull multilinear (Tucker) product. A full multilinear product, also calledthe Tucker product, of an N th-order tensor, G P R R ˆ R ˆ¨¨¨ˆ R N , and aset of N factor matrices, B ( n ) P R I n ˆ R n for n =
1, 2, . . . , N , performs themultiplications in all the modes and can be compactly written as (see Figure2.11(b)) C = G ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) (2.13) = (cid:74) G ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) P R I ˆ I ˆ¨¨¨ˆ I N .Observe that this format corresponds to the Tucker decomposition[119, 209, 210] (see Section 3.3). Multilinear product of a tensor and a vector (TTV).
In a similar way, themode- n multiplication of a tensor, A P R I ˆ¨¨¨ˆ I N , and a vector, b P R I n (tensor-times-vector, TTV) yields a tensor C = A ¯ ˆ n b P R I ˆ¨¨¨ˆ I n ´ ˆ I n + ˆ¨¨¨ˆ I N , (2.14)with entries c i ,..., i n ´ , i n + ,..., i N = I n ÿ i n = a i ,..., i n ´ , i n , i n + ,..., i N b i n . (2.15)Note that the mode- n multiplication of a tensor by a matrix does not changethe tensor order, while the multiplication of a tensor by vectors reduces itsorder, with the mode n removed (see Figure 2.11).Multilinear products of tensors by matrices or vectors play a key rolein deterministic methods for the reshaping of tensors and dimensionalityreduction, as well as in probabilistic methods for randomization /sketching procedures and in random projections of tensors into matricesor vectors. In other words, we can also perform reshaping of a tensorthrough random projections that change its entries, dimensionality orsize of modes, and/or the tensor order. This is achieved by multiplyinga tensor by random matrices or vectors, transformations which preserveits basic properties. [72, 126, 132, 137, 168, 192, 199, 223] (see Section 3.5 formore detail). Tensor contractions.
Tensor contraction is a fundamental and the mostimportant operation in tensor networks, and can be considered a higher-dimensional analogue of matrix multiplication, inner product, and outerproduct. 37a)
ScalarVectorMatrixLower-orderTensor ==== R R R R R R R R R R R R R R R R R R R R R R R R R R G GGGG (b) (c) R R R B (1) R B (5) B (4) B (3) B (2) I I I I I R G R R R R b b b G Figure 2.11: Multilinear tensor products in a compact tensor networknotation. (a) Transforming and/or compressing a 4th-order tensor, G P R R ˆ R ˆ R ˆ R , into a scalar, vector, matrix and 3rd-order tensor, bymultilinear products of the tensor and vectors. Note that a mode- n multiplication of a tensor by a matrix does not change the order of atensor, while a multiplication of a tensor by a vector reduces its order byone. For example, a multilinear product of a 4th-order tensor and fourvectors (top diagram) yields a scalar. (b) Multilinear product of a tensor, G P R R ˆ R ˆ¨¨¨ˆ R , and five factor (component) matrices, B ( n ) P R I n ˆ R n ( n =
1, 2, . . . , 5), yields the tensor C = G ˆ B ( ) ˆ B ( ) ˆ B ( ) ˆ B ( ) ˆ B ( ) P R I ˆ I ˆ¨¨¨ˆ I . This corresponds to the Tucker format. (c) Multilinear productof a 4th-order tensor, G P R R ˆ R ˆ R ˆ R , and three vectors, b n P R R n ( n =
1, 2, 3 ) , yields the vector c = G ¯ ˆ b ¯ ˆ b ¯ ˆ b P R R .38n a way similar to the mode- n multilinear product , the mode- ( mn ) product (tensor contraction) of two tensors, A P R I ˆ I ˆ¨¨¨ˆ I N and B P R J ˆ J ˆ¨¨¨ˆ J M , with common modes, I n = J m , yields an ( N + M ´ ) -ordertensor, C P R I ˆ¨¨¨ˆ I n ´ ˆ I n + ˆ¨¨¨ˆ I N ˆ J ˆ¨¨¨ˆ J m ´ ˆ J m + ˆ¨¨¨ˆ J M , in the form (seeFigure 2.12(a)) C = A ˆ mn B , (2.16)for which the entries are computed as c i , ..., i n ´ , i n + , ..., i N , j , ..., j m ´ , j m + , ..., j M == I n ÿ i n = a i ,..., i n ´ , i n , i n + , ..., i N b j , ..., j m ´ , i n , j m + , ..., j M . (2.17)This operation is referred to as a contraction of two tensors in single commonmode .Tensors can be contracted in several modes or even in all modes, asillustrated in Figure 2.12. For convenience of presentation, the super- orsub-index, e.g., m , n , will be omitted in a few special cases. For example, themultilinear product of the tensors, A P R I ˆ I ˆ¨¨¨ˆ I N and B P R J ˆ J ˆ¨¨¨ˆ J M ,with common modes, I N = J , can be written as C = A ˆ N B = A ˆ B = A ‚ B P R I ˆ I ˆ¨¨¨ˆ I N ´ ˆ J ˆ¨¨¨ˆ J M , (2.18)for which the entries c i , i ,..., i N ´ , j , j ,..., j M = I N ÿ i N = a i , i ,..., i N b i N , j ,..., j M .In this notation, the multiplications of matrices and vectors can bewritten as, A ˆ B = A ˆ B = AB , A ˆ B = AB T , A ˆ B = A ¯ ˆ B = x A , B y , and A ˆ x = A ˆ x = Ax .Note that tensor contractions are, in general not associative orcommutative, since when contracting more than two tensors, the order hasto be precisely specified (defined), for example, A ˆ ba ( B ˆ dc C ) for b ă c .It is also important to note that a matrix-by-vector product, y = Ax P R I ¨¨¨ I N , with A P R I ¨¨¨ I N ˆ J ¨¨¨ J N and x P R J ¨¨¨ J N , can be expressedin a tensorized form via the contraction operator as Y = A ¯ ˆ X , where In the literature, sometimes the symbol ˆ n is replaced by ‚ n . I J J A B I I J = A B I I J J I = JI = JI = JI N J J m +1 J M A B
I J n m = I I )b()a(c( ) d( ) ... ... ... A B I I I J ... Figure 2.12: Examples of contractions of two tensors. (a) Multilinearproduct of two tensors is denoted by A ˆ mn B . (b) Inner product of two3rd-order tensors yields a scalar c = x A , B y = A ˆ B = A ¯ ˆ B = ř i , i , i a i , i , i b i , i , i . (c) Tensor contraction of two 4th-order tensors, alongmode-3 in A and mode-2 in B , yields a 6th-order tensor, C = A ˆ B P R I ˆ I ˆ I ˆ J ˆ J ˆ J , with entries c i , i , i , j , j , j = ř i a i , i , i , i b j , i , j , j . (d)Tensor contraction of two 5th-order tensors along the modes 3, 4, 5 in A and 1, 2, 3 in B yields a 4th-order tensor, C = A ˆ B P R I ˆ I ˆ J ˆ J .the symbol ¯ ˆ denotes the contraction of all modes of the tensor X (seeSection 4.5).Unlike the matrix-by-matrix multiplications for which several efficientparallel schemes have been developed, (e.g. BLAS procedure) thenumber of efficient algorithms for tensor contractions is rather limited. Inpractice, due to the high computational complexity of tensor contractions,especially for tensor networks with loops, this operation is often performedapproximately [66, 107, 138, 167]. Tensor trace.
Consider a tensor with partial self-contraction modes, wherethe outer (or open) indices represent physical modes of the tensor, whilethe inner indices indicate its contraction modes. The Tensor Trace operatorperforms the summation of all inner indices of the tensor [89]. For example,a tensor A of size R ˆ I ˆ R has two inner indices, modes 1 and 3 of size40 , and one open mode of size I . Its tensor trace yields a vector of length I ,given by a = Tr ( A ) = ÿ r A ( r , :, r ) ,the elements of which are the traces of its lateral slices A i P R R ˆ R ( i =
1, 2, . . . , I ) , that is, (see bottom of Figure 2.13) a = [ tr ( A ) , . . . , tr ( A i ) , . . . , tr ( A I )] T . (2.19)A tensor can have more than one pair of inner indices, e.g., the tensor A of size R ˆ I ˆ S ˆ S ˆ I ˆ R has two pairs of inner indices, modes 1 and6, modes 3 and 4, and two open modes (2 and 5). The tensor trace of A therefore returns a matrix of size I ˆ I defined asTr ( A ) = ÿ r ÿ s A ( r , :, s , s , :, r ) .A variant of Tensor Trace [128] for the case of the partial tensor self-contraction considers a tensor A P R R ˆ I ˆ I ˆ¨¨¨ˆ I N ˆ R and yields a reduced-order tensor r A = Tr ( A ) P R I ˆ I ˆ¨¨¨ˆ I N , with entries r A ( i , i , . . . , i N ) = R ÿ r = A ( r , i , i , . . . , i N , r ) , (2.20)Conversions of tensors to scalars, vectors, matrices or tensors withreshaped modes and/or reduced orders are illustrated in Figures 2.11– 2.13. Tensor networks (TNs) represent a higher-order tensor as a set of sparselyinterconnected lower-order tensors (see Figure 2.14), and in this wayprovide computational and storage benefits. The lines (branches, edges)connecting core tensors correspond to the contracted modes while theirweights (or numbers of branches) represent the rank of a tensor network ,whereas the lines which do not connect core tensors correspond to the“external” physical variables (modes, indices) within the data tensor. Inother words, the number of free (dangling) edges (with weights larger thanone) determines the order of a data tensor under consideration, while setof weights of internal branches represents the TN rank. Strictly speaking, the minimum set of internal indices t R , R , R , . . . u is called the rank(bond dimensions) of a specific tensor network. AA A A A IR = tr( ) A c A A A A ) A I tr ( c aiii = a RI J
I IK
Ax yAX X tr(
Ay x ) x Ay T T tr(
X AX ) T [ a a , ,..., a I ] T a i r A ( r,i,r ) Figure 2.13: Tensor network notation for the traces of matrices (panels 1-4 from the top), and a (partial) tensor trace (tensor self-contraction) of a3rd-order tensor (bottom panel). Note that graphical representations ofthe trace of matrices intuitively explain the permutation property of traceoperator, e.g., tr ( A A A A ) = tr ( A A A A ) . Hierarchical Tucker (HT) decompositions (also called hierarchical tensorrepresentation) have been introduced in [92] and also independently in[86], see also [7, 91, 122, 139, 211] and references therein . Generally, theHT decomposition requires splitting the set of modes of a tensor in ahierarchical way, which results in a binary tree containing a subset ofmodes at each branch (called a dimension tree); examples of binary treesare given in Figures 2.15, 2.16 and 2.17. In tensor networks based on binary The HT model was developed independently, from a different perspective, in thechemistry community under the name MultiLayer Multi-Configurational Time-DependentHartree method (ML-MCTDH) [220]. Furthermore, the PARATREE model, developedindependently for signal processing applications [181], is quite similar to the HT model [86]. I I I I I X I I I I I I I I I I I I MPSPEPSTTNS I I I I I I I I I I I I I I I I I I Figure 2.14: Illustration of the decomposition of a 9th-order tensor, X P R I ˆ I ˆ¨¨¨ˆ I , into different forms of tensor networks (TNs). In general, theobjective is to decompose a very high-order tensor into sparsely (weakly)connected low-order and small size tensors, typically 3rd-order and 4th-order tensors called cores. Top: The Tensor Chain (TC) model, whichis equivalent to the Matrix Product State (MPS) with periodic boundaryconditions (PBC). Middle: The Projected Entangled-Pair States (PEPS), alsowith PBC. Bottom: The Tree Tensor Network State (TTNS).trees, all the cores are of order of three or less. Observe that the HT modeldoes not contain any cycles (loops), i.e., no edges connecting a node withitself. The splitting operation of the set of modes of the original data tensorby binary tree edges is performed through a suitable matricization. Choice of dimension tree.
The dimension tree within the HT formatis chosen a priori and defines the topology of the HT decomposition.Intuitively, the dimension tree specifies which groups of modes are“separated” from other groups of modes, so that a sequential HTdecomposition can be performed via a (truncated) SVD applied to asuitably matricized tensor. One of the simplest and most straightforwardchoices of a dimension tree is the linear and unbalanced tree, which givesrise to the tensor-train (TT) decomposition, discussed in detail in Section 2.4and Section 4 [158, 161].Using mathematical formalism, a dimension tree is a binary tree T N ,43 I I I I I I I (cid:2) I I I I I I I I Figure 2.15: The standard Tucker decomposition of an 8th-order tensor intoa core tensor (red circle) and eight factor matrices (green circles), and itstransformation into an equivalent Hierarchical Tucker (HT) model usinginterconnected smaller size 3rd-order core tensors and the same factormatrices. N ą
1, which satisfies that(i) all nodes t P T N are non-empty subsets of {
1, 2,. . . , N } ,(ii) the set t root = t
1, 2, . . . , N u is the root node of T N , and(iii) each non-leaf node has two children u , v P T N such that t is a disjointunion t = u Y v .The HT model is illustrated through the following Example. Example.
Suppose that the dimension tree T is given, which gives theHT decomposition illustrated in Figure 2.17. The HT decomposition of atensor X P R I ˆ¨¨¨ˆ I with given set of integers t R t u t P T can be expressedin the tensor and vector / matrix forms as follows. Let intermediatetensors X ( t ) with t = t n , . . . , n k u Ă t
1, . . . , 7 u have the size I n ˆ I n ˆ¨ ¨ ¨ ˆ I n k ˆ R t . Let X ( t ) r t ” X ( t ) ( :, . . . , :, r t ) denote the subtensor of X ( t ) and X ( t ) ” X ( t ) ă k ą P R I n I n ¨¨¨ I nk ˆ R t denote the corresponding unfolded matrix.Let G ( t ) P R R u ˆ R v ˆ R t be core tensors where u and v denote respectively theleft and right children of t . 44 rder 3: Order 4:Order 5:Order 6:Order 7:Order 8: Figure 2.16: Examples of HT/TT models (formats) for distributed Tuckerdecompositions with 3rd-order cores, for different orders of data tensors.Green circles denote factor matrices (which can be absorbed by coretensors), while red circles indicate cores. Observe that the representationsare not unique. 45 G (123) G (4567) G (67) G (45) G (23) B (1) B (2) B (3) B (4) B (5) B (6) B (7) R R R R R I I I I I I R R R R R R R ・・・ (12 7) I Figure 2.17: Example illustrating the HT decomposition for a 7th-order datatensor.The HT model shown in Figure 2.17 can be then describedmathematically in the vector form asvec ( X ) – ( X ( ) b L X ( ) ) vec ( G ( ¨¨¨ ) ) , X ( ) – ( B ( ) b L X ( ) ) G ( ) ă ą , X ( ) – ( X ( ) b L X ( ) ) G ( ) ă ą , X ( ) – ( B ( ) b L B ( ) ) G ( ) ă ą , X ( ) – ( B ( ) b L B ( ) ) G ( ) ă ą , X ( ) – ( B ( ) b L B ( ) ) G ( ) ă ą .An equivalent, more explicit form, using tensor notations becomes X – R ÿ r = R ÿ r = g ( ¨¨¨ ) r , r X ( ) r ˝ X ( ) r , X ( ) r – R ÿ r = R ÿ r = g ( ) r , r , r b ( ) r ˝ X ( ) r ,46 ( ) r – R ÿ r = R ÿ r = g ( ) r , r , r X ( ) r ˝ X ( ) r , X ( ) r – R ÿ r = R ÿ r = g ( ) r , r , r b ( ) r ˝ b ( ) r , X ( ) r – R ÿ r = R ÿ r = g ( ) r , r , r b ( ) r ˝ b ( ) r , X ( ) r – R ÿ r = R ÿ r = g ( ) r , r , r b ( ) r ˝ b ( ) r .The TT/HT decompositions lead naturally to a distributed Tuckerdecomposition, where a single core tensor is replaced by interconnectedcores of lower-order, resulting in a distributed network in which only somecores are connected directly with factor matrices, as illustrated in Figure2.15. Figure 2.16 illustrates exemplary HT/TT structures for data tensors ofvarious orders [122, 205]. Note that for a 3rd-order tensor, there is only oneHT tensor network representation, while for a 5th-order we have 5, and fora 10th-order tensor there are 11 possible HT architectures.A simple approach to reduce the size of a large-scale core tensor in thestandard Tucker decomposition (typically, for N ą
5) would be to applythe concept of distributed tensor networks (DTNs). The DTNs assume twokinds of cores (blocks): (i) the internal cores (nodes) which are connectedonly to other cores and have no free edges and (ii) external cores whichdo have free edges representing physical modes (indices) of a given datatensor (see also Section 2.6). Such distributed representations of tensors arenot unique.The tree tensor network state (TTNS) model, whereby all nodes are of3rd-order or higher, can be considered as a generalization of the TT/HTdecompositions, as illustrated by two examples in Figure 2.18 [149]. A moredetailed mathematical description of the TTNS is given in Section 3.3.47igure 2.18: The Tree Tensor Network State (TTNS) with 3rd-order and 4th-order cores for the representation of 24th-order data tensors. The TTNScan be considered both as a generalization of HT/TT format and as adistributed model for the Tucker- N decomposition (see Section 3.3). The Tensor Train (TT) format can be interpreted as a special case of theHT format, where all nodes (TT-cores) of the underlying tensor networkare connected in cascade (or train), i.e., they are aligned while factormatrices corresponding to the leaf modes are assumed to be identities andthus need not be stored. The TT format was first proposed in numericalanalysis and scientific computing in [158, 161]. Figure 2.19 presents theconcept of TT decomposition for an N th-order tensor, the entries of whichcan be computed as a cascaded (multilayer) multiplication of appropriatematrices (slices of TT-cores). The weights of internal edges (denoted by t R , R , . . . , R N ´ u ) represent the TT-rank. In this way, the so alignedsequence of core tensors represents a “tensor train” where the role of“buffers” is played by TT-core connections. It is important to highlight thatTT networks can be applied not only for the approximation of tensorizedvectors but also for scalar multivariate functions, matrices, and even large-scale low-order tensors, as illustrated in Figure 2.20 (for more detail seeSection 4).In the quantum physics community, the TT format is known as theMatrix Product State (MPS) representation with the Open BoundaryConditions (OBC) and was introduced in 1987 as the ground state of the48a) I ... R R R I (2) G (1) G i ( ) n G ( ) N G R R I I N I I n G (2) G ( ) n G (1) G ( ) N R n -1 R n R N -1 ... ... R n -1 I n R n R N -1 I N ... ... ... ... ... ... ... i i n N i (b) ... R R I (2) G (1) G i ( ) n G R R I I N I I n G (2) G ( ) n G (1) G ( ) N R n -1 R n R N -1 R n -1 I n R n ... ... ... i i n ...... I N R N R N -1 ... R R N I ... R N G i ( ) NN Figure 2.19:
Concepts of the tensor train (TT) and tensor chain (TC)decompositions (MPS with OBC and PBC, respectively) for an N th-order datatensor, X P R I ˆ I ˆ¨¨¨ˆ I N . (a) Tensor Train (TT) can be mathematicallydescribed as x i , i ,..., i N = G ( ) i G ( ) i ¨ ¨ ¨ G ( N ) i N , where (bottom panel) the slicematrices of TT-cores G ( n ) P R R n ´ ˆ I n ˆ R n are defined as G ( n ) i n = G ( n ) ( :, i n , : ) P R R n ´ ˆ R n with R = R N =
1. (b) For the Tensor Chain (TC), theentries of a tensor are expressed as x i , i ,..., i N = tr ( G ( ) i G ( ) i ¨ ¨ ¨ G ( N ) i N ) = R ÿ r = R ÿ r = ¨ ¨ ¨ R N ÿ r N = g ( ) r N , i , r g ( ) r , i , r ¨ ¨ ¨ g ( N ) r N ´ , i N , r N , where (bottom panel) the lateralslices of the TC-cores are defined as G ( n ) i n = G ( n ) ( :, i n , : ) P R R n ´ ˆ R n and g ( n ) r n ´ , i n , r n = G ( n ) ( r n ´ , i n , r n ) for n =
1, 2, . . . , N , with R = R N ą
1. Noticethat TC/MPS is effectively a TT with a single loop connecting the first and the lastcore, so that all TC-cores are of 3rd-order. AA I I = I … I N I I = I … I N J J = J … J N K K = K … K N I J I I I I N I I I I N J J J J N K K KI J I J I J K N I N J N Figure 2.20: Forms of tensor train decompositions for a vector, a P R I ,matrix, A P R I ˆ J , and 3rd-order tensor, A P R I ˆ J ˆ K (by applying a suitabletensorization).1D AKLT model [2]. It was subsequently extended by many researchers (see [102, 156, 166, 183, 214, 216, 224] and references therein). Advantages of TT formats.
An important advantage of the TT/MPSformat over the HT format is its simpler practical implementation, as nobinary tree needs to be determined (see Section 4). Another attractiveproperty of the TT-decomposition is its simplicity when performing basicmathematical operations on tensors directly in the TT-format (that is,employing only core tensors). These include matrix-by-matrix and matrix-by-vector multiplications, tensor addition, and the entry-wise (Hadamard)product of tensors. These operations produce tensors, also in the TT-format, which generally exhibit increased TT-ranks. A detailed descriptionof basic operations supported by the TT format is given in Section 4.5.Moreover, only TT-cores need to be stored and processed, which makesthe number of parameters to scale linearly in the tensor order, N , of a datatensor and all mathematical operations are then performed only on the low-order and relatively small size core tensors. In fact, the TT was rediscovered several times under different names: MPS, valencebond states, and density matrix renormalization group (DMRG) [224]. The DMRG usuallyrefers not only to a tensor network format but also the efficient computational algorithms(see also [101, 182] and references therein). Also, in quantum physics the ALS algorithm iscalled the one-site DMRG, while the Modified ALS (MALS) is known as the two-site DMRG(for more detail, see Part 2). EPS PEPOMPS MPO
Figure 2.21: Class of 1D and 2D tensor train networks with open boundaryconditions (OBC): the Matrix Product State (MPS) or (vector) Tensor Train(TT), the Matrix Product Operator (MPO) or Matrix TT, the ProjectedEntangled-Pair States (PEPS) or Tensor Product State (TPS), and theProjected Entangled-Pair Operators (PEPO).The TT rank is defined as an ( N ´ ) -tuple of the formrank TT ( X ) = r TT = t R , . . . , R N ´ u , R n = rank ( X ă n ą ) , (2.21)where X ă n ą P R I ¨¨¨ I n ˆ I n ´ ¨¨¨ I N is an n th canonical matricization of the tensor X . Since the TT rank determines memory requirements of a tensor train,it has a strong impact on the complexity, i.e., the suitability of tensor trainrepresentation for a given raw data tensor.The number of data samples to be stored scales linearly in the tensororder, N , and the size, I , and quadratically in the maximum TT rank bound, R , that is N ÿ n = R n ´ R n I n „ O ( NR I ) , R : = max n t R n u , I : = max n t I n u . (2.22)This is why it is crucially important to have low-rank TT approximations .A drawback of the TT format is that the ranks of a tensor traindecomposition depend on the ordering (permutation) of the modes, In the worst case scenario the TT ranks can grow up to I ( N /2 ) for an N th-order tensor. An important issue in tensor networks is the rank-complexity trade-off inthe design. Namely, the main idea behind TNs is to dramatically reducecomputational cost and provide distributed storage and computationthrough low-rank TN approximation. However, the TT/HT ranks, R n ,of 3rd-order core tensors sometimes increase rapidly with the order of adata tensor and/or increase of a desired approximation accuracy, for anychoice of a tree of tensor network. The ranks can be often kept undercontrol through hierarchical two-dimensional TT models called the PEPS(Projected Entangled Pair States ) and PEPO (Projected Entangled PairOperators) tensor networks, which contain cycles, as shown in Figure 2.21.In the PEPS and PEPO, the ranks are kept considerably smaller at a costof employing 5th- or even 6th-order core tensors and the associated highercomputational complexity with respect to the order [76, 184, 214].Even with the PEPS/PEPO architectures, for very high-order tensors,the ranks (internal size of cores) may increase rapidly with an increase inthe desired accuracy of approximation. For further control of the ranks,alternative tensor networks can be employed, such as: (1) the Honey-Comb Lattice (HCL) which uses 3rd-order cores, and (2) the Multi-scaleEntanglement Renormalization Ansatz (MERA) which consist of both 3rd-and 4th-order core tensors (see Figure 2.22) [83, 143, 156]. The ranks areoften kept considerably small through special architectures of such TNs,at the expense of higher computational complexity with respect to tensor An “entangled pair state” is a tensor that cannot be represented as an elementary rank-1 tensor. The state is called “projected” because it is not a real physical state but a projectiononto some subspace. The term “pair” refers to the entanglement being considered only formaximally entangled state pairs [94, 156].
Complexity of algorithms for computation (contraction) on tensornetworks typically scales polynomially with the rank, R n , or size, I n , ofthe core tensors, so that the computations quickly become intractable withthe increase in R n . A step towards reducing storage and computationalrequirements would be therefore to reduce the size (volume) of core tensorsby increasing their number through distributed tensor networks (DTNs),as illustrated in Figure 2.22. The underpinning idea is that each coretensor in an original TN is replaced by another TN (see Figure 2.23 for TTnetworks), resulting in a distributed TN in which only some core tensorsare associated with physical (natural) modes of the original data tensor[100]. A DTN consists of two kinds of relatively small-size cores (nodes),53 R K I G (2) G ( -1) N G (1) G ( ) N J I I N -1 K J I N J N K N R N -2 R N-1 J N -1 K N -1 I J K I J K I N -1 J N -1 K N -1 I N J N K N I J K I J K I N -1 J N -1 K N -1 I N J N K N Figure 2.23: Graphical representation of a large-scale data tensor via itsTT model (top panel), the PEPS model of the TT (third panel), and itstransformation to a distributed 2D (second from bottom panel) and 3D(bottom panel) tensor train networks.54able 2.2: Links between tensor networks (TNs) and graphical models usedin Machine Learning (ML) and Statistics. The corresponding categories arenot exactly the same, but have general analogies.Tensor Networks Neural Networks and Graphical Models inML/StatisticsTT/MPS Hidden Markov Models (HMM)HT/TTNS Deep Learning Neural Networks, GaussianMixture Model (GMM)PEPS Markov Random Field (MRF), ConditionalRandom Field (CRF)MERA Wavelets, Deep Belief Networks (DBN)ALS, DMRG/MALSAlgorithms Forward-Backward Algorithms, BlockNonlinear Gauss-Seidel Methodsinternal nodes which have no free edges and external nodes which havefree edges representing natural (physical) indices of a data tensor.The obvious advantage of DTNs is that the size of each core tensor in theinternal tensor network structure is usually much smaller than the size ofthe initial core tensor; this allows for a better management of distributedstorage, and often in the reduction of the total number of networkparameters through distributed computing. However, compared to initialtree structures, the contraction of the resulting distributed tensor networkbecomes much more difficult because of the loops in the architecture.
Table 2.2 summarizes the conceptual connections of tensor networks withgraphical and neural network models in machine learning and statistics[44, 45, 52, 53, 77, 110, 146, 154, 226]. More research is needed to establishdeeper and more precise relationships.55 .8 Changing the Structure of Tensor Networks
An advantage of the graphical (graph) representation of tensor networks isthat the graphs allow us to perform complex mathematical operations oncore tensors in an intuitive and easy to understand way, without the needto resort to complicated mathematical expressions. Another importantadvantage is the ability to modify (optimize) the topology of a TN, whilekeeping the original physical modes intact. The so optimized topologiesyield simplified or more convenient graphical representations of a higher-order data tensor and facilitate practical applications [94, 100, 230]. Inparticular: • A change in topology to a HT/TT tree structure provides reducedcomputational complexity, through sequential contractions of coretensors and enhanced stability of the corresponding algorithms; • Topology of TNs with cycles can be modified so as to completelyeliminate the cycles or to reduce their number; • Even for vastly diverse original data tensors, topology modificationsmay produce identical or similar TN structures which make it easierto compare and jointly analyze block of interconnected data tensors.This provides opportunity to perform joint group (linked) analysis oftensors by decomposing them to TNs.It is important to note that, due to the iterative way in which tensorcontractions are performed, the computational requirements associatedwith tensor contractions are usually much smaller for tree-structurednetworks than for tensor networks containing many cycles. Therefore,for stable computations, it is advantageous to transform a tensor networkwith cycles into a tree structure.
Tensor Network transformations.
In order to modify tensor networkstructures, we may perform sequential core contractions, followed by theunfolding of these contracted tensors into matrices, matrix factorizations(typically truncated SVD) and finally reshaping of such matrices back intonew core tensors, as illustrated in Figures 2.24.The example in Figure 2.24(a) shows that, in the first step a contractionof two core tensors, G ( ) P R I ˆ I ˆ R and G ( ) P R R ˆ I ˆ I , is performed togive the tensor G ( ) = G ( ) ˆ G ( ) P R I ˆ I ˆ I ˆ I , (2.23)56a) R I I I I G (1) G (2) I I I I G (1,2) I I I I UVG (1,2) I I I I G (2) G I I I I I I I I RR R Contraction Matricization SVD Reshaping (b) noitcartnoCDVS
Figure 2.24: Illustration of basic transformations on a tensor network. (a)Contraction, matricization, matrix factorization (SVD) and reshaping ofmatrices back into tensors. (b) Transformation of a Honey-Comb latticeinto a Tensor Chain (TC) via tensor contractions and the SVD.with entries g ( ) i , i , i , i = ř Rr = g ( ) i , i , r g ( ) r , i , i . In the next step, the tensor G ( ) istransformed into a matrix via matricization, followed by a low-rank matrixfactorization using the SVD, to give G ( ) i i , i i – USV T P R I I ˆ I I . (2.24)In the final step, the factor matrices, US P R I I ˆ R and VS P R R ˆ I I ,are reshaped into new core tensors, G ( ) P R I ˆ R ˆ I and G ( ) P R R ˆ I ˆ I .The above tensor transformation procedure is quite general, and isapplied in Figure 2.24(b) to transform a Honey-Comb lattice into a tensorchain (TC), while Figure 2.25 illustrates the conversion of a tensor chain(TC) into TT/MPS with OBC. 57 I G (1) G (2) G (4) G (3) I I R R R G (1) G (2) G (3,4) G (3) G (1,2) G R R R I I I I I I I I R R R G (3) G R I I G (2) G R I I R R G (2) G (3) G R R R R (4) G (4) G (3) G I I G (2) G I I R R R I I I I R Figure 2.25: Transformation of the closed-loop Tensor Chain (TC) into theopen-loop Tensor Train (TT). This is achieved by suitable contractions,reshaping and decompositions of core tensors.To convert a TC into TT/MPS, in the first step, we perform a contractionof two tensors, G ( ) P R I ˆ R ˆ R and G ( ) P R R ˆ R ˆ I , as G ( ) = G ( ) ˆ G ( ) P R I ˆ R ˆ R ˆ I ,for which the entries g ( ) i , r , r , i = ř R r = g ( ) i , r , r g ( ) r , r , i . In the next step, thetensor G ( ) is transformed into a matrix, followed by a truncated SVD G ( )( ) – USV T P R I ˆ R R I .Finally, the matrices, U P R I ˆ R and VS P R R ˆ R R I , are reshaped backinto the core tensors, G ( ) = U P R ˆ I ˆ R and G ( ) P R R ˆ R ˆ R ˆ I .The procedure is repeated all over again for different pairs of cores, asillustrated in Figure 2.25. 58 . . . ++ A A A R b b b b b b b R (1) b R (2) b R (3) + ++ . . . X A A A R ( ) ( ) ( ) b b b b b b b R (1) b R (2) b R (3) X Figure 2.26: Block term decomposition (BTD) of a 6th-order block tensor,to yield X = ř Rr = A r ˝ (cid:16) b ( ) r ˝ b ( ) r ˝ b ( ) r (cid:17) (top panel), for more detail see[57, 193]. BTD in the tensor network notation (bottom panel). Therefore,the 6th-order tensor X is approximately represented as a sum of R terms,each of which is an outer product of a 3rd-order tensor, A r , and another a3rd-order, rank-1 tensor, b ( ) r ˝ b ( ) r ˝ b ( ) r (in dashed circle), which itself isan outer product of three vectors. The fundamental TNs considered so far assume that the links betweenthe cores are expressed by tensor contractions. In general, links betweenthe core tensors (or tensor sub-networks) can also be expressed via othermathematical linear/multilinear or nonlinear operators, such as the outer(tensor) product, Kronecker product, Hadamard product and convolutionoperator. For example, the use of the outer product leads to Block TermDecomposition (BTD) [57,58,61,193] and use the Kronecker products yieldsto the Kronecker Tensor Decomposition (KTD) [174, 175, 178]. Block termdecompositions (BTD) are closely related to constrained Tucker formats(with a sparse block Tucker core) and the Hierarchical Outer ProductTensor Approximation (HOPTA), which be employed for very high-orderdata tensors [39].Figure 2.26 illustrates such a BTD model for a 6th-order tensor, wherethe links between the components are expressed via outer products, whileFigure 2.27 shows a more flexible Hierarchical Outer Product TensorApproximation (HOPTA) model suitable for very high-order tensors.59 X + X β ( ) A B p p p + α ( ) A b r r r α ( ) A b rr r β ( ) c p A B p X α ( ) A B C r rr r α ( ) A B rr r X α ( ) A B r rr (1) (2) (3) ( ) q q q b b b λ q p p Figure 2.27: Conceptual model of the HOPTA generalized tensor network,illustrated for data tensors of different orders. For simplicity, we usethe standard outer (tensor) products, but conceptually nonlinear outerproducts (see Eq. (2.25) and other tensor product operators (Kronecker,Hadamard) can also be employed. Each component (core tensor), A r , B r and/or C r , can be further hierarchically decomposed using suitable outerproducts, so that the HOPTA models can be applied to very high-ordertensors. 60bserve that the fundamental operator in the HOPTA generalizedtensor networks is outer (tensor) product, which for two tensors A P R I ˆ¨¨¨ˆ I N and B P R J ˆ¨¨¨ˆ J M , of arbitrary orders N and M , is defined asan ( N + M ) th-order tensor C = A ˝ B P R I ˆ¨¨¨ˆ I N ˆ J ˆ¨¨¨ˆ J M , with entries c i ,..., i N , j ,..., j M = a i ,..., i N b j ,..., j M . This standard outer product of two tensorscan be generalized to a nonlinear outer product as follows (cid:0) A ˝ f B (cid:1) i ,..., i N , j ,..., J M = f (cid:0) a i ,..., i N , b j ,..., j M (cid:1) , (2.25)where f ( ¨ , ¨ ) is a suitably designed nonlinear function with associativeand commutative properties. In a similar way, we can define othernonlinear tensor products, for example, Hadamard, Kronecker or Khatri–Rao products and employ them in generalized nonlinear tensor networks.The advantage of the HOPTA model over other TN models is its flexibilityand the ability to model more complex data structures by approximatingvery high-order tensors through a relatively small number of low-ordercores.The BTD, and KTD models can be expressed mathematically, forexample, in simple nested (hierarchical) forms, given byBTD : X – R ÿ r = ( A r ˝ B r ) , (2.26)KTD : ˜ X – R ÿ r = ( A r b B r ) , (2.27)where, e.g., for BTD, each factor tensor can be represented recursively as A r – ř R r = ( A ( ) r ˝ B ( ) r ) or B r – ř R r = A ( ) r ˝ B ( ) r .Note that the 2 N th-order subtensors, A r ˝ B r and A r b B r , have the sameelements, just arranged differently. For example, if X = A ˝ B and X = A b B , where A P R J ˆ J ˆ¨¨¨ˆ J N and B P R K ˆ K ˆ¨¨¨ˆ K N , then x j , j ,..., j N , k , k ,..., k N = x k + K ( j ´ ) ,..., k N + K N ( j N ´ ) .The definition of the tensor Kronecker product in the KTD modelassumes that both core tensors, A r and B r , have the same order. This isnot a limitation, given that vectors and matrices can also be treated astensors, e.g, a matrix of dimension I ˆ J as is also a 3rd-order tensor ofdimension I ˆ J ˆ
1. In fact, from the BTD/KTD models, many existingand new TDs/TNs can be derived by changing the structure and orders offactor tensors, A r and B r . For example: • If A r are rank-1 tensors of size I ˆ I ˆ ¨ ¨ ¨ ˆ I N , and B r are scalars, @ r , then (2.27) represents the rank- R CP decomposition;61 If A r are rank- L r tensors of size I ˆ I ˆ ¨ ¨ ¨ ˆ I R ˆ ˆ ¨ ¨ ¨ ˆ
1, in theKruskal (CP) format, and B r are rank-1 tensors of size 1 ˆ ¨ ¨ ¨ ˆ ˆ I R + ˆ ¨ ¨ ¨ ˆ I N , @ r , then (2.27) expresses the rank-( L r ˝
1) BTD; • If A r and B r are expressed by KTDs, we arrive at the NestedKronecker Tensor Decomposition (NKTD), a special case of which isthe Tensor Train (TT) decomposition. Therefore, the BTD model in(2.27) can also be used for recursive TT-decompositions.The generalized tensor network approach caters for a large variety oftensor decomposition models, which may find applications in scientificcomputing, signal processing or deep learning (see, eg., [37, 39, 45, 58, 177]).In this monograph, we will mostly focus on the more establishedTucker and TT decompositions (and some of their extensions), due to theirconceptual simplicity, availability of stable and efficient algorithms for theircomputation and the possibility to naturally extend these models to morecomplex tensor networks. In other words, the Tucker and TT models areconsidered here as simplest prototypes, which can then serve as buildingblocks for more sophisticated tensor networks.62 hapter 3 Constrained TensorDecompositions: FromTwo-way to MultiwayComponent Analysis
The component analysis (CA) framework usually refers to the applicationof constrained matrix factorization techniques to observed mixed signals inorder to extract components with specific properties and/or estimate themixing matrix [40, 43, 47, 55, 103]. In the machine learning practice, to aidthe well-posedness and uniqueness of the problem, component analysismethods exploit prior knowledge about the statistics and diversities oflatent variables (hidden sources) within the data. Here, by the diversities,we refer to different characteristics, features or morphology of latentvariables which allow us to extract the desired components or features, forexample, sparse or statistically independent components.
Two-way Component Analysis (2-way CA), in its simplest form, can beformulated as a constrained matrix factorization of typically low-rank, inthe form X = A Λ B T + E = R ÿ r = λ r a r ˝ b r + E = R ÿ r = λ r a r b T r + E , (3.1)63here Λ = diag ( λ , . . . , λ R ) is an optional diagonal scaling matrix.The potential constraints imposed on the factor matrices, A and/or B ,include orthogonality, sparsity, statistical independence, nonnegativity orsmoothness. In the bilinear 2-way CA in (3.1), X P R I ˆ J is a knownmatrix of observed data, E P R I ˆ J represents residuals or noise, A =[ a , a , . . . , a R ] P R I ˆ R is the unknown mixing matrix with R basis vectors a r P R I , and depending on application, B = [ b , b , . . . , b R ] P R J ˆ R , isthe matrix of unknown components, factors, latent variables, or hiddensources, represented by vectors b r P R J (see Figure 3.2).It should be noted that 2-way CA has an inherent symmetry. Indeed,Eq. (3.1) could also be written as X T « BA T , thus interchanging the roles ofsources and mixing process.Algorithmic approaches to 2-way (matrix) component analysis are wellestablished, and include Principal Component Analysis (PCA), RobustPCA (RPCA), Independent Component Analysis (ICA), NonnegativeMatrix Factorization (NMF), Sparse Component Analysis (SCA) andSmooth Component Analysis (SmCA) [6, 24, 43, 47, 109, 228]. Thesetechniques have become standard tools in blind source separation (BSS),feature extraction, and classification paradigms. The columns of the matrix B , which represent different latent components, are then determined byspecific chosen constraints and should be, for example, (i) as statisticallymutually independent as possible for ICA; (ii) as sparse as possible forSCA; (iii) as smooth as possible for SmCA; (iv) take only nonnegativevalues for NMF.Singular value decomposition (SVD) of the data matrix X P R I ˆ J is aspecial, very important, case of the factorization in Eq. (3.1), and is givenby X = USV T = R ÿ r = σ r u r ˝ v r = R ÿ r = σ r u r v T r , (3.2)where U P R I ˆ R and V P R J ˆ R are column-wise orthogonal matrices and S P R R ˆ R is a diagonal matrix containing only nonnegative singular values σ r in a monotonically non-increasing order.According to the well known Eckart–Young theorem, the truncatedSVD provides the optimal, in the least-squares (LS) sense, low-rankmatrix approximation . The SVD, therefore, forms the backbone oflow-rank matrix approximations (and consequently low-rank tensorapproximations). [145] has generalized this optimality to arbitrary unitarily invariant norms. X k « A k B T k , ( k =
1, 2, . . . , K ) , (3.3)on several data matrices, X k , which represent linked datasets, subject tovarious constraints imposed on linked (interrelated) component (factor)matrices. In the case of orthogonality or statistical independenceconstraints, the problem in (3.3) can be related to models of groupPCA/ICA through suitable pre-processing, dimensionality reduction andpost-processing procedures [38, 75, 88, 191, 239]. The terms “groupcomponent analysis” and “joint multi-block data analysis” are usedinterchangeably to refer to methods which aim to identify links(correlations, similarities) between hidden components in data. In otherwords, the objective of group component analysis is to analyze the correlation,variability, and consistency of the latent components across multi-block datasets .The field of 2-way CA is maturing and has generated efficient algorithmsfor 2-way component analysis, especially for sparse/functional PCA/SVD,ICA, NMF and SCA [6, 40, 47, 103, 236].The rapidly emerging field of tensor decompositions is the nextimportant step which naturally generalizes 2-way CA/BSS models andalgorithms. Tensors, by virtue of multilinear algebra, offer enhancedflexibility in CA, in the sense that not all components need to bestatistically independent, and can be instead smooth, sparse, and/or non-negative (e.g., spectral components). Furthermore, additional constraintscan be used to reflect physical properties and/or diversities of spatialdistributions, spectral and temporal patterns. We proceed to show howconstrained matrix factorizations or 2-way CA models can be extendedto multilinear models using tensor decompositions, such as the CanonicalPolyadic (CP) and the Tucker decompositions, as illustrated in Figures 3.1,3.2 and 3.3. The CP decomposition (also called the CANDECOMP, PARAFAC, orCanonical Polyadic decomposition) decomposes an N th-order tensor, X P R I ˆ I ˆ¨¨¨ˆ I N , into a linear combination of terms, b ( ) r ˝ b ( ) r ˝ ¨ ¨ ¨ ˝ b ( N ) r , which65re rank-1 tensors, and is given by [29, 95, 96] X – R ÿ r = λ r b ( ) r ˝ b ( ) r ˝ ¨ ¨ ¨ ˝ b ( N ) r = Λ ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) = (cid:74) Λ ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) , (3.4)where λ r are non-zero entries of the diagonal core tensor Λ P R R ˆ R ˆ¨¨¨ˆ R and B ( n ) = [ b ( n ) , b ( n ) , . . . , b ( n ) R ] P R I n ˆ R are factor matrices (see Figure 3.1and Figure 3.2).Via the Khatri–Rao products (see Table 2.1), the CP decomposition canbe equivalently expressed in a matrix/vector form as X ( n ) – B ( n ) Λ ( B ( N ) d ¨ ¨ ¨ d B ( n + ) d B ( n ´ ) d ¨ ¨ ¨ d B ( ) ) T (3.5) = B ( n ) Λ ( B ( ) d L ¨ ¨ ¨ d L B ( n ´ ) d L B ( n + ) d L ¨ ¨ ¨ d L B ( N ) ) T and vec ( X ) – [ B ( N ) d B ( N ´ ) d ¨ ¨ ¨ d B ( ) ] λ (3.6) – [ B ( ) d L B ( ) d L ¨ ¨ ¨ d L B ( N ) ] λ ,where λ = [ λ , λ , . . . , λ R ] T and Λ = diag ( λ , . . . , λ R ) is a diagonal matrix.The rank of a tensor X is defined as the smallest R for which the CPdecomposition in (3.4) holds exactly. Algorithms to compute CP decomposition.
In real world applications, thesignals of interest are corrupted by noise, so that the CP decomposition israrely exact and has to be estimated by minimizing a suitable cost function.Such cost functions are typically of the Least-Squares (LS) type, in the formof the Frobenius norm J ( B ( ) , B ( ) , . . . , B ( N ) ) = } X ´ (cid:74) Λ ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) } F , (3.7)or Least Absolute Error (LAE) criteria [217] J ( B ( ) , B ( ) , . . . , B ( N ) ) = } X ´ (cid:74) Λ ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) } . (3.8)The Alternating Least Squares (ALS) based algorithms minimize thecost function iteratively by individually optimizing each component (factormatrix, B ( n ) )), while keeping the other component matrices fixed [95, 119].66a) Standard block diagram for CP decomposition of a 3rd-order tensor
X A
JI K ( )
I R ( )
R RR ( )
R J B T C Λ + + . . . c b a c R b R a R λ λ R K G C A B T ( ) R RK ( )
I R ( )
R J C diag( ) λ (b) CP decomposition for a 4th-order tensor in the tensor network notation I I I I X = I I I I R RRR B (1) B (2) B (3) B (4) Λ ~ Figure 3.1: Representations of the CP decomposition. The objective ofthe CP decomposition is to estimate the factor matrices B ( n ) P R I n ˆ R andscaling coefficients t λ , λ , . . . , λ R u . (a) The CP decomposition of a 3rd-order tensor in the form, X – Λ ˆ A ˆ B ˆ C = ř Rr = λ r a r ˝ b r ˝ c r = G c ˆ A ˆ B , with G c = Λ ˆ C . (b) The CP decomposition for a 4th-ordertensor in the form X – Λ ˆ B ( ) ˆ B ( ) ˆ B ( ) ˆ B ( ) = ř Rr = λ r b ( ) r ˝ b ( ) r ˝ b ( ) r ˝ b ( ) r . 67igure 3.2: Analogy between a low-rank matrix factorization, X – A Λ B T = ř Rr = λ r a r ˝ b r (top), and a simple low-rank tensor factorization (CPdecomposition), X – Λ ˆ A ˆ B ˆ C = ř Rr = λ r a r ˝ b r ˝ c r (bottom).To illustrate the ALS principle, assume that the diagonal matrix Λ has been absorbed into one of the component matrices; then, by takingadvantage of the Khatri–Rao structure in Eq. (3.5), the component matrices, B ( n ) , can be updated sequentially as B ( n ) Ð X ( n ) (cid:32) ä k ‰ n B ( k ) (cid:33) (cid:32) æ k ‰ n ( B ( k ) T B ( k ) ) (cid:33) : . (3.9)The main challenge (or bottleneck) in implementing ALS and GradientDecent (GD) techniques for CP decomposition lies therefore in multiplyinga matricized tensor and Khatri–Rao product (of factor matrices) [35, 171]and in the computation of the pseudo-inverse of ( R ˆ R ) matrices (for thebasic ALS see Algorithm 1).The ALS approach is attractive for its simplicity, and often providessatisfactory performance for well defined problems with high SNRsand well separated and non-collinear components. For ill-conditionedproblems, advanced algorithms are required which typically exploitthe rank-1 structure of the terms within CP decomposition to performefficient computation and storage of the Jacobian and Hessian of the costfunction [172, 176, 193]. Implementation of parallel ALS algorithm overdistributed memory for very large-scale tensors was proposed in [35, 108]. Multiple random projections, tensor sketching and Giga-Tensor.
Most ofthe existing algorithms for the computation of CP decomposition are based68 lgorithm 1 : Basic ALS for the CP decomposition of a 3rd-ordertensor
Input:
Data tensor X P R I ˆ J ˆ K and rank R Output:
Factor matrices A P R I ˆ R , B P R J ˆ R , C P R K ˆ R , and scalingvector λ P R R Initialize A , B , C while not converged or iteration limit is not reached do A Ð X ( ) ( C d B )( C T C f B T B ) : Normalize column vectors of A to unit length (by computing thenorm of each column vector and dividing each element of avector by its norm) B Ð X ( ) ( C d A )( C T C f A T A ) : Normalize column vectors of B to unit length C Ð X ( ) ( B d A )( B T B f C T C ) : Normalize column vectors of C to unit length,store the norms in vector λ end while return A , B , C and λ . on the ALS or GD approaches, however, these can be too computationallyexpensive for huge tensors. Indeed, algorithms for tensor decompositionshave generally not yet reached the level of maturity and efficiency of low-rank matrix factorization (LRMF) methods. In order to employ efficientLRMF algorithms to tensors, we need to either: (i) reshape the tensor athand into a set of matrices using traditional matricizations, (ii) employreduced randomized unfolding matrices, or (iii) perform suitable randommultiple projections of a data tensor onto lower-dimensional subspaces.The principles of the approaches (i) and (ii) are self-evident, while theapproach (iii) employs a multilinear product of an N th-order tensor and ( N ´ ) random vectors, which are either chosen uniformly from a unitsphere or assumed to be i.i.d. Gaussian vectors [126].For example, for a 3rd-order tensor, X P R I ˆ I ˆ I , we can use the setof random projections, X ¯3 = X ¯ ˆ ω P R I ˆ I , X ¯2 = X ¯ ˆ ω P R I ˆ I and X ¯1 = X ¯ ˆ ω P R I ˆ I , where the vectors ω n P R I n , n =
1, 2, 3,are suitably chosen random vectors. Note that random projections in sucha case are non-typical – instead of using projections for dimensionalityreduction, they are used to reduce a tensor of any order to matrices andconsequently transform the CP decomposition problem to constrainedmatrix factorizations problem, which can be solved via simultaneous (joint)matrix diagonalization [31, 56]. It was shown that even a small number of69andom projections, such as O ( log R ) is sufficient to preserve the spectralinformation in a tensor. This mitigates the problem of the dependence onthe eigen-gap that plagued earlier tensor-to-matrix reductions. Althougha uniform random sampling may experience problems for tensors withspiky elements, it often outperforms the standard CP-ALS decompositionalgorithms.Alternative algorithms for the CP decomposition of huge-scale tensorsinclude tensor sketching – a random mapping technique, which exploitskernel methods and regression [168, 223], and the class of distributedalgorithms such as the DFacTo [35] and the GigaTensor which is based onHadoop / MapReduce paradigm [106]. Constraints.
Under rather mild conditions, the CP decomposition isgenerally unique by itself [125, 188]. It does not require additionalconstraints on the factor matrices to achieve uniqueness, which makesit a powerful and useful tool for tensor factroization. Of course, ifthe components in one or more modes are known to possess someproperties, e.g., they are known to be nonnegative, orthogonal,statistically independent or sparse, such prior knowledge may beincorporated into the algorithms to compute CPD and at the same timerelax uniqueness conditions. More importantly, such constraints mayenhance the accuracy and stability of CP decomposition algorithmsand also facilitate better physical interpretability of the extractedcomponents [65, 117, 134, 187, 195, 234].
Applications.
The CP decomposition has already been established as anadvanced tool for blind signal separation in vastly diverse branches ofsignal processing and machine learning [1, 3, 119, 147, 189, 207, 223]. It isalso routinely used in exploratory data analysis, where the rank-1 termscapture essential properties of dynamically complex datasets, while inwireless communication systems, signals transmitted by different userscorrespond to rank-1 terms in the case of line-of-sight propagation andtherefore admit analysis in the CP format. Another potential applicationis in harmonic retrieval and direction of arrival problems, where real orcomplex exponentials have rank-1 structures, for which the use of CPdecomposition is quite natural [185, 186, 194]. In linear algebra, the eigen-gap of a linear operator is the difference between twosuccessive eigenvalues, where the eigenvalues are sorted in an ascending order. .3 The Tucker Tensor Format Compared to the CP decomposition, the Tucker decomposition providesa more general factorization of an N th-order tensor into a relatively smallsize core tensor and factor matrices, and can be expressed as follows: X – R ÿ r = ¨ ¨ ¨ R N ÿ r N = g r r ¨¨¨ r N (cid:16) b ( ) r ˝ b ( ) r ˝ ¨ ¨ ¨ ˝ b ( N ) r N (cid:17) = G ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) = (cid:74) G ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) , (3.10)where X P R I ˆ I ˆ¨¨¨ˆ I N is the given data tensor, G P R R ˆ R ˆ¨¨¨ˆ R N isthe core tensor, and B ( n ) = [ b ( n ) , b ( n ) , . . . , b ( n ) R n ] P R I n ˆ R n are the mode- n factor (component) matrices, n =
1, 2, . . . , N (see Figure 3.3). The coretensor (typically, R n ăă I n ) models a potentially complex pattern of mutualinteraction between the vectors in different modes. The model in (3.10) isoften referred to as the Tucker- N model.The CP and Tucker decompositions have long history. For recentsurveys and more detailed information we refer to [42, 46, 87, 119, 189].Using the properties of the Kronecker tensor product, the Tucker- N decomposition in (3.10) can be expressed in an equivalent matrix andvector form as X ( n ) – B ( n ) G ( n ) ( B ( ) b L ¨ ¨ ¨ b L B ( n ´ ) b L B ( n + ) b L ¨ ¨ ¨ b L B ( N ) ) T = B ( n ) G ( n ) ( B ( N ) b ¨ ¨ ¨ b B ( n + ) b B ( n ´ ) b ¨ ¨ ¨ b B ( ) ) T , (3.11) X ă n ą – ( B ( ) b L ¨ ¨ ¨ b L B ( n ) ) G ă n ą ( B ( n + ) b L ¨ ¨ ¨ b L B ( N ) ) T = ( B ( n ) b ¨ ¨ ¨ b B ( ) ) G ă n ą ( B ( N ) b B ( N ´ ) b ¨ ¨ ¨ b B ( n + ) ) T ,(3.12)vec ( X ) – [ B ( ) b L B ( ) b L ¨ ¨ ¨ b L B ( N ) ] vec ( G )= [ B ( N ) b B ( N ´ ) b ¨ ¨ ¨ b B ( ) ] vec ( G ) , (3.13)where the multi-indices are ordered in a reverse lexicographic order (little-endian).Table 3.1 and Table 3.2 summarize fundamental mathematicalrepresentations of CP and Tucker decompositions for 3rd-order and N th-order tensors.The Tucker decomposition is said to be in an independent Tucker format if all the factor matrices, B ( n ) , are full column rank, while a Tucker format71a) Standard block diagrams of Tucker (top) and Tucker-CP (bottom)decompositions for a 3rd-order tensor B (1) X JI K + . . . ( ( c b a c R b R a R ( ) I R ( R R R ( ) K R ( ) R J ( ) K R G R R R B (2) T B (3) B (1) B (3) B (2) T ) = r
1 2 3 r r r b r b r b g , , r
1 2 3 r r , , (3) (2)(1) == + (b) The TN diagram for the Tucker and Tucker/CP decompositions of a 4th-ordertensor R R R R I I I I B (1) R R RRI I I I Λ G R R R R B (2) B (3) B (4) B (2) B (4) B (3) B (1) A (1) A (2) A (3) A (4) Figure 3.3:
Illustration of the Tucker and Tucker-CP decompositions, where theobjective is to compute the factor matrices, B ( n ) , and the core tensor, G . (a) Tuckerdecomposition of a 3rd-order tensor, X – G ˆ B ( ) ˆ B ( ) ˆ B ( ) . In someapplications, the core tensor can be further approximately factorized using theCP decomposition as G – ř Rr = a r ˝ b r ˝ c r (bottom diagram), or alternativelyusing TT/HT decompositions. (b) Graphical representation of the Tucker-CPdecomposition for a 4th-order tensor, X – G ˆ B ( ) ˆ B ( ) ˆ B ( ) ˆ B ( ) = (cid:74) G ; B ( ) , B ( ) , B ( ) , B ( ) (cid:75) – ( Λ ˆ A ( ) ˆ A ( ) ˆ A ( ) ˆ A ( ) ) ˆ B ( ) ˆ B ( ) ˆ B ( ) ˆ B ( ) = (cid:74) Λ ; B ( ) A ( ) , B ( ) A ( ) , B ( ) A ( ) , B ( ) A ( ) (cid:75) .
72s termed an orthonormal format , if in addition, all the factor matrices, B ( n ) = U ( n ) , are orthogonal. The standard Tucker model often hasorthogonal factor matrices. Multilinear rank.
The multilinear rank of an N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N corresponds to the N -tuple ( R , R , . . . , R N ) consisting of thedimensions of the different subspaces. If the Tucker decomposition (3.10)holds exactly it is mathematically defined asrank ML ( X ) = t rank ( X ( ) ) , rank ( X ( ) ) , . . . , rank ( X ( N ) ) u , (3.14)with X ( n ) P R I n ˆ I ¨¨¨ I n ´ I n + ¨¨¨ I N for n =
1, 2, . . . , N . Rank of the Tuckerdecompositions can be determined using information criteria [227], orthrough the number of dominant eigenvalues when an approximationaccuracy of the decomposition or a noise level is given (see Algorithm 8).The independent Tucker format has the following important propertiesif the equality in (3.10) holds exactly (see, e.g., [105] and references therein):1. The tensor (CP) rank of any tensor, X = (cid:74) G ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) P R I ˆ I ˆ¨¨¨ˆ I N , and the rank of its core tensor, G P R R ˆ R ˆ¨¨¨ˆ R N , areexactly the same, i.e.,rank CP ( X ) = rank CP ( G ) . (3.15)2. If a tensor, X P R I ˆ I ˆ¨¨¨ˆ I N = (cid:74) G ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) , admits anindependent Tucker format with multilinear rank t R , R , . . . , R N u ,then R n ď N ź p ‰ n R p @ n . (3.16)Moreover, without loss of generality, under the assumption R ď R ď ¨ ¨ ¨ ď R N , we have R ď rank CP ( X ) ď R R ¨ ¨ ¨ R N . (3.17)3. If a data tensor is symmetric and admits an independent Tuckerformat, X = (cid:74) G ; B , B , . . . , B (cid:75) P R I ˆ I ˆ¨¨¨ˆ I , then its core tensor, G P R R ˆ R ˆ¨¨¨ˆ R , is also symmetric, with rank CP ( X ) = rank CP ( G ).73. For the orthonormal Tucker format, that is, X = (cid:74) G ; U ( ) , U ( ) , . . . , U ( N ) (cid:75) P R I ˆ I ˆ¨¨¨ˆ I N , with U ( n ) T U ( n ) = I , @ n ,the Frobenius norms and the Schatten p -norms of the data tensor, X , and its core tensor, G , are equal, i.e., } X } F = } G } F , } X } S p = } G } S p , 1 ď p ă 8 .Thus, the computation of the Frobenius norms can be performed withan O ( R N ) complexity ( R = max t R , . . . , R N u ) , instead of the usualorder O ( I N ) complexity (typically R ! I ).Note that the CP decomposition can be considered as a special caseof the Tucker decomposition, whereby the cube core tensor has nonzeroelements only on the main diagonal (see Figure 3.1). In contrast tothe CP decomposition, the unconstrained Tucker decomposition is notunique. However, constraints imposed on all factor matrices and/or coretensor can reduce the indeterminacies inherent in CA to only column-wisepermutation and scaling, thus yielding a unique core tensor and factormatrices [235].The Tucker- N model, in which ( N ´ K ) factor matrices are identitymatrices is called the Tucker- ( K , N ) model. In the simplest scenario, fora 3rd-order tensor X P R I ˆ J ˆ K , the Tucker-(2,3) or simply Tucker-2 model,can be described as X – G ˆ A ˆ B ˆ I = G ˆ A ˆ B , (3.18)or in an equivalent matrix form X k = AG k B T , ( k =
1, 2, . . . , K ) , (3.19)where X k = X ( :, :, k ) P R I ˆ J and G k = G ( :, :, k ) P R R ˆ R are respectivelythe frontal slices of the data tensor X and the core tensor G P R R ˆ R ˆ R ,and A P R I ˆ R , B P R J ˆ R . The Schatten p -norm of an N th-order tensor X is defined as the average of the Schattennorms of mode- n unfoldings, i.e., } X } S p = ( N ) ř Nn = } X ( n ) } S p and } X } S p = ( ř r σ pr ) p ,where σ r is the r th singular value of the matrix X . For p =
1, the Schatten norm of a matrix X is called the nuclear norm or the trace norm, while for p = X , which can be replaced by the surrogate function log det ( XX T + ε I ) , ε ą For a 3rd-order tensor, the Tucker-2 model is equivalent to the TT model. The casewhere the factor matrices and the core tensor are non-negative is referred to as the NTD-2(Nonnegative Tucker-2 decomposition). X P R I ˆ J ˆ K , where λ = [ λ , λ , . . . , λ R ] T , and Λ = diag t λ , λ , . . . , λ R u .CP Decomposition Tucker DecompositionScalar representation x ijk = R ř r = λ r a i r b j r c k r x ijk = R ř r = R ř r = R ř r = g r r r a i r b j r c k r Tensor representation, outer products X = R ř r = λ r a r ˝ b r ˝ c r X = R ř r = R ř r = R ř r = g r r r a r ˝ b r ˝ c r Tensor representation, multilinear products X = Λ ˆ A ˆ B ˆ C X = G ˆ A ˆ B ˆ C = (cid:74) Λ ; A , B , C (cid:75) = (cid:74) G ; A , B , C (cid:75) Matrix representations X ( ) = A Λ ( B d L C ) T X ( ) = A G ( ) ( B b L C ) T X ( ) = B Λ ( A d L C ) T X ( ) = B G ( ) ( A b L C ) T X ( ) = C Λ ( A d L B ) T X ( ) = C G ( ) ( A b L B ) T Vector representationvec ( X ) = ( A d L B d L C ) λ vec ( X ) = ( A b L B b L C ) vec ( G ) Matrix slices X k = X ( :, :, k ) X k = A diag ( λ c k ,1 , . . . , λ R c k , R ) B T X k = A (cid:32) R ř r = c kr G ( :, :, r ) (cid:33) B T N th-ordertensor X P R I ˆ I ˆ¨¨¨ˆ I N . CP TuckerScalar product x i ,..., i N = R ÿ r = λ r b ( ) i , r ¨ ¨ ¨ b ( N ) i N , r x i ,..., i N = R ÿ r = ¨ ¨ ¨ R N ÿ r N = g r ,..., r N b ( ) i , r ¨ ¨ ¨ b ( N ) i N , r N Outer product X = R ÿ r = λ r b ( ) r ˝ ¨ ¨ ¨ ˝ b ( N ) r X = R ÿ r = ¨ ¨ ¨ R N ÿ r N = g r ,..., r N b ( ) r ˝ ¨ ¨ ¨ ˝ b ( N ) r N Multilinear product X = Λ ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) X = G ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) X = (cid:114) Λ ; B ( ) , B ( ) , . . . , B ( N ) (cid:122) X = (cid:114) G ; B ( ) , B ( ) , . . . , B ( N ) (cid:122) Vectorizationvec ( X ) = (cid:32) ä n = N B ( n ) (cid:33) λ vec ( X ) = (cid:32) â n = N B ( n ) (cid:33) vec ( G ) Matricization X ( n ) = B ( n ) Λ (cid:32) ä m = N , m ‰ n B ( m ) (cid:33) T X ( n ) = B ( n ) G ( n ) (cid:32) â m = N , m ‰ n B ( m ) (cid:33) T X ă n ą = ( ä m = n B ( m ) ) Λ ( n + ä m = N B ( m ) ) T , X ă n ą = ( â m = n B ( m ) ) G ă n ą ( n + â m = N B ( m ) ) T Slice representation X ( :, :, k ) = B ( ) r D k B ( ) T X ( :, :, k ) = B ( ) r G k B ( ) T , k = i i ¨ ¨ ¨ i N r D k = diag ( ˜ d , . . . , ˜ d RR ) P R R ˆ R with entries ˜ d rr = λ r b ( ) i , r ¨ ¨ ¨ b ( N ) i N , r r G k = ÿ r ¨ ¨ ¨ ÿ r N b ( ) i , r ¨ ¨ ¨ b ( N ) i N , r N G :,:, r ,..., r N is the sum of frontal slices. eneralized Tucker format and its links to TTNS model. For high-ordertensors, X P R I ˆ¨¨¨ˆ I K ˆ I ˆ¨¨¨ˆ I N , KN , the Tucker- N format can be naturallygeneralized by replacing the factor matrices, B ( n ) P R I n ˆ R n , by higher-ordertensors B ( n ) P R I n ,1 ˆ I n ,2 ˆ¨¨¨ˆ I n , Kn ˆ R n , to give X – (cid:74) G ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) , (3.20)where the entries of the data tensor are computed as X ( i , . . . , i N ) = R ÿ r = ¨ ¨ ¨ R N ÿ r N = G ( r , . . . , r N ) B ( ) ( i , r ) ¨ ¨ ¨ B ( N ) ( i N , r N ) ,and i n = ( i n ,1 i n ,2 . . . i n , K n ) [128].Furthermore, the nested (hierarchical) form of such a generalizedTucker decomposition leads to the Tree Tensor Networks State (TTNS)model [149] (see Figure 2.15 and Figure 2.18), with possibly a varying orderof cores, which can be formulated as X = (cid:74) G ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) G = (cid:74) G ; A ( ) , A ( ) , . . . , A ( N ,2 ) (cid:75) . ¨ ¨ ¨ G P = (cid:74) G P + ; A ( P + ) , A ( P + ) , . . . , A ( N P + , P + ) (cid:75) , (3.21)where G p P R R ( p ) ˆ R ( p ) ˆ¨¨¨ˆ R ( p ) Np and A ( n p , p ) P R R ( p ´ ) lnp ˆ¨¨¨ˆ R ( p ´ ) mnp ˆ R ( p ) np , with p =
2, . . . , P + A ( n ,1 ) and/or A ( n p , p ) , can be identitytensors which yield an irregular structure, possibly with a varying orderof tensors. This follows from the simple observation that a mode- n productmay have, e.g., the following form X ˆ n B ( n ) = (cid:74) X ; I , . . . , I I n ´ , B ( n ) , I I n + , . . . , I I N (cid:75) .The efficiency of this representation strongly relies on an appropriatechoice of the tree structure. It is usually assumed that the tree structureof TTNS is given or assumed a priori , and recent efforts aim to find anoptimal tree structure from a subset of tensor entries and without any apriori knowledge of the tree structure. This is achieved using so-calledrank-adaptive cross-approximation techniques which approximate atensor by hierarchical tensor formats [9, 10].77 perations in the Tucker format. If large-scale data tensors admit anexact or approximate representation in their Tucker formats, then mostmathematical operations can be performed more efficiently using the soobtained much smaller core tensors and factor matrices. Consider the N th-order tensors X and Y in the Tucker format, given by X = (cid:74) G X ; X ( ) , . . . , X ( N ) (cid:75) and Y = (cid:74) G Y ; Y ( ) , . . . , Y ( N ) (cid:75) , (3.22)for which the respective multilinear ranks are t R , R , . . . , R N u and t Q , Q , . . . , Q N u , then the following mathematical operations can beperformed directly in the Tucker format , which admits a significantreduction in computational costs [128, 175, 177]: • The addition of two Tucker tensors of the same order and sizes X + Y = (cid:74) G X ‘ G Y ; [ X ( ) , Y ( ) ] , . . . , [ X ( N ) , Y ( N ) ] (cid:75) , (3.23)where ‘ denotes a direct sum of two tensors, and [ X ( n ) , Y ( n ) ] P R I n ˆ ( R n + Q n ) , X ( n ) P R I n ˆ R n and Y ( n ) P R I n ˆ Q n , @ n . • The Kronecker product of two Tucker tensors of arbitrary orders andsizes X b Y = (cid:74) G X b G Y ; X ( ) b Y ( ) , . . . , X ( N ) b Y ( N ) (cid:75) . (3.24) • The Hadamard or element-wise product of two Tucker tensors of thesame order and the same sizes X f Y = (cid:74) G X b G Y ; X ( ) d Y ( ) , . . . , X ( N ) d Y ( N ) (cid:75) , (3.25)where d denotes the mode-1 Khatri–Rao product, also called thetransposed Khatri–Rao product or row-wise Kronecker product. • The inner product of two Tucker tensors of the same order andsizes can be reduced to the inner product of two smaller tensors byexploiting the Kronecker product structure in the vectorized form, as Similar operations can be performed in the CP format, assuming that the core tensorsare diagonal. x X , Y y = vec ( X ) T vec ( Y ) (3.26) = vec ( G X ) T (cid:32) N â n = X ( n ) T (cid:33) (cid:32) N â n = Y ( n ) (cid:33) vec ( G Y )= vec ( G X ) T (cid:32) N â n = X ( n ) T Y ( n ) (cid:33) vec ( G Y )= x (cid:74) G X ; ( X ( ) T Y ( ) ) , . . . , ( X ( N ) T Y ( N ) ) (cid:75) , G Y y . • The Frobenius norm can be computed in a particularly simpleway if the factor matrices are orthogonal, since then all products X ( n ) T X ( n ) , @ n , become the identity matrices, so that } X } F = x X , X y = vec (cid:16) (cid:74) G X ; ( X ( ) T X ( ) ) , . . . , ( X ( N ) T X ( N ) ) (cid:75) (cid:17) T vec ( G X )= vec ( G X ) T vec ( G X ) = } G X } F . (3.27) • The N -D discrete convolution of tensors X P R I ˆ¨¨¨ˆ I N and Y P R J ˆ¨¨¨ˆ J N in their Tucker formats can be expressed as Z = X ˚ Y = (cid:74) G Z ; Z ( ) , . . . , Z ( N ) (cid:75) (3.28) P R ( I + J ´ ) ˆ¨¨¨ˆ ( I N + J N ´ ) .If t R , R , . . . , R N u is the multilinear rank of X and t Q , Q , . . . , Q N u the multilinear rank Y , then the core tensor G Z = G X b G Y P R R Q ˆ¨¨¨ˆ R N Q N and the factor matrices Z ( n ) = X ( n ) d Y ( n ) P R ( I n + J n ´ ) ˆ R n Q n , (3.29)where Z ( n ) ( :, s n ) = X ( n ) ( :, r n ) ˚ Y ( n ) ( :, q n ) P R ( I n + J n ´ ) for s n = r n q n =
1, 2, . . . , R n Q n . • Super Fast discrete Fourier transform (MATLAB functions fftn ( X ) and fft ( X ( n ) , [] , 1 ) ) of a tensor in the Tucker format F ( X ) = (cid:74) G X ; F ( X ( ) ) , . . . , F ( X ( N ) ) (cid:75) . (3.30)79ote that if the data tensor admits low multilinear rankapproximation, then performing the FFT on factor matrices ofrelatively small size X ( n ) P R I n ˆ R n , instead of a large-scale data tensor,decreases considerably computational complexity. This approach isreferred to as the super fast Fourier transform in Tucker format. The MultiLinear Singular Value Decomposition (MLSVD), also called thehigher-order SVD (HOSVD), can be considered as a special form of theconstrained Tucker decomposition [59, 60], in which all factor matrices, B ( n ) = U ( n ) P R I n ˆ I n , are orthogonal and the core tensor, G = S P R I ˆ I ˆ¨¨¨ˆ I N , is all-orthogonal (see Figure 3.4).The orthogonality properties of the core tensor are defined through thefollowing conditions:1. All orthogonality.
The slices in each mode are mutually orthogonal,e.g., for a 3rd-order tensor and its lateral slices x S :, k ,: S :, l ,: y =
0, for k ‰ l , (3.31)2. Pseudo-diagonality.
The Frobenius norms of slices in each mode aredecreasing with the increase in the running index, e.g., for a 3rd-ordertensor and its lateral slices } S :, k ,: } F ě } S :, l ,: } F , k ě l . (3.32)These norms play a role similar to singular values in standard matrixSVD.In practice, the orthogonal matrices U ( n ) P R I n ˆ R n , with R n ď I n , can becomputed by applying both the randomized and standard truncated SVDto the unfolded mode- n matrices, X ( n ) – U ( n ) S n V ( n ) T P R I n ˆ I ¨¨¨ I n ´ I n + ¨¨¨ I N .After obtaining the orthogonal matrices U ( n ) of left singular vectors of X ( n ) ,for each n , the core tensor G = S can be computed as S = X ˆ U ( ) T ˆ U ( ) T ¨ ¨ ¨ ˆ N U ( N ) T , (3.33)so that X = S ˆ U ( ) ˆ U ( ) ¨ ¨ ¨ ˆ N U ( N ) . (3.34)80a) X I J = R Uu r Eigenvector of XX T R v r Eigenvector of
X X T Rank of XX T V T R ( ) I J ( )
I I ( )
I J ( )
J J R ~ ... ... Singularvalue S s r (b) U (1) ( ) I R ( )
I R ( )
I R X U (2) I I I I R R R I I I I ( ) I I I S R S t U (3) R R ( ) I I I (c) I I I I X R R R R I I I I S t U (1) U (2) U (3) U (4) Figure 3.4:
Graphical illustration of the truncated SVD and HOSVD. (a) Theexact and truncated standard matrix SVD, X – USV T . (b) The truncated(approximative) HOSVD for a 3rd-order tensor calculated as X – S t ˆ U ( ) ˆ U ( ) ˆ U ( ) . (c) Tensor network notation for the HOSVD of a 4th-order tensor X – S t ˆ U ( ) ˆ U ( ) ˆ U ( ) ˆ U ( ) . All the factor matrices, U ( n ) P R I n ˆ R n , andthe core tensor, S t = G P R R ˆ¨¨¨ˆ R N , are orthogonal. S , its slices are also mutuallyorthogonal.Analogous to the standard truncated SVD, a large-scale data tensor, X ,can be approximated by discarding the multilinear singular vectors andslices of the core tensor corresponding to small multilinear singular values.Figure 3.4 and Algorithm 2 outline the truncated HOSVD, for which anyoptimized matrix SVD procedure can be applied.For large-scale tensors, the unfolding matrices, X ( n ) P R I n ˆ I ¯ n ( I ¯ n = I ¨ ¨ ¨ I n I n + ¨ ¨ ¨ I N ) may become prohibitively large (with I ¯ n " I n ), easilyexceeding the memory of standard computers. Using a direct andsimple divide-and-conquer approach, the truncated SVD of an unfoldingmatrix, X ( n ) = U ( n ) S n V ( n ) T , can be partitioned into Q slices, as X ( n ) =[ X n , X n , . . . , X Q , n ] = U ( n ) S n [ V T1, n , V T2, n , . . . , V T Q , n ] . Next, the orthogonalmatrices U ( n ) and the diagonal matrices S n can be obtained from theeigenvalue decompositions X ( n ) X T ( n ) = U ( n ) S n U ( n ) T = ř q X q , n X T q , n P R I n ˆ I n ,allowing for the terms V q , n = X T q , n U ( n ) S ´ n to be computed separately. Thisenables us to optimize the size of the q th slice X q , n P R I n ˆ ( I ¯ n / Q ) so as tomatch the available computer memory. Such a simple approach to computematrices U ( n ) and/or V ( n ) does not require loading the entire unfoldingmatrices at once into computer memory; instead the access to the datasets issequential. For current standard sizes of computer memory, the dimension I n is typically less than 10,000, while there is no limit on the dimension I ¯ n = ś k ‰ n I k .For very large-scale and low-rank matrices, instead of the standardtruncated SVD approach, we can alternatively apply the randomized SVDalgorithm, which reduces the original data matrix X to a relatively smallmatrix by random sketching, i.e. through multiplication with a randomsampling matrix Ω (see Algorithm 3). Note that we explicitly allow therank of the data matrix X to be overestimated (that is, ˜ R = R + P , where R is a true but unknown rank and P is the over-sampling parameter)because it is easier to obtain more accurate approximation of this form.Performance of randomized SVD can be further improved by integratingmultiple random sketches, that is, by multiplying a data matrix X by a setof random matrices Ω p for p =
1, 2, . . . , P and integrating leading low-dimensional subspaces by applying a Monte Carlo integration method [33].Using special random sampling matrices, for instance, a sub-sampledrandom Fourier transform, substantial gain in the execution time canbe achieved, together with the asymptotic complexity of O ( I J log ( R )) .Unfortunately, this approach is not accurate enough for matrices for which82 lgorithm 2 : Sequentially Truncated HOSVD [212]
Input: N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N and approximationaccuracy ε Output:
HOSVD in the Tucker format ˆ X = (cid:74) S ; U ( ) , . . . , U ( N ) (cid:75) ,such that } X ´ ˆ X } F ď ε S Ð X for n = N do [ U ( n ) , S , V ] = truncated svd ( S ( n ) , ε ? N ) S Ð VS end for S Ð reshape ( S , [ R , . . . , R N ]) return Core tensor S and orthogonal factor matrices U ( n ) P R I n ˆ R n . Algorithm 3 : Randomized SVD (rSVD) for large-scale and low-rankmatrices with single sketch [93]
Input:
A matrix X P R I ˆ J , desired or estimated rank R , andoversampling parameter P or overestimated rank r R = R + P ,exponent of the power method q ( q = q = Output:
An approximate rank- r R SVD, X – USV T ,i.e., orthogonal matrices U P R I ˆ r R , V P R J ˆ r R and diagonal matrix of singular values S P R r R ˆ r R Draw a random Gaussian matrix Ω P R J ˆ r R , Form the sample matrix Y = ( XX T ) q X Ω P R I ˆ r R Compute a QR decomposition Y = QR Form the matrix A = Q T X P R r R ˆ J Compute the SVD of the small matrix A as A = p USV T Form the matrix U = Q p U . the singular values decay slowly [93].The truncated HOSVD can be optimized and implemented in severalalternative ways. For example, if R n ! I n , the truncated tensor Z Ð X ˆ U ( ) T yields a smaller unfolding matrix Z ( ) P R I ˆ R I ¨¨¨ I N , so that themultiplication Z ( ) Z T ( ) can be faster in the next iterations [5, 212].Furthermore, since the unfolding matrices Y T ( n ) are typically very “talland skinny”, a huge-scale truncated SVD and other constrained low-rankmatrix factorizations can be computed efficiently based on the Hadoop /MapReduce paradigm [20, 48, 49]. 83 lgorithm 4 : Higher Order Orthogonal Iteration (HOOI) [5, 60]Input: N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N (usually in Tucker/HOSVDformat) Output:
Improved Tucker approximation using ALS approach, withorthogonal factor matrices U ( n ) Initialization via the standard HOSVD (see Algorithm 2) repeat for n = N do Z Ð X ˆ p ‰ n t U ( p ) T u C Ð Z ( n ) Z T ( n ) P R R ˆ R U ( n ) Ð leading R n eigenvectors of C end for G Ð Z ˆ N U ( N ) T until the cost function ( } X } F ´ } G } F ) ceases to decrease return (cid:74) G ; U ( ) , U ( ) , . . . , U ( N ) (cid:75) Low multilinear rank approximation is always well-posed , however, incontrast to the standard truncated SVD for matrices, the truncated HOSVDdoes not yield the best multilinear rank approximation , but satisfies the quasi-best approximation property [59] } X ´ (cid:74) S ; U ( ) , . . . , U ( N ) (cid:75) } ď ? N } X ´ X Best } , (3.35)where X Best is the best multilinear rank approximation of X , for a specifictensor norm } ¨ } .When it comes to the problem of finding the best approximation, theALS type algorithm called the Higher Order Orthogonal Iteration (HOOI)exhibits both the advantages and drawbacks of ALS algorithms for CPdecomposition. For the HOOI algorithms, see Algorithm 4 and Algorithm5. For more sophisticated algorithms for Tucker decompositions withorthogonality and nonnegativity constraints, suitable for large-scale datatensors, see [49, 104, 169, 236].When a data tensor X is very large and cannot be stored in computermemory, another challenge is to compute a core tensor G = S directly,using the formula (3.33). Such computation is performed sequentiallyby fast matrix-by-matrix multiplications , as illustrated in Figure 3.5(a)and (b). Efficient and parallel (state of the art) algorithms for multiplications of such very large-scale matrices are proposed in [11, 131]. X P R I ˆ I ˆ¨¨¨ˆ I N denotes a noisy data tensor, while Y = G ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) is the general constrained Tucker modelwith the latent factor matrices B ( n ) P R I n ˆ R n and the core tensor G P R R ˆ R ˆ¨¨¨ˆ R N . In the special case of a CP decomposition, the core tensoris diagonal, G = Λ P R R ˆ¨¨¨ˆ R , so that Y = ř Rr = λ r ( b ( ) r ˝ b ( ) r ˝ ¨ ¨ ¨ ˝ b ( N ) r ) . Cost Function ConstraintsMultilinear (sparse) PCA (MPCA)max u ( n ) r X ¯ ˆ u ( ) r ¯ ˆ u ( ) r ¨ ¨ ¨ ¯ ˆ N u ( N ) r + γ ř Nn = } u ( n ) r } u ( n ) T r u ( n ) r = @ ( n , r ) u ( n ) T r u ( n ) q = r ‰ q HOSVD/HOOImin U ( n ) } X ´ G ˆ U ( ) ˆ U ( ) ¨ ¨ ¨ ˆ N U ( N ) } F U ( n ) T U ( n ) = I R n , @ n Multilinear ICAmin B ( n ) } X ´ G ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) } F Vectors of B ( n ) statisticallyas independent as possibleNonnegative CP/Tucker decomposition(NTF/NTD) [43]min B ( n ) } X ´ G ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) } F + γ ř Nn = ř R n r n = } b ( n ) r n } Entries of G and B ( n ) , @ n are nonnegativeSparse CP/Tucker decompositionmin B ( n ) } X ´ G ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) } F + γ ř Nn = ř R n r n = } b ( n ) r n } Sparsity constraintsimposed on B ( n ) Smooth CP/Tucker decomposition(SmCP/SmTD) [228]min B ( n ) } X ´ Λ ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) } F + γ ř Nn = ř Rr = } Lb ( n ) r } Smoothness imposedon vectors b ( n ) r of B ( n ) P R I n ˆ R , @ n via a difference operator L lgorithm 5 : HOOI using randomization for large-scale data [238]Input: N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N and multilinear rank t R , R , . . . , R N u Output:
Approximative representation of a tensor in Tucker format,with orthogonal factor matrices U ( n ) P R I n ˆ R n Initialize factor matrices U ( n ) as random Gaussian matricesRepeat steps (2)-(6) only two times: for n = N do Z = X ˆ p ‰ n t U ( p ) T u Compute ˜ Z ( n ) = Z ( n ) Ω ( n ) P R I n ˆ R n , where Ω ( n ) P R ś p ‰ n R p ˆ R n is a random matrix drawn from Gaussian distribution Compute U ( n ) as an orthonormal basis of ˜ Z ( n ) , e.g., by using QRdecomposition end for Construct the core tensor as G = X ˆ U ( ) T ˆ U ( ) T ¨ ¨ ¨ ˆ N U ( N ) T return X – (cid:74) G ; U ( ) , U ( ) , . . . , U ( N ) (cid:75) Algorithm 6 : Tucker decomposition with constrained factormatrices via 2-way CA /LRMFInput: N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N , multilinear rank t R , . . . , R N u and desired constraints imposed on factor matrices B ( n ) P R I n ˆ R n Output:
Tucker decomposition with constrained factor matrices B ( n ) using LRMF and a simple unfolding approach Initialize randomly or via standard HOSVD (see Algorithm 2) for n = N do Compute specific LRMF or 2-way CA (e.g., RPCA, ICA, NMF) ofunfolding X T ( n ) – A ( n ) B ( n ) T or X ( n ) – B ( n ) A ( n ) T end for Compute core tensor G = X ˆ [ B ( ) ] : ˆ [ B ( ) ] : ¨ ¨ ¨ ˆ N [ B ( N )] : return Constrained Tucker decomposition X – (cid:74) G , B ( ) , . . . , B ( N ) (cid:75) We have shown that for very large-scale problems, it is useful to dividea data tensor X into small blocks X [ k , k ,..., k N ] . In a similar way, we canpartition the orthogonal factor matrices U ( n ) T into the corresponding blocks86f matrices U ( n ) T [ k n , p n ] , as illustrated in Figure 3.5(c) for 3rd-order tensors[200, 221]. For example, the blocks within the resulting tensor G ( n ) canbe computed sequentially or in parallel, as follows: G ( n )[ k , k ,..., q n ,..., k N ] = K n ÿ k n = X [ k , k ,..., k n ,..., k N ] ˆ n U ( n ) T [ k n , q n ] . (3.36) Applications.
We have shown that the Tucker/HOSVD decompositionmay be considered as a multilinear extension of PCA [124]; it thereforegeneralizes signal subspace techniques and finds application in areasincluding multilinear blind source separation, classification, featureextraction, and subspace-based harmonic retrieval [90, 137, 173, 213]. Inthis way, a low multilinear rank approximation achieved through Tuckerdecomposition may yield higher Signal-to-Noise Ratio (SNR) than the SNRfor the original raw data tensor, which also makes Tucker decomposition anatural tool for signal compression and enhancement.It was recently shown that HOSVD can also perform simultaneoussubspace selection (data compression) and K-means clustering, bothunsupervised learning tasks [99, 164]. This is important, as a combinationof these methods can both identify and classify “relevant” data, and inthis way not only reveal desired information but also simplify featureextraction.
Anomaly detection using HOSVD.
Anomaly detection refers to thediscrimination of some specific patterns, signals, outliers or features thatdo not conform to certain expected behaviors, trends or properties [32,78]. While such analysis can be performed in different domains, it ismost frequently based on spectral methods such as PCA, whereby highdimensional data are projected onto a lower-dimensional subspace inwhich the anomalies may be identified more easier. The main assumptionwithin such approaches is that the normal and abnormal patterns, whichmay be difficult to distinguish in the original space, appear significantlydifferent in the projected subspace. When considering very large datasets,since the basic Tucker decomposition model generalizes PCA and SVD,it offers a natural framework for anomaly detection via HOSVD, asillustrated in Figure 3.6. To handle the exceedingly large dimensionality,we may first compute tensor decompositions for sampled (pre-selected)small blocks of the original large-scale 3rd-order tensor, followed by theanalysis of changes in specific factor matrices U ( n ) . A simpler form87a) Sequential computation I I I I I R R R I I I I R R I R R I R R R X ... ... ... G (1) G (2) U (1)T G (1) U (2)T G (2) GU (3)T (1)T (2)T (3)T1 2 3 (( ) ) G X U U U I ... ... ... R R (b) Fast matrix-by-matrix approach R I I R I I X (1) I I I I I R R X ... U (1)T G (1) I ... U (1)T G (1)(1) (c) Divide-and-conquer approach X U [1,1] U [2,1] U [3,1] U [1,2] U [2,2] U [3,2] = U (1)T Z G = (1) ( ) I I I ( )
R I ( )
R I I X [1,1,1] X [1,2,1] X [2,1,1] X [2,2,1] X [3,2,1] X [3,1,1] X [1,1,2] X [1,2,2] Z [1,1,1] Z [1,2,1] Z [2,1,1] Z [2,2,1] Z [1,1,2] Z [1,2,2] z [ ] , , Figure 3.5:
Computation of a multilinear (Tucker) product for large-scaleHOSVD. (a) Standard sequential computing of multilinear products (TTM) G = S = ((( X ˆ U ( ) T ) ˆ U ( ) T ) ˆ U ( ) T ) . (b) Distributed implementation throughfast matrix-by-matrix multiplications. (c) An alternative method for large-scaleproblems using the “divide and conquer” approach, whereby a data tensor, X ,and factor matrices, U ( n ) T , are partitioned into suitable small blocks: Subtensors X [ k , k , k ] and block matrices U ( ) T [ k , p ] . The blocks of a tensor, Z = G ( ) = X ˆ U ( ) T ,are computed as Z [ q , k , k ] = ř K k = X [ k , k , k ] ˆ U ( ) T [ k , q ] (see Eq. (3.36) for a generalcase). U (1) ( ) I R ( )
R I ( )
I R X k X G ( )
R R R I I (2)T k U U (2) U (2)T R R R Figure 3.6: Conceptual model for performing the HOSVD for a very large-scale 3rd-order data tensor. This is achieved by dividing the tensor intoblocks X k – G ˆ U ( ) ˆ U ( ) k ˆ U ( ) , ( k =
1, 2 . . . , K ) . It assumed thatthe data tensor X P R I ˆ I ˆ I is sampled by sliding the block X k from leftto right (with an overlapping sliding window). The model can be usedfor anomaly detection by fixing the core tensor and some factor matriceswhile monitoring the changes along one or more specific modes (in our casemode two). Tensor decomposition is then first performed for a sampled(pre-selected) small block, followed by the analysis of changes in specificsmaller–dimensional factor matrices U ( n ) .is straightforwardly obtained by fixing the core tensor and some factormatrices while monitoring the changes along one or more specific modes,as the block tensor moves from left to right as shown in Figure 3.6. The notion of sketches refers to replacing the original huge matrix or tensorby a new matrix or tensor of a significantly smaller size or compactness, butwhich approximates well the original matrix/tensor. Finding such sketchesin an efficient way is important for the analysis of big data, as a computerprocessor (and memory) is often incapable of handling the whole data-setin a feasible amount of time. For these reasons, the computation is oftenspread among a set of processors which for standard “all-in-one” SVDalgorithms, are unfeasible.Given a very large-scale tensor X , a useful approach is to compute asketch tensor Z or set of sketch tensors Z n that are of significantly smallersizes than the original one.There exist several matrix and tensor sketching approaches:sparsification, random projections, fiber subset selections, iterativesketching techniques and distributed sketching. We review the main89ketching approaches which are promising for tensors.
1. Sparsification generates a sparser version of the tensor which, in general,can be stored more efficiently and admit faster multiplications by factormatrices. This is achieved by decreasing the number on non-zero entriesand quantizing or rounding up entries. A simple technique is element-wise sparsification which zeroes out all sufficiently small elements (belowsome threshold) of a data tensor, keeps all sufficiently large elements,and randomly samples the remaining elements of the tensor with sampleprobabilities proportional to the square of their magnitudes [152].
2. Random Projection based sketching randomly combines fibers of adata tensor in all or selected modes, and is related to the concept of arandomized subspace embedding, which is used to solve a variety ofnumerical linear algebra problems (see [208] and references therein).
3. Fiber subset selection , also called tensor cross approximation (TCA),finds a small subset of fibers which approximates the entire data tensor.For the matrix case, this problem is known as the Column/Row SubsetSelection or CUR Problem which has been thoroughly investigated and forwhich there exist several algorithms with almost matching lower bounds[64, 82, 140].
The random projection framework has been developed for computingstructured low-rank approximations of a data tensor from (random) linearprojections of much lower dimensions than the data tensor itself [28, 208].Such techniques have many potential applications in large-scale numericalmultilinear algebra and optimization problems.Notice that for an N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N , we can computethe following sketches Z = X ˆ Ω ˆ Ω ¨ ¨ ¨ ˆ N Ω N (3.37)and Z n = X ˆ Ω ¨ ¨ ¨ ˆ n ´ Ω n ´ ˆ n + Ω n + ¨ ¨ ¨ ˆ N Ω N , (3.38)for n = , 1, 2, . . . , N , where Ω n P R R n ˆ I n are statistically independentrandom matrices with R n ! I n , usually called test (or sensing) matrices.A sketch can be implemented using test matrices drawn from variousdistributions. The choice of a distribution leads to some tradeoffs [208],90a) Z R R I I I I I R I R Z I R Z R I R R R I Z R R R Ω Ω Ω XX X = = I R Ω = I R Ω I R Ω I R Ω I R Ω I R Ω = X X XX (b) . X R R R n -1 R n +1 R N I n =
12 -1 nN +1 n I I I n -1 I n +1 I N Z n R R R n -1 R n +1 R N I n ΩΩΩ ΩΩ Ω Ω XZ n n -1 n +1 Ω Ω
N N n -1 n +1 ... .. . .. ... = X R R R n R N . I I I n I N N Ω ΩΩ Z R R R n R N Z X . Ω n Ω Ω N N Ω .... . .. . . . ... Figure 3.7: Illustration of tensor sketching using random projections of adata tensor. (a) Sketches of a 3rd-order tensor X P R I ˆ I ˆ I given by Z = X ˆ Ω ˆ Ω P R I ˆ R ˆ R , Z = X ˆ Ω ˆ Ω P R R ˆ I ˆ R , Z = X ˆ Ω ˆ Ω P R R ˆ R ˆ I , and Z = X ˆ Ω ˆ Ω ˆ Ω P R R ˆ R ˆ R . (b)Sketches for an N th-order tensor X P R I ˆ¨¨¨ˆ I N .91specially regarding (i) the costs of randomization, computation, andcommunication to generate the test matrices; (ii) the storage costs for thetest matrices and the sketch; (iii) the arithmetic costs for sketching andupdates; (iv) the numerical stability of reconstruction algorithms; and (v)the quality of a priori error bounds. The most important distributions ofrandom test matrices include: • Gaussian random projections which generate random matriceswith standard normal distribution. Such matrices usually provideexcellent performance in practical scenarios and accurate a priorierror bounds. • Random matrices with orthonormal columns that span uniformlydistributed random subspaces of dimensions R n . Such matricesbehave similar to Gaussian case, but usually exhibit even betternumerical stability, especially when R n are large. • Rademacher and super-sparse Rademacher random projections thathave independent Rademacher entries which take the values ˘ s ; the remaining entries are set to zero. In an extreme caseof maximum sparsity, s =
1, and each column of a test matrix hasexactly only one nonzero entry. • Subsampled randomized Fourier transforms based on test matricestake the following form Ω n = P n F n D n , (3.39)where D n are diagonal square matrices with independentRademacher entries, F n are discrete cosine transform (DCT) ordiscrete Fourier transform (DFT) matrices, and entries of the matrix P n are drawn at random from a uniform distribution. Example.
The concept of tensor sketching via random projections isillustrated in Figure 3.7 for a 3rd-order tensor and for a general case of N th-order tensors. For a 3rd-order tensor with volume (number of entries)92 I I we have four possible sketches which are subtensors of much smallersizes, e.g., I R R , with R n ! I n , if the sketching is performed along mode-2and mode-3, or R R R , if the sketching is performed along all three modes(Figure 3.7(a) bottom right). From these subtensors we can reconstruct anyhuge tensor if it has low a multilinear rank (lower than t R , R , . . . , R n u ).In more general scenario, it can be shown [28] that the N th order tensordata tensor X with sufficiently low-multilinear rank can be reconstructedperfectly from the sketch tensors Z n , for n =
1, 2, . . . , N , as followsˆ X = Z ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) , (3.40)where B ( n ) = [ Z n ] ( n ) Z : ( n ) for n =
1, 2, . . . , N (for more detail see the nextsection). Huge-scale matrices can be factorized using the Matrix Cross-Approximation (MCA) method, which is also known underthe names of Pseudo-Skeleton or CUR matrix decompositions[16, 17, 84, 85, 116, 141, 142, 162]. The main idea behind the MCA is toprovide reduced dimensionality of data through a linear combination ofonly a few “meaningful” components, which are exact replicas of columnsand rows of the original data matrix. Such an approach is based on thefundamental assumption that large datasets are highly redundant andcan therefore be approximated by low-rank matrices, which significantlyreduces computational complexity at the cost of a marginal loss ofinformation.The MCA method factorizes a data matrix X P R I ˆ J as [84, 85] (seeFigure 3.8) X = CUR + E , (3.41)where C P R I ˆ C is a matrix constructed from C suitably selected columnsof the data matrix X , matrix R P R R ˆ J consists of R appropriately selectedrows of X , and matrix U P R C ˆ R is calculated so as to minimize the normof the error E P R I ˆ J .A simple modification of this formula, whereby the matrix U isabsorbed into either C or R , yields the so-called CR matrix factorizationor Column/Row Subset selection: X – C ˜ R = ˜ CR (3.42)93 C U R (cid:2) JI ( ) (cid:3) I C ( ) (cid:3)
R J ( ) (cid:3)
C R
Figure 3.8: Principle of the matrix cross-approximation which decomposesa huge matrix X into a product of three matrices, whereby only a small-sizecore matrix U needs to be computed.for which the bases can be either the columns, C , or rows, R , while ˜ R = UR and ˜ C = CU .For dimensionality reduction, C ! J and R ! I , and the columnsand rows of X should be chosen optimally, in the sense of providing ahigh “statistical leverage” and the best low-rank fit to the data matrix,while at the same time minimizing the cost function } E } F . For a givenset of columns, C , and rows, R , the optimal choice for the core matrixis U = C : X ( R : ) T . This requires access to all the entries of X and is notpractical or feasible for large-scale data. In such cases, a pragmatic choicefor the core matrix would be U = W : , where the matrix W P R R ˆ C iscomposed from the intersections of the selected rows and columns. Itshould be noted that for rank ( X ) ď min t C , R u the cross-approximation isexact. For the general case, it has been proven that when the intersectionsubmatrix W is of maximum volume , the matrix cross-approximation isclose to the optimal SVD solution. The problem of finding a submatrixwith maximum volume has exponential complexity, however, suboptimalmatrices can be found using fast greedy algorithms [4, 144, 179, 222].The concept of MCA can be generalized to tensor cross-approximation(TCA) (see Figure 3.9) through several approaches, including: • Applying the MCA decomposition to a matricized version of thetensor data [142]; • Operating directly on fibers of a data tensor which admits a low-rank Tucker approximation, an approach termed the Fiber Sampling The volume of a square submatrix W is defined as | det ( W ) | . RU ( ) I P P ( )
P P I ( )
P P P P P P ... ... ( )
I P P ... TX T P P P P P P R I I I WC Figure 3.9: The principle of the tensor cross-approximation (TCA)algorithm, illustrated for a large-scale 3rd-order tensor X – U ˆ C ˆ R ˆ T = (cid:74) U ; C , R , T (cid:75) , where U = W ˆ W : ( ) ˆ W : ( ) ˆ W : ( ) = (cid:74) W ; W : ( ) , W : ( ) , W : ( ) (cid:75) P R P P ˆ P P ˆ P P and W P R P ˆ P ˆ P . For simplicityof illustration, we assume that the selected fibers are permuted, so asto become clustered as subtensors, C P R I ˆ P ˆ P , R P R P ˆ I ˆ P and T P R P ˆ P ˆ I .Tucker Decomposition (FSTD) [26–28].Real-life structured data often admit good low-multilinear rankapproximations, and the FSTD provides such a low-rank Tuckerdecomposition which is practical as it is directly expressed in terms of arelatively small number of fibers of the data tensor.For example, for a 3rd-order tensor, X P R I ˆ I ˆ I , for which an exactrank- ( R , R , R ) Tucker representation exists, the FSTD selects P n ě R n , n =
1, 2, 3, indices in each mode; this determines an intersection subtensor, W P R P ˆ P ˆ P , so that the following exact Tucker representation can beobtained (see Figure 3.10) X = (cid:74) U ; C , R , T (cid:75) , (3.43)where the core tensor is computed as U = G = (cid:74) W ; W : ( ) , W : ( ) , W : ( ) (cid:75) , whilethe factor matrices, C P R I ˆ P P , R P R I ˆ P P , T P R I ˆ P P , contain thefibers which are the respective subsets of the columns C , rows R and tubes T . An equivalent Tucker representation is then given by X = (cid:74) W ; CW : ( ) , RW : ( ) , TW : ( ) (cid:75) . (3.44)Observe that for N =
2, the TCA model simplifies into the MCA for a95a) ( )
I P ( )
P I ( )
P P P ( )
I P B (1) B (2)T B (3) = ( ) I I I X I I I TW R C R W P P P (b) X I I I U P P I P P
RC T I I W P P P C T I P P P B (3) I W (1) + W (3) + P P
R W (2) + P B (2) B (1) W I I I B (1) B (3) B (2) P P P P P I Figure 3.10: The Tucker decomposition of a low multilinear rank 3rd-order tensor using the cross-approximation approach. (a) Standard blockdiagram. (b) Transformation from the TCA in the Tucker format, X – U ˆ C ˆ R ˆ T , into a standard Tucker representation, X – W ˆ B ( ) ˆ B ( ) ˆ B ( ) = (cid:74) W ; CW : ( ) , RW : ( ) , TW : ( ) (cid:75) , with a prescribed core tensor W .96atrix case, X = CUR , for which the core matrix is U = (cid:74) W ; W : ( ) , W : ( ) (cid:75) = W : WW : = W : .For a general case of an N th-order tensor, we can show [26] that atensor, X P R I ˆ I ˆ¨¨¨ˆ I N , with a low multilinear rank t R , R , . . . , R N u ,where R n ď I n , @ n , can be fully reconstructed via the TCA FSTD, X = (cid:74) U ; C ( ) , C ( ) , . . . , C ( N ) (cid:75) , using only N factor matrices C ( n ) P R I n ˆ P n ( n =
1, 2, . . . , N ) , built up from the fibers of the data and core tensors, U = G = (cid:74) W ; W : ( ) , W : ( ) , . . . , W : ( N ) (cid:75) , under the condition that the subtensor W P R P ˆ P ˆ¨¨¨ˆ P N with P n ě R n , @ n , has the multilinear rank t R , R , . . . , R N u .The selection of a minimum number of suitable fibers depends upona chosen optimization criterion. A strategy which requires access toonly a small subset of entries of a data tensor, achieved by selectingthe entries with maximum modulus within each single fiber, is given in[26]. These entries are selected sequentially using a deflation approach,thus making the tensor cross-approximation FSTD algorithm suitable forthe approximation of very large-scale but relatively low-order tensors(including tensors with missing fibers or entries).It should be noted that an alternative efficient way to estimatesubtensors W , C , R and T is to apply random projections as follows W = Z = X ˆ Ω ˆ Ω ˆ Ω P R P ˆ P ˆ P , C = Z = X ˆ Ω ˆ Ω P R I ˆ P ˆ P , R = Z = X ˆ Ω ˆ Ω P R P ˆ I ˆ P , T = Z = X ˆ Ω ˆ Ω P R P ˆ P ˆ I , (3.45)where Ω n P R P n ˆ I n with P n ě R n for n =
1, 2, 3 are independent randommatrices. We explicitly assume that the multilinear rank t P , P , . . . , P N u ofapproximated tensor to be somewhat larger than a true multilinear rank t R , R , . . . , R N u of target tensor, because it is easier to obtain an accurateapproximation in this form. The great success of 2-way component analyses (PCA, ICA, NMF, SCA)is largely due to the existence of very efficient algorithms for theircomputation and the possibility to extract components with a desired97hysical meaning, provided by the various flexible constraints exploitedin these methods. Without these constraints, matrix factorizations wouldbe less useful in practice, as the components would have only mathematicalbut not physical meaning.Similarly, to exploit the full potential of tensorfactorization/decompositions, it is a prerequisite to impose suitableconstraints on the desired components. In fact, there is much moreflexibility for tensors, since different constraints can be imposed onthe matrix factorizations in every mode n a matricized tensor X ( n ) (seeAlgorithm 6 and Figure 3.11).Such physically meaningful representation through flexible mode-wise constraints underpins the concept of multiway component analysis(MWCA). The Tucker representation of MWCA naturally accommodatessuch diversities in different modes. Besides the orthogonality, alternativeconstraints in the Tucker format include statistical independence, sparsity,smoothness and nonnegativity [42, 43, 213, 235] (see Table 3.3).The multiway component analysis (MWCA) based on the Tucker- N model can be computed directly in two or three steps:1. For each mode n ( n =
1, 2, . . . , N ) perform model reduction andmatricization of data tensors sequentially, then apply a suitable setof 2-way CA/BSS algorithms to the so reduced unfolding matrices,˜ X ( n ) . In each mode, we can apply different constraints and a different2-way CA algorithms.2. Compute the core tensor using, e.g., the inversion formula, ˆ G = X ˆ B ( ) : ˆ B ( ) : ¨ ¨ ¨ ˆ N B ( N ) : . This step is quite important becausecore tensors often model the complex links among the multiplecomponents in different modes.3. Optionally, perform fine tuning of factor matrices and the core tensorby the ALS minimization of a suitable cost function, e.g., } X ´ (cid:74) G ; B ( ) , . . . , B ( N ) (cid:75) } F , subject to specific imposed constraints. We have shown that TDs provide natural extensions of blind sourceseparation (BSS) and 2-way (matrix) Component Analysis to multi-waycomponent analysis (MWCA) methods.98 JK I ... J J X (1) X (2) U (SVD) V X (:,:, ) k ... T K ... I I X (:, :) j, X (3) A (SCA) T ... J K K A (ICA) ( ,:,:) X i X S = T1 R R R B B B ... ... Figure 3.11: Multiway Component Analysis (MWCA) for a third-ordertensor via constrained matrix factorizations, assuming that the componentsare: orthogonal in the first mode, statistically independent in the secondmode and sparse in the third mode.In addition, TDs are suitable for the coupled multiway analysis ofmulti-block datasets, possibly with missing values and corrupted by noise.To illustrate the simplest scenario for multi-block analysis, consider theblock matrices, X ( k ) P R I ˆ J , which need to be approximately jointlyfactorized as X ( k ) – AG ( k ) B T , ( k =
1, 2, . . . , K ) , (3.46)where A P R I ˆ R and B P R J ˆ R are common factor matrices and G ( k ) P R R ˆ R are reduced-size matrices, while the number of data matrices K canbe huge (hundreds of millions or more matrices). Such a simple model isreferred to as the Population Value Decomposition (PVD) [51]. Note thatthe PVD is equivalent to the unconstrained or constrained Tucker-2 model,as illustrated in Figure 3.12. In a special case with square diagonal matrices, G ( k ) , the model is equivalent to the CP decomposition and is related to jointmatrix diagonalization [31, 56, 203]. Furthermore, if A = B then the PVDmodel is equivalent to the RESCAL model [153].Observe that the PVD/Tucker-2 model is quite general and flexible,since any high-order tensor, X P R I ˆ I ˆ¨¨¨ˆ I N (with N ą r X P R J ˆ J ˆ K , with e.g., I = I , J = I and K = I I ¨ ¨ ¨ I N , for whichPVD/Tucker-2 Algorithm 8 can be applied.99a) … ( ) I J × ( ) I R × ( ) R R × ( ) R J × GG (2) (2) G ( K ) ( K ) AAA T B T B T B … … … (b) X ( ) I J × × K A R R G K T B ( ) R J × ( ) R R
1 2 × × K ( ) I R × Figure 3.12: Concept of the Population Value Decomposition (PVD). (a)Principle of simultaneous multi-block matrix factorizations. (b) Equivalentrepresentation of the PVD as the constrained or unconstrained Tucker-2decomposition, X – G ˆ A ˆ B . The objective is to find the commonfactor matrices, A , B and the core tensor, G P R R ˆ R ˆ K .As previously mentioned, various constraints, including sparsity,nonnegativity or smoothness can be imposed on the factor matrices, A and B , to obtain physically meaningful and unique components.A simple SVD/QR based algorithm for the PVD with orthogonalityconstraints is presented in Algorithm 7 [49, 51, 219]. However, it should benoted that this algorithm does not provide an optimal solution in the sense100 lgorithm 7 : Population Value Decomposition (PVD) withorthogonality constraints
Input:
A set of matrices X k P R I ˆ J , for k =
1, . . . , K (typically, K " max t I , J u ) Output:
Factor matrices A P R I ˆ R , B P R J ˆ R and G k P R R ˆ R ,with orthogonality constraints A T A = I R and B T B = I R for k = K do Perform truncated SVD, X k = U k S k V T k , using R largest singularvalues end for Construct short and wide matrices: U = [ U S , . . . , U K S K ] P R I ˆ KR and V = [ V S , . . . , V K S K ] P R J ˆ KR Perform SVD (or QR) for the matrices U and V Obtain common orthogonal matrices A and B as left-singularmatrices of U and V , respectively for k = K do Compute G k = A T X k B end for Algorithm 8 : Orthogonal Tucker-2 decomposition with a prescribedapproximation accuracy [170]
Input:
A 3rd-order tensor X P R I ˆ J ˆ K (typically, K " max t I , J u )and estimation accuracy ε Output:
A set of orthogonal matrices A P R I ˆ R , B P R J ˆ R and core tensor G P R R ˆ R ˆ K , which satisfies the constraint } X ´ G ˆ A ˆ B } F ď ε , s.t, A T A = I R and B T B = I R . Initialize A = I I P R I ˆ I , R = I while not converged or iteration limit is not reached do Compute the tensor Z ( ) = X ˆ A T P R R ˆ J ˆ K Compute EVD of a small matrix Q = Z ( )( ) Z ( ) T ( ) P R J ˆ J as Q = B diag (cid:0) λ , ¨ ¨ ¨ , λ R (cid:1) B T , such that ř R r = λ r ě } X } F ´ ε ě ř R ´ r = λ r Compute tensor Z ( ) = X ˆ B T P R I ˆ R ˆ K Compute EVD of a small matrix Q = Z ( )( ) Z ( ) T ( ) P R I ˆ I as Q = A diag (cid:0) λ , . . . , λ R (cid:1) A T , such that ř R r = λ r ě } X } F ´ ε ě ř R ´ r = λ r end while Compute the core tensor G = X ˆ A T ˆ B T return A , B and G . (1) X ( ) K B (1, ) K . . . . . . I B (1,2) X (2) I I I I I
3( ) K I I
2( ) K I B C(1) B I (1,1) B C(1) B I (1,2) B I (1, ) K B C(1) G (1) G ( K ) B (1,1) G (2) Figure 3.13: Linked Multiway Component Analysis (LMWCA) for coupled3rd-order data tensors X ( ) , . . . , X ( K ) ; these can have different dimensionsin every mode, except for the mode-1 for which the size is I for all X ( k ) . Linked Tucker-1 decompositions are then performed in the form X ( k ) – G ( k ) ˆ B ( k ) , where partially correlated factor matrices are B ( k ) =[ B ( ) C , B ( k ) I ] P R I ˆ R k , ( k =
1, 2, . . . , K ) . The objective is to find the commoncomponents, B ( ) C P R I ˆ C , and individual components, B ( k ) I P R I ˆ ( R k ´ C ) ,where C ď min t R , . . . , R K u is the number of common components inmode-1.of the absolute minimum of the cost function, ř Kk = } X k ´ AG k B T } F , andfor data corrupted by Gaussian noise, better performance can be achievedusing the HOOI-2 given in Algorithm 4, for N =
3. An improved PVDalgorithm referred to as Tucker-2 algorithm is given in Algorithm 8 [170].
Linked MWCA.
Consider the analysis of multi-modal high-dimensional102ata collected under the same or very similar conditions, for example, a setof EEG and MEG or EEG and fMRI signals recorded for different subjectsover many trials and under the same experimental configurations andmental tasks. Such data share some common latent (hidden) componentsbut can also have their own independent features. As a result, it isadvantageous and natural to analyze such data in a linked way insteadof treating them independently. In such a scenario, the PVD model can begeneralized to multi-block matrix/tensor datasets [38, 237, 239].The linked multiway component analysis (LMWCA) for multi-blocktensor data can therefore be formulated as a set of approximatesimultaneous (joint) Tucker- ( N ) decompositions of a set of data tensors, X ( k ) P R I ( k ) ˆ I ( k ) ˆ¨¨¨ˆ I ( k ) N , with I ( k ) = I for k =
1, 2, . . . , K , in the form (seeFigure 3.13) X ( k ) = G ( k ) ˆ B ( k ) , ( k =
1, 2, . . . K ) (3.47)where each factor (component) matrix, B ( k ) = [ B ( ) C , B ( k ) I ] P R I ˆ R k ,comprises two sets of components: (1) Components B ( ) C P R I ˆ C (with0 ď C ď R k ), @ k , which are common for all the available blocksand correspond to identical or maximally correlated components, and (2)components B ( k ) I P R I ˆ ( R k ´ C ) , which are different independent processesfor each block, k , these can be, for example, latent variables independentof excitations or stimuli/tasks. The objective is therefore to estimatethe common (strongly correlated) components, B ( ) C , and statisticallyindependent (individual) components, B ( k ) I [38].If B ( n , k ) = B ( n ) C P R I n ˆ R n for a specific mode n (in our case n = C n ă R n , we can unfold each datatensor X ( k ) in the common mode, and perform a set of simultaneous matrixfactorizations, e.g., X ( k )( ) – B ( ) C A ( k ) C + B ( k ) I A ( k ) I , through solving the103a) A G = (1) G (2) G (3) G ( -1) N G (2) G (3) B G = ( ) N G ( -1) N B G = ( ) N A G = (1) = I I I I N -1 I N J J J J N -1 J N ... I I I n I N =IJ J n J N X ... I ...... ... J Y J =I (b) ... I I I n I N J =I J J n J N X ... I ...... ... J Y Figure 3.14: Conceptual models of generalized Linked MultiwayComponent Analysis (LMWCA) applied to the cores of high-order TNs.The objective is to find a suitable tensor decomposition which yields themaximum number of cores that are as much correlated as possible. (a)Linked Tensor Train (TT) networks. (b) Linked Hierarchical Tucker (HT)networks with the correlated cores indicated by ellipses in broken lines.104onstrained optimization problemsmin K ÿ k = } X ( k )( ) ´ B ( ) C A ( k ) C ´ B ( k ) I A ( k ) I } F + P ( B ( ) C ) , s . t . B ( ) T C B ( k ) I = @ k , (3.48)where the symbol P denotes the penalty terms which impose additionalconstraints on the common components, B ( ) C , in order to extract as manycommon components as possible. In the special case of orthogonalityconstraints, the problem can be transformed into a generalized eigenvalueproblem. The key point is to assume that common factor submatrices, B ( ) C ,are present in all data blocks and hence reflect structurally complex latent(hidden) and intrinsic links between the data blocks. In practice, the numberof common components, C, is unknown and should be estimated [237].The linked multiway component analysis (LMWCA) modelcomplements currently available techniques for group component analysisand feature extraction from multi-block datasets, and is a natural extensionof group ICA, PVD, and CCA/PLS methods (see [38, 231, 237, 239] andreferences therein). Moreover, the concept of LMWCA can be generalizedto tensor networks, as illustrated in Figure 3.14.
The Infinite Tucker model and its modification, the Distributed InfiniteTucker (DinTucker), generalize the standard Tucker decompositionto infinitely dimensional feature spaces using kernel and Bayesianapproaches [201, 225, 233].Consider the classic Tucker- N model of an N th-order tensor X P R I ˆ¨¨¨ˆ I N , given by X = G ˆ B ( ) ˆ B ( ) ¨ ¨ ¨ ˆ N B ( N ) = (cid:74) G ; B ( ) , B ( ) , . . . , B ( N ) (cid:75) (3.49)in its vectorized versionvec ( X ) = ( B ( ) b L ¨ ¨ ¨ b L B ( N ) ) vec ( G ) .Furthermore, assume that the noisy data tensor is modeled as Y = X + E , (3.50)105here E represents the tensor of additive Gaussian noise. Using theBayesian framework and tensor-variate Gaussian processes (TGP) forTucker decomposition, a standard normal prior can be assigned over eachentry, g r , r ,..., r N , of an N th-order core tensor, G P R R ˆ¨¨¨ˆ R N , in order tomarginalize out G and express the probability density function of tensor X [36, 225, 233] in the form p (cid:16) X | B ( ) , . . . , B ( N ) (cid:17) = N (cid:16) vec ( X ) ; , C ( ) b L ¨ ¨ ¨ b L C ( N ) (cid:17) = exp (cid:16) ´ } (cid:74) X ; ( C ( ) ) ´ , . . . , ( C ( N ) ) ´ (cid:75) } F (cid:17) ( π ) I /2 ś Nn = | C ( n ) | ´ I / ( I n ) (3.51)where I = ś n I n and C ( n ) = B ( n ) B ( n ) T P R I n ˆ I n for n =
1, 2, . . . , N .In order to model unknown, complex, and potentially nonlinearinteractions between the latent factors, each row, ¯ b ( n ) i n P R ˆ R n , within B ( n ) ,is replaced by a nonlinear feature transformation Φ ( ¯ b ( n ) i n ) using the kerneltrick [232], whereby the nonlinear covariance matrix C ( n ) = k ( B ( n ) , B ( n ) ) replaces the standard covariance matrix, B ( n ) B ( n ) T . Using such a nonlinearfeature mapping, the original Tucker factorization is performed in aninfinite feature space, while Eq. (3.51) defines a Gaussian process (GP) ona tensor, called the Tensor-variate GP (TGP), where the inputs come from aset of factor matrices t B ( ) , . . . , B ( N ) u = t B ( n ) u .For a noisy data tensor Y , the joint probability density function is givenby p ( Y , X , t B ( n ) u ) = p ( t B ( n ) u ) p ( X | t B ( n ) u ) p ( Y | X ) . (3.52)To improve scalability, the observed noisy tensor Y can be split into K subtensors t Y , . . . , Y K u , whereby each subtensor Y k is sampled from itsown GP based model with factor matrices, t ˜ B ( n ) k u = t ˜ B ( ) k , . . . , ˜ B ( N ) k u . Thefactor matrices can then be merged via a prior distribution p ( t ˜ B ( n ) k u|t B ( n ) u ) = N ź n = p ( ˜ B ( n ) k | B ( n ) )= N ź n = N ( vec ( ˜ B ( n ) k ) | vec ( B ( n ) )) , λ I ) , (3.53)where λ ą C ( ) b ¨ ¨ ¨ b C ( N ) P R ś n I n ˆ ś n I n , may havea prohibitively large size and can be extremely sparse. For such cases,an alternative nonlinear tensor decomposition model has been recentlydeveloped, which does not, either explicitly or implicitly, exploit theKronecker structure of covariance matrices [41]. Within this model, foreach tensor entry, x i ,..., i N = x i , with i = ( i , i , . . . , i N ) , an input vector b i is constructed by concatenating the corresponding row vectors of factor(latent) matrices, B ( n ) , for all N modes, as b i = [ ¯ b ( ) i , . . . , ¯ b ( N ) i N ] P R ˆ ř Nn = R n . (3.54)We can formalize an (unknown) nonlinear transformation as x i = f ( b i ) = f ([ ¯ b ( ) i , . . . , ¯ b ( N ) i N ]) (3.55)for which a zero-mean multivariate Gaussian distribution is determinedby B S = t b i , . . . , b i M u and f S = t f ( b i ) , . . . , f ( b i M ) u . This allows us toconstruct the following probability function p (cid:16) f S |t B ( n ) u (cid:17) = N ( f S | , k ( B S , B S )) , (3.56)where k ( ¨ , ¨ ) is a nonlinear covariance function which can be expressed as k ( b i , b j ) = k (([ ¯ b ( ) i , . . . , ¯ b ( N ) i N ]) , ([ ¯ b ( ) j , . . . , ¯ b ( N ) j N ])) and S = [ i , . . . , i M ] .In order to assign a standard normal prior over the factor matrices, t B ( n ) u , we assume that for selected entries, x = [ x i , . . . , x i M ] , of a tensor X ,the noisy entries, y = [ y i , . . . , y i M ] , of the observed tensor Y , are sampledfrom the following joint probability model p ( y , x , t B ( n ) u ) (3.57) = N ź n = N ( vec ( B ( n ) ) | , I ) N ( x | , k ( B S , B S )) N ( y | x , β ´ I ) ,where β represents noise variance.These nonlinear and probabilistic models can be potentially appliedfor data tensors or function-related tensors comprising large number ofentries, typically with millions of non-zero entries and billions of zeroentries. Even if only nonzero entries are used, exact inference of theabove nonlinear tensor decomposition models may still be intractable. Toalleviate this problem, a distributed variational inference algorithm hasbeen developed, which is based on sparse GP, together with an efficientMapReduce framework which uses a small set of inducing points to breakup the dependencies between random function values [204, 233].107 hapter 4 Tensor Train Decompositions:Graphical Interpretations andAlgorithms
Efficient implementation of the various operations in tensor train (TT)formats requires compact and easy-to-understand mathematical andgraphical representations [37, 39]. To this end, we next presentmathematical formulations of the TT decompositions and demonstratetheir advantages in both theoretical and practical scenarios.
The tensor train (TT/MPS) representation of an N th-order data tensor, X P R I ˆ I ˆ¨¨¨ˆ I N , can be described in several equivalent forms (see Figures 4.1,4.2 and Table 4.1) listed below:1. The entry-wise scalar form, given by x i , i ,..., i N – R , R ,..., R N ´ ÿ r , r ,..., r N ´ = g ( ) i , r g ( ) r , i , r ¨ ¨ ¨ g ( N ) r N ´ , i N ,1 . (4.1)2. The slice representation (see Figure 2.19) in the form x i , i ,..., i N – G ( ) i G ( ) i ¨ ¨ ¨ G ( N ) i N , (4.2)108a) R R I I G (2) G (3) G (1) G (4) R I I I R I R R I R R R (1) r g (2) , r r g (3) , r r g (4) r g ( ) R I R ( )
R I R (1 ) I R ( R ) I I (b) ( ) I R ( )
R I R ( )
R I R ( 1)
R I I R R I R I I G (2) G (3) G (4) G R R Figure 4.1:
TT decomposition of a 4th-order tensor, X , for which the TT rankis R = R = R =
5. (a) (Upper panel) Representation of the TTvia a multilinear product of the cores, X – G ( ) ˆ G ( ) ˆ G ( ) ˆ G ( ) = xx G ( ) , G ( ) , G ( ) , G ( ) yy , and (lower panel) an equivalent representation via theouter product of mode-2 fibers (sum of rank-1 tensors) in the form, X – ř R r = ř R r = ř R r = ř R r = ( g ( ) r ˝ g ( ) r , r ˝ g ( ) r , r ˝ g ( ) r ) . (b) TT decompositionin a vectorized form represented via strong Kronecker products of block matrices, x – r G ( ) |b| r G ( ) |b| r G ( ) |b| r G ( ) P R I I I I , where the block matrices are definedas r G ( n ) P R R n ´ I n ˆ R n , with block vectors g ( n ) r n ´ , r n P R I n ˆ , n =
1, . . . , 4 and R = R = N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N . It is assumed that the TT rank is r TT = t R , R , . . . , R N ´ u ,with R = R N = X = G ( ) ˆ G ( ) ˆ ¨ ¨ ¨ ˆ G ( N ) P R I ˆ I ˆ¨¨¨ˆ I N with the 3rd-order cores G ( n ) P R R n ´ ˆ I n ˆ R n , ( n =
1, 2, . . . , N ) Tensor representation: Outer products X = R , R ,..., R N ´ ÿ r , r ,..., r N ´ = g ( ) r ˝ g ( ) r , r ˝ ¨ ¨ ¨ ˝ g ( N ´ ) r N ´ , r N ´ ˝ g ( N ) r N ´ , 1 where g ( n ) r n ´ , r n = G ( n ) ( r n ´ , :, r n ) P R I n are fiber vectors.Vector representation: Strong Kronecker products x = r G ( ) |b| r G ( ) |b| ¨ ¨ ¨ |b| r G ( N ) P R I I ¨¨¨ I N , where r G ( n ) P R R n ´ I n ˆ R n are block matrices with blocks g ( n ) r n ´ , r n P R I n Scalar representation x i , i ,..., i N = R , R ,..., R N ´ ÿ r , r ,..., r N ´ = g ( ) i , r g ( ) r , i , r ¨ ¨ ¨ g ( N ´ ) r N ´ , i N ´ , r N ´ g ( N ) r N ´ , i N ,1 where g ( n ) r n ´ , i n , r n are entries of a 3rd-order core G ( n ) P R R n ´ ˆ I n ˆ R n Slice (MPS) representation x i , i ,..., i N = G ( ) i G ( ) i ¨ ¨ ¨ G ( N ) i N , where G ( n ) i n = G ( n ) ( :, i n , : ) P R R n ´ ˆ R n are lateral slices of G ( n ) P R R n ´ ˆ I n ˆ R n N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N . It is assumed that the TC rank is r TC = t R , R , . . . , R N ´ , R N u .Tensor representation: Trace of multilinear products of cores X = Tr ( G ( ) ˆ G ( ) ˆ ¨ ¨ ¨ ˆ G ( N ) ) P R I ˆ I ˆ¨¨¨ˆ I N with the 3rd-order cores G ( n ) P R R n ´ ˆ I n ˆ R n , R = R N , n =
1, 2, . . . , N Tensor/Vector representation: Outer/Kronecker products X = R , R ,..., R N ÿ r , r ,..., r N = g ( ) r N , r ˝ g ( ) r , r ˝ ¨ ¨ ¨ ˝ g ( N ) r N ´ , r N P R I ˆ I ˆ¨¨¨ˆ I N x = R , R ,..., R N ÿ r , r ,..., r N = g ( ) r N , r b L g ( ) r , r b L ¨ ¨ ¨ b L g ( N ) r N ´ , r N P R I I ¨¨¨ I N where g ( n ) r n ´ , r n P R I n are fiber vectors within G ( n ) ( r n ´ , :, r n ) P R I n Vector representation: Strong Kronecker products x = R N ÿ r N = ( r G ( ) r N |b| r G ( ) |b| ¨ ¨ ¨ |b| r G ( N ´ ) |b| r G ( N ) r N ) P R I I ¨¨¨ I N where r G ( n ) P R R n ´ I n ˆ R n are block matrices with blocks g ( n ) r n ´ , r n P R I n , r G ( ) r N P R I ˆ R is a matrix with blocks (columns) g ( ) r N , r P R I , r G ( N ) r N P R R N ´ I N ˆ is a block vector with blocks g ( N ) r N ´ , r N P R I N Scalar representations x i , i ,..., i N = tr ( G ( ) i G ( ) i ¨ ¨ ¨ G ( N ) i N ) = R N ÿ r N = ( g ( ) T r N , i , : G ( ) i ¨ ¨ ¨ G ( N ´ ) i N ´ g ( N ) :, i N , r N ) where g ( ) r N , i , : = G ( ) ( r N , i , : ) P R R , g ( N ) :, i N , r N = G ( N ) ( :, i N , r N ) P R R N ´ I I I N ... N I =I I I I X ~~ x (b) ... ... (1 ) I R ( )
R I R R R R N -1 I I n I N I G (2) G ( ) n G (1) G ( ) N R R I R n -1 I N R N -1 ( ) n nn R I R ( 1) NN R I R I ... ... ... ... R n R n -1 R n I n ... ......... (c) (1) G ( ) I R ( )
R I R ( 1) I (2) G ( ) n G ( ) N G I R ( 1) n I ( 1) N I R ... ( 1) I ......... ...... ... ... ......... ... ...... ... ... ( ) n nn R I R ( 1) NN R I ... R n ... R Figure 4.2: TT/MPS decomposition of an N th-order data tensor, X , forwhich the TT rank is t R , R , . . . , R N ´ u . (a) Tensorization of a huge-scale vector, x P R I , into an N th-order tensor, X P R I ˆ I ˆ¨¨¨ˆ I N . (b)The data tensor can be represented exactly or approximately via a tensortrain (TT/MPS), consisting of 3rd-order cores in the form X – G ( ) ˆ G ( ) ˆ ¨ ¨ ¨ ˆ G ( N ) = xx G ( ) , G ( ) , . . . , G ( N ) yy , where G ( n ) P R R n ´ ˆ I n ˆ R n for n =
1, 2, . . . , N with R = R N =
1. (c) Equivalently, using the strongKronecker products, the TT tensor can be expressed in a vectorized form, x – r G ( ) |b| r G ( ) |b| ¨ ¨ ¨ |b| r G ( N ) P R I I ¨¨¨ I N , where the block matrices aredefined as r G ( n ) P R R n ´ I n ˆ R n , with blocks g ( n ) r n ´ , r n P R I n ˆ .112here the slice matrices are defined as G ( n ) i n = G ( n ) ( :, i n , : ) P R R n ´ ˆ R n , i n =
1, 2, . . . , I n with G ( n ) i n being the i n th lateral slice of the core G ( n ) P R R n ´ ˆ I n ˆ R n , n =
1, 2, . . . , N and R = R N = X – G ( ) ˆ G ( ) ˆ ¨ ¨ ¨ ˆ G ( N ´ ) ˆ G ( N ) = xx G ( ) , G ( ) , . . . , G ( N ´ ) , G ( N ) yy , (4.3)where the 3rd-order cores G ( n ) P R R n ´ ˆ I n ˆ R n , n =
1, 2, . . . , N and R = R N = X – R , R ,..., R N ´ ÿ r , r ,..., r N ´ = g ( ) r ˝ g ( ) r , r ˝ ¨ ¨ ¨ ˝ g ( N ´ ) r N ´ , r N ´ ˝ g ( N ) r N ´ , 1 , (4.4)where g ( n ) r n ´ , r n = G ( n ) ( r n ´ , :, r n ) P R I n are mode-2 fibers, n =
1, 2, . . . , N and R = R N = x – R , R ,..., R N ´ ÿ r , r ,..., r N ´ = g ( ) r b L g ( ) r , r b L ¨ ¨ ¨ b L g ( N ´ ) r N ´ , r N ´ b L g ( N ) r N ´ , 1 , (4.5)where x = vec ( X ) P R I I ¨¨¨ I N .6. An alternative vector form, produced by strong Kronecker productsof block matrices (see Figure 4.1(b)) and Figure 4.2(c)), given by x – r G ( ) |b| r G ( ) |b| ¨ ¨ ¨ |b| r G ( N ) , (4.6) Note that the cores G ( ) and G ( N ) are now two-dimensional arrays (matrices), but fora uniform representation, we assume that these matrices are treated as 3rd-order cores ofsizes 1 ˆ I ˆ R and R N ´ ˆ I N ˆ
1, respectively. r G ( n ) P R R n ´ I n ˆ R n , for n =
1, 2, . . . , N ,consist of blocks g ( n ) r n ´ , r n P R I n ˆ , n =
1, 2, . . . , N , with R = R N = |b| denotes the strong Kronecker product.Analogous relationships can be established for Tensor Chain (i.e., MPSwith PBC (see Figure 2.19(b)) and summarized in Table 4.2. The matrix tensor train, also called the Matrix Product Operator (MPO)with open boundary conditions (TT/MPO), is an important TN modelwhich first represents huge-scale structured matrices, X P R I ˆ J , as 2 N th-order tensors, X P R I ˆ J ˆ I ˆ J ˆ¨¨¨ I N ˆ J N , where I = I I ¨ ¨ ¨ I N and J = J J ¨ ¨ ¨ J N (see Figures 4.3, 4.4 and Table 4.3). Then, the matrix TT/MPOconverts such a 2 N th-order tensor into a chain (train) of 4th-order cores .It should be noted that the matrix TT decomposition is equivalent to thevector TT, created by merging all index pairs ( i n , j n ) into a single indexranging from 1 to I n J n , in a reverse lexicographic order.Similarly to the vector TT decomposition, a large scale 2 N th-ordertensor, X P R I ˆ J ˆ I ˆ J ˆ¨¨¨ˆ I N ˆ J N , can be represented in a TT/MPO formatvia the following mathematical representations:1. The scalar (entry-wise) form x i , j ,..., i N , j N – R ÿ r = R ÿ r = ¨ ¨ ¨ R N ´ ÿ r N ´ = g ( ) i , j , r g ( ) r , i , j , r ¨ ¨ ¨ g ( N ´ ) r N ´ , i N ´ , j N ´ , r N ´ g ( N ) r N ´ , i N , j N , 1 . (4.7)2. The slice representation x i , j ,..., i N , j N – G ( ) i , j G ( ) i , j ¨ ¨ ¨ G ( N ) i N , j N , (4.8)where G ( n ) i n , j n = G ( n ) ( :, i n , j n , : ) P R R n ´ ˆ R n are slices of the cores G ( n ) P R R n ´ ˆ I n ˆ J n ˆ R n , n =
1, 2, . . . , N and R = R N = The cores G ( ) and G ( N ) are in fact three-dimensional arrays, however for uniformrepresentation, we treat them as 4th-order cores of sizes 1 ˆ I ˆ J ˆ R and R N ´ ˆ I N ˆ J N ˆ X – G ( ) ˆ G ( ) ˆ ¨ ¨ ¨ ˆ G ( N ) = xx G ( ) , G ( ) , . . . , G ( N ) yy , (4.9)where the TT-cores are defined as G ( n ) P R R n ´ ˆ I n ˆ J n ˆ R n , n =
1, 2, . . . , N and R = R N = X – r G ( ) |b| r G ( ) |b| ¨ ¨ ¨ |b| r G ( N ) P R I ¨¨¨ I N ˆ J ¨¨¨ J N , (4.10)where r G ( n ) P R R n ´ I n ˆ R n J n are block matrices with blocks G ( n ) r n ´ , r n P R I n ˆ J n and the number of blocks is R n ´ ˆ R n . In a special case, whenthe TT ranks R n = @ n , the strong Kronecker products simplify intostandard (left) Kronecker products.The strong Kronecker product representation of a TT is probably themost comprehensive and useful form for displaying tensor trains in theirvector/matrix form, since it allows us to perform many operations usingrelatively small block matrices. Example.
For two matrices (in the TT format) expressed via the strongKronecker products, A = ˜ A ( ) |b| ˜ A ( ) |b| ¨ ¨ ¨ |b| ˜ A ( N ) and B = ˜ B ( ) |b| ˜ B ( ) |b| ¨ ¨ ¨ |b| ˜ B ( N ) , their Kronecker product can be efficiently computedas A b L B = ˜ A ( ) |b| ¨ ¨ ¨ |b| ˜ A ( N ) |b| ˜ B ( ) |b| ¨ ¨ ¨ |b| ˜ B ( N ) . Furthermore, if thematrices A and B have the same mode sizes , then their linear combination, C = α A + β B can be compactly expressed as [112, 113, 158] C = [ ˜ A ( ) ˜ B ( ) ] |b| (cid:20) ˜ A ( ) ˜ B ( ) (cid:21) |b| ¨ ¨ ¨ |b| (cid:20) ˜ A ( N ´ )
00 ˜ B ( N ´ ) (cid:21) |b| (cid:20) α ˜ A ( N ) β ˜ B ( N ) (cid:21) .Consider its reshaped tensor C = xx C ( ) , C ( ) , . . . , C ( N ) yy in the TT format;then its cores C ( n ) P R R n ´ ˆ I n ˆ J n ˆ R n , n =
1, 2, . . . , N can be expressedthrough their unfolding matrices, C ( n ) ă n ą P R R n ´ I n ˆ R n J n , or equivalently by Note that, wile original matrices A P R I ¨¨¨ I N ˆ J ¨¨¨ J N and B P R I ¨¨¨ I N ˆ J ¨¨¨ J N must havethe same mode sizes, the corresponding core tenors, A ( n ) = P R R An ´ ˆ I n ˆ J n ˆ R An and B ( n ) = P R R Bn ´ ˆ I n ˆ J n ˆ R Bn , may have arbitrary mode sizes. R R R I I I I G (2) G (3) I R R R G (1) G (4) J I R I R I R R I J J J J J J J J J I (1 × I × J × R ) ( R × I × J × R ) ( R × I × J × R ) ( R × I × J × I R J R I J R I J R (b) (1) G (2) G (3) G (4) G ( I × J ) ( I × J ) ( I × J ) ( I × J ) ( I × R J ) ( R I × R J ) ( R I × R J ) ( R I × J ) Figure 4.3: TT/MPO decomposition of a matrix, X P R I ˆ J , reshaped as an8th-order tensor, X P R I ˆ J ˆ¨¨¨ˆ I ˆ J , where I = I I I I and J = J J J J .(a) Basic TT representation via multilinear products (tensor contractions)of cores X = G ( ) ˆ G ( ) ˆ G ( ) ˆ G ( ) , with G ( n ) P R R n ´ ˆ I n ˆ R n for R = R = R = R = R =
1. (b) Representation of a matrix or amatricized tensor via strong Kronecker products of block matrices, in theform X = r G ( ) |b| r G ( ) |b| r G ( ) |b| r G ( ) P R I I I I ˆ J J J J .116a) I I I N J J J N ... (cid:2) X N I = I I I (cid:2) N J = J J J (cid:2) (b) I R J ... I R J R R R J I G (2) G ( ) n G (1) G ( ) N I I n J I N J N R n -1 R n J n I N J N ... ... I J ... ...... ... ... R N -1 J n I n J n I n R n R n -1 R N -1 (1 ) I J R ( )
R I J R ( ) n n nn R I J R ( 1) N NN
R I J (c) ( )
I R J ( )
R I R J (1) G (2) G ( ) n G ( ) N G ......... ... ... ... ... ......... ... ... ... ... ... ( ) N N
I J ( ) n n nn R I R J ( ) N NN
R I J ... ... R N -1 R ( ) I J ( ) n n
I J ( )
I J ...
Figure 4.4: Representations of huge matrices by “linked” block matrices.(a) Tensorization of a huge-scale matrix, X P R I ˆ J , into a 2 N th-ordertensor X P R I ˆ J ˆ¨¨¨ˆ I N ˆ J N . (b) The TT/MPO decomposition of a hugematrix, X , expressed by 4th-order cores, G ( n ) P R R n ´ ˆ I n ˆ J n ˆ R n . (c)Alternative graphical representation of a matrix, X P R I I ¨¨¨ I N ˆ J J ¨¨¨ J N ,via strong Kronecker products of block matrices r G ( n ) P R R n ´ I n ˆ R n J n for n =
1, 2, . . . , N with R = R N =
1. 117able 4.3: Equivalent forms of the matrix Tensor Train decomposition(MPO with open boundary conditions) for a 2 N th-order tensor X P R I ˆ J ˆ I ˆ J ˆ¨¨¨ˆ I N ˆ J N . It is assumed that the TT rank is t R , R , . . . , R N ´ u ,with R = R N = X = G ( ) ˆ G ( ) ˆ ¨ ¨ ¨ ˆ G ( N ´ ) ˆ G ( N ) with 4th-order cores G ( n ) P R R n ´ ˆ I n ˆ J n ˆ R n , ( n =
1, 2, . . . , N ) Tensor representation: Outer products X = R , R ,..., R N ´ ÿ r , r ,..., r N ´ = G ( ) r ˝ G ( ) r , r ˝ ¨ ¨ ¨ ˝ G ( N ´ ) r N ´ , r N ´ ˝ G ( N ) r N ´ , 1 where G ( n ) r n ´ , r n P R I n ˆ J n are blocks of r G ( n ) P R R n ´ I n ˆ R n J n Matrix representation: Strong Kronecker products X = r G ( ) |b| r G ( ) |b| ¨ ¨ ¨ |b| r G ( N ) P R I ¨¨¨ I N ˆ J ¨¨¨ J N where r G ( n ) P R R n ´ I n ˆ R n J n are block matrices with blocks G ( n ) ( r n ´ , :, :, r n ) Scalar representation x i , j , i , j ,..., i N , j N = R , R ,..., R N ´ ÿ r , r ,..., r N ´ = g ( ) i , j , r g ( ) r , i , j , r ¨ ¨ ¨ g ( N ) r N ´ , i N , j N ,1 where g ( n ) r n ´ , i n , j n , r n are entries of a 4th-order core G ( n ) P R R n ´ ˆ I n ˆ J n ˆ R n Slice (MPS) representation x i , j , i , j ,..., i N , j N = G ( ) i , j G ( ) i , j ¨ ¨ ¨ G ( N ) i N , j N where G ( n ) i n , j n = G ( n ) ( :, i n , j n , : ) P R R n ´ ˆ R n are slices of G ( n ) P R R n ´ ˆ I n ˆ J n ˆ R n C ( n ) i n , j n P R R n ´ ˆ R n , as follows C ( n ) i n , j n = (cid:34) A ( n ) i n , j n
00 B ( n ) i n , j n (cid:35) , n =
2, 3, . . . , N ´
1, (4.11)while for the border cores C ( ) i , j = (cid:104) A ( ) i , j B ( ) i , j (cid:105) , C ( N ) i N , j N = (cid:34) α A ( N ) i N , j N β B ( N ) i N , j N (cid:35) (4.12)for i n =
1, 2, . . . , I n , j n =
1, 2, . . . , J N , n =
1, 2, . . . , N .Note that the various mathematical and graphical representations ofTT/MPS and TT/MPO can be used interchangeably for different purposesor applications. With these representations, all basic mathematicaloperations in TT format can be performed on the constitutive blockmatrices, even without the need to explicitly construct core tensors [67,158]. Remark.
In the TT/MPO paradigm, compression of large matrices isnot performed by global (standard) low-rank matrix approximations, butby low-rank approximations of block-matrices (submatrices) arranged ina hierarchical (linked) fashion. However, to achieve a low-rank TT andconsequently a good compression ratio, ranks of all the correspondingunfolding matrices of a specific structured data tensor must be low, i.e.,their singular values must rapidly decrease to zero. While this is true formany structured matrices, unfortunately in general, this assumption doesnot hold.
It is important to note that any specific TN format can be converted into theTT format. This very useful property is next illustrated for two simple butimportant cases which establish links between the CP and TT and the BTDand TT formats.1. A tensor in the CP format, given by X = R ÿ r = a ( ) r ˝ a ( ) r ˝ ¨ ¨ ¨ ˝ a ( N ) r , (4.13)119a) R RI I N I G (2) GG (1) G ( ) N R RI N -1 R RR I I N -1 I N I (1× I × R ) ( R × I × R ) ( R × I × R ) N -1 ( R × I ×1) N A A A ( N -1)(2)(1) A ( N ) R ( N -1) RR (b) I RJ I R J R RJ I G (2) G G (1) G ( ) N I IJ I N J N R J I N J N ......... ... ... ... ... ......... ... ... ... ... ... ... ... I J ... ...... ... ... ... RJI JI R R R RR ( N- N- N- R N- N- N- N- ( R × I × J × R ) N- N- ( R × I × J ×1) N N ( R × I × J × R ) (1× I × J × R ) ( I × RJ ) ( RI × RJ ) ( RI × RJ ) N- N- ( RI × J ) N N ( I × J ) N N ( I × J ) N- N- ( I × J ) ( I × J ) Figure 4.5: Links between the TT format and other tensor network formats.(a) Representation of the CP decomposition for an N th-order tensor, X = I ˆ A ( ) ˆ A ( ) ¨ ¨ ¨ ˆ N A ( N ) , in the TT format. (b) Representation of theBTD model given by Eqs. (4.15) and (4.16) in the TT/MPO format. Observethat the TT-cores are very sparse and the TT ranks are t R , R , . . . , R u . Similarrelationships can be established straightforwardly for the TC format.120an be straightforwardly converted into the TT/MPS format asfollows. Since each of the R rank-1 tensors can be represented in theTT format of TT rank (
1, 1, . . . , 1 ) , using formulas (4.11) and (4.12), wehave X = R ÿ r = xx a ( ) T r , a ( ) T r , . . . , a ( N ) T r yy (4.14) = xx G ( ) , G ( ) , . . . , G ( N ´ ) , G ( N ) yy ,where the TT-cores G ( n ) P R R ˆ I n ˆ R have diagonal lateral slices G ( n ) ( :, i n , : ) = G ( n ) i n = diag ( a i n ,1 , a i n ,2 , . . . , a i n , R ) P R R ˆ R for n =
2, 3, . . . , N ´ G ( ) = A ( ) P R I ˆ R and G ( N ) = A ( N ) T P R R ˆ I N (see Figure4.5(a)).2. A more general Block Term Decomposition (BTD) for a 2 N th-orderdata tensor X = R ÿ r = ( A ( ) r ˝ A ( ) r ˝ ¨ ¨ ¨ ˝ A ( N ) r ) P R I ˆ J ˆ¨¨¨ˆ I N ˆ J N (4.15)with full rank matrices, A ( n ) r P R I n ˆ J n , @ r , can be converted into amatrix TT/MPO format, as illustrated in Figure 4.5(b).Note that (4.15) can be expressed in a matricized (unfolding) form viastrong Kronecker products of block diagonal matrices (see formulas(4.11)), given by X = R ÿ r = ( A ( ) r b L A ( ) r b L ¨ ¨ ¨ b L A ( N ) r ) (4.16) = r G ( ) |b| r G ( ) |b| ¨ ¨ ¨ |b| r G ( N ) P R I ¨¨¨ I N ˆ J ¨¨¨ˆ J N ,with the TT rank, R n = R for n =
1, 2, . . . N ´
1, and the blockdiagonal matrices, r G ( n ) = diag ( A ( n ) , A ( n ) , . . . , A ( n ) R ) P R RI n ˆ RJ n , for n =
2, 3, . . . , N ´
1, while r G ( ) = [ A ( ) , A ( ) , . . . , A ( ) R ] P R I ˆ RJ is arow block matrix, and r G ( N ) = A ( N ) ... A ( N ) R P R RI N ˆ J N a column blockmatrix (see Figure 4.5(b)). 121 = (64 1) (2 2 2 2 2 2) . . . = G (1) G (2) G (3) G (4) G (5) G (6) TT Figure 4.6: Concept of tensorization/quantization of a large-scale vectorinto a higher-order quantized tensor. In order to achieve a goodcompression ratio, we need to apply a suitable tensor decomposition suchas the quantized TT (QTT) using 3rd-order cores, X = G ( ) ˆ G ( ) ˆ ¨ ¨ ¨ ˆ G ( ) .Several algorithms exist for decompositions in the form (4.15) and (4.16)[14, 15, 181]. In this way, TT/MPO decompositions for huge-scalestructured matrices can be constructed indirectly. The procedure of creating a higher-order tensor from lower-order originaldata is referred to as tensorization, while in a special case where each modehas a very small size 2, 3 or 4, it is referred to as quantization. In addition tovectors and matrices, lower-order tensors can also be reshaped into higher-order tensors. By virtue of quantization, low-rank TN approximations withhigh compression ratios can be obtained, which is not possible to achievewith original raw data formats. [114, 157].Therefore, the quantization can be considered as a special form of tensorizationwhere size of each mode is very small, typically 2 or 3 . The concept of quantizedtensor networks (QTN) was first proposed in [157] and [114], whereby low-size 3rd-order cores are sparsely interconnected via tensor contractions.The so obtained model often provides an efficient, highly compressed, andlow-rank representation of a data tensor and helps to mitigate the curse of122imensionality, as illustrated below.
Example.
The quantization of a huge vector, x P R I , I = K , can beachieved through reshaping to give a ( ˆ ˆ ¨ ¨ ¨ ˆ ) tensor X of order K ,as illustrated in Figure 4.6. For structured data such a quantized tensor, X , often admits low-rank TN approximation, so that a good compressionof a huge vector x can be achieved by enforcing the maximum possiblelow-rank structure on the tensor network. Even more generally, an N th-order tensor, X P R I ˆ¨¨¨ˆ I N , with I n = q K n , can be quantized in all modessimultaneously to yield a ( q ˆ q ˆ ¨ ¨ ¨ ˆ q ) quantized tensor of higher-orderand with small value of q . Example.
Since large-scale tensors (even of low-order) cannot be loadeddirectly into the computer memory, our approach to this problem isto represent the huge-scale data by tensor networks in a distributedand compressed TT format, so as to avoid the explicit requirement forunfeasible large computer memory.In the example shown in Figure 4.7, the tensor train of a huge3rd-order tensor is expressed by the strong Kronecker products ofblock tensors with relatively small 3rd-order tensor blocks. TheQTT is mathematically represented in a distributed form via strongKronecker products of block 5th-order tensors. Recall that the strongKronecker product of two block core tensors, r G ( n ) P R R n ´ I n ˆ R n J n ˆ K n and r G ( n + ) P R R n I n + ˆ R n + J n + ˆ K n + , is defined as the block tensor, C = r G ( n ) |b| r G ( n + ) P R R n ´ I n I n + ˆ R n + J n J n + ˆ K n K n + , with 3rd-order tensorblocks, C r n ´ , r n + = ř R n r n = G ( n ) r n ´ , r n b L G ( n + ) r n , r n + P R I n I n + ˆ J n J n + ˆ K n K n + , where G ( n ) r n ´ , r n P R I n ˆ J n ˆ K n and G ( n + ) r n , r n + P R I n + ˆ J n + ˆ K n + are the block tensors of r G ( n ) and r G ( n + ) , respectively.In practice, a fine ( q =
2, 3, 4 ) quantization is desirable to createas many virtual (additional) modes as possible, thus allowing for theimplementation of efficient low-rank tensor approximations. For example,the binary encoding ( q =
2) reshapes an N th-order tensor with ( K ˆ K ˆ¨ ¨ ¨ ˆ K N ) elements into a tensor of order ( K + K + ¨ ¨ ¨ + K N ) , with thesame number of elements. In other words, the idea is to quantize each ofthe n “physical” modes (dimensions) by replacing them with K n “virtual”modes, provided that the corresponding mode sizes, I n , are factorized as I n = I n ,1 I n ,2 ¨ ¨ ¨ I n , K n . This, in turn, corresponds to reshaping the n th mode123able 4.4: Storage complexities of tensor decomposition models foran N th-order tensor, X P R I ˆ I ˆ¨¨¨ˆ I N , for which the original storagecomplexity is O ( I N ) , where I = max t I , I , . . . , I N u , while R is the upperbound on the ranks of tensor decompositions considered, that is, R = max t R , R , . . . , R N ´ u or R = max t R , R , . . . , R N u .1. Full (raw) tensor format O ( I N )
2. CP O ( N IR )
3. Tucker O ( N IR + R N )
4. TT/MPS O ( N IR )
5. TT/MPO O ( N I R )
6. Quantized TT/MPS (QTT) O ( NR log q ( I ))
7. QTT+Tucker O ( NR log q ( I ) + NR )
8. Hierarchical Tucker (HT) O ( N IR + NR ) of size I n into K n modes of sizes I n ,1 , I n ,2 , . . . , I n , K n .The TT decomposition applied to quantized tensors is referred to asthe QTT, Quantics-TT or Quantized-TT, and was first introduced as acompression scheme for large-scale matrices [157], and also independentlyfor more general settings.The attractive properties of QTT are:1. Not only QTT ranks are typically small (usually, below 20) butthey are also almost independent of the data size (even for I = ), thus providing a logarithmic (sub-linear) reduction of storagerequirements from O ( I N ) to O ( NR log q ( I )) which is referred to assuper-compression [68, 70, 111, 112, 114]. Comparisons of the storagecomplexity of various tensor formats are given in Table 4.4.2. Compared to the TT decomposition (without quantization), the QTTformat often represents deep structures in the data by introducing “virtual”dimensions or modes . For data which exhibit high degrees of structure, At least uniformly bounded. I I I N J J J N K K K N ... (cid:2) X N K = K K K (cid:2) N I = I I I (cid:2) N J = J J J (cid:2) (b) R R K I ( ) I J K R ( )
R I J K R )( N N N N N
R I J K R ( ) N N NN
R I J K G (2) G ( -1) N G (1) G ( ) N J I I N -1 K J R R R N -1 R N -2 R G (2) G ( -1) N G (1) G ( ) N I N J N K N R N -2 R N-1 ... ...... ... ... ... ... ... ... ... ... ... ( )
I R J K ( )
R I R J K ~~~~ ... J N -1 K N -1 R N -1 )( N N N N N
R I R J K ( ) N N NN
R I J K ( )
I J K ( )
I J K ( )
N N N
I J K ( )
N N N
I J K ... ... ... ...
Figure 4.7: Tensorization/quantization of a huge-scale 3rd-order tensorinto a higher order tensor and its TT representation. (a) Example oftensorization/quantization of a 3rd-order tensor, X P R I ˆ J ˆ K , into a 3 N th-order tensor, assuming that the mode sizes can be factorized as, I = I I ¨ ¨ ¨ I N , J = J J ¨ ¨ ¨ J N and K = K K ¨ ¨ ¨ K N . (b) Decomposition of thehigh-order tensor via a generalized Tensor Train and its representationby the strong Kronecker product of block tensors as X – r G ( ) |b| r G ( ) |b| ¨ ¨ ¨ |b| r G ( N ) P R I ¨¨¨ I N ˆ J ¨¨¨ J N ˆ K ¨¨¨ K N , where each block r G ( n ) P R R n ´ I n ˆ R n J n ˆ K n is also a 3rd-order tensor of size ( I n ˆ J n ˆ K n ) , for n =
1, 2, . . . , N with R = R N =
1. In the special case when J = K =
1, themodel simplifies into the standard TT/MPS model.125a) I n I n A n A n , A n , A n K , R n , I n , I n , I n K , ~~ R R nn (b) - - -- - - G (1) G ( N -1) R N -1 G ( ) N R R R N- R N A A A N -1 A N I I I N -1 I N R R - - - -- -- - R N -1 I A A A N -1,1 A N ,1 I I N -1,1 I N ,1 R R I K I K I N K -1, I A K A K A A R R R N- R N G (2) G (1) G (2) G G ( ) N I I I N ...... ( N -2) N K , N K -1,
N K , R N R N Figure 4.8: The QTT-Tucker or alternatively QTC-Tucker (QuantizedTensor-Chain-Tucker) format. (a) Distributed representation of a matrix A n P R I n ˆ ˆ R n with a very large value of I n via QTT, by tensorizationto a high-order quantized tensor, followed by QTT decomposition. (b)Distributed representation of a large-scale Tucker- N model, X – G ˆ A ˆ A ¨ ¨ ¨ ˆ N A N , via a quantized TC model in which the core tensor G P R ˆ R ˆ ˆ R ˆ¨¨¨ˆ ˆ R N and optionally all large-scale factor matrices A n ( n =
1, 2, . . . , N ) are represented by MPS models (for more detail see [68]).126he high compressibility of the QTT approximation is a consequenceof the better separability properties of the quantized tensor.3. The fact that the QTT ranks are often moderate or even low offersunique advantages in the context of big data analytics (see [112,114, 115] and references therein), together with high efficiency ofmultilinear algebra within the TT/QTT algorithms which rests uponthe well-posedness of the low-rank TT approximations.The ranks of the QTT format often grow dramatically with data size, butwith a linear increase in the approximation accuracy. To overcome thisproblem, Dolgov and Khoromskij proposed the QTT-Tucker format [68](see Figure 4.8), which exploits the TT approximation not only for theTucker core tensor, but also for the factor matrices. This model naturallyadmits distributed computation, and often yields bounded ranks, thusavoiding the curse of dimensionality.The TT/QTT tensor networks have already found application in verylarge-scale problems in scientific computing, such as in eigenanalysis,super-fast Fourier transforms, and in solving huge systems of large linearequations (see [68, 70, 102, 120, 123, 218] and references therein). For big tensors in their TT formats, basic mathematical operations, suchas the addition, inner product, computation of tensor norms, Hadamardand Kronecker product, and matrix-by-vector and matrix-by-matrixmultiplications can be very efficiently performed using block (slice)matrices of individual (relatively small size) core tensors.Consider two N th-order tensors in the TT format X = xx X ( ) , X ( ) , . . . , X ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N Y = xx Y ( ) , Y ( ) , . . . , Y ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N ,for which the TT ranks are r X = t R , R , . . . , R N ´ u and r Y = t ˜ R , ˜ R , . . . , ˜ R N ´ u . The following operations can then be performeddirectly in the TT formats. The TT/QTT ranks are constant or growing linearly with respect to the tensor order N and are constant or growing logarithmically with respect to the dimension of tensor modes I . ensor addition. The sum of two tensors Z = X + Y = xx Z ( ) , Z ( ) , . . . , Z ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N (4.17)has the TT rank r Z = r X + r Y and can be expressed via lateral slices of thecores Z P R R n ´ ˆ I n ˆ R n as Z ( n ) i n = (cid:34) X ( n ) i n
00 Y ( n ) i n (cid:35) , n =
2, 3, . . . , N ´
1. (4.18)For the border cores, we have Z ( ) i = (cid:104) X ( ) i Y ( ) i (cid:105) , Z ( N ) i N = (cid:34) X ( N ) i N Y ( N ) i N (cid:35) (4.19)for i n =
1, 2, . . . , I n , n =
1, 2, . . . , N . Hadamard product.
The computation of the Hadamard (element-wise)product, Z = X f Y , of two tensors, X and Y , of the same order and thesame size can be performed very efficiently in the TT format by expressingthe slices of the cores, Z P R R n ´ ˆ I n ˆ R n , as Z ( n ) i n = X ( n ) i n b Y ( n ) i n , n =
1, . . . , N , i n =
1, . . . , I n . (4.20)This increases the TT ranks for the tensor Z to at most R n ˜ R n , n =
1, 2, . . . , N ,but the associated computational complexity can be reduced from beingexponential in N , O ( I N ) , to being linear in both I and N , O ( I N ( R ˜ R ) )) . Super fast Fourier transform of a tensor in the TT format (MATLABfunctions: fftn ( X ) and fft ( X ( n ) , [] , 2 ) ) can be computed as F ( X ) = xx F ( X ( ) ) , F ( X ( ) ) , . . . , F ( X ( N ) ) yy = F ( X ( ) ) ˆ F ( X ( ) ) ˆ ¨ ¨ ¨ ˆ F ( X ( N ) ) . (4.21)It should be emphasized that performing computation of the FFT onrelatively small core tensors X ( n ) P R R n ´ ˆ I n ˆ R n reduces dramaticallycomputational complexity under condition that a data tensor admits low-rankTT approximation . This approach is referred to as the super fast Fouriertransform (SFFT) in TT format. Wavelets, DCT, and other linear integral128ransformations admit a similar form to the SFFT in (4.21), for example, forthe wavelet transform in the TT format, we have W ( X ) = xx W ( X ( ) ) , W ( X ( ) ) , . . . , W ( X ( N ) ) yy = W ( X ( ) ) ˆ W ( X ( ) ) ˆ ¨ ¨ ¨ ˆ W ( X ( N ) ) . (4.22) The N-D discrete convolution in a TT format of tensors X P R I ˆ¨¨¨ˆ I N with TT rank t R , R , . . . , R N ´ u and Y P R J ˆ¨¨¨ˆ J N with TT rank t Q , Q , . . . , Q N ´ u can be computed as Z = X ˚ Y (4.23) = xx Z ( ) , Z ( ) , . . . , Z ( N ) yy P R ( I + J ´ ) ˆ ( I + J ´ ) ˆ¨¨¨ˆ ( I N + J N ´ ) ,with the TT-cores given by Z ( n ) = X ( n ) d Y ( n ) P R ( R n ´ Q n ´ ) ˆ ( I n + J n ´ ) ˆ ( R n Q n ) , (4.24)or, equivalently, using the standard convolution Z ( n ) ( s n ´ , :, s n ) = X ( n ) ( r n ´ , :, r n ) ˚ Y ( n ) ( q n ´ , :, q n ) P R ( I n + J n ´ ) for s n =
1, 2, . . . , R n Q n and n =
1, 2, . . . , N , R = R N = Inner product.
The computation of the inner (scalar, dot) product oftwo N th-order tensors, X = xx X ( ) , X ( ) , . . . , X ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N and Y = xx Y ( ) , Y ( ) , . . . , Y ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N , is given by x X , Y y = x vec ( X ) , vec ( Y ) y (4.25) = I ÿ i = ¨ ¨ ¨ I N ÿ i N = x i ... i n y i ¨¨¨ i N and has the complexity of O ( I N ) in the raw tensor format. In TT formats,the inner product can be computed with the reduced complexity of only O ( N I ( R ˜ R + R ˜ R )) when the inner product is calculated by movingTT-cores from left to right and performing calculations on relatively smallmatrices, S n = X ( n ) ˆ ( Y ( n ) ˆ S n ´ ) P R R n ˆ r R n for n =
1, 2, . . . , N .The results are then sequentially multiplied by the next core Y ( n + ) (seeAlgorithm 9). Computation of the Frobenius norm.
In a similar way, we can efficientlycompute the Frobenius norm of a tensor, } X } F = a x X , X y , in the TT format.129 lgorithm 9 : Inner product of two large-scale tensors in the TTFormat [67, 158]
Input: N th-order tensors, X = xx X ( ) , X ( ) , . . . , X ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N and Y = xx Y ( ) , Y ( ) , . . . , Y ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N in TT formats, withTT-cores X P R R n ´ ˆ I n ˆ R n and Y P R r R n ´ ˆ I n ˆ r R n and R = r R = R N = r R N = Output:
Inner product x X , Y y = vec ( X ) T vec ( Y ) Initialization S = for n = N do Z ( n )( ) = S n ´ Y ( n )( ) P R R n ´ ˆ I n r R n S n = X ( n ) T ă ą Z ( n ) ă ą P R R n ˆ r R n end for return Scalar x X , Y y = S N P R R N ˆ r R N = R , with R N = r R N = For the so-called n -orthogonal TT format, it is easy to show that } X } F = } X ( n ) } F . (4.26) Matrix-by-vector multiplication.
Consider a huge-scale matrix equation(see Figure 4.9 and Figure 4.10) Ax = y , (4.27)where A P R I ˆ J , x P R J and y P R I are represented approximately inthe TT format, with I = I I ¨ ¨ ¨ I N and J = J J ¨ ¨ ¨ J N . As shown in Figure4.9(a), the cores are defined as A ( n ) P R P n ´ ˆ I n ˆ J n ˆ P n , X ( n ) P R R n ´ ˆ J n ˆ R n and Y ( n ) P R Q n ´ ˆ I n ˆ Q n .Upon representing the entries of the matrix A and vectors x and y in An N th-order tensor X = xx X ( ) , X ( ) . . . , X ( N ) yy in the TT format is called n -orthogonalif all the cores to the left of the core X ( n ) are left-orthogonalized and all the cores to the rightof the core X ( n ) are right-orthogonalized (see Part 2 for more detail). A = P , P ,..., P N ´ ÿ p , p ,..., p N ´ = A ( ) p ˝ A ( ) p , p ˝ ¨ ¨ ¨ ˝ A ( N ) p N ´ ,1 X = R , R ,..., R N ´ ÿ r , r ,..., r N ´ = x ( ) r ˝ x ( ) r , r ˝ ¨ ¨ ¨ ˝ x ( N ) r N ´ (4.28) Y = Q , Q ,..., Q N ´ ÿ q , q ,..., q N ´ = y ( ) q ˝ y ( ) q , q ˝ ¨ ¨ ¨ ˝ y ( N ) q N ´ ,we arrive at a simple formula for the tubes of the tensor Y , in the form y ( n ) q n ´ , q n = y ( n ) r n ´ p n ´ , r n p n = A ( n ) p n ´ , p n x ( n ) r n ´ , r n P R I n ,with Q n = P n R n for n =
1, 2, . . . , N .Furthermore, by representing the matrix A and vectors x , y via thestrong Kronecker products A = ˜ A ( ) |b| ˜ A ( ) |b| ¨ ¨ ¨ |b| ˜ A ( N ) x = ˜ X ( ) |b| ˜ X ( ) |b| ¨ ¨ ¨ |b| ˜ X ( N ) (4.29) y = ˜ Y ( ) |b| ˜ Y ( ) |b| ¨ ¨ ¨ |b| ˜ Y ( N ) ,with ˜ A ( n ) P R P n ´ I n ˆ J n P n , ˜ X ( n ) P R R n ´ J n ˆ R n and ˜ Y ( n ) P R Q n ´ I n ˆ Q n , we canestablish a simple relationship˜ Y ( n ) = ˜ A ( n ) |‚| ˜ X ( n ) P R R n ´ P n ´ I n ˆ R n P n , n =
1, . . . , N , (4.30)where the operator | ‚ | represents the C (Core) product of two blockmatrices.The C product of a block matrix A ( n ) P R P n ´ I n ˆ P n J n with blocks A ( n ) p n ´ , p n P R I n ˆ J n , and a block matrix B ( n ) P R R n ´ J n ˆ R n K n , with blocks B ( n ) r n ´ , r n P R J n ˆ K n , is defined as C ( n ) = A ( n ) |‚| B ( n ) P R Q n ´ I n ˆ Q n K n , theblocks of which are given by C ( n ) q n ´ , q n = A ( n ) p n ´ , p n B ( n ) r n ´ , r n P R I n ˆ K n , where q n = p n r n , as illustrated in Figure 4.11.Note that, equivalently to Eq. (4.30), for Ax = y , we can use a slicerepresentation, given by Y ( n ) i n = J n ÿ j n = ( A ( n ) i n , j n b L X ( n ) j n ) , (4.31)131a) J I J I JI J N I N I I I I N ・・・ ・・・・・・ ・・・・・・ ・・・ XAY nnn ~ = ~ = R R J J R n J n J N R n- { x P P P n P n- A { A ( ) N A ( ) n A (2) A (1) X (1) X (2) X ( ) n X ( ) N I I I n I N Q Q I I Q n I n I N Q n- { y Y (1) Y (2) Y ( ) n Y ( ) N (b) K J I K J I J n I n K n K N J N I N I I n I n N I N ・・・ ・・・・・・ ・・・・・・ ・・・・・・ ・・・ XAY ~ = K K K K ・・・ ・・・ QK K J J I I K N XA J J N I I N K n X (1) X (2) X X ( ) N A (1) A (2) A ( ) N A P P P n R R R n } K K K N K Y (1) Q QI I I N I n ~ = N Y ( ) YY (2) ( n )( n ) nn nn ( n ) } Y } Figure 4.9: Linear systems represented by arbitrary tensor networks ( left )and TT networks ( right ) for (a) Ax – y and (b) AX – Y .132able 4.5: Basic operations on tensors in TT formats, where X = X ( ) ˆ X ( ) ˆ ¨ ¨ ¨ ˆ X ( N ) P R I ˆ I ˆ¨¨¨ˆ I N , Y = Y ( ) ˆ Y ( ) ˆ ¨ ¨ ¨ ˆ Y ( N ) P R J ˆ J ˆ¨¨¨ˆ J N , and Z = Z ( ) ˆ Z ( ) ˆ ¨ ¨ ¨ ˆ Z ( N ) P R K ˆ K ˆ¨¨¨ˆ K N .Operation TT-cores Z = X + Y = (cid:16) X ( ) ‘ Y ( ) (cid:17) ˆ (cid:16) X ( ) ‘ Y ( ) (cid:17) ˆ ¨ ¨ ¨ ˆ (cid:16) X ( N ) ‘ Y ( N ) (cid:17) Z ( n ) = X ( n ) ‘ Y ( n ) , with TT core slices Z ( n ) i n = X ( n ) i n ‘ Y ( n ) i n , ( I n = J n = K n , @ n ) Z = X ‘ Y = (cid:16) X ( ) ‘ Y ( ) (cid:17) ˆ (cid:16) X ( ) ‘ Y ( ) (cid:17) ˆ ¨ ¨ ¨ ˆ (cid:16) X ( N ) ‘ Y ( N ) (cid:17) Z = X f Y = (cid:16) X ( ) d Y ( ) (cid:17) ˆ (cid:16) X ( ) d Y ( ) (cid:17) ˆ ¨ ¨ ¨ ˆ (cid:16) X ( N ) d Y ( N ) (cid:17) Z ( n ) = X ( n ) d Y ( n ) , with TT core slices Z ( n ) i n = X ( n ) i n b Y ( n ) i n , ( I n = J n = K n , @ n ) Z = X b Y = (cid:16) X ( ) b Y ( ) (cid:17) ˆ (cid:16) X ( ) b Y ( ) (cid:17) ˆ ¨ ¨ ¨ ˆ (cid:16) X ( N ) b Y ( N ) (cid:17) Z ( n ) = X ( n ) b Y ( n ) , with TT core slices Z ( n ) k n = X ( n ) i n b Y ( n ) j n ( k n = i n j n ) Z = X ˚ Y = ( X ( ) d Y ( ) ) ˆ ¨ ¨ ¨ ˆ ( X ( N ) d Y ( N ) ) Z ( n ) = X ( n ) d Y ( n ) P R ( R n ´ Q n ´ ) ˆ ( I n + J n ´ ) ˆ ( R n Q n ) , with vectors Z ( n ) ( s n ´ , :, s n ) = X ( n ) ( r n ´ , :, r n ) ˚ Y ( n ) ( q n ´ , :, q n ) P R ( I n + J n ´ ) for s n =
1, 2, . . . , R n Q n and n =
1, 2, . . . , N , R = R N = Z = X ˆ n A = X ( ) ˆ ¨ ¨ ¨ ˆ X ( n ´ ) ˆ (cid:16) X ( n ) ˆ A (cid:17) ˆ X ( n + ) ˆ ¨ ¨ ¨ ˆ X ( N ) z = x X , Y y = Z ( ) ˆ Z ( ) ˆ ¨ ¨ ¨ ˆ Z ( N ) = Z ( ) Z ( ) ¨ ¨ ¨ Z ( N ) Z ( n ) = (cid:16) X ( n ) d Y ( n ) (cid:17) ˆ I n = ř i n X ( n ) i n b Y ( n ) i n ( I n = J n , @ n ) A = r A ( ) |b| r A ( ) |b| ¨ ¨ ¨ |b| r A ( N ) , B = r B ( ) |b| r B ( ) |b| ¨ ¨ ¨ |b| r B ( N ) , x = r X ( ) |b| r X ( ) |b| ¨ ¨ ¨ |b| r X ( N ) , y = r Y ( ) |b| r Y ( ) |b| ¨ ¨ ¨ |b| r Y ( N ) and the block matrices r A ( n ) P R R An ´ I n ˆ J n R An , r B ( n ) P R R Bn ´ J n ˆ K n R Bn , r X ( n ) P R R xn ´ I n ˆ R xn , r Y ( n ) P R R yn ´ I n ˆ R yn . Operation Block matrices of TT-cores Z = A + B = (cid:104) r A ( ) r B ( ) (cid:105) |b| (cid:34) r A ( ) r B ( ) (cid:35) |b| ¨ ¨ ¨ |b| (cid:34) r A ( N ´ ) r B ( N ´ ) (cid:35) |b| (cid:34) r A ( N ) r B ( N ) (cid:35) Z = A b B = r A ( ) |b| ¨ ¨ ¨ |b| r A ( N ) |b| r B ( ) |b| ¨ ¨ ¨ |b| r B ( N ) z = x T y = x x , y y = (cid:16) r X ( ) |‚| r Y ( ) (cid:17) |b| ¨ ¨ ¨ |b| (cid:16) r X ( N ) |‚| r Y ( N ) (cid:17) r Z ( n ) = r X ( n ) |‚| r Y ( n ) P R R xn ´ R yn ´ ˆ R xn R yn , with core slices Z ( n ) = ř i n X ( n ) i n b Y ( n ) i n z = Ax = (cid:16) r A ( ) |‚| r X ( ) (cid:17) |b| ¨ ¨ ¨ |b| (cid:16) r A ( N ) |‚| r X ( N ) (cid:17) r Z ( n ) = r A ( n ) ˆ r X ( n ) , with blocks (vectors) z ( n ) s n ´ , s n = A ( n ) r An ´ , r An x ( n ) r xn ´ , r xn ( s n = r An r xn ) Z = AB = (cid:16) r A ( ) |‚| r B ( ) (cid:17) |b| ¨ ¨ ¨ |b| (cid:16) r A ( N ) |‚| r B ( N ) (cid:17) r Z ( n ) = r A ( n ) |‚| r B ( n ) , with blocks Z ( n ) s n ´ , s n = A ( n ) r An ´ , r An B ( n ) r Bn ´ , r Bn ( s n = r An r Bn ) z = x T Ax = x x , Ax y = (cid:16) r X ( ) |‚| r A ( ) |‚| r X ( ) (cid:17) |b| ¨ ¨ ¨ |b| (cid:16) r X ( N ) |‚| r A ( N ) |‚| r X ( N ) (cid:17) r Z ( n ) = r X ( n ) |‚| r A ( n ) |‚| r X ( n ) P R R xn ´ R An ´ R xn ´ ˆ R xn R An R xn , with blocks (entries) z ( n ) s n ´ , s n = B x ( n ) r xn ´ , r xn , A ( n ) r An ´ , r An x ( n ) r yn ´ , r yn F ( s n = r xn r An r yn ) I J I J I n J n I N J N YAX ・・・ ・・・・・・ ・・・ Q Q I I Q n I n I N Q n- { y P P P n P n- J J J n J N A R R R n R n- x Y (1) {{ X (1) X (2) X ( ) n X ( ) N Y (2) Y ( ) n Y ( ) N A ( ) N A ( ) n A (2) A (1) T (b) J J I J J I J n J n I n J N I N XAAX ・・・ ・・・ J N ・・・ ・・・・・・ ・・・ R R J J R n J n J N R n- { x P P P n P n- J J J n J N A R R R n R n- x T {{ X (1) X (2) X ( ) n X ( ) N A ( ) N A ( ) n A (2) A (1) X (1) X (2) X ( ) n X ( ) N P P P n P n- A T { A ( ) N A ( ) n A (2) A (1) I I I n I N Figure 4.10: Representation of typical cost functions by arbitrary TNs andby TT networks: (a) J ( x ) = y T Ax and (b) J ( x ) = x T A T Ax . Note thattensors A , X and Y can be, in general, approximated by any TNs thatprovide good low-rank representations.which can be implemented by fast matrix-by matrix multiplicationalgorithms (see Algorithm 10). In practice, for very large scale data, weusually perform TT core contractions (MPO-MPS product) approximately,with reduced TT ranks, e.g., via the “zip-up” method proposed by [198].In a similar way, the matrix equation Y – AX , (4.32)where A P R I ˆ J , X P R J ˆ K , Y P R I ˆ K , with I = I I ¨ ¨ ¨ I N , J = J J ¨ ¨ ¨ J N and K = K K ¨ ¨ ¨ K N , can be represented in TT formats. This is illustrated135 B = C=A BA A A A B B B A
11 11 BA
11 21 BA
11 31
B A
12 11 BA
12 21 BA
12 31 BA
21 11 BA
21 21 BA
21 31
B A
22 11 BA
22 21 BA
22 31 B ( ) I J ( )
J K ( )
I K ( )
P I P J ( )
R J R K ( )
P R I P R K B B B A
11 12 BA
11 22 BA
11 32
B A
12 12 BA
12 22 BA
12 32 BA
21 12 BA
21 22 BA
21 32
B A
22 12 BA
22 22 BA
22 32 B Figure 4.11: Graphical illustration of the C product of two block matrices.in Figure 4.9(b) for the corresponding TT-cores defined as A ( n ) P R P n ´ ˆ I n ˆ J n ˆ P n X ( n ) P R R n ´ ˆ J n ˆ K n ˆ R n Y ( n ) P R Q n ´ ˆ I n ˆ K n ˆ Q n .It is straightforward to show that when the matrices, A P R I ˆ J and X P R J ˆ K , are represented in their TT formats, they can be expressed viaa strong Kronecker product of block matrices as A = ˜ A ( ) |b| ˜ A ( ) |b| ¨ ¨ ¨ |b| ˜ A ( N ) and X = ˜ X ( ) |b| ˜ X ( ) |b| ¨ ¨ ¨ |b| ˜ X ( N ) , where the factor matrices are˜ A ( n ) P R P n ´ I n ˆ J n P n and ˜ X ( n ) P R R n ´ J n ˆ K n R n . Then, the matrix Y = AX can also be expressed via the strong Kronecker products, Y = ˜ Y ( ) |b| ¨ ¨ ¨ |b| ˜ Y ( N ) , where ˜ Y ( n ) = ˜ A ( n ) |‚| ˜ X ( n ) P R Q n ´ I n ˆ K n Q n , ( n =
1, 2, . . . , N ) , withblocks ˜ Y ( n ) q n ´ , q n = ˜ A ( n ) p n ´ , p n ˜ X ( n ) r n ´ , r n , where Q n = R n P n , q n = p n r n , @ n .Similarly, a quadratic form, z = x T Ax , for a huge symmetric matrix A , can be computed by first computing (in TT formats), a vector y = Ax ,followed by the inner product x T y .Basic operations in the TT format are summarized in Table 4.5, whileTable 4.6 presents these operations expressed via strong Kronecker andC products of block matrices of TT-cores. For more advanced andsophisticated operations in TT/QTT formats, see [112, 113, 128]. We have shown that a major advantage of the TT decomposition is theexistence of efficient algorithms for an exact representation of higher-136 lgorithm 10 : Computation of a Matrix-by-Vector Product in the TTFormatInput:
Matrix A P R I ˆ J and vector x P R J in their respective TT format A = xx A ( ) , A ( ) , . . . , A ( N ) yy P R I ˆ J ˆ I ˆ J ˆ¨¨¨ˆ I N ˆ J N ,and X = xx X ( ) , X ( ) , . . . , X ( N ) yy P R J ˆ J ˆ¨¨¨ˆ J N ,with TT-cores X ( n ) P R R n ´ ˆ J n ˆ R n and A ( n ) P R R An ´ ˆ I n ˆ I n ˆ R An Output:
Matrix by vector product y = Ax in the TT format Y = xx Y ( ) , Y ( ) , . . . , Y ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N , with cores Y ( n ) P R R Yn ´ ˆ J n ˆ R Yn for n = N do for i n = I n do Y ( n ) i n = ř J n j n = (cid:16) A ( n ) i n , j n b L X ( n ) j n (cid:17) end for end for return y P R I I ¨¨¨ I N in the TT format Y = xx Y ( ) , Y ( ) , . . . , Y ( N ) yy order tensors and/or their low-rank approximate representations with aprescribed accuracy. Similarly to the quasi-best approximation propertyof the HOSVD, the TT approximation p X = xx p X ( ) , p X ( ) , . . . , p X ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N (with core tensors denoted by X ( n ) = G ( n ) ), obtained by theTT-SVD algorithm, satisfies the following inequality } X ´ p X } ď N ´ ÿ n = I n ÿ j = R n + σ j ( X ă n ą ) , (4.33)where the (cid:96) -norm of a tensor is defined via its vectorization and σ j ( X ă n ą ) denotes the j th largest singular value of the unfolding matrix X ă n ą [158].The two basic approaches to perform efficiently TT decompositions arebased on: (1) low-rank matrix factorizations (LRMF), and (2) constrainedTucker-2 decompositions. The most important algorithm for the TT decomposition is the TT-SVDalgorithm (see Algorithm 11) [161, 216], which applies the truncated SVDsequentially to the unfolding matrices, as illustrated in Figure 4.12. Insteadof SVD, alternative and efficient LRMF algorithms can be used [50], see137 I I I I Reshape I I I I I X tSVD I R U S R I I I I V T Reshape M R I I I I I R U R S R I I I V T ... I R U R R
S V T I I R X (1) R I X (2) R I X (3) R I X (4) I X (5) M =X (1)1 4 tSVDtSVDReshape = = I R X (1) I R X (1) I R X (1) R R I X (2) R R I X (3) Figure 4.12: The TT-SVD algorithm for a 5th-order data tensor usingtruncated SVD. Instead of the SVD, any alternative LRMF algorithm canbe employed, such as randomized SVD, RPCA, CUR/CA, NMF, SCA, ICA.Top panel: A 6th-order tensor X of size I ˆ I ˆ ¨ ¨ ¨ ˆ I is first reshaped intoa long matrix M of size I ˆ I ¨ ¨ ¨ I . Second panel: The tSVD is performedto produce low-rank matrix factorization, with I ˆ R factor matrix U andthe R ˆ I ¨ ¨ ¨ I matrix S V T1 , so that M – U S V T1 . Third panel: thematrix U becomes the first core core X ( ) P R ˆ I ˆ R , while the matrix S V T1 is reshaped into the R I ˆ I I I matrix M . Remaining panels:Perform tSVD to yield M – U S V T2 , reshape U into an R ˆ I ˆ R core X ( ) and repeat the procedure until all the five cores are extracted (bottompanel). The same procedure applies to higher order tensors of any order.also Algorithm 12). For example, in [162] a new approximate formulafor TT decomposition is proposed, where an N th-order data tensor X is interpolated using a special form of cross-approximation. In fact,the TT-Cross-Approximation is analogous to the TT-SVD algorithm, butuses adaptive cross-approximation instead of the computationally more138 lgorithm 11 : TT-SVD Decomposition using truncated SVD(tSVD) or randomized SVD (rSVD) [158, 216]
Input: N th-order tensor X P R I ˆ I ˆ¨¨¨ˆ I N and approximation accuracy ε Output:
Approximative representation of a tensor in the TT format p X = xx p X ( ) , p X ( ) , . . . , p X ( N ) yy , such that } X ´ p X } F ď ε Unfolding of tensor X in mode-1 M = X ( ) Initialization R = for n = N ´ do Perform tSVD [ U n , S n , V n ] = tSVD ( M n , ε / ? N ´ ) Estimate n th TT rank R n = size ( U n , 2 ) Reshape orthogonal matrix U n into a 3rd-order core p X ( n ) = reshape ( U n , [ R n ´ , I n , R n ]) Reshape the matrix V n into a matrix M n + = reshape (cid:16) S n V T n , [ R n I n + , ś Np = n + I p ] (cid:17) end for Construct the last core as p X ( N ) = reshape ( M N , [ R N ´ , I N , 1 ]) return xx p X ( ) , p X ( ) , . . . , p X ( N ) yy . expensive SVD. The complexity of the cross-approximation algorithmsscales linearly with the order N of a data tensor. The key idea in this approach is to reshape any N th-order data tensor, X P R I ˆ I ˆ¨¨¨ˆ I N with N ą
3, into a suitable 3rd-order tensor, e.g., r X P R I ˆ I N ˆ I ¨¨¨ I N ´ , in order to apply the Tucker-2 decomposition as follows(see Algorithm 8 and Figure 4.13(a)) r X = G ( N ´ ) ˆ X ( ) ˆ X ( N ) = X ( ) ˆ G ( N ´ ) ˆ X ( N ) , (4.34)which, by using frontal slices of the involved tensors, can also be expressedin the matrix form X k = X ( ) G k X ( N ) , k =
1, 2, . . . , I ¨ ¨ ¨ I N ´ . (4.35)Such representations allow us to compute the tensor, G ( N ´ ) , the firstTT-core, X ( ) , and the last TT-core, X ( N ) . The procedure can be repeated139a) = X (1) X ( N ) R N -1 ... I I K = I N - K = I N GX k k R I ... ... I I N N - { ... ... I ... R R N -1 { X-~ G- (2, N -1) = G- (b) = X ( n ) ... I I K = n G k R ... ... I n + N - n { ... ... { R pp IR n -1 n IR n -1 n n <2> R n ... I I n + R p -1 X ( p )<1> R p -1 n G G- -~ n n +1 K = n N - n I R pp (c) I I I I I X I M Reshape I PVD or Tucker2 I R A R R G I Reshape I R A R I G I I PVD or Tucker2 I R A I R A
21 2 R RR G R RI IR
Reshape I R X (1) RI X (2) 2 RI X (3) I X (4) I X (5) RI I I I R R R I I I I B T B T B T B T = ~ Figure 4.13:
TT decomposition based on the Tucker-2/PVD model. (a) Extractionof the first and the last core. (b) The procedure can be repeated sequentially forreshaped 3rd-order tensors G n (for n =
2, 3, . . . and p = N ´ N ´
2, . . .). (c)Illustration of a TT decomposition for a 5th-order data tensor, using an algorithmbased on sequential Tucker-2/PVD decompositions. lgorithm 12 : TT Decomposition using any efficient LRMFInput:
Tensor X P R I ˆ I ˆ¨¨¨ˆ I N and the approximation accuracy ε Output:
Approximate tensor representation in the TT format p X – xx p X ( ) , p X ( ) , . . . , p X ( N ) yy Initialization R = Unfolding of tensor X in mode-1 as M = X ( ) for n = N ´ do Perform LRMF, e.g., CUR, RPCA, ... [ A n , B n ] = LRMF ( M n , ε ) , i.e., M n – A n B T n Estimate n th TT rank, R n = size ( A n , 2 ) Reshape matrix A n into a 3rd-order core, as p X ( n ) = reshape ( A n , [ R n ´ , I n , R n ]) Reshape the matrix B n into the ( n + ) th unfolding matrix M n + = reshape (cid:16) B T n , [ R n I n + , ś Np = n + I p ] (cid:17) end for Construct the last core as p X ( N ) = reshape ( M N , [ R N ´ , I N , 1 ]) return TT-cores: xx p X ( ) , p X ( ) , . . . , p X ( N ) yy .sequentially for reshaped tensors r G n = G ( n + N ´ n ) for n =
1, 2, . . ., in orderto extract subsequent TT-cores in their matricized forms, as illustratedin Figure 4.13(b). See also the detailed step-by-step procedure shown inFigure 4.13(c).Such a simple recursive procedure for TT decomposition can be used inconjunction with any efficient algorithm for Tucker-2/PVD decompositionsor the nonnegative Tucker-2 decomposition (NTD-2) (see also Section 3).
Mathematical operations in TT format produce core tensors with ranks whichare not guaranteed to be optimal with respect to the desired approximationaccuracy . For example, matrix-by-vector or matrix-by-matrix productsconsiderably increase the TT ranks, which quickly become computationallyprohibitive, so that a truncation or low-rank TT approximations arenecessary for mathematical tractability. To this end, the TT–rounding(also called truncation or recompression) may be used as a post-processingprocedure to reduce the TT ranks. The TT rounding algorithms are141 lgorithm 13 : TT Rounding (Recompression) [158]Input: N th-order tensor X = xx X ( ) , X ( ) , . . . , X ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N ,in a TT format with an overestimated TT rank, r TT = t R , R , . . . , R N ´ u , and TT-cores X P R R n ´ ˆ I n ˆ R n ,absolute tolerance ε , and maximum rank R max Output: N th-order tensor p X with a reduced TT rank; the cores arerounded (reduced) according to the input tolerance ε and/or ranksbounded by R max , such that } X ´ p X } F ď ε } X } F Initialization p X = X and δ = ε / ? N ´ for n = N ´ do QR decomposition X ( n ) ă ą = Q n R , with X ( n ) ă ą P R R n ´ I n ˆ R n Replace cores X ( n ) ă ą = Q n and X ( n + ) ă ą Ð RX ( n + ) ă ą , with X ( n + ) ă ą P R R n ˆ I n + R n + end for for n = N to 2 do Perform δ -truncated SVD X ( n ) ă ą = U diag t σ u V T Determine minimum rank p R n ´ such that ř r ą R n ´ σ r ď δ } σ } Replace cores p X ( n ´ ) ă ą Ð p X ( n ´ ) ă ą p U diag t p σ u and p X ( n ) ă ą = p V T end for return N th-order tensor p X = xx p X ( ) , p X ( ) , . . . , p X ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N ,with reduced cores p X ( n ) P R p R n ´ ˆ I n ˆ p R n typically implemented via QR/SVD with the aim to approximate, with adesired prescribed accuracy, the TT core tensors, G ( n ) = X ( n ) , by other coretensors with minimum possible TT-ranks (see Algorithm 13). Note that TTrounding is mathematically the same as the TT-SVD, but is more efficientowing to the to use of TT format.The complexity of TT-rounding procedures is only O ( N IR ) , sinceall operations are performed in TT format which requires the SVD tobe computed only for a relatively small matricized core tensor at eachiteration. A similar approach has been developed for the HT format[74, 86, 87, 122]. 142 .10 Orthogonalization of Tensor Train Network The orthogonalization of core tensors is an essential procedure in manyalgorithms for the TT formats [67, 70, 97, 120, 158, 196, 197].For convenience, we divide a TT network, which represents a tensor p X = xx p X ( ) , p X ( ) , . . . , p X ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N , into sub-trains. In this way, alarge-scale task is replaced by easier-to-handle sub-tasks, whereby the aimis to extract a specific TT core or its slices from the whole TT network. Forthis purpose, the TT sub-trains can be defined as follows p X ă n = xx p X ( ) , p X ( ) , . . . , p X ( n ´ ) yy P R I ˆ I ˆ¨¨¨ˆ I n ´ ˆ R n ´ (4.36) p X ą n = xx p X ( n + ) , p X ( n + ) , . . . , p X ( N ) yy P R R n ˆ I n + ˆ¨¨¨ˆ I N (4.37)while the corresponding unfolding matrices, also called interface matrices,are defined by p X ď n P R I I ¨¨¨ I n ˆ R n , p X ą n P R R n ˆ I n + ¨¨¨ I N . (4.38)The left and right unfolding of the cores are defined as p X ( n ) L = p X ( n ) ă ą P R R n ´ I n ˆ R n and p X ( n ) R = X ( n ) ă ą P R R n ´ ˆ I n R n . The n -orthogonality of tensors. An N th-order tensor in a TT format p X = xx p X ( ) , . . . , p X ( N ) yy , is called n -orthogonal with 1 ď n ď N , if ( p X ( m ) L ) T p X ( m ) L = I R m , m =
1, . . . , n ´ p X ( m ) R ( p X ( m ) R ) T = I R m ´ , m = n +
1, . . . , N . (4.40)The tensor is called left-orthogonal if n = N and right-orthogonal if n = n th TT core, it is usually assumed that all coresto the left are left-orthogonalized, and all cores to the right are right-orthogonalized. Notice that if a TT tensor , p X , is n -orthogonal then the“left” and “right” interface matrices have orthonormal columns and rows,that is ( p X ă n ) T p X ă n = I R n ´ , p X ą n ( p X ą n ) T = I R n . (4.41)A tensor in a TT format can be orthogonalized efficiently using recursiveQR and LQ decompositions (see Algorithm 14). From the above definition,for n = N the algorithms perform left-orthogonalization and for n = By a TT-tensor we refer to as a tensor represented in the TT format. lgorithm 14 : Left-orthogonalization, right-orthogonalization and n -orthogonalization of a tensor in the TT formatInput: N th-order tensor p X = xx p X ( ) , p X ( ) , . . . , p X ( N ) yy P R I ˆ I ˆ¨¨¨ˆ I N ,with TT cores p X ( n ) P R R n ´ ˆ I n ˆ R n and R = R N = Output:
Cores p X ( ) , . . . , p X ( n ´ ) become left-orthogonal, while theremaining cores are right-orthogonal, except for the core p X ( n ) for m = n ´ do Perform the QR decomposition [ Q , R ] Ð qr ( p X ( m ) L ) for theunfolding cores p X ( m ) L P R R m ´ I m ˆ R m Replace the cores p X ( m ) L Ð Q and p X ( m + ) Ð p X ( m + ) ˆ R end for for m = N to n + do Perform QR decomposition [ Q , R ] Ð qr (( p X ( m ) R ) T ) for theunfolding cores ( p X ( m ) R ) P R R m ´ ˆ I m R m , Replace the cores: G ( m ) R Ð Q T and p X ( m ´ ) Ð p X ( m ´ ) ˆ R T end for return Left-orthogonal TT cores with ( p X ( m ) L ) T p X ( m ) L = I R m for m =
1, 2, . . . , n ´ p X ( m ) R ( p X ( m ) R ) T = I R m ´ for m = N , N ´
1, . . . , n + Finally, we next present an efficient algorithm for TT decomposition,referred to as the Alternating Single Core Update (ASCU), whichsequentially optimizes a single TT-core tensor while keeping the other TT-cores fixed in a manner similar to the modified ALS [170].Assume that the TT-tensor p X = xx p X ( ) , p X ( ) , . . . , p X ( N ) yy is left- and right-orthogonalized up to p X ( n ) , i.e., the unfolding matrices p X ( k ) ă ą for k =
1, . . . , n ´ p X ( m )( ) for m = n +
1, . . . , N have orthonormal rows. Then, the Frobenius norm of the TT-tensor p X isequivalent to the Frobenius norm of p X ( n ) , that is, } p X } F = } p X ( n ) } F , so thatthe Frobenius norm of the approximation error between a data tensor X p X can be written as J ( X ( n ) ) = } X ´ p X } F (4.42) = } X } F + } p X } F ´ x X , p X y = } X } F + } p X ( n ) } F ´ x C ( n ) , p X ( n ) y = } X } F ´ } C ( n ) } F + } C ( n ) ´ p X ( n ) } F , n =
1, . . . , N ,where C ( n ) P R R n ´ ˆ I n ˆ R n represents a tensor contraction of X and p X alongall modes but the mode- n , as illustrated in Figure 4.14. The C ( n ) can beefficiently computed through left contractions along the first ( n ´ ) -modesand right contractions along the last ( N ´ m ) -modes, expressed as L ă n = p X ă n ˙ n ´ X , C ( n ) = L ă n ¸ N ´ n p X ą n . (4.43)The symbols ˙ n and ¸ m stand for the tensor contractions between two N th-order tensors along their first n modes and last m = N ´ n modes,respectively.The optimization problem in (4.42) is usually performed subject to thefollowing constraint } X ´ p X } F ď ε (4.44)such that the TT-rank of p X is minimum.Observe that the constraint in (4.44) for left- and right-orthogonalizedTT-cores is equivalent to the set of sub-constraints } C ( n ) ´ p X ( n ) } F ď ε n n =
1, . . . , N , (4.45)whereby the n th core X ( n ) P R R n ´ ˆ I n ˆ R n should have minimum ranks R n ´ and R n . Furthermore, ε n = ε ´ } X } F + } C ( n ) } F is assumed to be non-negative. Finally, we can formulate the following sequential optimizationproblem min ( R n ´ ¨ R n ) ,s.t. } C ( n ) ´ p X ( n ) } F ď ε n , n =
1, 2, . . . , N . (4.46)By expressing the TT-core tensor p X ( n ) as a TT-tensor of three factors, i.e.,in a Tucker-2 format given by p X ( n ) = A n ˆ ˜ X ( n ) ˆ B n ,145 n X (1) ( -1) n ( +1) n X ( ) n I n +1 I N I I n -1 〉 X X X ( N ) X ・・・ ・・・ R R n -1 R n R n +1 C ( ) n L < n X
Input:
Data tensor X P R I ˆ I ˆ¨¨¨ˆ I N and approximation accuracy ε Output:
TT-tensor p X = p X ( ) ˆ p X ( ) ˆ ¨ ¨ ¨ ˆ p X ( N ) of minimumTT-rank such that } X ´ p X } F ď ε Initialize p X = xx p X ( ) , p X ( ) , . . . , p X ( N ) yy repeat for n =
1, 2, . . . , N ´ do Compute contracted tensor C ( n ) = L ă n ¸ N ´ n p X ą n Solve a Tucker-2 decomposition } C ( n ) ´ A n ˆ p X ( n ) ˆ B n } F ď ε ´ } X } F + } C ( n ) } F Adjust adjacent cores p X ( n ´ ) Ð p X ( n ´ ) ˆ A n , p X ( n + ) Ð B n ˆ p X ( n + ) Perform left-orthogonalization of p X ( n ) Update left-side contracted tensors L ă n Ð A T n ˆ L ă n , L ă ( n + ) Ð p X ( n ) ˙ L ă n end for for n = N , N ´
1, . . . , 2 do Compute contracted tensor C ( n ) = L ă n ¸ N ´ n p X ą n Solve a constrained Tucker-2 decomposition } C ( n ) ´ A n ˆ p X ( n ) ˆ B n } F ď ε ´ } X } F + } C ( n ) } F p X ( n ´ ) Ð p X ( n ´ ) ˆ A n , p X ( n + ) Ð B n ˆ p X ( n + ) Perform right-orthogonalization of p X ( n ) end for until a stopping criterion is met return xx p X ( ) , p X ( ) , . . . , p X ( N ) yy . contraction in the form [101, 182] L ă n = p X ( n ´ ) ˙ L ă ( n ´ ) , (4.47)where L ă = X .Alternatively, instead of adjusting the two TT ranks, R n ´ and R n , of p X ( n ) , we can update only one rank, either R n ´ or R n , corresponding to theright-to-left or left-to-right update order procedure. Assuming that the coretensors are updated in the left-to-right order, we need to find p X ( n ) which147 lgorithm 16 : The Alternating Single-Core Update Algorithm (one-side rank adjustment) [170]
Input:
Data tensor X P R I ˆ I ˆ¨¨¨ˆ I N and approximation accuracy ε Output:
TT-tensor p X = p X ( ) ˆ p X ( ) ˆ ¨ ¨ ¨ ˆ p X ( N ) of minimumTT-rank such that } X ´ p X } F ď ε Initialize TT-cores p X ( n ) , @ n repeat for n =
1, 2, . . . , N ´ do Compute the contracted tensor C ( n ) = L ă n ¸ N ´ n p X ą n Truncated SVD: } [ C ( n ) ] ă ą ´ U Σ V T } F ď ε ´ } X } F + } C ( n ) } F Update p X ( n ) = reshape ( U , R n ´ ˆ I n ˆ R n ) Adjust adjacent core p X ( n + ) Ð ( Σ V T ) ˆ p X ( n + ) Update left-side contracted tensors L ă ( n + ) Ð p X ( n ) ˙ L ă n end for for n = N , N ´
1, . . . , 2 do Compute contracted tensor C ( n ) = L ă n ¸ N ´ n p X ą n Truncated SVD: } [ C ( n ) ] ( ) ´ U Σ V T } F ď ε ´ } X } F + } C ( n ) } F ; p X ( n ) = reshape ( V T , R n ´ ˆ I n ˆ R n ) p X ( n ´ ) Ð p X ( n ´ ) ˆ ( U Σ ) end for until a stopping criterion is met return xx p X ( ) , p X ( ) , . . . , p X ( N ) yy . has a minimum rank- R n and satisfies the constraints } C ( n ) ´ p X ( n ) ˆ B n } F ď ε n , n =
1, . . . , N .This problem reduces to the truncated SVD of the mode- t
1, 2 u matricizationof C ( n ) with an accuracy ε n , that is [ C ( n ) ] ă ą « U n Σ V T n ,where Σ = diag ( σ n ,1 , . . . , σ n , R ‹ n ) . Here, for the new optimized rank R ‹ n , thefollowing holds R ‹ n ÿ r = σ n , r ě } X } F ´ ε ą R ‹ n ´ ÿ r = σ n , r . (4.48)148he core tensor p X ( n ) is then updated by reshaping U n to an order-3 tensor ofsize R n ´ ˆ I n ˆ R ‹ n , while the core p X ( n + ) needs to be adjusted accordinglyas p X ( n + ) ‹ = Σ V T n ˆ p X ( n + ) . (4.49)When the algorithm updates the core tensors in the right-to-left order, weupdate p X ( n ) by using the R ‹ n ´ leading right singular vectors of the mode-1matricization of C ( n ) , and adjust the core p X ( n ´ ) accordingly, that is, [ C ( n ) ] ( ) – U n Σ V T n p X ( n ) ‹ = reshape ( V T n , [ R ‹ n ´ , I n , R n ]) p X ( n ´ ) ‹ = p X ( n ´ ) ˆ ( U n Σ ) . (4.50)To summarise, the ASCU method performs a sequential update of one coreand adjusts (or rotates) another core. Hence, it updates two cores at a time(for detail see Algorithm 16).The ASCU algorithm can be implemented in an even more efficient way,if the data tensor X is already given in a TT format (with a non-optimalTT ranks for the prescribed accuracy). Detailed MATLAB implementationsand other variants of the TT decomposition algorithm are provided in [170].149 hapter 5 Discussion and Conclusions
In Part 1 of this monograph, we have provided a systematic andexample-rich guide to the basic properties and applications of tensornetwork methodologies, and have demonstrated their promise as a toolfor the analysis of extreme-scale multidimensional data. Our main aimhas been to illustrate that, owing to the intrinsic compression abilitythat stems from the distributed way in which they represent data andprocess information, TNs can be naturally employed for linear/multilineardimensionality reduction. Indeed, current applications of TNs includegeneralized multivariate regression, compressed sensing, multi-way blindsource separation, sparse representation and coding, feature extraction,classification, clustering and data fusion.With multilinear algebra as their mathematical backbone, TNs havebeen shown to have intrinsic advantages over the flat two-dimensionalview provided by matrices, including the ability to model both strong andweak couplings among multiple variables, and to cater for multimodal,incomplete and noisy data.In Part 2 of this monograph we introduce a scalable frameworkfor distributed implementation of optimization algorithms, in orderto transform huge-scale optimization problems into linked small-scaleoptimization sub-problems of the same type. In that sense, TNs can be seenas a natural bridge between small-scale and very large-scale optimizationparadigms, which allows for any efficient standard numerical algorithm tobe applied to such local optimization sub-problems.Although research on tensor networks for dimensionality reductionand optimization problems is only emerging, given that in many modernapplications, multiway arrays (tensors) arise, either explicitly or indirectly,150hrough the tensorization of vectors and matrices, we foresee this materialserving as a useful foundation for further studies on a variety of machinelearning problems for data of otherwise prohibitively large volume, variety,or veracity. We also hope that the readers will find the approachespresented in this monograph helpful in advancing seamlessly fromnumerical linear algebra to numerical multilinear algebra.151 ibliography [1] E. Acar and B. Yener. Unsupervised multiway data analysis:A literature survey.
IEEE Transactions on Knowledge and DataEngineering , 21:6–20, 2009.[2] I. Affleck, T. Kennedy, E.H. Lieb, and H. Tasaki. Rigorous resultson valence-bond ground states in antiferromagnets.
Physical ReviewLetters , 59(7):799, 1987.[3] A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade, and M. Telgarsky.Tensor decompositions for learning latent variable models.
Journalof Machine Learning Research , 15:2773–2832, 2014.[4] D. Anderson, S. Du, M. Mahoney, C. Melgaard, K. Wu, and M. Gu.Spectral gap error bounds for improving CUR matrix decompositionand the Nystr ¨om method. In
Proceedings of the 18th InternationalConference on Artificial Intelligence and Statistics , pages 19–27, 2015.[5] W. Austin, G. Ballard, and T.G. Kolda. Parallel tensor compressionfor large-scale scientific data. arXiv preprint arXiv:1510.06689 , 2015.[6] F.R. Bach and M.I. Jordan. Kernel independent component analysis.
The Journal of Machine Learning Research , 3:1–48, 2003.[7] M. Bachmayr, R. Schneider, and A. Uschmajew. Tensor networksand hierarchical tensors for the solution of high-dimensional partialdifferential equations.
Foundations of Computational Mathematics ,16(6):1423–1472, 2016.[8] B.W. Bader and T.G. Kolda. MATLAB tensor toolbox version 2.6,February 2015. 1529] J. Ballani and L. Grasedyck. Tree adaptive approximation in thehierarchical tensor format.
SIAM Journal on Scientific Computing ,36(4):A1415–A1431, 2014.[10] J. Ballani, L. Grasedyck, and M. Kluge. A review on adaptive low-rank approximation techniques in the hierarchical tensor format. In
Extraction of Quantifiable Information from Complex Systems , pages 195–210. Springer, 2014.[11] G. Ballard, A.R. Benson, A. Druinsky, B. Lipshitz, and O. Schwartz.Improving the numerical stability of fast matrix multiplicationalgorithms. arXiv preprint arXiv:1507.00687 , 2015.[12] G. Ballard, A. Druinsky, N. Knight, and O. Schwartz. Briefannouncement: Hypergraph partitioning for parallel sparse matrix-matrix multiplication. In
Proceedings of the 27th ACM on Symposium onParallelism in Algorithms and Architectures , pages 86–88. ACM, 2015.[13] G. Barcza, ¨O. Legeza, K.H. Marti, and M. Reiher. Quantum-information analysis of electronic states of different molecularstructures.
Physical Review A , 83(1):012508, 2011.[14] K. Batselier, H. Liu, and N. Wong. A constructive algorithm fordecomposing a tensor into a finite sum of orthonormal rank-1 terms.
SIAM Journal on Matrix Analysis and Applications , 36(3):1315–1337,2015.[15] K. Batselier and N. Wong. A constructive arbitrary-degree Kroneckerproduct decomposition of tensors. arXiv preprint arXiv:1507.08805 ,2015.[16] M. Bebendorf. Adaptive cross-approximation of multivariatefunctions.
Constructive Approximation , 34(2):149–179, 2011.[17] M. Bebendorf, C. Kuske, and R. Venn. Wideband nested crossapproximation for Helmholtz problems.
Numerische Mathematik ,130(1):1–34, 2015.[18] R.E. Bellman.
Adaptive Control Processes . Princeton University Press,Princeton, NJ, 1961.[19] P. Benner, V. Khoromskaia, and B.N. Khoromskij. A reduced basisapproach for calculation of the Bethe–Salpeter excitation energies153y using low-rank tensor factorisations.
Molecular Physics , 114(7-8):1148–1161, 2016.[20] A.R. Benson, J.D. Lee, B. Rajwa, and D.F. Gleich. Scalable methods fornonnegative matrix factorizations of near-separable tall-and-skinnymatrices. In
Proceedings of Neural Information Processing Systems(NIPS) , pages 945–953, 2014.[21] D. Bini. Tensor and border rank of certain classes of matrices andthe fast evaluation of determinant inverse matrix and eigenvalues.
Calcolo , 22(1):209–228, 1985.[22] M. Bolten, K. Kahl, and S. Sokolovi´c. Multigrid Methods for TensorStructured Markov Chains with Low Rank Approximation.
SIAMJournal on Scientific Computing , 38(2):A649–A667, 2016.[23] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributedoptimization and statistical learning via the alternating directionmethod of multipliers.
Foundations and Trends in Machine Learning ,3(1):1–122, 2011.[24] A. Bruckstein, D. Donoho, and M. Elad. From sparse solutions ofsystems of equations to sparse modeling of signals and images.
SIAMReview , 51(1):34–81, 2009.[25] H.-J. Bungartz and M. Griebel. Sparse grids.
Acta Numerica , 13:147–269, 2004.[26] C. Caiafa and A. Cichocki. Generalizing the column-row matrixdecomposition to multi-way arrays.
Linear Algebra and itsApplications , 433(3):557–573, 2010.[27] C. Caiafa and A Cichocki. Computing sparse representations ofmultidimensional signals using Kronecker bases.
Neural Computaion ,25(1):186–220, 2013.[28] C. Caiafa and A. Cichocki. Stable, robust, and super–fastreconstruction of tensors using multi-way projections.
IEEETransactions on Signal Processing , 63(3):780–793, 2015.[29] J.D. Carroll and J.-J. Chang. Analysis of individual differences inmultidimensional scaling via an N-way generalization of ”Eckart-Young” decomposition.
Psychometrika , 35(3):283–319, 1970.15430] V. Cevher, S. Becker, and M. Schmidt. Convex optimization for bigdata: Scalable, randomized, and parallel algorithms for big dataanalytics.
IEEE Signal Processing Magazine , 31(5):32–43, 2014.[31] G. Chabriel, M. Kleinsteuber, E. Moreau, H. Shen, P. Tichavsky, andA. Yeredor. Joint matrix decompositions and blind source separation:A survey of methods, identification, and applications.
IEEE SignalProcessing Magazine , 31(3):34–43, 2014.[32] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: Asurvey.
ACM Computing Surveys (CSUR) , 41(3):15, 2009.[33] T.-L. Chen, D. D. Chang, S.-Y. Huang, H. Chen, C. Lin, andW. Wang. Integrating multiple random sketches for singular valuedecomposition. arXiv e-prints , 2016.[34] H. Cho, D. Venturi, and G.E. Karniadakis. Numerical methods forhigh-dimensional probability density function equations.
Journal ofComputational Physics , 305:817–837, 2016.[35] J.H. Choi and S. Vishwanathan. DFacTo: Distributed factorization oftensors. In
Advances in Neural Information Processing Systems , pages1296–1304, 2014.[36] W. Chu and Z. Ghahramani. Probabilistic models for incompletemulti-dimensional arrays. In
JMLR Workshop and ConferenceProceedings Volume 5: AISTATS 2009 , volume 5, pages 89–96. Microtome Publishing (paper) Journal of Machine LearningResearch, 2009.[37] A. Cichocki. Era of big data processing: A new approach via tensornetworks and tensor decompositions, (invited). In
Proceedings of theInternational Workshop on Smart Info-Media Systems in Asia (SISA2013) ,September 2013.[38] A. Cichocki. Tensor decompositions: A new concept in brain dataanalysis? arXiv preprint arXiv:1305.0395 , 2013.[39] A. Cichocki. Tensor networks for big data analytics and large-scaleoptimization problems. arXiv preprint arXiv:1407.3124 , 2014.[40] A. Cichocki and S. Amari.
Adaptive Blind Signal and Image Processing:Learning Algorithms and Applications . John Wiley & Sons, Ltd, 2003.15541] A. Cichocki, S. Cruces, and S. Amari. Log-determinant divergencesrevisited: Alpha-beta and gamma log-det divergences.
Entropy ,17(5):2988–3034, 2015.[42] A. Cichocki, D. Mandic, C. Caiafa, A.H. Phan, G. Zhou, Q. Zhao,and L. De Lathauwer. Tensor decompositions for signal processingapplications: From two-way to multiway component analysis.
IEEESignal Processing Magazine , 32(2):145–163, 2015.[43] A. Cichocki, R Zdunek, A.-H. Phan, and S. Amari.
Nonnegative Matrixand Tensor Factorizations: Applications to Exploratory Multi-way DataAnalysis and Blind Source Separation . Wiley, Chichester, 2009.[44] N. Cohen, O. Sharir, and A. Shashua. On the expressive powerof deep learning: A tensor analysis. In , pages 698–728, 2016.[45] N. Cohen and A. Shashua. Convolutional rectifier networks asgeneralized tensor decompositions. In
Proceedings of The 33rdInternational Conference on Machine Learning , pages 955–963, 2016.[46] P. Comon. Tensors: a brief introduction.
IEEE Signal ProcessingMagazine , 31(3):44–53, 2014.[47] P. Comon and C. Jutten.
Handbook of Blind Source Separation:Independent Component Analysis and Applications . Academic Press,2010.[48] P.G. Constantine and D.F. Gleich. Tall and skinny QR factorizationsin MapReduce architectures. In
Proceedings of the Second InternationalWorkshop on MapReduce and its Applications , pages 43–50. ACM, 2011.[49] P.G. Constantine, D.F Gleich, Y. Hou, and J. Templeton. Modelreduction with MapReduce-enabled tall and skinny singular valuedecomposition.
SIAM Journal on Scientific Computing , 36(5):S166–S191, 2014.[50] E. Corona, A. Rahimian, and D. Zorin. A Tensor-Train acceleratedsolver for integral equations in complex geometries. arXiv preprintarXiv:1511.06029 , November 2015.[51] C. Crainiceanu, B. Caffo, S. Luo, V. Zipunnikov, and N. Punjabi.Population value decomposition, a framework for the analysis of156mage populations.
Journal of the American Statistical Association ,106(495):775–790, 2011.[52] A. Critch and J. Morton. Algebraic geometry of matrix productstates.
Symmetry, Integrability and Geometry: Methods and Applications(SIGMA) , 10:095, 2014.[53] A.J. Critch.
Algebraic Geometry of Hidden Markov and Related Models .PhD thesis, University of California, Berkeley, 2013.[54] A.L.F. de Almeida, G. Favier, J.C.M. Mota, and J.P.C.L. da Costa.Overview of tensor decompositions with applications tocommunications. In R.F. Coelho, V.H. Nascimento, R.L. de Queiroz,J.M.T. Romano, and C.C. Cavalcante, editors,
Signals and Images:Advances and Results in Speech, Estimation, Compression, Recognition,Filtering, and Processing , chapter 12, pages 325–355. CRC Press, 2015.[55] F. De la Torre. A least-squares framework for component analysis.
IEEE Transactions on Pattern Analysis and Machine Intelligence ,34(6):1041–1055, 2012.[56] L. De Lathauwer. A link between the canonical decomposition inmultilinear algebra and simultaneous matrix diagonalization.
SIAMJournal on Matrix Analysis and Applications , 28:642–666, 2006.[57] L. De Lathauwer. Decompositions of a higher-order tensor inblock terms — Part I and II.
SIAM Journal on Matrix Analysisand Applications , 30(3):1022–1066, 2008. Special Issue on TensorDecompositions and Applications.[58] L. De Lathauwer. Blind separation of exponential polynomials andthe decomposition of a tensor in rank- ( L r , L r , 1 ) terms. SIAM Journalon Matrix Analysis and Applications , 32(4):1451–1474, 2011.[59] L. De Lathauwer, B. De Moor, and J. Vandewalle. A MultilinearSingular Value Decomposition.
SIAM Journal on Matrix AnalysisApplications , 21:1253–1278, 2000.[60] L. De Lathauwer, B. De Moor, and J. Vandewalle. On the best rank-1 and rank- ( R , R , ..., R N ) approximation of higher-order tensors. SIAM Journal of Matrix Analysis and Applications , 21(4):1324–1342,2000. 15761] L. De Lathauwer and D. Nion. Decompositions of a higher-ordertensor in block terms – Part III: Alternating least squares algorithms.
SIAM Journal on Matrix Analysis and Applications , 30(3):1067–1083,2008.[62] W. de Launey and J. Seberry. The strong Kronecker product.
Journalof Combinatorial Theory, Series A , 66(2):192–213, 1994.[63] V. de Silva and L.-H. Lim. Tensor rank and the ill-posedness ofthe best low-rank approximation problem.
SIAM Journal on MatrixAnalysis and Applications , 30:1084–1127, 2008.[64] A. Desai, M. Ghashami, and J.M. Phillips. Improved practical matrixsketching with guarantees.
IEEE Transactions on Knowledge and DataEngineering , 28(7):1678–1690, 2016.[65] I.S. Dhillon. Fast Newton-type methods for nonnegative matrix andtensor approximation. The NSF Workshop, Future Directions inTensor-Based Computation and Modeling, 2009.[66] E. Di Napoli, D. Fabregat-Traver, G. Quintana-Ort´ı, and P. Bientinesi.Towards an efficient use of the BLAS library for multilinear tensorcontractions.
Applied Mathematics and Computation , 235:454–468, 2014.[67] S.V. Dolgov.
Tensor Product Methods in Numerical Simulation of High-dimensional Dynamical Problems . PhD thesis, Faculty of Mathematicsand Informatics, University Leipzig, Germany, Leipzig, Germany,2014.[68] S.V. Dolgov and B.N. Khoromskij. Two-level QTT-Tucker formatfor optimized tensor calculus.
SIAM Journal on Matrix Analysis andApplications , 34(2):593–623, 2013.[69] S.V. Dolgov and B.N. Khoromskij. Simultaneous state-timeapproximation of the chemical master equation using tensor productformats.
Numerical Linear Algebra with Applications , 22(2):197–219,2015.[70] S.V. Dolgov, B.N. Khoromskij, I.V. Oseledets, and D.V. Savostyanov.Computation of extreme eigenvalues in higher dimensions usingblock tensor train format.
Computer Physics Communications ,185(4):1207–1216, 2014. 15871] S.V. Dolgov and D.V. Savostyanov. Alternating minimal energymethods for linear systems in higher dimensions.
SIAM Journal onScientific Computing , 36(5):A2248–A2271, 2014.[72] P. Drineas and M.W. Mahoney. A randomized algorithm for a tensor-based generalization of the singular value decomposition.
LinearAlgebra and its Applications , 420(2):553–571, 2007.[73] G. Ehlers, J. S ´olyom, ¨O. Legeza, and R.M. Noack. Entanglementstructure of the hubbard model in momentum space.
Physical ReviewB , 92(23):235116, 2015.[74] M Espig, M Schuster, A Killaitis, N Waldren, P W¨ahnert,S Handschuh, and H Auer. TensorCalculus library, 2012.[75] F. Esposito, T. Scarabino, A. Hyv¨arinen, J. Himberg, E. Formisano,S. Comani, G. Tedeschi, R. Goebel, E. Seifritz, and F. Di Salle.Independent component analysis of fMRI group studies by self-organizing clustering.
NeuroImage , 25(1):193–205, 2005.[76] G. Evenbly and G. Vidal. Algorithms for entanglementrenormalization.
Physical Review B , 79(14):144108, 2009.[77] G. Evenbly and S. R. White. Entanglement Renormalization andWavelets.
Physical Review Letters , 116(14):140403, 2016.[78] H. Fanaee-T and J. Gama. Tensor-based anomaly detection: Aninterdisciplinary survey.
Knowledge-Based Systems , 2016.[79] G. Favier and A. de Almeida. Overview of constrained PARAFACmodels.
EURASIP Journal on Advances in Signal Processing , 2014(1):1–25, 2014.[80] J. Garcke, M. Griebel, and M. Thess. Data mining with sparse grids.
Computing , 67(3):225–253, 2001.[81] S. Garreis and M. Ulbrich. Constrained optimization with low-ranktensors and applications to parametric problems with PDEs.
SIAMJournal on Scientific Computing , (accepted), 2016.[82] M. Ghashami, E. Liberty, and J.M. Phillips. Efficient frequentdirections algorithm for sparse matrices. arXiv preprintarXiv:1602.00412 , 2016. 15983] V. Giovannetti, S. Montangero, and R. Fazio. Quantum multiscaleentanglement renormalization ansatz channels.
Physical ReviewLetters , 101(18):180503, 2008.[84] S.A. Goreinov, E.E. Tyrtyshnikov, and N.L. Zamarashkin. A theory ofpseudo-skeleton approximations.
Linear Algebra and its Applications ,261:1–21, 1997.[85] S.A. Goreinov, N.L. Zamarashkin, and E.E. Tyrtyshnikov. Pseudo-skeleton approximations by matrices of maximum volume.
Mathematical Notes , 62(4):515–519, 1997.[86] L. Grasedyck. Hierarchical singular value decomposition of tensors.
SIAM Journal on Matrix Analysis and Applications , 31(4):2029–2054,2010.[87] L. Grasedyck, D. Kessner, and C. Tobler. A literature survey of low-rank tensor approximation techniques.
GAMM-Mitteilungen , 36:53–78, 2013.[88] A.R. Groves, C.F. Beckmann, S.M. Smith, and M.W. Woolrich.Linked independent component analysis for multimodal data fusion.
NeuroImage , 54(1):2198 – 21217, 2011.[89] Z.-C. Gu, M. Levin, B. Swingle, and X.-G. Wen. Tensor-productrepresentations for string-net condensed states.
Physical Review B ,79(8):085118, 2009.[90] M. Haardt, F. Roemer, and G. Del Galdo. Higher-order SVD basedsubspace estimation to improve the parameter estimation accuracyin multi-dimensional harmonic retrieval problems.
IEEE Transactionson Signal Processing , 56:3198–3213, July 2008.[91] W. Hackbusch.
Tensor Spaces and Numerical Tensor Calculus , volume 42of
Springer Series in Computational Mathematics . Springer, Heidelberg,2012.[92] W. Hackbusch and S. K ¨uhn. A new scheme for the tensorrepresentation.
Journal of Fourier Analysis and Applications , 15(5):706–722, 2009.[93] N. Halko, P. Martinsson, and J. Tropp. Finding structure withrandomness: Probabilistic algorithms for constructing approximatematrix decompositions.
SIAM Review , 53(2):217–288, 2011.16094] S. Handschuh.
Numerical Methods in Tensor Networks . PhDthesis, Facualty of Mathematics and Informatics, University Leipzig,Germany, Leipzig, Germany, 2015.[95] R.A. Harshman. Foundations of the PARAFAC procedure: Modelsand conditions for an explanatory multimodal factor analysis.
UCLAWorking Papers in Phonetics , 16:1–84, 1970.[96] F.L. Hitchcock. Multiple invariants and generalized rank of a p -waymatrix or tensor. Journal of Mathematics and Physics , 7:39–79, 1927.[97] S. Holtz, T. Rohwedder, and R. Schneider. The alternating linearscheme for tensor optimization in the tensor train format.
SIAMJournal on Scientific Computing , 34(2), 2012.[98] M. Hong, M. Razaviyayn, Z.Q. Luo, and J.S. Pang. A unifiedalgorithmic framework for block-structured optimization involvingbig data with applications in machine learning and signal processing.
IEEE Signal Processing Magazine , 33(1):57–77, 2016.[99] H. Huang, C. Ding, D. Luo, and T. Li. Simultaneous tensorsubspace selection and clustering: The equivalence of high orderSVD and K-means clustering. In
Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data mining , pages327–335. ACM, 2008.[100] R. H ¨ubener, V. Nebendahl, and W. D ¨ur. Concatenated tensor networkstates.
New Journal of Physics , 12(2):025004, 2010.[101] C. Hubig, I.P. McCulloch, U. Schollw ¨ock, and F.A. Wolf. Strictlysingle-site DMRG algorithm with subspace expansion.
PhysicalReview B , 91(15):155115, 2015.[102] T. Huckle, K. Waldherr, and T. Schulte-Herbriggen. Computationsin quantum tensor networks.
Linear Algebra and its Applications ,438(2):750 – 781, 2013.[103] A. Hyv¨arinen. Independent component analysis: Recent advances.
Philosophical Transactions of the Royal Society A , 371(1984):20110534,2013.[104] I. Jeon, E.E. Papalexakis, C. Faloutsos, L. Sael, and U. Kang. Miningbillion-scale tensors: Algorithms and discoveries.
The VLDB Journal ,pages 1–26, 2016. 161105] B. Jiang, F. Yang, and S. Zhang. Tensor and its Tucker core: Theinvariance relationships. arXiv e-prints arXiv:1601.01469 , January2016.[106] U. Kang, E.E. Papalexakis, A. Harpale, and C. Faloutsos. GigaTensor:Scaling tensor analysis up by 100 times - algorithms and discoveries.In
Proceedings of the 18th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD ’12) , pages 316–324,August 2012.[107] Y.-J. Kao, Y.-D. Hsieh, and P. Chen. Uni10: An open-source libraryfor tensor network algorithms. In
Journal of Physics: Conference Series ,volume 640, page 012040. IOP Publishing, 2015.[108] L. Karlsson, D. Kressner, and A. Uschmajew. Parallel algorithms fortensor completion in the CP format.
Parallel Computing , 57:222–234,2016.[109] J.-P. Kauppi, J. Hahne, K.R. M ¨uller, and A. Hyv¨arinen. Three-wayanalysis of spectrospatial electromyography data: Classification andinterpretation.
PloS One , 10(6):e0127231, 2015.[110] V.A. Kazeev, M. Khammash, M. Nip, and C. Schwab. Direct solutionof the chemical master equation using quantized tensor trains.
PLoSComputational Biology , 10(3):e1003359, 2014.[111] V.A. Kazeev and B.N. Khoromskij. Low-rank explicit QTTrepresentation of the Laplace operator and its inverse.
SIAM Journalon Matrix Analysis and Applications , 33(3):742–758, 2012.[112] V.A. Kazeev, B.N. Khoromskij, and E.E. Tyrtyshnikov. MultilevelToeplitz matrices generated by tensor-structured vectors andconvolution with logarithmic complexity.
SIAM Journal on ScientificComputing , 35(3):A1511–A1536, 2013.[113] V.A. Kazeev, O. Reichmann, and C. Schwab. Low-rank tensorstructure of linear diffusion operators in the TT and QTT formats.
Linear Algebra and its Applications , 438(11):4204–4221, 2013.[114] B.N. Khoromskij. O ( d log N ) -quantics approximation of N - d tensors in high-dimensional numerical modeling. ConstructiveApproximation , 34(2):257–280, 2011.162115] B.N. Khoromskij. Tensors-structured numerical methods in scientificcomputing: Survey on recent advances.
Chemometrics and IntelligentLaboratory Systems , 110(1):1–19, 2011.[116] B.N. Khoromskij and A. Veit. Efficient computation of highlyoscillatory integrals by using QTT tensor approximation.
Computational Methods in Applied Mathematics , 16(1):145–159, 2016.[117] H.-J. Kim, E. Ollila, V. Koivunen, and H.V. Poor. Robust iterativelyreweighted Lasso for sparse tensor factorizations. In
IEEE Workshopon Statistical Signal Processing (SSP) , pages 420–423, 2014.[118] S. Klus and C. Sch ¨utte. Towards tensor-based methods for thenumerical approximation of the Perron-Frobenius and Koopmanoperator. arXiv e-prints arXiv:1512.06527 , December 2015.[119] T.G. Kolda and B.W. Bader. Tensor decompositions and applications.
SIAM Review , 51(3):455–500, 2009.[120] D. Kressner, M. Steinlechner, and A. Uschmajew. Low-ranktensor methods with subspace correction for symmetric eigenvalueproblems.
SIAM Journal on Scientific Computing , 36(5):A2346–A2368,2014.[121] D. Kressner, M. Steinlechner, and B. Vandereycken. Low-rank tensorcompletion by Riemannian optimization.
BIT Numerical Mathematics ,54(2):447–468, 2014.[122] D. Kressner and C. Tobler. Algorithm 941: HTucker–A MATLABtoolbox for tensors in hierarchical Tucker format.
ACM Transactionson Mathematical Software , 40(3):22, 2014.[123] D. Kressner and A. Uschmajew. On low-rank approximability ofsolutions to high-dimensional operator equations and eigenvalueproblems.
Linear Algebra and its Applications , 493:556–572, 2016.[124] P.M. Kroonenberg.
Applied Multiway Data Analysis . John Wiley &Sons Ltd, New York, 2008.[125] J.B. Kruskal. Three-way arrays: Rank and uniqueness of trilineardecompositions, with application to arithmetic complexity andstatistics.
Linear Algebra and its Applications , 18(2):95–138, 1977.163126] V. Kuleshov, A.T. Chaganty, and P. Liang. Tensor factorization viamatrix factorization. In
Proceedings of the Eighteenth InternationalConference on Artificial Intelligence and Statistics , pages 507–516, 2015.[127] N. Lee and A. Cichocki. Estimating a few extreme singular valuesand vectors for large-scale matrices in Tensor Train format.
SIAMJournal on Matrix Analysis and Applications , 36(3):994–1014, 2015.[128] N. Lee and A. Cichocki. Fundamental tensor operations for large-scale data analysis using tensor network formats.
MultidimensionalSystems and Signal Processing , (accepted), 2016.[129] N. Lee and A. Cichocki. Regularized computation of approximatepseudoinverse of large matrices using low-rank tensor traindecompositions.
SIAM Journal on Matrix Analysis and Applications ,37(2):598–623, 2016.[130] N. Lee and A. Cichocki. Tensor train decompositions for higherorder regression with LASSO penalties. In
Workshop on TensorDecompositions and Applications (TDA2016) , 2016.[131] J. Li, C. Battaglino, I. Perros, J. Sun, and R. Vuduc. Aninput-adaptive and in-place approach to dense tensor-times-matrixmultiply. In
Proceedings of the International Conference for HighPerformance Computing, Networking, Storage and Analysis , page 76.ACM, 2015.[132] M. Li and V. Monga. Robust video hashing via multilinear subspaceprojections.
IEEE Transactions on Image Processing , 21(10):4397–4409,2012.[133] S. Liao, T. Vejchodsk ´y, and R. Erban. Tensor methods for parameterestimation and bifurcation analysis of stochastic reaction networks.
Journal of the Royal Society Interface , 12(108):20150233, 2015.[134] A.P. Liavas and N.D. Sidiropoulos. Parallel algorithms forconstrained tensor factorization via alternating direction method ofmultipliers.
IEEE Transactions on Signal Processing , 63(20):5450–5463,2015.[135] L.H. Lim and P. Comon. Multiarray signal processing: Tensordecomposition meets compressed sensing.
Comptes RendusMecanique , 338(6):311–320, 2010.164136] M.S. Litsarev and I.V. Oseledets. A low-rank approach to thecomputation of path integrals.
Journal of Computational Physics ,305:557–574, 2016.[137] H. Lu, K.N. Plataniotis, and A.N. Venetsanopoulos. A survey ofmultilinear subspace learning for tensor data.
Pattern Recognition ,44(7):1540–1551, 2011.[138] M. Lubasch, J.I. Cirac, and M.-C. Banuls. Unifying projectedentangled pair state contractions.
New Journal of Physics , 16(3):033014,2014.[139] C. Lubich, T. Rohwedder, R. Schneider, and B. Vandereycken.Dynamical approximation of hierarchical Tucker and tensor-traintensors.
SIAM Journal on Matrix Analysis and Applications , 34(2):470–494, 2013.[140] M.W. Mahoney. Randomized algorithms for matrices and data.
Foundations and Trends in Machine Learning , 3(2):123–224, 2011.[141] M.W. Mahoney and P. Drineas. CUR matrix decompositions forimproved data analysis.
Proceedings of the National Academy ofSciences , 106:697–702, 2009.[142] M.W. Mahoney, M. Maggioni, and P. Drineas. Tensor-CURdecompositions for tensor-based data.
SIAM Journal on MatrixAnalysis and Applications , 30(3):957–987, 2008.[143] H. Matsueda. Analytic optimization of a MERA network and itsrelevance to quantum integrability and wavelet. arXiv preprintarXiv:1608.02205 , 2016.[144] A.Y. Mikhalev and I.V. Oseledets. Iterative representing set selectionfor nested cross–approximation.
Numerical Linear Algebra withApplications , 2015.[145] L. Mirsky. Symmetric gauge functions and unitarily invariant norms.
The Quarterly Journal of Mathematics , 11:50–59, 1960.[146] J. Morton. Tensor networks in algebraic geometry and statistics.
Lecture at Networking Tensor Networks, Centro de Ciencias de BenasquePedro Pascual, Benasque, Spain , 2012.165147] M. Mørup. Applications of tensor (multiway array) factorizationsand decompositions in data mining.
Wiley Interdisciplinary Review:Data Mining and Knowledge Discovery , 1(1):24–40, 2011.[148] V. Murg, F. Verstraete, R. Schneider, P.R. Nagy, and O. Legeza.Tree tensor network state with variable tensor order: An efficientmultireference method for strongly correlated systems.
Journal ofChemical Theory and Computation , 11(3):1027–1036, 2015.[149] N Nakatani and G.K.L. Chan. Efficient tree tensor network states(TTNS) for quantum chemistry: Generalizations of the densitymatrix renormalization group algorithm.
The Journal of ChemicalPhysics , 2013.[150] Y. Nesterov. Efficiency of coordinate descent methods on huge-scaleoptimization problems.
SIAM Journal on Optimization , 22(2):341–362,2012.[151] Y. Nesterov. Subgradient methods for huge-scale optimizationproblems.
Mathematical Programming , 146(1-2):275–297, 2014.[152] N. H. Nguyen, P. Drineas, and T. D. Tran. Tensor sparsification viaa bound on the spectral norm of random tensors.
Information andInference , page iav004, 2015.[153] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review ofrelational machine learning for knowledge graphs.
Proceedings of theIEEE , 104(1):11–33, 2016.[154] A. Novikov and R.A. Rodomanov. Putting MRFs on a tensor train. In
Proceedings of the International Conference on Machine Learning (ICML’14) , 2014.[155] A.C. Olivieri. Analytical advantages of multivariate data processing.One, two, three, infinity?
Analytical Chemistry , 80(15):5713–5720,2008.[156] R. Or ´us. A practical introduction to tensor networks: Matrix productstates and projected entangled pair states.
Annals of Physics , 349:117–158, 2014.[157] I.V. Oseledets. Approximation of 2 d ˆ d matrices using tensordecomposition. SIAM Journal on Matrix Analysis and Applications ,31(4):2130–2145, 2010. 166158] I.V. Oseledets. Tensor-train decomposition.
SIAM Journal on ScientificComputing , 33(5):2295–2317, 2011.[159] I.V. Oseledets and S.V. Dolgov. Solution of linear systems and matrixinversion in the TT-format.
SIAM Journal on Scientific Computing ,34(5):A2718–A2739, 2012.[160] I.V. Oseledets, S.V. Dolgov, V.A. Kazeev, D. Savostyanov,O. Lebedeva, P. Zhlobich, T. Mach, and L. Song. TT-Toolbox,2012.[161] I.V. Oseledets and E.E. Tyrtyshnikov. Breaking the curse ofdimensionality, or how to use SVD in many dimensions.
SIAMJournal on Scientific Computing , 31(5):3744–3759, 2009.[162] I.V. Oseledets and E.E. Tyrtyshnikov. TT cross–approximationfor multidimensional arrays.
Linear Algebra and its Applications ,432(1):70–88, 2010.[163] E.E. Papalexakis, C. Faloutsos, and N.D. Sidiropoulos. Tensors fordata mining and data fusion: Models, applications, and scalablealgorithms.
ACM Transactions on Intelligent Systems and Technology(TIST) , 8(2):16, 2016.[164] E.E. Papalexakis, N. Sidiropoulos, and R. Bro. From K-means tohigher-way co-clustering: Multilinear decomposition with sparselatent factors.
IEEE Transactions on Signal Processing , 61(2):493–506,2013.[165] N. Parikh and S.P. Boyd. Proximal algorithms.
Foundations and Trendsin Optimization , 1(3):127–239, 2014.[166] D. Perez-Garcia, F. Verstraete, M.M. Wolf, and J.I. Cirac. Matrixproduct state representations.
Quantum Information & Computation ,7(5):401–430, July 2007.[167] R. Pfeifer, G. Evenbly, S. Singh, and G. Vidal. NCON: A tensornetwork contractor for MATLAB. arXiv preprint arXiv:1402.0939 ,2014.[168] N. Pham and R. Pagh. Fast and scalable polynomial kernels viaexplicit feature maps. In
Proceedings of the 19th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages239–247. ACM, 2013. 167169] A-H. Phan and A. Cichocki. Extended HALS algorithm fornonnegative Tucker decomposition and its applications for multiwayanalysis and classification.
Neurocomputing , 74(11):1956–1969, 2011.[170] A.-H. Phan, A. Cichocki, A. Uschmajew, P. Tichavsky, G. Luta, andD. Mandic. Tensor networks for latent variable analysis. Part I:Algorithms for tensor train decomposition.
ArXiv e-prints , 2016.[171] A.-H. Phan, P. Tichavsk `y, and A. Cichocki. Fast alternating lsalgorithms for high order candecomp/parafac tensor factorizations.
IEEE Transactions on Signal Processing , 61(19):4834–4846, 2013.[172] A.-H. Phan, P. Tichavsk `y, and A. Cichocki. Tensor deflation forcandecomp/parafacpart i: Alternating subspace update algorithm.
IEEE Transactions on Signal Processing , 63(22):5924–5938, 2015.[173] A.H. Phan and A. Cichocki. Tensor decompositions for featureextraction and classification of high dimensional datasets.
NonlinearTheory and its Applications, IEICE , 1(1):37–68, 2010.[174] A.H. Phan, A. Cichocki, P. Tichavsky, D. Mandic, and K. Matsuoka.On revealing replicating structures in multiway data: A novel tensordecomposition approach. In
Proceedings of the 10th InternationalConference LVA/ICA, Tel Aviv, March 12-15 , pages 297–305. Springer,2012.[175] A.H. Phan, A. Cichocki, P. Tichavsk ´y, R. Zdunek, and S.R. Lehky.From basis components to complex structural patterns. In
Proceedingsof the IEEE International Conference on Acoustics, Speech and SignalProcessing, ICASSP 2013, Vancouver, BC, Canada, May 26-31, 2013 ,pages 3228–3232, 2013.[176] A.H. Phan, P. Tichavsk ´y, and A. Cichocki. Low complexity dampedGauss-Newton algorithms for CANDECOMP/PARAFAC.
SIAMJournal on Matrix Analysis and Applications (SIMAX) , 34(1):126–147,2013.[177] A.H. Phan, P. Tichavsk ´y, and A. Cichocki. Low rank tensordeconvolution. In
Proceedings of the IEEE International Conferenceon Acoustics Speech and Signal Processing, ICASSP , pages 2169–2173,April 2015. 168178] S. Ragnarsson.
Structured Tensor Computations: Blocking Symmetriesand Kronecker Factorization . PhD dissertation, Cornell University,Department of Applied Mathematics, 2012.[179] M.V. Rakhuba and I.V. Oseledets. Fast multidimensional convolutionin low-rank tensor formats via cross–approximation.
SIAM Journal onScientific Computing , 37(2):A565–A582, 2015.[180] P. Richt´arik and M. Tak´aˇc. Parallel coordinate descent methods forbig data optimization.
Mathematical Programming , 156:433–484, 2016.[181] J. Salmi, A. Richter, and V. Koivunen. Sequential unfolding SVDfor tensors with applications in array signal processing.
IEEETransactions on Signal Processing , 57:4719–4733, 2009.[182] U. Schollw ¨ock. The density-matrix renormalization group in the ageof matrix product states.
Annals of Physics , 326(1):96–192, 2011.[183] U. Schollw ¨ock. Matrix product state algorithms: DMRG, TEBD andrelatives. In
Strongly Correlated Systems , pages 67–98. Springer, 2013.[184] N. Schuch, I. Cirac, and D. P´erez-Garc´ıa. PEPS as ground states:Degeneracy and topology.
Annals of Physics , 325(10):2153–2192, 2010.[185] N. Sidiropoulos, R. Bro, and G. Giannakis. Parallel factor analysisin sensor array processing.
IEEE Transactions on Signal Processing ,48(8):2377–2388, 2000.[186] N.D. Sidiropoulos. Generalizing Caratheodory’s uniqueness ofharmonic parameterization to N dimensions.
IEEE Transactions onInformation Theory , 47(4):1687–1690, 2001.[187] N.D. Sidiropoulos. Low-rank decomposition of multi-way arrays: Asignal processing perspective. In
Proceedings of the IEEE Sensor Arrayand Multichannel Signal Processing Workshop (SAM 2004) , July 2004.[188] N.D. Sidiropoulos and R. Bro. On the uniqueness of multilineardecomposition of N-way arrays.
Journal of Chemometrics , 14(3):229–239, 2000.[189] N.D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E.E.Papalexakis, and C. Faloutsos. Tensor decomposition for signalprocessing and machine learning. arXiv e-prints arXiv:1607.01668 ,2016. 169190] A. Smilde, R. Bro, and P. Geladi.
Multi-way Analysis: Applications inthe Chemical Sciences . John Wiley & Sons Ltd, New York, 2004.[191] S.M. Smith, A. Hyv¨arinen, G. Varoquaux, K.L. Miller, and C.F.Beckmann. Group-PCA for very large fMRI datasets.
NeuroImage ,101:738–749, 2014.[192] L. Sorber, I. Domanov, M. Van Barel, and L. De Lathauwer. Exact lineand plane search for tensor optimization.
Computational Optimizationand Applications , 63(1):121–142, 2016.[193] L. Sorber, M. Van Barel, and L. De Lathauwer. Optimization-based algorithms for tensor decompositions: Canonical PolyadicDecomposition, decomposition in rank- ( L r , L r , 1 ) terms and a newgeneralization. SIAM Journal on Optimization , 23(2), 2013.[194] M. Sørensen and L. De Lathauwer. Blind signal separation via tensordecomposition with Vandermonde factor. Part I: Canonical polyadicdecomposition.
IEEE Transactions on Signal Processing , 61(22):5507–5519, 2013.[195] M. Sørensen, L. De Lathauwer, P. Comon, S. Icart, and L. Deneire.Canonical Polyadic Decomposition with orthogonality constraints.
SIAM Journal on Matrix Analysis and Applications , 33(4):1190–1213,2012.[196] M. Steinlechner. Riemannian optimization for high-dimensionaltensor completion. Technical report, Technical report MATHICSE5.2015, EPF Lausanne, Switzerland, 2015.[197] M.M. Steinlechner.
Riemannian Optimization for Solving High-Dimensional Problems with Low-Rank Tensor Structure . PhD thesis,´Ecole Polytechnnque F´ed´erale de Lausanne, 2016.[198] E.M. Stoudenmire and Steven R. White. Minimally entangled typicalthermal state algorithms.
New Journal of Physics , 12(5):055026, 2010.[199] J. Sun, D. Tao, and C. Faloutsos. Beyond streams and graphs:Dynamic tensor analysis. In
Proceedings of the 12th ACM SIGKDDinternational conference on Knowledge Discovery and Data Mining , pages374–383. ACM, 2006. 170200] S.K. Suter, M. Makhynia, and R. Pajarola. TAMRESH - tensorapproximation multiresolution hierarchy for interactive volumevisualization.
Computer Graphics Forum , 32(3):151–160, 2013.[201] Y. Tang, R. Salakhutdinov, and G. Hinton. Tensor analyzers. In
Proceedings of the 30th International Conference on Machine Learning,(ICML 2013), Atlanta, USA , 2013.[202] D. Tao, X. Li, X. Wu, and S. Maybank. General tensor discriminantanalysis and Gabor features for gait recognition.
IEEE Transactions onPattern Analysis and Machine Intelligence , 29(10):1700–1715, 2007.[203] P. Tichavsky and A. Yeredor. Fast approximate joint diagonalizationincorporating weight matrices.
IEEE Transactions on Signal Processing ,47(3):878–891, 2009.[204] M.K. Titsias. Variational learning of inducing variables in sparseGaussian processes. In
Proceedings of the 12th International Conferenceon Artificial Intelligence and Statistics , pages 567–574, 2009.[205] C. Tobler.
Low-rank tensor methods for linear systems and eigenvalueproblems . PhD thesis, ETH Z ¨urich, 2012.[206] L.N. Trefethen. Cubature, approximation, and isotropy in thehypercube.
SIAM Review (to appear) , 2017.[207] V. Tresp, Y. Esteban, C.and Yang, S. Baier, and D. Krompaß. Learningwith memory embeddings. arXiv preprint arXiv:1511.07972 , 2015.[208] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher. Randomizedsingle-view algorithms for low-rank matrix approximation. arXiv e-prints , 2016.[209] L. Tucker. Some mathematical notes on three-mode factor analysis.
Psychometrika , 31(3):279–311, 1966.[210] L.R. Tucker. The extension of factor analysis to three-dimensionalmatrices. In H. Gulliksen and N. Frederiksen, editors,
Contributions toMathematical Psychology , pages 110–127. Holt, Rinehart and Winston,New York, 1964.[211] A. Uschmajew and B. Vandereycken. The geometry of algorithmsusing hierarchical tensors.
Linear Algebra and its Applications ,439:133—166, 2013. 171212] N. Vannieuwenhoven, R. Vandebril, and K. Meerbergen. Anew truncation strategy for the higher-order singular valuedecomposition.
SIAM Journal on Scientific Computing , 34(2):A1027–A1052, 2012.[213] M.A.O. Vasilescu and D. Terzopoulos. Multilinear analysis of imageensembles: Tensorfaces. In
Proceedings of the European Conference onComputer Vision (ECCV) , volume 2350, pages 447–460, Copenhagen,Denmark, May 2002.[214] F. Verstraete, V. Murg, and I. Cirac. Matrix product states,projected entangled pair states, and variational renormalizationgroup methods for quantum spin systems.
Advances in Physics ,57(2):143–224, 2008.[215] N. Vervliet, O. Debals, L. Sorber, and L. De Lathauwer. Breaking thecurse of dimensionality using decompositions of incomplete tensors:Tensor-based scientific computing in big data analysis.
IEEE SignalProcessing Magazine , 31(5):71–79, 2014.[216] G. Vidal. Efficient classical simulation of slightly entangled quantumcomputations.
Physical Review Letters , 91(14):147902, 2003.[217] S.A. Vorobyov, Y. Rong, N.D. Sidiropoulos, and A.B. Gershman.Robust iterative fitting of multilinear models.
IEEE Transactions onSignal Processing , 53(8):2678–2689, 2005.[218] S. Wahls, V. Koivunen, H.V. Poor, and M. Verhaegen. Learningmultidimensional Fourier series with tensor trains. In
IEEE GlobalConference on Signal and Information Processing (GlobalSIP) , pages 394–398. IEEE, 2014.[219] D. Wang, H. Shen, and Y. Truong. Efficient dimension reductionfor high-dimensional matrix-valued data.
Neurocomputing , 190:25–34, 2016.[220] H. Wang and M. Thoss. Multilayer formulation of themulticonfiguration time-dependent Hartree theory.
Journal ofChemical Physics , 119(3):1289–1299, 2003.[221] H. Wang, Q. Wu, L. Shi, Y. Yu, and N. Ahuja. Out-of-core tensorapproximation of multi-dimensional matrices of visual data.
ACMTransactions on Graphics , 24(3):527–535, 2005.172222] S. Wang and Z. Zhang. Improving CUR matrix decomposition andthe Nystr ¨om approximation via adaptive sampling.
The Journal ofMachine Learning Research , 14(1):2729–2769, 2013.[223] Y. Wang, H.-Y. Tung, A. Smola, and A. Anandkumar. Fast andguaranteed tensor decomposition via sketching. In
Advances inNeural Information Processing Systems , pages 991–999, 2015.[224] S.R. White. Density-matrix algorithms for quantum renormalizationgroups.
Physical Review B , 48(14):10345, 1993.[225] Z. Xu, F. Yan, and Y. Qi. Infinite Tucker decomposition:Nonparametric Bayesian models for multiway data analysis. In
Proceedings of the 29th International Conference on Machine Learning(ICML) , ICML ’12, pages 1023–1030. Omnipress, July 2012.[226] Y. Yang and T. Hospedales. Deep multi-task representation learning:A tensor factorisation approach. arXiv preprint arXiv:1605.06391 ,2016.[227] T. Yokota, N. Lee, and A. Cichocki. Robust multilinear tensor rankestimation using Higher Order Singular Value Decomposition andInformation Criteria.
IEEE Transactions on Signal Processing , accepted,2017.[228] T. Yokota, Q. Zhao, and A. Cichocki. Smooth PARAFACdecomposition for tensor completion.
IEEE Transactions on SignalProcessing , 64(20):5423–5436, 2016.[229] Z. Zhang, X. Yang, I.V. Oseledets, G.E. Karniadakis, and L. Daniel.Enabling high-dimensional hierarchical uncertainty quantificationby ANOVA and tensor-train decomposition.
IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems , 34(1):63–76,2015.[230] H.H. Zhao, Z.Y. Xie, Q.N. Chen, Z.C. Wei, J.W. Cai, and T. Xiang.Renormalization of tensor-network states.
Physical Review B ,81(17):174411, 2010.[231] Q. Zhao, C. Caiafa, D.P. Mandic, Z.C. Chao, Y. Nagasaka, N. Fujii,L. Zhang, and A. Cichocki. Higher order partial least squares(HOPLS): A generalized multilinear regression method.
IEEETransactions on Pattern Analysis and Machine Intelligence , 35(7):1660–1673, 2013. 173232] Q. Zhao, G. Zhou, T. Adali, L. Zhang, and A. Cichocki. Kernelizationof tensor-based models for multiway data analysis: Processing ofmultidimensional structured data.
IEEE Signal Processing Magazine ,30(4):137–148, 2013.[233] S. Zhe, Y. Qi, Y. Park, Z. Xu, I. Molloy, and S. Chari. DinTucker:Scaling up Gaussian process models on large multidimensionalarrays. In
Proceedings of the Thirtieth AAAI Conference on ArtificialIntelligence , 2016.[234] G. Zhou and A. Cichocki. Canonical Polyadic Decomposition basedon a single mode blind source separation.
IEEE Signal ProcessingLetters , 19(8):523–526, 2012.[235] G. Zhou and A. Cichocki. Fast and unique Tucker decompositionsvia multiway blind source separation.
Bulletin of Polish Academy ofScience , 60(3):389–407, 2012.[236] G. Zhou, A. Cichocki, and S. Xie. Fast nonnegative matrix/tensorfactorization based on low-rank approximation.
IEEE Transactions onSignal Processing , 60(6):2928–2940, June 2012.[237] G. Zhou, A. Cichocki, Y. Zhang, and D.P. Mandic. Group componentanalysis for multiblock data: Common and individual featureextraction.
IEEE Transactions on Neural Networks and Learning Systems ,(in print), 2016.[238] G. Zhou, A. Cichocki, Q. Zhao, and S. Xie. Efficient nonnegativeTucker decompositions: Algorithms and uniqueness.
IEEETransactions on Image Processing , 24(12):4990–5003, 2015.[239] G. Zhou, Q. Zhao, Y. Zhang, T. Adali, S. Xie, and A. Cichocki.Linked component analysis from matrices to high-order tensors:Applications to biomedical data.