Tensor-Train Parameterization for Ultra Dimensionality Reduction
aa r X i v : . [ c s . L G ] A ug Tensor-Train Parameterization for UltraDimensionality Reduction st Mingyuan Bai
Discipline of Business AnalyticsThe University of Sydney Business SchoolThe University of Sydney
Camperdown, NSW, [email protected] nd S.T. Boris Choy
Discipline of Business AnalyticsThe University of Sydney Business SchoolThe University of Sydney
Camperdown, NSW, [email protected] rd Xin Song
Discipline of Business AnalyticsThe University of Sydney Business SchoolThe University of Sydney
Camperdown, NSW, Australia
School of Computer ScienceChina University of Geosciences
Wuhan 430074, P. R. [email protected] th Junbin Gao
Discipline of Business AnalyticsThe University of Sydney Business SchoolThe University of Sydney
Camperdown, NSW, [email protected]
Abstract —Locality preserving projections (LPP) are a classicaldimensionality reduction method based on data graph infor-mation. However, LPP is still responsive to extreme outliers.LPP aiming for vectorial data may undermine data structuralinformation when it is applied to multidimensional data. Besides,it assumes the dimension of data to be smaller than the numberof instances, which is not suitable for high-dimensional data. Forhigh-dimensional data analysis, the tensor-train decompositionis proved to be able to efficiently and effectively capture thespatial relations. Thus, we propose a tensor-train parameteriza-tion for ultra dimensionality reduction (TTPUDR) in which thetraditional LPP mapping is tensorized in terms of tensor-trainsand the LPP objective is replaced with the Frobenius norm toincrease the robustness of the model. The manifold optimizationtechnique is utilized to solve the new model. The performanceof TTPUDR is assessed on classification problems and TTPUDRsignificantly outperforms the past methods and the several state-of-the-art methods.
Index Terms —tensor, high-dimensional data, dimensionalityreduction, locality preserving projections, robustness
I. I
NTRODUCTION
The ultra high-dimensional data, attracting great attentionfrom both academia and the industry, have been common incomputer vision [1], recommender systems [2], signal process-ing [3] and neuroscience [4]. In many cases, high-dimensionaldata are converted from the so-called multi-dimensional data,commonly referred to as tensors or multi-arrays. There existsa great amount of research which scrutinizes the informationin the tensors. To avoid the curse-of-dimensionality issue indata-driven learning, research on dimensionality reduction bytaking the tensorial structure into account has attracted greatinterests in literature [5]–[7]. Many methods are utilized to explore the information intensors by tensor decomposition methods. A group of thempresume and maintain spatial structures in tensors. Threeclassical methods are the CANDECOMP/PARAFAC (CP)decomposition [8], the Tucker decomposition [9] and tensor-train (TT) decomposition [10]. The TT decomposition offersthe most compact capacity by decomposing an n -order tensorin terms of the multiplication of n ℓ -norm is robust to outliers, includingthe extreme ones, which has been applied in both PCA andLPP. However, the ℓ -norm is not differentiable at every point.Furthermore, minimizing an ℓ -norm objective function withrespect to a matrix optimization variable is in substance mini-mizing each element or column vector of this matrix variable,where these elements or column vectors are the components ofthis matrix variable. Thus, the spatial relation is not consideredsufficiently. An approximation to the ℓ -norm is the Frobeniusnorm. When minimizing the Frobenius norm objective func-tion, all the components are treated as a whole group and thespatial relations between the components are thus adequatelypreserved and analyzed. Most of the existing dimensionalityreduction methods are applied to high/multi-dimensional databy vectorizing them. This vectorization enlarges the parameterspace of the algorithms and neglects the spatial relationalinformation existing in multi-way data. Therefore, the tensorsubspace embedded dimensionality reduction methods comeon stage. There are already a small number of existing attemptsto embed the tensor subspace into the low-dimensional spaces.For example, Tucker LPP (TLPP) [6] embeds the tensorsubspace based on the Tucker decomposition into the low-dimensional space under the LPP criterion. The local relationis sufficiently captured, but the accuracy is deteriorated due tothe sensitivity to the exceptional outliers and the computationalcost also exponentially increases.In this paper, we propose a dimensionality reduction methodwith the TT subspace embedded, based on the Frobenius normto measure the distance. We name our method tensor-train pa-rameterization for ultra dimensionality reduction (TTPUDR).It enables the spatial relational information in the tensorto be efficiently and effectively processed and scrutinized,especially when the tensor is with a large number of modes ordimensions. Even for extreme outliers, the results in terms ofaccuracy and storage efficiency still appear to be satisfactory.In particular, the storage efficiency is higher than the existingdimensionality reduction methods such as PCA, LPP andTLPP. The main contribution of the paper lies in the following:1) The proposed TTPUDR is the first example, intendingto fill the research gap mentioned above. The embeddedTT subspace can preserve spatial relations in multi/high-dimensional data and achieve lower storage complexitythan the Tucker-based subspace in [6].2) We propose to use the Frobenius norm (F-norm) in thetensor-train LPP (TTLPP) objective function to greatlyreduce the sensitivity to the outliers, especially the ex-treme ones, and consider the spatial relations sufficiently.3) An efficient algorithm is proposed so that TTPUDR issustainable and executable for ultra-dimensional data.This is a significant improvement over the approximatedpseudo PCA implemented in the state-of-the-art dimen-sionality reduction method- tensor train neighborhoodpreserving embedding (TTNPE) [7].4) A number of numerical experiments have been con-ducted on several real-world datasets. Its performance on these datasets is precisely consistent with the statedcontributions and advantages.II. R ELATED W ORK
As aforementioned, there are a great number of tensordecomposition methods investigating spatial relations in multi-dimensional data, i.e., tensors [8]–[10], [13], [14]. The tensor-train (TT) decomposition is relatively most efficient and ef-fective among the above 3 classical methods.To preserve spatial information within tensors in the dimen-sionality reduction methods, [6] introduces the Tucker LPP(TLPP) which is LPP based on the Tucker decomposition toanalyze the high-dimensional data and has the exponential in-crease in storage complexity as the number of modes increases.The other existing dimensionality reduction method whichembeds the TT subspace, is the tensor train neighbourhoodpreserving embedding (TTNPE) [7]. TTNPE solves the ex-ponential explosion on the complexity with the number ofmodes increasing. However, its robustness to the extremeoutliers remains as a concern. Therefore, a dimensionalityreduction method for tensors with a large number of modesor dimensions is demanded to propose on the TT subspaceand the capability of reducing the sensitivity to the extremeoutliers. Our method TTPUDR is thus developed with all theaspects.
A. Preliminaries
Before introducing the TT decomposition and LPP, theground definitions, the notations and tensor operations arespecified. In this paper, we do not distinguish the dimensionsof a tensor and its modes. A classic vector is a tensor of mode1 or 1-order tensor. Similarly, a matrix is a tensor of mode 2,i.e., a 2-order tensor; and a 3-order tensor can be viewed as adata cubic with three modes.As the tradition, we denote the scalars by lower-case letters,such as a ; the vectors by the bold lower-case letters, for in-stance, x ; the matrices as the bold capital letters, for example, S . They are all examples of tensors. In general, we use thecalligraphic capital letters as the notations for tensors, e.g., X ∈ R I × I ×···× I n being an n -order tensor of dimension I i atmode i .Tensor contraction is defined as the multiplication of tensorsalong their compatible modes. Let X ∈ R I × I × I ×···× I n and Y ∈ R J × J × J ×···× J m . The tensor contraction is defined as Z = X × ˜ q ˜ p Y (1)where ˜ p ⊆ p = { , · · · , n } and ˜ q ⊆ q = { , · · · , m } aresubsets satisfying ˜ p = { k | I k = J k } and ˜ q = { k | I k = J k } , re-spectively. The tensor contraction merges two tensors along themodes with the equal sizes, per se, and Z ∈ R × k ∈ ˜ pc I k × k ∈ ˜ qc J k .We denote the left unfolding operation [7] of X ∈ R I × I × I ×···× I n × R n as the matrix L ( X ) ∈ R I I I ··· I n × R n where the last mode of the tensor becomes the columnindices of the left unfolding matrix and the rest of themodes are the row indices. Similarly, for the right un-folding operation, denoting it as R ( X ) ∈ R I × I ··· I n R n .lso, the vectorization of a tensor is denoted by V ( X ) ∈ R I I ··· I n R n . The F-norm of a tensor can be defined as the ℓ -norm of its vectorization, i.e., k X k F = k V ( X ) k = qP I i =1 P I i =1 · · · P I n i n =1 P R n r n =1 x i ,i , ··· ,i n ,r n , which con-siders all the elements x i ,i , ··· ,i n , i = 1 , · · · , I , · · · , i n =1 , · · · , I n , r n = 1 , · · · , R n as an entire group and preservesthe general spatial relations between elements. Besides ℓ -norm of a tensor is computed as k X k = k V ( X ) k = P I i =1 P I i =1 · · · P I n i n =1 P R n r n =1 | x i ,i , ··· ,i n ,r n | which treatseach elements separately and can probably cause the spatialinformation loss. B. Tensor-Train Decomposition
The tensor-train (TT) decomposition is designed for large-scale data analysis [10]. It can achieve a simpler implementa-tion than the tree-type decomposition algorithms [15] whichare developed to reduce the storage complexity and avoid thelocal minima.The TT decomposition assumes a special structure of atensor subspace where an n -order tensor is expressed as thecontraction of a series of n n -order tensor Y ∈ R I × I × I ×···× I n is formed as follows, Y ( i , i , · · · , i n ) = U (: , i , :) U (: , i , :) · · · U k (: , i k , :) · · · U n − (: , i n − , :) U n (: , i n , :) (2)where U ∈ R × I × R , U k ∈ R R k − × I k × R k ( < k < n ),and U n ∈ R R n − × I n × . R k ( k = 1 , , · · · , n − ) arethe tensor ranks. Let R = max { R , R , · · · , R n − } and I = max { I , I , · · · , I n } . Thus, the storage complexity is O ( nIR ) for the TT decomposition.For most of the applications, in order to achieve the com-putational efficiency and be less information redundant, theresearchers often restrict the tensor ranks to be smaller thanthe size of their corresponding tensor mode, i.e., R k < I k for k = 1 , , · · · , n − [7]. C. Locality Preserving Projections
Locality preserving projections (LPP) [12] is to explore andpreserve local information of data in the projected lower di-mensional space, while the conventional principal componentanalysis (PCA) [11] favours maintaining global information indata.Given a set of vectorial training data { x i } Ni =1 ⊂ R P andan affinity matrix of locality similarity S = [ s ij ] , LPP intendsto seek for a linear projection A from R P to R p such thatthe following optimization problem is solved to minimize thelocality preserving criterion set as the objective function. min A X i,j k A ⊤ x i − A ⊤ x j k s ij s.t. A ⊤ XDX ⊤ A = I (3) The widely used affinity S = [ s ij ] is based on the graph ofthe neighborhood information in the data as follows [12]. s ij = ( e − || x i − x j || Ft , if x i ∈ N k ( x j ) or x j ∈ N k ( x i )0 , otherwisewhere t ∈ R + is a positive parameter and N k ( x ) denotes the k -nearest neighborhood of x .Denote X = [ x , x , · · · , x N − , x N ] . The LPP problem (3)indeed can be converted to the following generalized eigen-value problem to solve the eigenvalues λ and eigenvectors a . XLX ⊤ a = λ XDX ⊤ a (4)where L = D − S and D is a diagonal matrix consisting of therow sum of S . The columns of the final mapping A consist ofthe generalized eigenvectors a in Equation (4), correspondingto the smallest p eigenvalues λ ’s.LPP is a classical dimensionality reduction method andhas been applied in many real cases, for example, computervision [16] . It captures the local information among the datapoints and reduces more sensitivity to the outliers than PCA.However, we do observe the following shortcomings of LPP:1) LPP is designed for vectorial data. When it is applied tomulti-dimensional data, i.e, tensors, there exists potentialloss of spatial information. The existing tensor localitypreserving projections, i.e., the Tucker LPP (TLPP) [6]embeds the tensor space with a high storage complexityat O ( nIR + R n ) .2) Theoretically, LPP cannot work for the cases where thedata dimension is greater than the number of samples.Although this can be avoided by a trick in which onefirst projects the data onto its PCA subspace, thenimplements LPP in this subspace , this would not workwell for ultra-dimensional data with a fairly large datasetas a singular value decomposition (SVD) becomes abottleneck.The TT decomposition with a smaller storage complexityat O ( nIR ) has been recently applied in the tensor trainneighborhood preserving embedding (TTNPE) [7], [17]. Nev-ertheless, the actual algorithm in TTNPE is only implementedas a TT approximation to the pseudo PCA. To the best ofour knowledge, there is no existing dimensionality reductionmethod which can directly process the tensor data with lessstorage complexity, i.e., using the TT decomposition in algo-rithms. III. M ETHODOLOGY
In this section, we propose the tensor-train parameteri-zation for ultra dimensionality reduction (TTPUDR) to fillthe research gap aforementioned in Section I. The learningprocedure is presented in detail with a summary in the formof pseudo code. onsider a tensor-train (TT) e U = U × U × · · · × U n where U k ∈ R R k − × I k × R k , R = 1 and R n = I I · · · I n .For a given set of tensor data { X i } Ni =1 ⊂ R I × I ···× I n , weproject X i ∈ R I × I ×···× I n to the vector t i ∈ R R n by a TTparameterized mapping defined as, t i = L ⊤ ( e U ) V ( X i ) where R n now is the number of components or the dimensionof X i . Denote by S = [ s ij ] the similarity based on the graphof the neighborhood of tensor data, which may be defined asused in LPP [12] introduces in Section II. To increase themodel robustness towards extreme data outliers and preservethe spatial relations, we design the TTPUDR by modifying theLPP formulation as the following optimization problem usingthe Frobenius norm objective function instead of applying thesquared Frobenius norm or the ℓ -norm, min U , U , ··· , U n N X i,j =1 k L ⊤ ( e U ) V ( X i ) − L ⊤ ( e U ) V ( X j ) k F s ij s.t. L ⊤ ( U k ) L ( U k ) = I R k ∀ k = 1 , · · · n. (5)The TT decomposition based parameterization for the map-ping tensor can preserve or learn the spatial relation in tensordata X i . However, using the F-norm in Problem (5) makes itmore difficult to solve the problem of TTPUDR.We propose to use a splitting and iterative way to solve theproblem. For this purpose, we define e s ij = s ij k L ⊤ ( e U ) V ( X i ) − L ⊤ ( e U ) V ( X j ) k F (6)which is a function of the tensor cores U , U , · · · , U n . Thenwe rewrite Problem (5) in terms of the squared F-norm asfollows min U , U , ··· , U n N X i,j =1 k L ⊤ ( e U ) V ( X i ) − L ⊤ ( e U ) V ( X j ) k F e s ij s.t. L ⊤ ( U k ) L ( U k ) = I R k ∀ k = 1 , · · · n. (7)Problem (7) seems to be an LPP problem. However, itis not because the modified affinity e s ij is a function ofparameters { U , U , · · · , U n } . We solve it in the followingway. Suppose Problem (7) is being solved by an iterativeoptimization algorithm. We use the current parameter valuesto calculate e s ij according to Equation (6) and then fix all e s ij to solve Problem (7). This alternative procedure can continueuntil convergence.To efficiently solve Problem (7) while e s ij fixed, we followan alternating procedure for solving each tensor core U k whilethe rest are fixed. Overall, we solve the TT parameters, i.e.,tensor cores, and update the neighborhood graph ˜ S alternately.This learning procedure terminates when the solution con-verges.In optimizing each tensor core U k , we find that the strategyin [17] involves manipulating a matrix Z ∈ R I I ··· I n × I I ··· I n ,which is forbidden when data are ultra-dimension or high-order tensors. By taking the commutative property of the tensor contraction operation, we propose a new strategy whichlargely speeds up the calculation.To describe the new algorithm, we define T ( k ) = U × · · · × U k − ∈ R I × I ×··· I k − × R k − , (8) T n ( k ) = U k +1 × · · · × U n ∈ R R k × I k +1 ×···× I n × R n (9)where ≤ k ≤ n but T (1) and T n ( n ) are not defined.Let X be the ( n + 1) -order data tensor whose mode- ( n + 1) stacks along the data samples, i.e., X ∈ R I × I ×···× I n × N .Then define the partially transformed tensor, for < k < n of size R k − × R k × R n × I k × N , Y k = ( X × , ,...,k , ,...,k T ( k )) × , ,...,n +1 − k , ,...,n +1 − k T n ( k ) , and, for k = 1 , Y = X × ,...,n ,...,n T n (1) ∈ R R × R n × I × N , and, for k = n , Y n = X × ,...,n − ,...,n − T ( n ) ∈ R R n − × I n × N . Finally, the optimization problem (7) for TTPUDR is trans-formed to the following subproblems, respectively:
Solving for U : For each ≤ r n ≤ R n , take the slice Y (: , r n , : , :) and reshape it as a matrix Y ( r n ) of size ( R I ) × N ,and form the matrix H = P R n r n =1 Y ( r n ) e LY ( r n ) ⊤ of size ( R I ) × ( R I ) . Then U is solved by min U V ( U ) ⊤ H V ( U ) , s.t. L ⊤ ( U ) L ( U ) = I R (10) Solving for U k ( < k < n ): For each ≤ r n ≤ R n ,take the slice Y k (: , : , r n , : , :) and reshape it as a matrix Y k ( r n ) of size ( R k − I k R k ) × N , and form the matrix H k = P R n r n =1 Y k ( r n ) e LY k ( r n ) ⊤ of size ( R k − I k R k ) × ( R k − I k R k ) . Then U k is solved by min U k V ( U k ) ⊤ H k V ( U k ) , s.t. L ⊤ ( U k ) L ( U k ) = I R k . (11) Solving for U n : Reshape Y n to the matrix Y n of size ( R n − I n ) × N , and form the matrix H n = Y n e LY ⊤ n . Thensolve U n satisfying L ⊤ ( U n ) L ( U n ) = I R n by min U n trace ( L ⊤ ( U n ) H n L ( U n )) . Remark 1:
We have added the orthogonal constraints L ⊤ ( U k ) L ( U k ) = I R k in Problems (10) - (12). These con-strained conditions make sure that the dimensionality reduc-tion mapping E = L ( U × U × · · · × U n ) consistsof orthogonal columns, by referring to Lemma 2 in [7]. lgorithm 1 Optimization Algorithm for TTPUDR
Input : X = { X i } Ni =1 ⊂ R I × I ×···× I n , the original neigh-bourhood graph S , and the number of maximum iterations Iter . Output : Optimal tensor cores U , U , · · · , U n for thetensor train. Initialize the tensor cores U k ∈ R R k − × I k × R k for k =1 , · · · , n . For U ∈ R R × I × R , R = 1 and for U n ∈ R R n − × I n × R n , R n = 1 , , · · · , I I · · · I n . for m = 1 : Iter do Calculate e S = [ e s ij ] according to Equation (6) andprepare e L = e D − e S ; for k = 1 : n do if k = 1 then Form the problem (10) by calculating H andobtain U by solving the problem; else if k = 2 , · · · , n − then Form the problem (11) by calculating H k andobtain U k by solving the problem; else Form the problem (12) by calculating H n andobtain U n by solving the problem; if converge then break return U , · · · , U n To ease the optimization on the Stiefel manifold in Prob-lems (10) and (11), we can replace the orthogonal condition by V ⊤ ( U k ) V ( U k ) = 1 ( ≤ k < n ), resulting in an eigenvalueproblem. However, the overall orthogonality will be lost. Remark 2:
Problem (12) is quite different from Problems(10) and (11). Problem (12) is equivalent to the eigenvalueproblem of H n . Remark 3:
The algorithm can be used for dimensionalityreduction for ultra-dimensional vectorial data. For example,suppose that the dimension of vector data is D = I × I ×· · ·× I n , then we can seek for the dimensionality reduction mappingin terms of TT parameterization. This makes dimensionalityreduction possible for ultra-dimensional data.IV. E XPERIMENTS
To validate the proposed TTPUDR method, the experimentson facial recognition and remote sensing are demonstratedin this section. The results are compared with the classicalmethods and its related methods, i.e., PCA [11] and LPP [12].All the experiments are conducted on the Windows 10 systemwith the memory at 128GB and the Intel Core i7 6950Xprocessor for 25M cache and up to 3.50 GHz, with Matlab2018a version.
A. Data Description
The performance of the TTPUDR method is studied throughnumerical experiments on two high-dimensional datasets fromtwo publicly available databases: the Extended Yale B [18] and the Northwest Indianas Indian Pines by the AirborneVisible/Infrared Imaging Spectrometer (AVIRIS) sensor in1992 [19]. The first two experiments are conducted on theoriginal datasets from these two databases, whereas the thirdexperiment aims to investigate the robust property of TTPUDRon extreme outliers. Therefore, we add the block noisesto the Extended Yale B dataset.The Extended Yale B dataset is on facial of 38 individuals.Each individual has 9 positions and 64 near frontal-faceimages, resulting in a total of images. Each imagehas been resized to × pixels. After conducting therearrangements and removing the missing values, the finalnumber of images is .In terms of the Northwest Indiana’s Indian Pine (Indiana)dataset, it is collected based on the Indian Pines test sitein North-western Indiana and contains × pixelsand spectral reflectance bands in the wavelength range . . × ( − meters. Similar to what is in the ExtendedYale B dataset, we choose spectral reflectance bands and pixel locations by eliminating the missing values andthe water absorption.For the noised Extended Yale B dataset, we add the blocknoise to and of the images for each. The noises aregenerated as either the minimum value or the maximum valueof the Extended Yale B dataset as either or , whereasthe general pixel values are from to . They are addedas × blocks to images, which are salt and pepper noises.Their locations are both predefined and random. This dataset isdesigned to examine the robustness of TTPUDR to the extremeoutliers.To investigate the capability of capturing the spatial struc-ture information, we select the first two datasets in the threedatasets above. In these two datasets (no noises added), of the data are considered as the training set and of thedata are regarded as the test set. Then to test the robustnessand further scrutinize the ability of TTPUDR to process theultra high-dimensional data, we only utilize the third noiseddataset, where and of the data are treated to be thetraining set and and of the data are set as the testset, respectively. In the case of the noised dataset, the extremeoutlier noises are added at and among the trainingdata accordingly. B. Benchmark and Comparison Criteria
The experiments are designed to evaluate the capabil-ity to analyze the structured high-dimensional data and therobustness to the extreme outliers of TTPUDR. We com-pare its performance with existing methods such as PCAand LPP for compatible cases. Note that we are unableto compare with TTNPE since its publicly available pro-gram itself is not executable due to its extreme compu-tational complexity. For TLPP, the same issue also exists.For both PCA and LPP, we use the implementation inhttps://lvdmaaten.github.io/drtoolbox/.For the classification performance, we use the data afterdimensionality reduction as the new features for each objectnd conduct a classifier fitting. The 1-nearest neighborhood(1NN) classifier is used in our experiments. The evaluationcriteria are the overall accuracy (OA), the average accuracy(AA), and Kappa coefficient (KC) for the number of reduceddimensions from 2 to 30, i.e., R n = 2 , · · · , . Specifically,these criteria are computed as OA = 1 T C X c =1 T P c , AA = 1 C C X c =1 T P c T P c + F P c ,KC = OA − C P Cc =1 ( T P c + F P c ) × ( T P c + F N c )1 − C P Cc =1 ( T P c + F P c ) × ( T P c + F N c ) . where C is the total number of classes and T is the numberof the test data points.For robustness to outliers, the evaluation criteria are on theaccuracy itself and the convergence speed of the accuracy,for the different proportion of outliers at and .Furthermore, the convergence analysis is conducted basedon the four cases mentioned above, but only the case withthe fastest convergence speed for TTPUDR is disclosed andcompared with the same three methods across all the iterationson the corresponding feature number for TTPUDR. C. Results and Findings
As aforementioned, the experiments on the Indiana datasetand the Extended Yale B dataset are to examine how TTPUDRcan capture the spatial information in the high-dimensionaldata. We also apply the noised Extended Yale B datasetto examine the robustness of TTPUDR. In the first set ofexperiments, the dimension of the training data is smallerthan the number of samples. Another set of experiments onthe noised Extended Yale B is intended to further evaluatethis ability of TTPUDR on ultra high-dimensional data andits robustness to extreme outliers.
1) Parameter Compression Capability:
In the case withspatial information capturing, the dimension of the data issmaller than the number of samples for the training set. Inother words, the assumption of LPP is not violated on thedimension size and the number of samples. On each methodfor each dataset, we have executed them for 150 iterations,i.e., 10 shuffles of random samples with 15 iterations for eachsample. Firstly, the results for the Indiana dataset is presentedin Table I.
Results from the Indiana DatasetPCA LPP TTPUDROA . . . AA . . . KC . . . TABLE IC
OMPARISON OF EVALUATION CRITERIA UNDER
TTPUDR, LPP
AND
PCA
ON THE I NDIANA DATASET . For the fair comparison, the number of neighbors and theparameter t are set as and . respectively to constructthe affinity matrix of locality similarity S for both LPP andTTPUDR. The sizes of tensor cores in TTPUDR are × × , × × and × × R n with R n from 2 to 30 as the number offeatures. The total numbers of model parameters are from 152to 1272, verse 200 to 6000 for PCA and LPP. Here we onlypresent the case with R n = 24 which is randomly selectedfrom 2 to 30. The values in each cell of the table are themeans of the 10 randomnesses. As this dataset has a largernumber of samples and a smaller number of dimensions, theperformance of the proposed TTPUDR is less competitive tothat of PCA and LPP. On average, OA, AA and KC valuesunder TTPUDR are 10% smaller than those under PCA andLPP.A similar experiment for the Extended Yale B dataset canbe demonstrated in Table II. Results from the Extended Yale B DatasetPCA LPP TTPUDROA . . . AA . . . KC . . . TABLE IIC
OMPARISON OF EVALUATION CRITERIA UNDER
TTPUDR, LPP
AND
PCA
ON THE E XTENDED Y ALE B DATASET . To compare TTPUDR with LPP fairly, the number ofneighbors and the Heat kernel width parameter t are set as and . respectively to construct the affinity matrix of localitysimilarity S for both LPP and TTPUDR. The sizes of tensorcores in TTPUDR are × × , × × , × × and × × R n with R n from 2 to 30 as the number of features. Thetotal numbers of model parameters are from 416 to 1312, verse2048 to 30720 for PCA and LPP. In this case, we randomlychoose R n = 28 to demonstrate. The numbers in Table IIare also the best result of each method for each criterion. Inthis case, the results are based on R n = 28 features, i.e.,dimensions. This case shows that TTPUDR performs betterthan both PCA and LPP. On average, these values are at least66% bigger under TTPUDR than LPP and PCA. The presentedOA, AA and KC in the table are also the means of thoseacross iterations. This result is not surprising as this datasethas a smaller sample size and a larger dimension than theIndiana dataset, which align with the characteristics of ultra-dimensionality under TTPUDR.This set of experiments has demonstrated that the TTPUDRuses much fewer model parameters to achieve comparableperformance for the classification tasks.
2) Robustness:
Following the parameter compression capa-bility, we examine the robustness of TTPUDR with the noisedExtended Yale B dataset. The results are reported in Figure 1.For simplicity, we present OA for TTPUDR, LPP and PCAacross dimensions, i.e., features from 2 from 30, since all thethree methods have the best performance on this evaluationcriterion than the other criteria.Figures 1a and 1b demonstrate the performance ofTTPUDR, LPP and PCA with 60% of training data with 10%and 20% of extreme outlier noise, respectively. From theseFigures, it is evident that TTPUDR significantly outperformsLPP and PCA on the overall accuracy. In the case with of the noise, TTPUDR generally achieves better performance
Number of Features O ve r a ll A cc u r acy ( O A ) TTPUDRLPPPCA (a)
Number of Features O ve r a ll A cc u r acy ( O A ) TTPUDRLPPPCA (b)
Number of Features O ve r a ll A cc u r acy ( O A ) TTPUDRPCA (c)
Number of Features O ve r a ll A cc u r acy ( O A ) TTPUDRPCA (d)Fig. 1. Comparison of overall accuracy (OA) for TTPUDR, LPP and PCAin the noised Extended Yale B dataset under (a) training data and of the noise; (b) training data and of the noise; (c) trainingdata and of the noise; and, (d) training data and of the noise. at a lower reduced dimensionality although this pace hasslightly slowed down in the case of the of extreme outliernoises. Therefore, we can conclude that TTPUDR is capable ofcapturing sufficient information in the ultra high-dimensionaldata effectively and efficiently under a lower dimensionality.In both cases of noises, TTPUDR has better performance thanboth LPP and PCA. This shows that TTPUDR has significantlyhigher robustness to the extreme outliers due to its adoptingthe F-norm LPP objective.In Figures 1c and 1d, we show the results for the caseof using training data, resulting 482 samples of 1024dimensions. Since the number of dimensions is larger than thenumber of samples, the assumption of LPP is violated. Thus,LPP is not able to execute and there is no result availablefor LPP. However, TTPUDR can still operate and producea more satisfactory OA compared with the other benchmarkmethod, PCA. To sum up, TTPUDR has an excellent capabilityof processing and analyzing the spatial structural informationin the ultra high-dimensional data effectively even with a reallysmall number of training data. In terms of the robustness,TTPUDR also has a more preferable performance than theother executable method.V. C
ONCLUSIONS
This paper proposes a tensor-train parameterization for theultra-dimensionality reduction algorithm. The dimensionalityreduction mapping is tensorized to learn and preserve spatialinformation amongst multi-dimensional data and to increasemodel robustness towards extreme data outliers. This methodhas been successfully illustrated in two real datasets. Theperformance of the method is comparable with the existingmethods with less parameters. It also outperforms other com-petitive models in the case of high-dimension-small-samples and large proportion of data with extreme noises. In the futureresearch, we intend to expand it into a structure which can alsocapture and analyze the sequential relations in the time seriestensor data. R
EFERENCES[1] M. Vasilescu and D. Terzopoulos, “Multilinear analysis of image en-sembles: Tensorfaces,” in
ECCV , A. Heyden, G. Sparr, M. Nielsen, andP. Johansen, Eds., 2002, pp. 447–460.[2] P. Symeonidis, “Matrix and tensor decomposition in recommendersystems,” in
ACM RecSys , 2016, pp. 429–430.[3] A. Cichocki, D. Mandic, A. Phan, C. Caiafa, G. Zhou, Q. Zhao, andL. Lathauwer, “Tensor decompositions for signal processing applicationsfrom two-way to multiway component analysis,” arXiv:1403.4462 , 2014.[4] C. F. Beckmann and S. M. Smith, “Tensorial extensions of indepen-dent component analysis for multisubject FMRI analysis,”
Neuroimage ,vol. 25, no. 1, pp. 294–311, 2005.[5] W. Wang, V. Aggarwal, and S. Aeron, “Tensor completion by alternatingminimization under the tensor train (TT) model,” arXiv;1609.05587 ,2016.[6] G. Dai and D. Yeung, “Tensor embedding methods,” in
AAAI , 2006, pp.330–335.[7] W. Wang, V. Aggarwal, and S. Aeron, “Principal component analysiswith tensor train subspace,” arXiv:1803.05026 , 2018.[8] F. L. Hitchcock, “Multiple invariants and generalized rank of a p-waymatrix or tensor,”
Journal of Mathematics and Physics , vol. 7, no. 1-4,pp. 39–79, 1928.[9] L. R. Tucker, “Implications of factor analysis of three-way matricesfor measurement of change,” in
Problems in measuring change , C. W.Harris, Ed. Madison WI: U of Wisconsin Press, 1963, pp. 122–137.[10] I. Oseledets, “Tensor-train decomposition,”
SIAM Journal on ScientificComputing , vol. 33, no. 5, pp. 2295–2317, 2011.[11] K. Pearson, “Liii. on lines and planes of closest fit to systems of pointsin space,”
The London, Edinburgh, and Dublin Philosophical Magazineand Journal of Science , vol. 2, no. 11, pp. 559–572, 1901.[12] X. He and P. Niyogi, “Locality preserving projections,” in
Advancesin Neural Information Processing Systems , S. Thrun, L. Saul, andB. Sch¨olkopf, Eds., vol. 16. Cambridge, MA: MIT Press, 2004.[13] A. Cichocki, N. Lee, I. Oseledets, A.-H. Phan, Q. Zhao, and D. P.Mandic, “Tensor networks for dimensionality reduction and large-scaleoptimization: Part 1 low-rank tensor decompositions,”
Foundations andTrends in Machine Learning , vol. 9, no. 4-5, pp. 249–429, 2016.[14] A. Cichocki, N. Lee, A. Phan, I. Oseledets, Q. Zhao,and D. Mandic,
Tensor Networks for Dimensionality Reductionand Large-Scale Optimization: Part 2 Applications and FuturePerspectives , ser. Foundations and Trends(r) in MachineLearning Series. Now Publishers, 2017. [Online]. Available:https://books.google.com.au/books?id=xlpkswEACAAJ[15] I. Oseledets and E. Tyrtyshnikov, “Breaking the curse of dimensionality,or how to use SVD in many dimensions,”
SIAM Journal on ScientificComputing , vol. 31, no. 5, pp. 3744–3759, 2009.[16] Y. Xu, A. Zhong, J. Yang, and D. Zhang, “Lpp solutionschemes for use with face recognition,”
Pattern Recognition ,vol. 43, no. 12, pp. 4165–4176, Dec. 2010. [Online]. Available:http://dx.doi.org/10.1016/j.patcog.2010.06.016[17] W. Wang, V. Aggarwal, and S. Aeron, “Tensor train neighborhood pre-serving embedding,”
IEEE Transactions on Signal Processing , vol. 66,no. 10, pp. 2724–2732, May 2018.[18] A. Georghiades, P. Belhumeur, and D. Kriegman, “From few to many:Illumination cone models for face recognition under variable lighting andpose,”