Image classification using local tensor singular value decompositions
IImage classification using local tensor singularvalue decompositions
Elizabeth Newman
Department of MathematicsTufts UniversityMedford, Massachusetts 02155Email: [email protected]
Misha Kilmer
Department of MathematicsTufts UniversityMedford, Massachusetts 02155Email: [email protected]
Lior Horesh
IBM TJ Watson Research Center1101 Kitchawan RoadYorktown Heights, NYEmail: [email protected]
Abstract —From linear classifiers to neural networks, imageclassification has been a widely explored topic in mathematics,and many algorithms have proven to be effective classifiers.However, the most accurate classifiers typically have significantlyhigh storage costs, or require complicated procedures that maybe computationally expensive. We present a novel (nonlinear)classification approach using truncation of local tensor singularvalue decompositions (tSVD) that robustly offers accurate results,while maintaining manageable storage costs. Our approach takesadvantage of the optimality of the representation under the tensoralgebra described to determine to which class an image belongs.We extend our approach to a method that can determine specificpairwise match scores, which could be useful in, for example,object recognition problems where pose/position are different. Wedemonstrate the promise of our new techniques on the MNISTdata set.
I. I
NTRODUCTION
Image classification is a well-explored problem in which animage is identified as belonging to one of a known numberof classes. Researchers seek to extract particular featuresfrom which to determine patterns comprising an image. Algo-rithms to determine these essential features include statisticalmethods such as centroid-based clustering, connectivity/graph-based clustering, distribution-based clustering, and density-based clustering [13], [14], [15], as well as learning algorithms(linear discriminant analysis, support vector machines, neuralnetworks) [5].Our approach differs significantly from techniques in theliterature in that it uses local tensor singular value decompo-sitions (tSVD) to form the feature space of an image. Tensorapproaches are gaining increasing popularity for tasks such asimage recognition and dictionary learning and reconstruction[3], [9], [7], [10]. These are favored over matrix-vector-basedapproaches as it has been demonstrated that a tensor-basedapproach enables retention of the original image structuralcorrelations that are lost by image vectorization. Tensor ap-proaches for image classification appear to be in their infancy,although some approaches based on the tensor HOSVD [11]have been explored in the literature [6].Here, we are motivated by the work in [3] which em-ploys optimal low tubal-rank tensor factorizations throughuse of the t-product [1] and by the work in [2] describingtensor orthogonal projections. We present a new approach for classification based on the tensor SVD from [1], calledthe tSVD, which is elegant for its straightforward mathe-matical interpretation and implementation, and which has theadvantage that it can be easily parallelized for great com-putational advantage . State-of-the-art matrix decompositionsare asymptotically challenged in dealing with the demand toprocess ever-growing datasets of larger and more complexobjects [16], so the importance of this dimension of this studycannot be overstated. Our method is in direct contrast to deepneural network based approaches which require many layersof complexity and for which theoretical interpretation is notreadily available [17]. Our approach is also different fromthe tensor approach in [6] because truncating the tSVD hasoptimality properties that truncating the HOSVD does notenjoy. We conclude this study with a demonstration on theMNIST [4] dataset.
A. Notation and Preliminaries
In this paper, a tensor is a third-order tensor, or three-dimensional array of data, denoted by a capital script letter.As depicted in Figure 1, A is an (cid:96) × m × n tensor. Frontalslices A ( k ) for k = 1 , . . . , n are (cid:96) × m matrices. Lateral slices (cid:126) A j for j = 1 , . . . , m are (cid:96) × n matrices oriented along thethird dimension. Tubes a ij for i = 1 , . . . , (cid:96) and j = 1 , . . . , m are n × column vectors oriented along the third dimension[2]. (a) Tensor A . (b) Frontalslices A ( k ) . (c) Lateralslices (cid:126) A j . (d) Tubes a ij .Fig. 1. Representations of third-order tensors. To paraphrase the definition by Kilmer et al. [2], the rangeof a tensor A is the t-linear span of the lateral slices of A : R ( A ) = { (cid:126) A ∗ c + · · · + (cid:126) A m ∗ c m | c i ∈ R × × n } . (1) a r X i v : . [ s t a t . M L ] J un ecause the lateral slices of (cid:126) A form the range, we store ourimages as lateral slices. Furthermore, A is real-valued becauseimages are real-value.To multiply a pair of tensors, we need to understand thet-product, which requires the following tensor reshaping ma-chinery. Given A ∈ R (cid:96) × m × n , the unfold function reshapes A into an (cid:96)n × m block-column vector (ie. the first block-column of (2)), while fold folds it back up again. The bcirc function forms an (cid:96)n × mn block-circulant matrix from thefrontal slices of A : bcirc ( A ) = A (1) A ( n ) . . . A (2) A (2) A (1) . . . A (3) ... ... . . . ... A ( n ) A ( n − . . . A (1) . (2)Now the t-product is defined as follows ([1]): Definition 1 (t-product):
Given
A ∈ R (cid:96) × p × n and B ∈ R p × m × n , the t-product is the (cid:96) × m × n product A ∗ B = fold ( bcirc ( A ) · unfold ( B )) . (3)Under the t-product (Definition 1), we need the followingfrom [1]. Definition 2:
The tensor transpose A T ∈ R p × (cid:96) × n takes thetranspose of the frontal slices of A and reverses the order ofslices through n . Definition 3:
The identity tensor J is an m × m × n tensorwhere J (1) an m × m identity matrix and all other frontalslices are zero. Definition 4: An orthogonal tensor Q is an m × m × n tensor such that Q T ∗ Q = Q ∗ Q T = J . Analogous to the columns of an orthogonal matrix, the lateralslices of Q are orthonormal [2]. Definition 5:
A tensor is f-diagonal if each frontal slice isa diagonal matrix.II. T
ENSOR S INGULAR V ALUE D ECOMPOSITION
Let A be an (cid:96) × m × n tensor. As defined in [1], the tensorsingular value decomposition (tSVD) of A is the following: A = U ∗ S ∗ V T , (4)where for p = min( (cid:96), m ) , U is an (cid:96) × p × n tensor withorthonormal lateral slices, V is a m × p × n tensor withorthonormal lateral slices, and S is p × p × n f-diagonal.The algorithm for computing the tSVD is given in [1].Importantly, as noted in that paper, the bulk of the com-putations are performed on matrices, which are independentand can thus be done in parallel . Furthermore, synonymouslyto matrix computation strategies, randomized variants of thetSVD algorithm have recently been proposed [12] which canbe favored when the tensor is particularly large. A. Range and Tubal-Rank of Tensors
As proven in Kilmer et al. [2], the range of A determinedvia t-linear combinations of the lateral slices of U , for appro-priate tensor coefficients c i : R ( A ) = { (cid:126) U ∗ c + · · · + (cid:126) U p ∗ c p | c i ∈ R × × n } . (5)The lateral slices of U form an orthonormal basis for the rangeof A . More details related to the definition and the rest of thelinear-algebraic framework can be found in [2].The definition of the range of a tensor leads to the notion ofprojection. Given a lateral slice (cid:126) B ∈ R m × × n , the orthogonalprojection into the range of A is defined as U ∗ U T ∗ (cid:126) B .We require the following theorem to understand tubal-rankof tensors: Theorem 1 ( [1]):
For k ≤ min( (cid:96), m ) , define A k = k (cid:88) i =1 (cid:126) U i ∗ s ii ∗ (cid:126) V Ti . where (cid:126) U i and (cid:126) V i are the ith lateral slices of U and V ,respectively, and s ii is the ( i, i ) -tube of S . Then A k = arg min ˜ A∈ M ||A − ˜ A|| F where M = {C = X ∗ Y | X ∈ R (cid:96) × k × n , Y ∈ R k × m × n } .From Theorem 1, we say A k is a tensor of tubal-rank- k .The definition of tubal rank is from [2]. It follows from theabove that A k is best tubal-rank- k approximation to A . B. The Algorithm
Suppose we have a set of training images and each imagein the set belongs to one of N different classes. First, we forma third-order tensor for each class A , A , . . . , A N where A i contains all the training images belonging to class i , stored aslateral slices. We assume all the training images are (cid:96) × n andthat there are m i images in class i ; i.e., A i is an (cid:96) × m i × n tensor. Note that the m i need not be the same. We then forma tubal-rank- k local tSVD (Theorem 1) for each tensor: A i ≈ U i ∗ S i ∗ V Ti for i = 1 , . . . , N, (6)where U i is an (cid:96) × k × n tensor. Here, k (cid:28) m i . Now, instead ofstoring all the training images, we need only store an (cid:96) × k × n tensor for each class. The training basis is thus an optimalbasis in the sense of Theorem 1. The tensor operator U i ∗ U Ti is an orthogonal projection tensor [2] onto the space which isthe t-linear combination of the lateral slices of the U i tensor.Likewise, ( I − U i ∗ U Ti ) projects orthogonally to this space.Next, suppose a test image belongs to one of the N classesand we want to determine the class to which it belongs. Were-orient this image as a lateral slice (cid:126) B and use our local tSVDbases to compute the norms of the tensor coefficients of theimage projected orthogonally to the current training set:arg min i =1 ,...,N || (cid:126) B − U i ∗ U Ti ∗ (cid:126) B|| F , for i = 1 , . . . , N. (7) We note that extensions of the t-product and corresponding decomposi-tions are possible for higher order tensor representations (e.g. for color imagetraining data), as well [18], [19]. f (cid:126) B is a member of the class i , we expect (7) to be small. Wedetermine the class to which (cid:126) B belongs by which projectionis the closest to the original image in the Frobenius norm.III. E XPERIMENTS AND R ESULTS
To test our local tSVD classifier, we use the public MNISTdataset of handwritten digits as a benchmark [4]. The MNISTdataset contains of , training images and , testimages. Each image is a × grayscale image consistingof a single hand-written digit (i.e., through ). We organizethe training images by digit resulting in different classeswith the distribution of digits displayed in Figure 2. Fig. 2. Table of MNIST digit distribution.
Digit 0 1 2 3 4 is of size × × ). Using (6),we independently form a local tSVD basis for each class U , U , . . . , U where U i is the basis for the digit i and ofsize × k × for some truncation k . For simplicity, we usethe same truncation for all bases . A. Numerical Results: Classification
Our first objective is to use these local tSVD bases todetermine the digit in each test image. Suppose (cid:126) P j is the × × lateral slice of the jth test image. We determinehow similar (cid:126) P j is to each digit using the following metric (7):arg min i =0 ,..., || (cid:126) P j − U i ∗ U Ti ∗ (cid:126) P j || F . (8)To measure the accuracy of our classification, we computethe recognition rate for the entire test data as follows: r = . (9)For various truncation values k , we obtain the followingrecognition rates: Fig. 3. Classification accuracy for various truncation values.
Truncation k = 3 k = 4 k = 5 k = 10 r (%) 87.99 88.51 87.14 75.31From Figure 3, we notice that smaller truncation valuesyield greater classification accuracy. This indicates that themagnitude of the tubes of singular values in S (i.e., || s ii || F ) Note that the tSVD offers flexibility in prescription of the truncation levelper basis [3]. decays rapidly for the early truncation values, as demonstratedin Figure 4.
Fig. 4. Magnitude decay of norm of singular value tubes for digits 0-4.
Notice in Figure 4 the magnitude of the tubes of S decaysrapidly for the first few indices i and decays more slowlystarting at the index i = 5 . This implies we can optimizeour storage costs by truncating at about k = 5 without losingsignificant classification accuracy.In addition to the overall classification accuracy, we canmeasure the accuracy of classifying each digit as r i = i i . (10)We show the per-digit accuracy results for k = 4 below: Fig. 5. Classification accuracy per digit for truncation k = 4 . Digit Most Freq. 2nd Most r i (%)0 0 1 91.121 1 4 96.562 2 0 83.923 3 8 82.774 4 1 96.135 5 8 79.486 6 1 93.327 7 9 90.958 8 5 82.149 9 4 87.02In Figure 5, the “Most Freq.” column indicates the class towhich the images of each digit were most frequently classified.The “2nd Most” column indicates the second class to whichthe images of each digit were most frequently classified.We illustrate some of the mis-classifications that occur inFigure 6 for truncation k = 4 . a) Incorrectly classified as . (b) Incorrectly classified as .Fig. 6. Examples of incorrect classification of images that should be 7. We notice that Figure 6a and 6b do have qualitative sim-ilarities to and , respectively. We can likely improve forambiguous digits by adding additional features for each classand/or employing slightly different metrics. B. Numerical Results: Identification
Our second objective is to use our local tSVD featurevectors to determine if a pair of test images contain the samedigit. To solve this problem, we consider each comparison (8)to be a feature for a particular image (cid:126) P j instead of minimizingover the number of classes. More specifically, we construct a × vector of features for each of our 10,000 test images.We measure the similarity between two images by comput-ing the cosine between the feature vectors. Though other sim-ilarity metrics are possible, given what all the (non-negative)entries in the feature vector represent, this seemed appropriatefor proof of concept.We compute the similarity for all ( i, j ) -pairs of test imagesto form a similarity score matrix S of size × where S is symmetric. Fig. 7. Similarity score matrix for truncation k = 4 . In Figure 7, we display only the similarity scores between . and and we notice that blocks along the diagonalcontain the highest similarity scores, as desired given theordering of the test data. This illustrates that the cosine metricdoes enable us to determine if two images contain the samedigit. Fig. 8. ROC curve for various truncation values k . Using a receiver operating characteristic (ROC) curve inFigure 8, we visualize the effectiveness of our local tSVDclassifier. Notice the curve for truncation k = 10 is signifi-cantly lower, indicating smaller truncation values (indicativeof less storage) yield better accuracy for the MNIST dataset.IV. C ONCLUSIONS AND F UTURE W ORK
We have developed a new local truncated tSVD approachto image classification based on provable optimality condi-tions which is elegant in its straightforward mathematicalapproach to the problem. Beyond the innate computationaland storage efficiency advantages of the proposed approach, ithas demonstrated effective performance in classifying MNISTdata. The primary purpose of this short paper was a proofof concept of a new method. In the future, we will compareour approach to current state-of-the-art approaches in terms ofstorage, computation time and qualitative classification resultsfor larger and different datasets (e.g. subjects from a dataset offacial images). Additionally, we seek an automated strategy fordetermining optimal truncation value k or a varied truncationscheme denoted tSVDII as in [3]. We will also explorewhether the alternative tensor-tensor products from [8] andtheir corresponding truncated tSVDs will allow us to obtainmore illustrative features, and whether new double-sided tSVDtechniques [20] that are insensitive to tensor orientation areuseful here as well. A CKNOWLEDGMENT