[PDF] Image classification using local tensor singular value decompositions

Abstract

From linear classifiers to neural networks, image classification has been a widely explored topic in mathematics, and many algorithms have proven to be effective classifiers. However, the most accurate classifiers typically have significantly high storage costs, or require complicated procedures that may be computationally expensive. We present a novel (nonlinear) classification approach using truncation of local tensor singular value decompositions (tSVD) that robustly offers accurate results, while maintaining manageable storage costs. Our approach takes advantage of the optimality of the representation under the tensor algebra described to determine to which class an image belongs. We extend our approach to a method that can determine specific pairwise match scores, which could be useful in, for example, object recognition problems where pose/position are different. We demonstrate the promise of our new techniques on the MNIST data set.

Full PDF

IImage classiﬁcation using local tensor singularvalue decompositions

Elizabeth Newman

Department of MathematicsTufts UniversityMedford, Massachusetts 02155Email: [email protected]

Misha Kilmer

Department of MathematicsTufts UniversityMedford, Massachusetts 02155Email: [email protected]

Lior Horesh

IBM TJ Watson Research Center1101 Kitchawan RoadYorktown Heights, NYEmail: [email protected]

Abstract —From linear classiﬁers to neural networks, imageclassiﬁcation has been a widely explored topic in mathematics,and many algorithms have proven to be effective classiﬁers.However, the most accurate classiﬁers typically have signiﬁcantlyhigh storage costs, or require complicated procedures that maybe computationally expensive. We present a novel (nonlinear)classiﬁcation approach using truncation of local tensor singularvalue decompositions (tSVD) that robustly offers accurate results,while maintaining manageable storage costs. Our approach takesadvantage of the optimality of the representation under the tensoralgebra described to determine to which class an image belongs.We extend our approach to a method that can determine speciﬁcpairwise match scores, which could be useful in, for example,object recognition problems where pose/position are different. Wedemonstrate the promise of our new techniques on the MNISTdata set.

I. I

NTRODUCTION

Image classiﬁcation is a well-explored problem in which animage is identiﬁed as belonging to one of a known numberof classes. Researchers seek to extract particular featuresfrom which to determine patterns comprising an image. Algo-rithms to determine these essential features include statisticalmethods such as centroid-based clustering, connectivity/graph-based clustering, distribution-based clustering, and density-based clustering [13], [14], [15], as well as learning algorithms(linear discriminant analysis, support vector machines, neuralnetworks) [5].Our approach differs signiﬁcantly from techniques in theliterature in that it uses local tensor singular value decompo-sitions (tSVD) to form the feature space of an image. Tensorapproaches are gaining increasing popularity for tasks such asimage recognition and dictionary learning and reconstruction[3], [9], [7], [10]. These are favored over matrix-vector-basedapproaches as it has been demonstrated that a tensor-basedapproach enables retention of the original image structuralcorrelations that are lost by image vectorization. Tensor ap-proaches for image classiﬁcation appear to be in their infancy,although some approaches based on the tensor HOSVD [11]have been explored in the literature [6].Here, we are motivated by the work in [3] which em-ploys optimal low tubal-rank tensor factorizations throughuse of the t-product [1] and by the work in [2] describingtensor orthogonal projections. We present a new approach for classiﬁcation based on the tensor SVD from [1], calledthe tSVD, which is elegant for its straightforward mathe-matical interpretation and implementation, and which has theadvantage that it can be easily parallelized for great com-putational advantage . State-of-the-art matrix decompositionsare asymptotically challenged in dealing with the demand toprocess ever-growing datasets of larger and more complexobjects [16], so the importance of this dimension of this studycannot be overstated. Our method is in direct contrast to deepneural network based approaches which require many layersof complexity and for which theoretical interpretation is notreadily available [17]. Our approach is also different fromthe tensor approach in [6] because truncating the tSVD hasoptimality properties that truncating the HOSVD does notenjoy. We conclude this study with a demonstration on theMNIST [4] dataset.

A. Notation and Preliminaries

In this paper, a tensor is a third-order tensor, or three-dimensional array of data, denoted by a capital script letter.As depicted in Figure 1, A is an (cid:96) × m × n tensor. Frontalslices A ( k ) for k = 1 , . . . , n are (cid:96) × m matrices. Lateral slices (cid:126) A j for j = 1 , . . . , m are (cid:96) × n matrices oriented along thethird dimension. Tubes a ij for i = 1 , . . . , (cid:96) and j = 1 , . . . , m are n × column vectors oriented along the third dimension[2]. (a) Tensor A . (b) Frontalslices A ( k ) . (c) Lateralslices (cid:126) A j . (d) Tubes a ij .Fig. 1. Representations of third-order tensors. To paraphrase the deﬁnition by Kilmer et al. [2], the rangeof a tensor A is the t-linear span of the lateral slices of A : R ( A ) = { (cid:126) A ∗ c + · · · + (cid:126) A m ∗ c m | c i ∈ R × × n } . (1) a r X i v : . [ s t a t . M L ] J un ecause the lateral slices of (cid:126) A form the range, we store ourimages as lateral slices. Furthermore, A is real-valued becauseimages are real-value.To multiply a pair of tensors, we need to understand thet-product, which requires the following tensor reshaping ma-chinery. Given A ∈ R (cid:96) × m × n , the unfold function reshapes A into an (cid:96)n × m block-column vector (ie. the ﬁrst block-column of (2)), while fold folds it back up again. The bcirc function forms an (cid:96)n × mn block-circulant matrix from thefrontal slices of A : bcirc ( A ) =  A (1) A ( n ) . . . A (2) A (2) A (1) . . . A (3) ... ... . . . ... A ( n ) A ( n − . . . A (1)  . (2)Now the t-product is deﬁned as follows ([1]): Deﬁnition 1 (t-product):

Given

A ∈ R (cid:96) × p × n and B ∈ R p × m × n , the t-product is the (cid:96) × m × n product A ∗ B = fold ( bcirc ( A ) · unfold ( B )) . (3)Under the t-product (Deﬁnition 1), we need the followingfrom [1]. Deﬁnition 2:

The tensor transpose A T ∈ R p × (cid:96) × n takes thetranspose of the frontal slices of A and reverses the order ofslices through n . Deﬁnition 3:

The identity tensor J is an m × m × n tensorwhere J (1) an m × m identity matrix and all other frontalslices are zero. Deﬁnition 4: An orthogonal tensor Q is an m × m × n tensor such that Q T ∗ Q = Q ∗ Q T = J . Analogous to the columns of an orthogonal matrix, the lateralslices of Q are orthonormal [2]. Deﬁnition 5:

A tensor is f-diagonal if each frontal slice isa diagonal matrix.II. T

ENSOR S INGULAR V ALUE D ECOMPOSITION

Let A be an (cid:96) × m × n tensor. As deﬁned in [1], the tensorsingular value decomposition (tSVD) of A is the following: A = U ∗ S ∗ V T , (4)where for p = min( (cid:96), m ) , U is an (cid:96) × p × n tensor withorthonormal lateral slices, V is a m × p × n tensor withorthonormal lateral slices, and S is p × p × n f-diagonal.The algorithm for computing the tSVD is given in [1].Importantly, as noted in that paper, the bulk of the com-putations are performed on matrices, which are independentand can thus be done in parallel . Furthermore, synonymouslyto matrix computation strategies, randomized variants of thetSVD algorithm have recently been proposed [12] which canbe favored when the tensor is particularly large. A. Range and Tubal-Rank of Tensors

As proven in Kilmer et al. [2], the range of A determinedvia t-linear combinations of the lateral slices of U , for appro-priate tensor coefﬁcients c i : R ( A ) = { (cid:126) U ∗ c + · · · + (cid:126) U p ∗ c p | c i ∈ R × × n } . (5)The lateral slices of U form an orthonormal basis for the rangeof A . More details related to the deﬁnition and the rest of thelinear-algebraic framework can be found in [2].The deﬁnition of the range of a tensor leads to the notion ofprojection. Given a lateral slice (cid:126) B ∈ R m × × n , the orthogonalprojection into the range of A is deﬁned as U ∗ U T ∗ (cid:126) B .We require the following theorem to understand tubal-rankof tensors: Theorem 1 ( [1]):

For k ≤ min( (cid:96), m ) , deﬁne A k = k (cid:88) i =1 (cid:126) U i ∗ s ii ∗ (cid:126) V Ti . where (cid:126) U i and (cid:126) V i are the ith lateral slices of U and V ,respectively, and s ii is the ( i, i ) -tube of S . Then A k = arg min ˜ A∈ M ||A − ˜ A|| F where M = {C = X ∗ Y | X ∈ R (cid:96) × k × n , Y ∈ R k × m × n } .From Theorem 1, we say A k is a tensor of tubal-rank- k .The deﬁnition of tubal rank is from [2]. It follows from theabove that A k is best tubal-rank- k approximation to A . B. The Algorithm

Suppose we have a set of training images and each imagein the set belongs to one of N different classes. First, we forma third-order tensor for each class A , A , . . . , A N where A i contains all the training images belonging to class i , stored aslateral slices. We assume all the training images are (cid:96) × n andthat there are m i images in class i ; i.e., A i is an (cid:96) × m i × n tensor. Note that the m i need not be the same. We then forma tubal-rank- k local tSVD (Theorem 1) for each tensor: A i ≈ U i ∗ S i ∗ V Ti for i = 1 , . . . , N, (6)where U i is an (cid:96) × k × n tensor. Here, k (cid:28) m i . Now, instead ofstoring all the training images, we need only store an (cid:96) × k × n tensor for each class. The training basis is thus an optimalbasis in the sense of Theorem 1. The tensor operator U i ∗ U Ti is an orthogonal projection tensor [2] onto the space which isthe t-linear combination of the lateral slices of the U i tensor.Likewise, ( I − U i ∗ U Ti ) projects orthogonally to this space.Next, suppose a test image belongs to one of the N classesand we want to determine the class to which it belongs. Were-orient this image as a lateral slice (cid:126) B and use our local tSVDbases to compute the norms of the tensor coefﬁcients of theimage projected orthogonally to the current training set:arg min i =1 ,...,N || (cid:126) B − U i ∗ U Ti ∗ (cid:126) B|| F , for i = 1 , . . . , N. (7) We note that extensions of the t-product and corresponding decomposi-tions are possible for higher order tensor representations (e.g. for color imagetraining data), as well [18], [19]. f (cid:126) B is a member of the class i , we expect (7) to be small. Wedetermine the class to which (cid:126) B belongs by which projectionis the closest to the original image in the Frobenius norm.III. E XPERIMENTS AND R ESULTS

To test our local tSVD classiﬁer, we use the public MNISTdataset of handwritten digits as a benchmark [4]. The MNISTdataset contains of , training images and , testimages. Each image is a × grayscale image consistingof a single hand-written digit (i.e., through ). We organizethe training images by digit resulting in different classeswith the distribution of digits displayed in Figure 2. Fig. 2. Table of MNIST digit distribution.

Digit 0 1 2 3 4 is of size × × ). Using (6),we independently form a local tSVD basis for each class U , U , . . . , U where U i is the basis for the digit i and ofsize × k × for some truncation k . For simplicity, we usethe same truncation for all bases . A. Numerical Results: Classiﬁcation

Our ﬁrst objective is to use these local tSVD bases todetermine the digit in each test image. Suppose (cid:126) P j is the × × lateral slice of the jth test image. We determinehow similar (cid:126) P j is to each digit using the following metric (7):arg min i =0 ,..., || (cid:126) P j − U i ∗ U Ti ∗ (cid:126) P j || F . (8)To measure the accuracy of our classiﬁcation, we computethe recognition rate for the entire test data as follows: r = . (9)For various truncation values k , we obtain the followingrecognition rates: Fig. 3. Classiﬁcation accuracy for various truncation values.

Truncation k = 3 k = 4 k = 5 k = 10 r (%) 87.99 88.51 87.14 75.31From Figure 3, we notice that smaller truncation valuesyield greater classiﬁcation accuracy. This indicates that themagnitude of the tubes of singular values in S (i.e., || s ii || F ) Note that the tSVD offers ﬂexibility in prescription of the truncation levelper basis [3]. decays rapidly for the early truncation values, as demonstratedin Figure 4.

Fig. 4. Magnitude decay of norm of singular value tubes for digits 0-4.

Notice in Figure 4 the magnitude of the tubes of S decaysrapidly for the ﬁrst few indices i and decays more slowlystarting at the index i = 5 . This implies we can optimizeour storage costs by truncating at about k = 5 without losingsigniﬁcant classiﬁcation accuracy.In addition to the overall classiﬁcation accuracy, we canmeasure the accuracy of classifying each digit as r i = i i . (10)We show the per-digit accuracy results for k = 4 below: Fig. 5. Classiﬁcation accuracy per digit for truncation k = 4 . Digit Most Freq. 2nd Most r i (%)0 0 1 91.121 1 4 96.562 2 0 83.923 3 8 82.774 4 1 96.135 5 8 79.486 6 1 93.327 7 9 90.958 8 5 82.149 9 4 87.02In Figure 5, the “Most Freq.” column indicates the class towhich the images of each digit were most frequently classiﬁed.The “2nd Most” column indicates the second class to whichthe images of each digit were most frequently classiﬁed.We illustrate some of the mis-classiﬁcations that occur inFigure 6 for truncation k = 4 . a) Incorrectly classiﬁed as . (b) Incorrectly classiﬁed as .Fig. 6. Examples of incorrect classiﬁcation of images that should be 7. We notice that Figure 6a and 6b do have qualitative sim-ilarities to and , respectively. We can likely improve forambiguous digits by adding additional features for each classand/or employing slightly different metrics. B. Numerical Results: Identiﬁcation

Our second objective is to use our local tSVD featurevectors to determine if a pair of test images contain the samedigit. To solve this problem, we consider each comparison (8)to be a feature for a particular image (cid:126) P j instead of minimizingover the number of classes. More speciﬁcally, we construct a × vector of features for each of our 10,000 test images.We measure the similarity between two images by comput-ing the cosine between the feature vectors. Though other sim-ilarity metrics are possible, given what all the (non-negative)entries in the feature vector represent, this seemed appropriatefor proof of concept.We compute the similarity for all ( i, j ) -pairs of test imagesto form a similarity score matrix S of size × where S is symmetric. Fig. 7. Similarity score matrix for truncation k = 4 . In Figure 7, we display only the similarity scores between . and and we notice that blocks along the diagonalcontain the highest similarity scores, as desired given theordering of the test data. This illustrates that the cosine metricdoes enable us to determine if two images contain the samedigit. Fig. 8. ROC curve for various truncation values k . Using a receiver operating characteristic (ROC) curve inFigure 8, we visualize the effectiveness of our local tSVDclassiﬁer. Notice the curve for truncation k = 10 is signiﬁ-cantly lower, indicating smaller truncation values (indicativeof less storage) yield better accuracy for the MNIST dataset.IV. C ONCLUSIONS AND F UTURE W ORK

We have developed a new local truncated tSVD approachto image classiﬁcation based on provable optimality condi-tions which is elegant in its straightforward mathematicalapproach to the problem. Beyond the innate computationaland storage efﬁciency advantages of the proposed approach, ithas demonstrated effective performance in classifying MNISTdata. The primary purpose of this short paper was a proofof concept of a new method. In the future, we will compareour approach to current state-of-the-art approaches in terms ofstorage, computation time and qualitative classiﬁcation resultsfor larger and different datasets (e.g. subjects from a dataset offacial images). Additionally, we seek an automated strategy fordetermining optimal truncation value k or a varied truncationscheme denoted tSVDII as in [3]. We will also explorewhether the alternative tensor-tensor products from [8] andtheir corresponding truncated tSVDs will allow us to obtainmore illustrative features, and whether new double-sided tSVDtechniques [20] that are insensitive to tensor orientation areuseful here as well. A CKNOWLEDGMENT