[PDF] TOP-SPIN: TOPic discovery via Sparse Principal component INterference

Abstract

We propose a novel topic discovery algorithm for unlabeled images based on the bag-of-words (BoW) framework. We first extract a dictionary of visual words and subsequently for each image compute a visual word occurrence histogram. We view these histograms as rows of a large matrix from which we extract sparse principal components (PCs). Each PC identifies a sparse combination of visual words which co-occur frequently in some images but seldom appear in others. Each sparse PC corresponds to a topic, and images whose interference with the PC is high belong to that topic, revealing the common parts possessed by the images. We propose to solve the associated sparse PCA problems using an Alternating Maximization (AM) method, which we modify for purpose of efficiently extracting multiple PCs in a deflation scheme. Our approach attacks the maximization problem in sparse PCA directly and is scalable to high-dimensional data. Experiments on automatic topic discovery and category prediction demonstrate encouraging performance of our approach.

Full PDF

TTOP-SPIN:TOPic discovery via Sparse Principal component INterference ∗ Martin Tak´aˇc (cid:93) , Selin Damla Ahipas¸ao˘glu ‡ , Ngai-Man Cheung ‡ and Peter Richt´arik (cid:93)(cid:93) University of Edinburgh ‡ Singapore University of Technology and DesignAugust 29, 2018

Abstract

We propose a novel topic discovery algorithm for unlabeled images based on the bag-of-words(BoW) framework. We ﬁrst extract a dictionary of visual words and subsequently for each image com-pute a visual word occurrence histogram. We view these histograms as rows of a large matrix from whichwe extract sparse principal components (PCs). Each PC identiﬁes a sparse combination of visual wordswhich co-occur frequently in some images but seldom appear in others. Each sparse PC corresponds toa topic, and images whose interference with the PC is high belong to that topic, revealing the commonparts possessed by the images. We propose to solve the associated sparse PCA problems using an Al-ternating Maximization (AM) method, which we modify for purpose of efﬁciently extracting multiplePCs in a deﬂation scheme. Our approach attacks the maximization problem in sparse PCA directly andis scalable to high-dimensional data. Experiments on automatic topic discovery and category predictiondemonstrate encouraging performance of our approach.

The goal of this paper is to design a method performing the following:

Given a database of n images, identify k (not necessarily disjoint) collections of images, S , . . . , S k ,with each covering a certain “topic”. Deﬁnition of a topic is not provided, and hence we are looking for an unsupervised learning methodable to ﬁrst i) automatically identify the topics from the images, and then to ii) form collections of imagesbelonging to these topics [5, 14, 1, 7].For instance, consider a database of photos with people, cars and buildings on them (without knowingthis). Some photos may contain people and no cars nor buildings, some may have people and cars, somemay be photos of buildings unspoiled by cars or people. From the viewpoint of the “cars” topic, peopleand buildings are clutter/background. From the viewpoint of the “people” topic, cars and buildings arebackground and not essential. We would wish to be able to automatically discover these three topics. Note ∗ The work of Martin Tak´aˇc was supported by the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRCgrant EP/G036136/1 and the Scottish Funding Council) and by the EPSRC grant EP/I017127/1 (Mathematics for Vast DigitalResources). The work of Peter Richt´arik was supported by EPSRC grants EP/J020567/1 (Algorithms for Data Simplicity) andEP/I017127/1 (Mathematics for Vast Digital Resources). a r X i v : . [ c s . C V ] N ov hat it may be that people and cars always occur together in an image, while people and buildings also alwaysoccur together. In that case the topics which we would wish to discover are “people and cars” and “peopleand buildings”.It has recently been demonstrated [17] that sparse PCA is able to discover topics in a database of articles.The approach is applied to a data-matrix A where rows correspond to articles, columns to words and A i,j isequal to the frequency of word j in article i . For example, [17] showed that in a NYTimes article dataset,words associated with the ﬁrst and second sparse PCs are million, percent, business, company, market, com-panies and point, play, team, season, game , respectively. These words discover two of the most importanttopics in the articles: business and sports.One of our contributions is to show that a similar approach can be successfully applied to images. Aswe shall see, identiﬁcation of topics in image databases can be performed by extracting sparse principalcomponents of a matrix whose rows correspond to all images in the database, columns to visual words(obtained by quantization of local descriptors such as SIFT, via clustering), with the ( i, j ) entry representingthe frequency of visual word j in image i . Images are subsequently assigned to the identiﬁed topics using asimple technique we call interference : images whose interference with a PC is high form natural topics. Contents:

We start in Section 2 by brieﬂy reviewing some of the relevant literature. In Section 3we propose and describe TOP-SPIN, an algorithm for topic discovery. Further, in Section 4 we providesome background for sparse PCA and present a scalable algorithm for extracting sparse PCs. In Section 5we provide numerical evidence for the efﬁcacy and efﬁciency of our approach. Finally, we conclude inSection 6 with a brief summary of our main contributions.

In the unsupervised visual object categorization problem, we attempt to uncover the category informationof an image dataset without relying on any information capturing image content [5, 14, 1, 7]. Unsupervisedcategorization relieves the burden of human labeling and removes subjective bias. Grauman and Darrell [5]proposed a graph-based method for unsupervised object categorization. In their work, the sets of local fea-ture descriptors extracted from individual database images are graph nodes, while graph edges are weightedby the number of correspondences between images. A spectral clustering algorithm is then applied to thegraph’s afﬁnity matrix to produce image groupings. Sivic et al. [14] demonstrated unsupervised learningof object hierarchy from datasets of unlabeled images. In their work, the generative Hierarchical LatentDirichlet Allocation (hLDA) model, previously used for text analysis [2], is adapted to the visual domain.Images are represented by a visual vocabulary of quantized SIFT descriptors. A “coarse-to-ﬁne” descrip-tion of the images with varying degrees of appearance and spatial localization granularity is proposed tofacilitate discovery of visual object class hierarchies. Bart et al. [1] also proposed unsupervised learningof visual taxonomies independent of Sivic et al. [14]. They use a modiﬁed nonparametric prior over treestructure of a certain depth [2]. Their modiﬁed model allows to represent several topics at each node in thetaxonomy and makes available all topics at every node to facilitate visual taxonomies inference. Images arerepresented using space-color histograms. Based on the BoW framework, Kinnunen et al. [7] applied theself-organization principle and the Kohonen map to solve unsupervised visual object categorization.Our work is also related to object recognition. One important difference is that we do not assumeany prior category information: as will be discussed, we discover object categories automatically from thedataset, and the testing images are assigned to these object categories using the same framework.In object recognition, the use of local descriptors with high degree of invariance has become one of thedominant approaches [16]. In particular, in the BoW approach, an image is represented by a bag of highly-2nvariant local feature descriptors (e.g., [8]). These local descriptors may be further clustered or quantizedinto a dictionary of visual words [15]. A visual word occurrence histogram of an image is used to determinea distance for classiﬁcation of object categories. To generate a large dictionary of vocabularies, hierarchicalquantization can be used to produce a vocabulary tree with the leaf nodes being the visual words [12]. Arecent work of Naikal et al. [11] used Sparse PCA to select informative visual words to improve objectrecognition. Given the prior object category information, they apply Sparse PCA to each object categoryseparately to select informative (more useful) visual words within individual categories. The union of allthe informative visual words selected from individual categories forms the overall reﬁned visual dictionary.Different from Naikal et al. [11], our work discovers object categorization automatically by applying SparsePCA in a different way (and with different philosophy). Also, we propose to perform category predictionby projecting the test-image’s occurrence histogram vector directly onto the principal components (PCs)associated with the discovered categories, and this is different from previously-proposed BoW-based objectrecognition systems. We argue that with our approach each PC selects and associates co-occurring visualwords that are signatures for a category. The projection of the test-image’s histogram onto a PC quantiﬁesthe extent of visual words co-occurrence in the test image, which is useful for predicting the category.

We propose TOP-SPIN (Algorithm 1), a method for TOPic discovery via Sparse Principal component IN-terference.

Algorithm 1

TOP-SPIN

Input: n images, p = k = s =sparsity1. Representation: i by a row vector h i ∈ R p w ∈ R p , w ≥ Extract topics via sparse PCA: A = H Diag( w ) ∈ R n × p , where H i : = h i s -sparse PCs x , . . . , x k ∈ R p from A Detect topic images via Interference: δ , . . . , δ k > S l ← { i : Intf( i, x l ) > δ l } , l = 1 , . . . , k Output: S l (images associated with topic l ), l = 1 , . . . , k In Step 1a we utilize the standard Bag of Words (BoW) approach, where for each image we identify key-points (e.g., by Maximally Stable Extremal Regions (MSER)), and then ﬁnd local feature descriptors forthem (e.g., by SIFT algorithm; SIFT descriptors are 128-dimensional vectors). We identify a high numberof descriptors for each image and then select a random subset and perform clustering, obtaining p clustercenters (“visual words”). Each local descriptor in an image is then substituted by the closest visual word(distances are measured in L norm). Therefore, image i can be described by a histogram vector f i ∈ R p as follows: f ij is the number of appearances of visual word j in image i . For normalization purposes (e.g.,sharpness, size) we instead represent each image i by the normalized histogram h i = f i / (cid:80) j f ij . While in3his paper we focus on this particular image representation, our framework also applies to other representa-tions.Some visual words may be more important than others. For instance, a word appearing in all imageswith identical frequency is not informative and hence can be excluded from further analysis. In Step 1b we associate with each visual word j = 1 , , . . . , p a weight w j ≥ , forming a vector w ∈ R p + . Inthe experiments in this paper we work with the Term Frequency Inverse Document Frequency (TF-IDF)weights [12] deﬁned by w j = ln( n/n j ) , where n j = |{ i : h ij > }| , i.e., the number of images containingvisual word j . If word j occurs in many images, then w j is small and vice-versa. However, different weightsmight be preferable depending on the dataset. In this step we extract k leading sparse principal components (sparse PCs) of the matrix A = H Diag( w ) ,where the i -th row of H ∈ R n × p is h i and Diag( w ) is the p × p diagonal matrix with vector w on thediagonal. Various sparse PCA formulations were suggested in the literature. Here we propose the s -sparsePC x l to be obtained as the solution of the following optimization problem:maximize (cid:107) A l x (cid:107) subject to (cid:107) x (cid:107) ≤ , (cid:107) x (cid:107) ≤ s, (1)where (cid:107) · (cid:107) is the standard Euclidean norm, (cid:107) x (cid:107) = |{ i : x i (cid:54) = 0 }| (number of nonzero elements in x ),and A l +1 = A l − x l ( x l ) T with A = A . Further, we propose that (1) be solved by the simple yet powerfulAlternating Maximization (AM) framework presented in [13]. The authors of [13] provide a source code called “24AM”: the method is scalable, fast and parallel and can be run on multicore machines, GPUs andclusters. However, 24AM does not implement the solution of a sequence of problems (1) for l = 1 , , . . . , k (deﬂation techniques for sparse PCA are described in [9]). A naive approach would be to simply solve (1)in a loop, forming A l +1 from A l as described above. However, this is not efﬁcient due to the structure andsparsity of the problem. We therefore implement our own multicore version of the method in C++ suitablefor the task. Our SPCA solver is three orders of magnitude faster than the Augmented Lagrangian Method(ALM) proposed by [11] for p = 500 and its advantage is growing with p . More details on SPCA, AM, ourmodiﬁcations of AM and a comparison with ALM are given in Section 4. Deﬁne the interference between PC x l and image i via Intf( i, x l ) := | p (cid:88) j =1 h ij w j x lj | . That is, it is the absolute value of the inner/dot product between x l and a i := ( h i w , h i w , . . . , h ip w p ) T (the i -th row of A ). It is easy to check that Intf( i, x l ) is in fact the length of the projection of a i onto x l : itquantiﬁes the extent that image i contains the visual words associated with PC x l . In Step 3b we deﬁne S l tobe the set of images i having large enough interference with x l , where the precise quantitative meaning of“large enough” is controlled by the parameter δ l chosen in Step 3a . This parameter can be chosen as follows.We compute the interferences of all images with x l and subsequently cluster them into two clusters: “high”and “small”. We then pick δ l which separates the two clusters, which leads to topic collections S l adapted to https://code.google.com/p/24am/ We illustrate the method on a simpliﬁed artiﬁcial example (see Figure 1). We have n = 9 images whichnaturally belong to 3 categories/topics: guns, mice and bicycles. In Step 1 we identify 8 visual words: 3for guns (green, brown and pink dots), 2 for mice (blue and dark green dots) and 3 for bicycles (light blue,purple and orange dots). In this case the situation is perfect as no two images in different topics contain thesame visual word. Here we choose w to be the vector of all ones. As a consequence, A is block diagonal,with rows a , . . . , a as depicted in Step 2 in Figure 1. In

Step 2 of TOP-SPIN, sparse PCs x , x and x arecomputed (we can choose s = 3 ). Each sparse PC has zero values outside of two topics and nonzero valuesin a single topic. In this sense, each sparse PC (perfectly) identiﬁes a topic. In particular, x represents the“mice” topic, x represents the “bicycles” topic and x represents the “guns” topic. Finally, in Step 3 foreach x l we compute the interferences with each normalized histogram vector a i . The last step in Figure 1plots each image in a 3D space, with the coordinates of image i being (Intf( i, x ) , Intf( i, x ) , Intf( i, x )) .In this example the interferences of i with x l will be nonzero if and only if i belongs to the topic representedby PC x l . Hence, each of the sets S l , l = 1 , , , will consist of images depicted on a single axis in the 3Dspace. The three sets S , S , S identiﬁed by TOP-SPIN correspond perfectly to the natural topics inherentin the image database.Real data sets are different from the simpliﬁed example depicted in Figure 1 in several ways. First, therewill be many images and many visual words. Second, A will not be block diagonal – images will naturallyshare visual words with other images since they may share multiple objects. As a consequence, the topicsdiscovered by TOP-SPIN will not be perfect as in the simpliﬁed example. Please see Section 5 for numericalexperiments with real datasets. Principal Component Analysis (PCA) is an important tool for dimension reduction and data analysis. Let A ∈ R n × p denote a data matrix where the rows correspond to measurements of p variables. PCA ﬁnds linearcombinations of the columns of A , called principal components (PCs), pointing in mutually orthogonal5irections, together explaining as much variance in the data as possible. If the rows of A are centered, theproblem of extracting the ﬁrst PC can be written as max {(cid:107) Ax (cid:107) : (cid:107) x (cid:107) ≤ } , where (cid:107) · (cid:107) is any norm formeasuring variance . Although classical PCA employs the L norm, L norm can also be used – this isespecially useful when the data is contaminated (e.g., by outliers). Further PCs can be obtained by deﬂationas explained in the previous section.PCA usually produces PCs that are combinations of all variables. In many applications however, in-cluding topic discovery, it is desirable to induce sparsity into the PCs. The problem of ﬁnding PCs withfew nonzero components is known as sparse PCA or SPCA (see [3], [4], [6], and [18]). Sparsity is usuallyincorporated either directly enforcing a constraint on the number of nonzero components in a PC, such as in(1), or by adding a penalty term to the objective function. We use the open-source 24AM framework [13] for solving the SPCA problem. 24AM is a unifying Al-ternating Maximization framework for large scale PCA and SPCA problems capable of solving variousformulations of SPCA. It also includes parallel implementations of the method for various architectures. Inparticular, we ﬁnd that the cardinality-constrained formulation (1) works best, and hence we present 24AMfor that case only: Algorithm 2. The method’s name comes from the fact that, for a certain function F ( x, y ) and convex sets X and Y , the two steps of 24AM are of the following alternating maximization form [13]: y = arg max y { F ( x, y ) : y ∈ Y } and x = arg max x { F ( x, y ) : x ∈ X } . Algorithm 2 x (0) ∈ R p and t ← Repeat y ( t ) = Ax ( t ) / (cid:107) Ax ( t ) (cid:107) x ( t +1) ← T s ( A T y ( t ) ) / (cid:107) T s ( A T y ( t ) ) (cid:107) t ← t + 1 Until a stopping criterion is satisﬁedBy T s ( a ) we denote the vector obtained from a by keeping the s largest elements a j in absolute valueand setting the rest to zero. ALM is an Augmented Lagrangian Method proposed in [11] for object recognition and applied to an SDP relaxation of the the Sparse PCA formulation in [4]. On the other hand, 24AM works with the most naturalformulation of sparse PCA directly . ALM does not control the sparsity level of the solution directly, but viaa penalty parameter the value of which is a very poor predictor of sparsity. If a particular target sparsity issought, one needs to run ALM repeatedly with different values of the penalty parameter, effectively ﬁne-tuning for it. On the other hand, 24AM does not suffer from this issue as sparsity is controlled directly by s . A simple scaling argument shows that the solution must satisfy (cid:107) x (cid:107) = 1 .

00 200 300 400 50010 −2 p C o m pu t a t i ona l t i m e [ s e c .]

24 AMALM

100 200 300 400 5000.50.60.70.80.91 p f ( x ) /f *

24 AMALM

Figure 2: 24AM vs ALM.In Figure 2 we compare the performance of 24AM and ALM on artiﬁcial random matrices A ∈ R n × p with n = p and p ∈ { , , . . . , } . For each problem we ﬁxed a penalty parameter and obtained asingle leading sparse PC using the ALM method. We then measured the resulting sparsity s of the solution.Subsequently, we run 24AM with target sparsity level set to s . Here are our ﬁndings. First, 24AM terminates three orders of magnitude faster than ALM for p = 500 ; with the gap getting larger with p (left plot).Hence, 24AM is well suited for problems where it is beneﬁcial to work with a large number of visual words.Second, 24AM solutions for all problem instances are of better quality than those obtained by ALM (rightplot) in the sense that they explain more of the optimal variance. That is, the ratio f ( x ) /f ∗ is larger, where f ∗ = (cid:107) Ax ∗ (cid:107) and x ∗ is the optimal non-sparse PC, and f ( x ) = (cid:107) Ax (cid:107) and x is the s -sparse PC found bythe methods. In this section we highlight, on a sequence of carefully chosen experiments, the efﬁcacy and efﬁciency ofTOP-SPIN. We work with the BMW (Berkeley Multiview Wireless) dataset [10, 11] consisting of 20 imagecategories (Berkeley campus buildings), with × images in each. In each category the same buildingis captured repeatedly from different distances and angles 16 times, each time simultaneously by 5 camerasattached to a ﬁxed frame in close proximity to one another. Hence, there is a total of × × images.In all experiments we use MSER keypoints and SIFT descriptors, our codes were implemented in C++.We used OpenCV library v2.4.4.0 to ﬁnd the keypoints, extract local descriptors and for hierarchical clus-tering (using FLANN) to obtain a dictionary of visual words. In this section we empirically show that sparse PCs can identify topics. We took all images from camera p = 5 , visual words and extracted PCs with s = 20 .For illustration purposes we limit our attention to just 2 topics; the message applies to more topics ofcourse. The top row of Figure 4 depicts the interference between sparse PCs x , x and images belonging7o three different topics/categories (red, blue and green bars). One can observe that indeed both sparse PCshave high interference with a single image topic. The next two rows show the same image three times; inthe 1st column with all visual words, in the 2nd column with only those visual words selected by x (i.e., inthe set { j : x j (cid:54) = 0 } ) and in the 3rd column only those visual words selected by x . Clearly, x selects asubstantial number of visual words in the second row image and does not select nearly any visual words inthe third row image. The top image has high interference with x , while bottom image has low interference.The situation with x is reversed. Indeed, the top image belongs to S , the topic attached to x , while thebottom image belongs to S . −0.2 −0.1 0 0.1 0.2 0.3−0.4−0.200.20.4−0.4−0.3−0.2−0.100.10.20.3 (a) Projection of { h i } onto a random 3D sub-space. (b) Projection of { h i } onto the 3D space spannedby three sparse PCs. Figure 3: Random vectors do not identify topics, sparse PCs do.In Figure 3 we focus on the same three categories as in the previous test, but in this case we visualizethem in 3D space, as in Figure 1. Because each image is represented by a p = 5 , dimensional vector ( h i ),a na¨ıve approach for visualizing h i in 3D would be to project the vectors h i onto a random 3D subspace of R p (Figure 3, left). No apparent separation of the images belonging to the three topics (represented by differentcolor and marker) is present. However, if we project onto the space spanned by the PCs corresponding tothe three topics, we can clearly see the images belonging to to different topics coalescing around differentaxes (Figure 3, right).Let us now look at (a portion of) the actual output of TOP-SPIN for k = 7 , with a dictionary of size p = 5 , and s = 50 . Figure 5 depicts the sets S and S for δ and δ chosen so that | S | = | S | = 8 . Itis clear that the method is able to identify the categories. We would like to stress that Sparse PCA is appliedto the entire training dataset, and that testing is done on different images. In contrast, the approach in [11]presupposes the knowledge of the categories as Sparse PCA is applied to test images from each category.Moreover, as we shall see later, TOP-SPIN is able to also give better categorization accuracy. In this section we consider the problem of category prediction (object recognition). While this is a differentproblem from the main focus of this paper: topic discovery, we will show that our framework can be also8 x I n t e r f e r en c e I n t e r f e r en c e Figure 4: Different principal components select visual words prevalent in different categories.9

50 100 150 200 250 30000.10.20.30.40.50.60.70.80.9 Image ID I n t e r f e r en c e I n t e r f e r en c e Figure 5: The top part (plot + 8 images) correspond to sparse PC x , the bottom art (plot + 8 images)correspond to sparse PC x . The plots show the interferences of all images (horizontal axis) with the givenPC. Images from the same category/topic (not known to our method!) are represented by the same color (buteach color is used three times and each time it represents a different category/topic). For each PC we showthe 8 images having the largest interference with it; these are the sets S and S for appropriate choice δ and δ . One can observe that not only TOP-SPIN selects important features (visual words), but the methodcorrectly identiﬁes topics. 10sed to perform category prediction. Moreover, we demonstrate that our approach yields superior predictionaccuracy results to the state of the art [11].Each image in the BMW dataset can be represented by a triple ( a, b, c ) , where c is the category number(0–19), b is camera number (0–4) and a is the shot number (0–15). Let M consist of all images with odd a and b = 2 (i.e., images per category). The remaining images are partitioned into two groups: T consistingof images with even a and b (cid:54) = 2 ( images per category) and D (the rest; images per category). Finally,let L be the set of all images with b = 2 . We will set aside L for “learning”, M for “matching” and T for“testing” as described below.In the following we will describe and compare four methods, two from the literature (Baseline and NYS[11]) and two new ones (Method 1 and Method 2), all of which perform the following category predictiontask. Using images in L , learn a classiﬁer which matches each image i in the testing set T to an image m ( i ) in the matching set M . We then compute the prediction accuracy of each method deﬁned as the percentageof images i ∈ T for which i and m ( i ) have the same category. The results are summarized in Table 1; thedescription of the methods follows.In all methods, a dictionary of p = 5 , visual words is ﬁrst extracted from images in L , and thennormalized histograms are computed for all images. In the Sparse PCA based methods we used 24AM forPC extraction.1. Baseline.

This classiﬁer is given by m ( i ) = arg min {(cid:107) h i − h m (cid:107) : m ∈ M } . That is, we assign i toimage m ( i ) whose histogram is closest to that of i in L norm.2. NYS.

In [11], the authors for each category c form a matrix A c of normalized histograms correspondingto images in L having category c , and then extract several sparse PCs of A c . Let the union of thesupports of the PCs for every c be I c , and let I = ∪ c I c . The NYS classiﬁer is given by m ( i ) =arg min { (cid:80) j ∈ I | h ij − h mj | : m ∈ M } . This is similar to baseline, with the difference that only theimportant features ( I ) are used when computing the L distance.3. Method 1.

Here we propose a classiﬁer similar to NYS with the exception that I is obtained as theunion of the supports of 160 -sparse PCs of matrix A L whose rows are the normalized histogramsof all images in L .4. Method 2.

Here we compute 160 50-sparse PCs from A L and assign each PC to the image in M withwhich it has the highest interference. Then, when querying an image from T , we assign it to the PCwith which it has the highest interference and through this, using the mapping just described, to animage in M .We can see from Table 1 that Method 2 is best (i.e., interference works better than L ), then followsMethod 1 (i.e., computing PCs using all of A L is better than computing PCs separately for each class),which is in turn superior to both NYS and Baseline. Method 2 outperforms Baseline by cca 4%. Remark 1:

Note that T consists precisely of those images which do not have neither a nor b in commonwith any images in M . This is crucial as for every image in D there is a (very) similar image in M (onewith the same a but taken by a different camera b ), which may skew the results. In fact, prediction accuracyon images from group T is . and on images from group D is . , if we choose SURFdescriptors and p = 1 , , as in [11]. This gap is present also when SIFT and p = 5 , is used, where theaccuracy for group T is . and for group D is . . This is the reason why we have discarded D and used only T for testing. 11able 1: Category prediction accuracy of four methods; by category and total. Our approach improves onBaseline by 4%. Cat. Baseline NYS Method 1 Method 20 100.00% 100.00% 100.00% 100.00%1 90.62% 93.75% 90.62% 87.50%2 68.75% 71.88% 68.75% 87.50%3 96.88% 96.88% 100.00% 96.89%4 81.25% 81.25% 81.25% 100.00%5 100.00% 100.00% 100.00% 100.00%6 100.00% 100.00% 100.00% 81.25%7 81.25% 81.25% 84.38% 96.88%8 37.50% 43.75% 37.50% 81.25%9 40.62% 46.88% 46.88% 75.00%10 93.75% 90.62% 90.62% 81.25%11 100.00% 100.00% 100.00% 90.62%12 40.62% 40.62% 43.75% 37.50%13 100.00% 100.00% 100.00% 100.00%14 78.12% 75.00% 75.00% 78.12%15 96.88% 93.75% 96.88% 96.88%16 90.62% 90.62% 90.62% 78.12%17 100.00% 100.00% 100.00% 100.00%18 93.75% 93.75% 96.88% 100.00%19 100.00% 100.00% 100.00% 100.00%Total 84.53% 84.68% 85.16 % emark 2: Also note that a perfect comparison of our results with [11] is not possible as all the dataneeded to reproduce the experiments exactly as in [11] are not available to us. After implementing theirmethod and setting all available options, we obtained Baseline prediction accuracy 80.69%, whereas in [11]the reported ﬁgure is 80.02%. Moreover, we have decided not to use SURF descriptors but SIFT as this waywe obtained better results.Figure 6: Features (visual words) selected by Baseline (ﬁrst row), NYS per category = I c (second row),NYS in aggregate = I (third row), and Method 1 / Method 2 (last row).Let us now look at the features (visual words) selected by the four methods described above. In Figure 6we show 3 images from different categories. The ﬁrst row shows all features in the dictionary appearingon these images. These are the features used by Baseline. The second row shows only features in I c , forthe three different values of category c the three images belong to. In the third row we show the aggregate13

20 40 60 80 10000.20.40.60.81 Image ID I n t e r f e r en c e Figure 7: 160 PCs represented as 160 lines with unique formatting, and their interference with 96 images(32 test images from 3 categories). We see that each PC has high interference with a subset of images of asingle category only, effectively selecting it.features I = ∪ c I c . Finally, the last row show the features selected our approach (Method 1/Method 2). Notethat we are able to achieve a better selection of features than NYS (third row) without the knowledge ofimage categories. The number of selected features for both NYS and PC1/PC2 was chosen to be the samefor fairness of comparison purposes.In Figure 7 we give an additional insight into why Method 2 and TOP-SPIN work. The horizontal axisrepresents all test images belonging to three categories, CAT1, CAT2 and CAT3, the categories the imagesin Figure 6 belong to. That is, we consider × images. The ﬁrst 32 images correspond to CAT1; images33–64 to CAT2 and images 65–96 belong to CAT3. Now, for all

160 sparse PCs we plot a unique line inthis plot, representing the interference of that PC with the images. For instance, the PC represented by thesolid red line has high interference with images 45–64. Notice that all these images belong to CAT2. ThePC corresponding to the solid blue line has high interference with 34–45, again a subset of images of CAT2.Note that neither the solid blue nor the solid red line have peaks in any of the other two regions/categories.This means that the PCs representing them effectively represent some object common to a subset of imagesin CAT2. The same is true for all other lines and the PCs they represent.

We now summarize some of our main contributions:1. We have developed an algorithm (TOP-SPIN) for solving the problem: topic discovery in a collectionof unlabeled images. Our algorithm applies Sparse PCA to identify co-occurred visual words that canbe used as topic signatures.2. We have demonstrated on real datasets that TOP-SPIN is able to discover topics and correctly assignimages to the topics.3. When used for category prediction, our framework gives higher accuracy than that of [11]. Moreover,14his is achieved without knowing what the categories are as sparse PCA is applied to data comingfrom all (test) images of all categories, not to (test) images of each category individually as in [11].4. Our Sparse PCA solver is 3 or more order of magnitude faster than ALM. It solves the Sparse PCAproblem directly (i.e., not a relaxation), and unlike ALM, has direct control over the sparsity of thethe PCs (via s ). Our Sparse PCA solver is parallel in nature and scalable to high-dimensions. References [1] Evgeniy Bart, Ian Porteous, Pietro Perona, and Max Welling. Unsupervised learning of visual tax-onomies. In

CVPR , 2008. 1, 2[2] David M. Blei, Thomas L. Grifﬁths, Michael I. Jordan, and Joshua B. Tenenbaum. Hierarchical topicmodels and the nested chinese restaurant process. In

NIPS , 2004. 2[3] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principalcomponent analysis.

Journal of Machine Learning Research , 9:1269–1294, 2008. 6[4] Alexandre d’Aspremont, Laurent El Ghaoui, Michael I. Jordan, and Gert R. G. Lanckriet. A directformulation for sparse PCA using semideﬁnite programming.

SIAM Review , 48(3):434–448, 2007. 6[5] Kristen Grauman and Trevor Darrell. Unsupervised learning of categories from sets of partially match-ing image features. In

CVPR , 2006. 1, 2[6] Michel Journ´ee, Yurii Nesterov, Peter Richt´arik, and Rodolphe Sepulchre. Generalized power methodfor sparse principal component analysis.

Journal of Machine Learning Research , 11:517–553, 2010.6[7] Teemu Kinnunen, Joni-Kristian Kamarainen, Lasse Lensu, and Heikki Kalviainen. Unsupervised vi-sual object categorisation via self-organisation. In

ICPR , 2010. 1, 2[8] David G. Lowe. Object recognition from local scale-invariant features. In

ICCV , 1999. 3[9] Lester Mackey. Deﬂation methods for sparse PCA. In

NIPS , 2008. 4[10] Nikhil Naikal, Allen Yang, and Shankar Sastry. Towards an efﬁcient distributed object recognitionsystem in wireless smart camera networks. In

International Conference on Information Fusion , 2010.7[11] Nikhil Naikal, Allen Y. Yang, and S. Shankar Sastry. Informative feature selection for object recogni-tion via sparse PCA. In

ICCV , 2011. 3, 4, 6, 7, 8, 11, 13, 14, 15[12] David Nist´er and Henrik Stew´enius. Scalable recognition with a vocabulary tree. In

CVPR , 2006. 3, 4[13] Peter Richt´arik, Martin Tak´aˇc, and Selin Damla Ahipasaoglu. Alternating maximization: unifyingframework for 8 sparse PCA formulations and efﬁcient parallel codes. arXiv:1212.4137 , 2012. 4, 6[14] Josef Sivic, Bryan C. Russell, Andrew Zisserman, William T. Freeman, and Alexei A. Efros. Unsuper-vised discovery of visual object class hierarchies. In

CVPR , 2008. 1, 21515] Josef Sivic and Andrew Zisserman. Video Google: A text retrieval approach to object matching invideos. In

ICCV , 2003. 3[16] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid. Local features and kernels for classiﬁcation oftexture and object categories: A comprehensive study.

IJCV , 2007. 2[17] Youwei Zhang and Laurent El Ghaoui. Large-scale sparse principal component analysis with applica-tion to text data. In