Image Data Compression for Covariance and Histogram Descriptors
Matt J. Kusner, Nicholas I. Kolkin, Stephen Tyree, Kilian Q. Weinberger
IImage Data Compression for Covariance and Histogram Descriptors
Matt J. Kusner, Nicholas I. KolkinWashington University in St. Louis {mkusner,n.kolkin}@wustl.edu
Stephen TyreeNVIDIA Research [email protected]
Kilian Q. WeinbergerWashington University in St. Louis [email protected]
Abstract
Covariance and histogram image descriptors provide aneffective way to capture information about images. Both ex-cel when used in combination with special purpose distancemetrics. For covariance descriptors these metrics measurethe distance along the non-Euclidean Riemannian manifoldof symmetric positive definite matrices. For histogram de-scriptors the Earth Mover’s distance measures the optimaltransport between two histograms. Although more precise,these distance metrics are very expensive to compute, mak-ing them impractical in many applications, even for datasets of only a few thousand examples. In this paper wepresent two methods to compress the size of covariance andhistogram datasets with only marginal increases in test er-ror for k -nearest neighbor classification. Specifically, weshow that we can reduce data sets to and in somecases as little as of their original size, while approx-imately matching the test error of k NN classification onthe full training set. In fact, because the compressed setis learned in a supervised fashion, it sometimes even out-performs the full data set, while requiring only a fraction ofthe space and drastically reducing test-time computation.
1. Introduction
In the absence of sufficient data to learn image descrip-tors directly, two of the most influential classes of featuredescriptors are (i) the histogram and (ii) the covariancedescriptor. Histogram descriptors are ubiquitous in com-puter vision [2, 10, 27, 32, 33]. These descriptors maybe designed to capture the distribution of image gradientsthroughout an image, or they may result from a visual bag-of-words representation [11, 12, 29]. Covariance descrip-tors, and more generally, symmetric positive definite (SPD)matrices, are often used to describe structure tensors [15],diffusion tensors [37] or region covariances [45]. The latterare particularly well suited for the task of object detectionfrom a variety of viewpoints and illuminations.In this paper, we focus on the k -nearest neighbor ( k NN)classifier with histogram or covariance image descriptors. Computing a nearest neighbor or simply comparing a pairof histograms or SPD matrices is non-trivial. For his-togram descriptors, certain bins may be individually simi-lar/dissimilar to other bins. Therefore, the Euclidean dis-tance is often a poor measure of distance as it cannot mea-sure such bin-wise dissimilarity. In the case of covari-ance descriptors, SPD matrices lie on a convex half-cone—a non-Euclidean Riemannian manifold embedded inside aEuclidean space. Measuring distances between SPD matri-ces with the straight-forward Euclidean metric ignores theunderlying manifold structure of the data and tends to sys-tematically under-perform in classification tasks [47].Histogram and covariance descriptors excel if their un-derlying structure is incorporated into the distance metric.Recently, there have been a number of proposed histogramdistances [34, 36, 38, 43]. Although these yield strong im-provements in k NN classification accuracy (versus the Eu-clidean distance), these distances are often very costly tocompute ( e.g. super-cubic in the histogram dimensionality).Similarly, for SPD matrices there are specialized geodesicdistances [6] and algorithms [22, 45] developed that operateon the SPD covariance manifold. Because of the SPD con-straint, these methods often require significantly more timeto make predictions on test data. This is especially true for k NN, as the computation of the geodesic distance along theSPD manifold requires an eigen-decomposition for each in-dividual pairwise distance — a computation that needs to berepeated for all training inputs to classify a single test input.Cherian et al. [6] improved the running time (in prac-tice) of test classification by approximating the Riemma-nian distance with a symmetrized log-determinant diver-gence. For low dimensional data, Bregman Ball Trees[4] can be adapted, however the performance deterioratesquickly as the dimensionality increases.In this paper, we develop a novel technique to speedup k -nearest neighbor applications on covariance and his-togram image features, one that can be used in concertwith many other speedup methods. Our methods, called Stochastic Covariance Compression (SCC) and
StochasticHistogram Compression (SHC) learn a compressed train-ing set of size m , such that m (cid:28) n , which approximately1 a r X i v : . [ s t a t . M L ] M a y atches the performance of the k NN classifier on the orig-inal data. This new data set does not consist of originaltraining samples; instead it contains new, artificially gener-ated inputs which are explicitly designed for low k NN erroron the training data. The original training set can be dis-carded after training and during test-time, as we only findthe k -nearest neighbors among these artificial samples. Thisdrastically reduces computation time and shrinks storagerequirements. To facilitate learning compressed data setswe borrow the concept of stochastic neighborhoods, usedin data visualization [21, 46] and metric learning [16], andleverage recent results from the machine learning commu-nity on data compression in Euclidean spaces [24].We make three novel contributions: 1. we derive SCCand SHC, two new methods for compression of covarianceand histogram data; 2. we devise efficient methods for solv-ing the SCC and SHC optimizations using the Cholesky de-composition and a normalized change of variable; 3. wecarefully evaluate both methods on several real world datasets and compare against an extensive set of state-of-the-artbaselines. Our experiments show that SCC can often com-press a covariance data set to about of its original size,without increase in k NN test error. In some cases, SHCand SCC can match the k NN test error with only of thetraining set size—leading to order-of-magnitude speedupsduring test time. Finally, because we learn the compressedset explicitly to minimize k NN error, in a few cases it even outperforms the full data set by achieving lower test error .
2. Covariance and Histogram Descriptors
We assume that we are given a set of d -dimensional fea-ture vectors F = { x , . . . , x |F| } ⊂ R d computed from a single input image. From these features, we compute co-variance or histogram image descriptors. Covariance descriptors represent F through the covari-ance matrix of the features, X = 1 |F| − |F| (cid:88) r =1 ( x r − µ )( x r − µ ) (cid:62) , where µ = |F| (cid:80) |F| r =1 x r . For vectorial data, ‘near-ness’ is often computed via the Euclidean distance ora learned Mahalanobis metric [48]. However, the Eu-clidean/Mahalanobis distance between two covariance ma-trices is a poor approximation to their true distance alongthe manifold of SPD matrices. A natural distance for co-variance matrices is the Affine-Invariant Riemannian metric[6], a geodesic distance on the SPD manifold. Definition 1.
Let S d + be the positive definite cone of ma-trices of rank d . The Affine-Invariant Riemannian met-ric (AIRM) between any two matrices X , ˆX ∈ S d + is D R ( X , ˆX ) = (cid:107) log( ˆX − / X ˆX − / ) (cid:107) F . While the AIRM accurately describes the dissimilaritybetween two covariances along the SPD manifold, it re-quires an eigenvalue decomposition for every input ˆX . Themetric becomes intractable to compute even for moderately-sized covariance matrices ( e.g., computing ˆX − / ∈ S d + re-quires roughly O (cid:0) d (cid:1) time). To alleviate this computationalburden, a distance metric with similar theoretical proper-ties has been proposed by [6], called the Jensen-BregmanLogDet Divergence (JBLD), D J ( X , ˆX ) = log (cid:12)(cid:12)(cid:12) X + ˆX (cid:12)(cid:12)(cid:12) −
12 log | X ˆX | . (1)Cherian et al. [6] demonstrate that for nearest neighbor clas-sification, using JBLD as a distance has performance nearlyidentical to the AIRM but is much faster in practice andasymptotically requires O ( d . ) computation [7]. Histogram descriptors are a popular alternative to covari-ance representations. Assume we again have a set of d -dimensional features for an image F = { x , . . . , x |F| } .Further, let the collection of all such features for all n im-ages in a training set be F = F i ∪ . . . ∪F n (where F i are thefeatures for image i ). To construct the visual bag-of-wordsrepresentation we cluster all features in F into K centroids c , . . . , c K (e.g., via k -means), where these centroids areoften referred to as a codebook [12]. Using this codebookthe visual bag-of-words representation h i of an image i is a K -dimensional vector, where element h ij is a count of howmany features in the bag F i have c j as the nearest centroid.Arguably one of the most successful histogram distancesis the Earth Mover’s Distance (EMD) [39], which has beenused to achieve impressive results for image classificationand retrieval [30, 31, 39, 35]. EMD constructs a distancebetween two histograms by ‘lifting’ a bin-to-bin distance,called the ground distance M , where M ij ≥ , to a fullhistogram distance. Specifically, for two histogram vectors h and h (cid:48) the EMD distance is the solution to the followinglinear program: min T ≥ tr ( TM ) s.t. T1 = h and T (cid:62) = h (cid:48) , (2)where T is the transportation matrix and is a vectorof ones. Each element T ij describes the amount of massmoved from h i to h (cid:48) j , for the vectors to match exactly. Oneexample ground distance for the visual bag-of-words rep-resentation is the Euclidean distance between the centroidvectors M ij (cid:44) (cid:107) c i − c j (cid:107) . When the ground distance is ametric, it can be shown that the EMD is also a metric [39].In practice, one limitation of the EMD distance is its highcomputational complexity. Cuturi et al. [8] therefore intro-duce the Sinkhorn Distance, which involves a regularizedversion of the EMD optimization problem: min T ≥ tr ( TM ) − λ h ( T ) , s.t. T1 = h and T (cid:62) = h (cid:48) (3)here h ( T ) = − tr ( T log( T )) is the entropy of the transport T . The Sinkhorn distance between h and h (cid:48) is D S ( h , h (cid:48) ) = tr ( T λ M ) , where T λ is the solution to (3). The solution isan arbitrarily close upper bound to the exact EMD solution(by increasing λ ) and the optimization problem is shownto be at least an order of magnitude faster to compute thanthe EMD linear program (2). Specifically, Cuturi et al. in-troduce a simple iterative algorithm to solve eq.(3) in time O (cid:0) d i (cid:1) , where d is the size of the histograms and i is thenumber of iterations of the algorithm. This is compared to O (cid:0) d log d (cid:1) complexity of the EMD optimization problem[35]. In practice, each algorithm iteration is a matrix scal-ing computation that can be performed between multiplehistograms simultaneously. This means that the algorithmis parallel and can be efficiently computed on modern hard-ware architectures ( i.e. , multi-core CPUs and GPUs) [8].
3. Covariance compression
In this section we detail our covariance compressiontechnique:
Stochastic Covariance Compression (SCC).SCC uses a stochastic neighborhood to compress the train-ing set from n input covariances to m ‘compressed’ covari-ances. After learning, the original training set can be dis-carded and all future classifications are made just using thecompressed inputs. Since m (cid:28) n , the complexity of test-time classification is drastically reduced, from O (cid:0) nd . (cid:1) to O (cid:0) md . (cid:1) , where O (cid:0) d . (cid:1) is the asymptotic complex-ity of computing a single JBLD distance in eq. (1).Assume we are given a training set of n covariancematrices { X , . . . , X n } ⊂ R d × d with corresponding la-bels y , . . . , y n . Our goal is to learn a compressed set of m covariance matrices { ˆX , . . . , ˆX m } ⊂ R d × d with la-bels ˆ y , . . . , ˆ y m . To initialize ˆX j , we randomly sample m covariance matrices from our training data set and copytheir associated labels for each ˆ y . We optimize these syn-thetic inputs ˆX j to minimize the k NN classification er-ror. The k NN classification error is non-continuous andnon-differentiable with respect to ˆX j , but we can intro-duce a stochastic neighborhood, as proposed by Hinton andRoweis [21], to “soften” the neighborhood assignment andallow optimization on k NN error. Specifically, we place aradial basis function around each input X i and proceed asif the nearest prototypes ˆX j are assigned randomly. For agiven X i , the probability that ˆX j is picked as the nearestneighbor is denoted p ij = e − γ D J ( X i , ˆX j ) (cid:80) mk =1 e − γ D J ( X i , ˆX k ) = 1Ω i e − γ D J ( X i , ˆX j ) , (4)where D J ( X i , ˆX j ) is the JBLD divergence in eq. (1) and Ω i denotes the normalization term. The constant γ > is ahyper-parameter defining the “sharpness” of the neighbor-hood distribution. (We set γ by cross-validation) Objective.
Inspired by Neighborhood Components Analy-sis [16], we can compute the probability p i that an input X i will be classified correctly by this stochastic nearest neigh-bor classifier under the compressed set { ˆX , . . . , ˆX m } , p i = (cid:88) j : y j = y i p ij . (5)Ideally, p i = 1 for all X i , implying the compressed setyields perfect predictions on the training set. The KL-divergence between this ideal “1-distribution” and p i is sim-ply KL (1 || p i ) = − log( p i ) . Our objective is to minimizethe sum of these KL-divergences with respect to our com-pressed set of covariance matrices { ˆX , . . . , ˆX m } , min { ˆX ,..., ˆX m } − n (cid:88) i =1 log( p i ) . (6) Gradient.
To ensure that the learned matrices ˆX j are SPD,we decompose each matrix ˆX j by its unique Cholesky de-composition: ˆX j = B (cid:62) j B j , where B j is an upper triangularmatrix. To ensure that ˆX j remains SPD we perform gradi-ent descent w.r.t. B j . The gradient of L w.r.t. B j is ∂ L ∂ B j = n (cid:88) i =1 p ij p i ( δ y i y j − p i ) γ ∂D J ( X i , B (cid:62) j B j ) ∂ B j (7)where δ y i y j = 1 if y i = y j and is otherwise and D J ( X i , B (cid:62) j B j ) is the JBLD divergence between X i and B (cid:62) j B j = ˆX j . The gradient of the JBLD w.r.t. B j is: ∂D J ( X i , B (cid:62) j B j ) ∂ B j = B j ( X i + B (cid:62) j B j ) − − ( B (cid:62) j ) − . (8)We substitute (8) into (7) to obtain the final gradient. For asingle compressed input ˆX j = B (cid:62) j B j , each step of gradi-ent descent requires O (cid:0) d (cid:1) to compute ∂D J ( X i , B (cid:62) j B j ) ∂ B j and O (cid:0) d . (cid:1) to compute D J ( X i , ˆX j ) . It requires O (cid:0) d m (cid:1) tocompute p ij and an additional O (cid:0) m (cid:1) for p i . Thus the over-all complexity of ∂ L ∂ ˆX j is O (cid:0) d m n (cid:1) . We minimize our ob-jective in eq. (6) via conjugate gradient descent. A Matlabimplementation of SCC is available at: http://anonymized.
4. Histogram compression
Analogous to covariance compression, we can also com-press histogram descriptors, which we refer to as
Stochas-tic Histogram Compression (SHC). Our aim is to learn acompressed set of m (cid:28) n histograms { ˆ h , . . . , ˆ h m } ⊂ Σ d with labels ˆ y , . . . , ˆ y m from a training set histograms http://tinyurl.com/minimize-m h , . . . , h n } ⊂ Σ d with labels y , . . . , y n , where Σ d is the ( d − -dimensional simplex. Objective.
As before we place a stochastic neighbor-hood distribution over compressed histograms and definethe probability that ˆ h j is the nearest neighbor of h i via p ij = 1Ω i e − γ D S ( h i , ˆ h j ) (9)where D S is the Sinkhorn distance and Ω i normalizes p ij sothat it is a valid probability. As in SCC, we define the proba-bility that a training histogram h i is predicted correctly as p i via eq. (5) by summing over the compressed inputs ˆ h k theshare the same label. We then minimize the KL-divergencebetween the perfect distribution and p i as in eq. (6) to learnour set of compressed histograms ˆ h j . Gradient.
As in the covariance setting, the gradient of theobjective in eq. (6) w.r.t. a compressed histogram ˆ h j is ∂ L ∂ ˆ h j = n (cid:88) i =1 p ij p i ( δ y i y j − p i ) γ ∂D S ( h i , ˆ h j ) ∂ ˆ h j . (10)The gradient of the Sinkhorn distance D S w.r.t. ˆ h j intro-duces two challenges: 1. the distance D S itself is a nestedoptimization problem; and 2. the learned vector ˆ h j must re-main a well-defined histogram throughout the optimization, i.e. it must be non-negative and sum to , s.t. ˆ h j ∈ Σ d .We first address the gradient of the nested optimizationproblem w.r.t. ˆ h j , i.e. ∂D S ( h i , ˆ h j ) ∂ ˆ h j . In the primal formu-lation, as stated in eq. (3), the histogram ˆ h j occurs withinthe constraints, which complicates the gradient computa-tion. Instead, we form the dual [8], max α , β ∈ R d α (cid:62) h i + β (cid:62) ˆ h j − d (cid:88) k,l =1 e − λ ( M ij − α i − β j ) λ , (11)where α , β are the corresponding dual variables. Due tostrong duality, the primal and dual formulations are iden-tical at the optimum, however the dual formulation (11) is unconstrained . The gradient of the dual objective (11) is linear w.r.t. ˆ h j . If we consider β fixed, it follows that atthe optimum ∂D S ( h i , ˆ h j ) ∂ ˆ h j ≈ β ∗ , where β ∗ the optimal valueof β . This optimal dual variable is easily computed withthe iterative Sinkhorn algorithm [9] mentioned in Section 2.This approximation ignores that β ∗ itself is a function of ˆ h j ,which is a reasonable approximation for small step-sizes.To address the second problem and simultaneously per-form (approximated) gradient descent while ensuring that ˆ h j always lies in the simplex (i.e., is normalized), we pro-pose a change of variable in which we redefine each com-pressed histogram ˆ h j as a positive, normalized quantity: ˆ h j = e w j / (cid:80) dk =1 e w jk for w j ∈ R d . Then the gradientof the Sinkhorn distance can be taken with respect to w j , ∂D S ( h i , ˆ h j ) ∂ w j ≈ β ∗ ◦ (cid:16) se w j − e w j s (cid:17) where s is the normalizing term s = (cid:80) dk =1 e w jk and ◦ isthe Hadamard (element-wise) product. The complexity ofcomputing the full SHC gradient in eq. (10) is O (cid:0) nmd ˆ i (cid:1) :each Sinkhorn gradient above requires time O (cid:0) d ˆ i (cid:1) (where ˆ i is the number of Sinkhorn iterations), and p ij requires time O (cid:0) ˆ id m (cid:1) . We use gradient descent with an updating learn-ing rate to learn ˆ h j , selecting the compressed set that yieldsthe best training error across all iterations.
5. Results
We evaluate both algorithms on a series of real-worlddata sets and compare with state-of-the-art algorithms for k NN compression.
Datasets.
We evaluate our covariance compression methodon six benchmark data sets. The
ETH
80 dataset has imagesof object categories, each pictured with a solid blue back-ground. For each category there are exemplar objectsand for each exemplar the camera is placed in differentpositions. We use the × covariance descriptors of [5],who segment the images and use per-pixel color and tex-ture features to construct covariances. The ETHZ dataset isa low-resolution set of images from surveillance cameras ofsizes from × to × . The original task is to iden-tify the person in a given image, from different individ-uals. The original dataset has multiple classes with fewerthan individuals. Therefore, to better demonstrate a widerange of compression ratios, we filter the dataset to includeonly the most popular classes resulting in each individualhaving between and images ( images total). Weuse the pixel-wise features of [5] to construct × covari-ance matrices. The FERET face recognition dataset has gray-scale images of the faces of individuals, orientedat various angles. As the majority of individuals have fewerthan images in the training set, we also limit the datasetto the most popular individuals and use a larger set ofcompression ratios (described further in the error analysissubsection). We use the × Gabor-filter covariancesof [6]. Our version of the
RGBD
Object dataset [25] con-tains , point cloud frames of objects from three dif-ferent views. The task is to classify an object as one of object categories. We use the × covariance featuresof [5] which consist of intensity and depth-map gradients,as well as surface normal information. The SCENE
15 data igure 1. Montages of the datasets used in our evaluations from top down, left to right: (a) ETH
80 objects at different orientations;(b)
ETHZ person recognition; (c)
RGBD objects from point clouds; (d)
FERET face detection; (e)
KTH - TIPS B material categorization (f) SCENE
15 scene classification. consists of black and white images of different in-door and outdoor scenes. We split the dataset into trainingand test sets as per [28]. To create covariance features wecompute a dense set of SIFT descriptors centered at eachpixel in the image . Our SIFT features have bins each inthe horizontal and vertical directions and orientation bins,producing a × covariance descriptor. Via the workof [19] we can learn a rank- r projection matrix U ∈ R d × r ,where r (cid:28) d to reduce the size of these covariance matri-ces to r × r via the transformation X i → U (cid:62) X i U . We useBayesian optimization [13] , to select values for the covari-ance size r , as well as two hyperparameters in [19]: ν v and ν w , by minimizing the -NN error on a small validation set.The KTH - TIPS B dataset is a material classification datasetof materials with total images. Each material has different samples each from fixed poses, scales, and lighting conditions. We follow the procedure of [18] to ex-tract × covariance descriptors using color informationand Gabor filter responses. Experimental Setup.
We compare all methods against testerror of -nearest neighbor classification that uses the entiretraining set. For results that depend on random initializationor sampling we report the average and standard deviationacross random runs (save KTH - TIPS B , for which we use splits by holding out each of the four provided samples,one at a time). For datasets RGBD , ETHZ , and
ETH
80 wereport results averaged over different train/test splits. Baselines.
We compare our method,
Stochastic CovarianceCompression (SCC), against a number of methods aimed atreducing the size of the training set, which we adapt for thecovariance feature setting: 1. k NN using the full trainingset, 2. k NN using a class-based subsampled training set,which we use as initialization for SCC, 3.
Condensed Near-est Neighbor (CNN) [20], 4.
Reduced Nearest Neighbor (RNN) [14], 5.
Random Mutation Hill Climbing (RMHC)[42], and 6.
Fast CNN (FCNN) [1]. Both CNN and FCNNselect subsets of the training set that have the same leave- https://bitbucket.org/mlcircus/bayesopt.m one-out training error as the full training set, and are verywell-known in the fast k NN literature. RNN works by post-processing the output of CNN to improve the training errorand RMHC is a random subset selection method. (We givefurther details on these algorithms in Section 6). For FCNNwe must make a modification to accommodate covariancematrix features. Specifically, FCNN requires computing thecentroid of each class at regular intervals during the selec-tion. A centroid of class y is given by solving the followingoptimization, X y = arg min X (cid:88) i : y i = y D J ( X , X i ) where D J ( X , X i ) is the JBLD divergence. Cherian etal. [6] give an efficient iterative procedure for solving theabove optimization, which we use in our covariance FCNNimplementation. Classification error.
Figure 2 compares the test error ofSCC to baselines for sizes of the compressed set equal to , , , and of the training set. For dataset FERET which has a large number of classes we use larger com-pression ratios: , , , and . Although CNN,RNN, and FCNN only output a single reduced training set(the final point on each curve) we plot the intermediate testerrors of each method at the above compression ratios aswell. On each dataset SCC is able to reduce the test errorto nearly that of k NN applied to the full dataset using lessthan or equal to of the training data. Only on
ETHZ and
RGBD could SCC not match the full k NN error up to signif-icance, however the error rates are only marginally higher.For small compression ratios SCC is superior to all of thebaselines, as well as the subsampling initialization. Ondatasets
ETH
80 and
ETHZ the final outputs of CNN, RNNand FCNN are roughly equivalent to the SCC curve. How-ever, one notable downside is that these algorithms have nocontrol on the size of these final sets, which for
FERET areas large as (CNN/RNN) and (FCNN). In contrastSCC allows one to regulate the compressed set size pre-cisely. RMHC is also able to regulate the compressed set .02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.180.20.250.30.350.40.450.5 scene15 compression ratio . . . . . . . .
16 0 . FERET t es t e rr o r ethzeth80 . . . . . SCC subsamplingKNNCNN (Hart, 1968)FCNN (Angiulli, 2005)RNN (Gates, 1972)RMHC (Skalak, 1994)
KTH-TIPS2b rgbd
Figure 2. k NN test error rates after training set compression for covariance descriptors. See text for details. size. However, because it is based on random sampling, itsperformance can be very poor, as on
FERET . Surprisingly,for
SCENE
15 and
KTH , learning a compressed covariancedataset with SCC reduces the test error below the k NN er-ror of the full training set . We suspect that this occurs be-cause (a) the training set may have some amount of labelnoise, which is partially alleviated by subsampling, and (b)SCC essentially learns a new, supervised covariance repre-sentation, versus a label-agnostic set of covariance descrip-tors. For these datasets, there is no reason not to shrink thedataset to only of its original size and discard the origi-nal data—yielding × and × speedups during test-time. Test-time speedup.
Table 1 shows the speedup of SCCover k NN classification using the full training set for var-ious compression ratios (the datasets marked by a (C) arelearned with SCC). In general the speedups are roughly /δ ,where δ denotes the compression ratio. Results that matchor exceed the accuracy (up to significance) are in blue. At compression of the datasets run at this compressionratio match the test error of full k NN classification. In ef-fect we have removed neighbor redundancies in the dataset,and gained a factor of roughly × speedup. Much largerspeedups can be obtained at or compression ratio—although at a small increase in classification error. For thedata set with many classes ( FERET ) “loss-free” compressioncan still yield a speedup of × at δ = 0 . . Training time.
Table 2 describes the average training timesfor SCC (again (C) denotes SCC results). For maximumcompression to the training time is on the order ofminutes. As the size of the compressed set gets largerthe time increases but only by small amounts, indeed the Table 1. Speed-up of k NN testing through SCC and SHC com-pression. The SCC datasets are denoted with a (C) and the SHCdatasets with an (H). Results where SCC/SHC matches or exceedsthe accuracy of full k NN (up to statistical significance) are in blue. S PEED - UP D ATASET C OMPRESSION R ATIO ( FEW CLASSES ) 2% 4% 8% 16%
ETH
80 (C) . ± . . ± . . ± . . ± . ETHZ (C) . ± . . ± . . ± .
05 6 . ± . RGBD (C) . ± . . ± . . ± . . ± . SCENE
15 (C) . ± . . ± . . ± . . ± . KTH (C) . ± . . ± . . ± . . ± . COIL
20 (H) . ± . . ± . . ± . . ± . KYLBERG (H) . ± . . ± . . ± .
04 6 . ± . D ATASET C OMPRESSION R ATIO ( MANY CLASSES ) 10% 20% 30% 40%
FERET (C) . ± . . ± . . ± .
006 2 . ± . MPEG . ± . . ± .
06 3 . ± .
01 2 . ± . longest training time is hours for RGBD with com-pression. Furthermore, the entire compression can be donecompletely off-line prior to testing. The contributions ofthe training points to the gradient are independent and havea high computation to memory load ratio. The SCC trainingcould therefore potentially be sped up significantly throughparallelization on clusters or GPUs.
We evaluate our technique for compressing histogramdatasets, Stochastic Histogram Compression (SHC) againstcurrent baseline methods for constructing a reduced trainingset. As a benchmark, we compare the k -nearest neighboraccuracies for compressed sets of different sizes, and reportthe test-time speedups achieved by our method. We start bydescribing the datasets we use for comparison. igure 3. Montages of datasets used in SHC evaluation from left to right: (a) COIL
20 3D object recognition; (b)
KYLBERG textureclassification; (c) MPEG7 shape detection.Table 2. SCC and SHC training times. T RAINING T IMES D ATASET C OMPRESSION R ATIO (few classes) 2% 4% 8% 16%
ETH
80 (C) m s m s m s m s ETHZ (C) m s m s m s m s RGBD (C) m s m s h m h m SCENE
15 (C) m s m s m s m s KTH (C) m s m s m s m s COIL
20 (H) s m s m s m s KYLBERG (H) m s m s m s m sD ATASET C OMPRESSION R ATIO (many classes) 10% 20% 30% 40%
FERET (C) m s m s m s m s MPEG s m s m s m s Datasets.
The
COIL
20 dataset consists of grayscale im-age objects with background masked out in black. Eachobject was rotated degrees and an image was takenevery degrees, yielding images per class. To con-struct histogram features we follow the procedure of [2] toextract shape context log-polar histograms using ran-domly sampled edge points, yielding histograms of dimen-sionality d = 60 . As a ground distance M we use the (cid:96) distance between bins of the log-polar histogram. The MPEG different shape classes, each with images. Each image has a black background with a soldwhite shape such as bat , cellular phone , fountain , and octo-pus , among others. We follow the procedure for the COIL
MPEG (cid:96) between bins. The KYLBERG texture dataset is a -classdataset of different surfaces. We used the dataset withoutrotations that contains images for each class. We fol-low the feature-extraction technique of [18], which uses firstand second order image-gradient features at every 4 pixels,after resizing. We then construct a visual bag-of-words rep-resentation by first clustering all features into codewords.We represent each image as a -dimensional count vector:the i th entry corresponds to the number of times a gradi-ent feature was closest (in the Euclidean sense) to the i th codeword. As a ground distance between bins we use theEuclidean distance between each pair of codewords. Experimental setup.
As for covariance features, our benchmark for comparison of all methods is the test errorof -nearest neighbor classification with the full training set.Similarly, for each dataset we report results over differenttrain/test splits. For our algorithm, SHC, we use Bayesianoptimization [13] to tune the γ parameter in the definitionof p ij , eq. (9), as well as the initial gradient descent learningrate, to minimize the training error. Additionally, we initial-ize SHC with the results of RMHC, which in the covariancesetting appears to largely outperform the subsampling ap-proach. We use the exact same baselines for covariance fea-tures, except now we use the Sinkhorn distance as our dis-similarity measure. The only subtlety is that FCNN needsto be able to compute the centroid of a set of histograms,with respect to the Sinkhorn distance. The centroid of a setof histogram measures with respect to the EMD is calledthe Wasserstein Barycenter [9]. It is shown how to com-pute this barycenter for the Sinkhorn distance in [8], andwe use their accelerated gradient approach to solve for eachSinkhorn centroid.
Classification error.
Figure 4 shows the average testerror and standard deviation for compression ratios of , , , and for COIL
20 and
KYLBERG and , , , for the many-class dataset MPEG k NN clas-sification ( i.e. compression ratio), SHC is able to achievethe lowest test error (possibly matched by other methods)throughout. On
KYLBERG , SHC can reduce the trainingset to / of its size without an increase in test error.On COIL
20 and
MPEG . and . ), which lead to only very modest speedups. Wedid not evaluate SHC in these arguably least interesting set-tings, but nevertheless show the error rates of the baselinesfor completeness. Test-time speedup.
The k NN test time speedups of SHCover the full training set are shown in Table 1 (the (H)datasets). Similar to SCC, the speedups reach up to . × at maximum compression and still reach an order of mag-nitude in the worst case ( × ) on the MPEG
KYLBERG the SHC error is lower thanthe full data set—even at a compression ratio. .02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.1800.10.20.30.40.50.6 compression ratio t es t e rr o r coil20 mpeg7 kylberg SHC subsamplingKNNCNN (Hart, 1968)FCNN (Angiulli, 2005)RNN (Gates, 1972)RMHC (Skalak, 1994)
Figure 4. k NN test error rates after training set compression for histogram descriptors. See text for details.
Training time.
The training times of SHC are shown inTable 2. SHC is very fast ( < minutes for compres-sion on KYLBERG ), especially considering that we are solv-ing a nested optimization problem over the compressed his-tograms and the Sinkhorn distance. We believe that thespeed of our implementation can be further improved withthe use of GPUs for the Sinkhorn computation and with ap-proximate second-order hill-climbing methods.
6. Related Work
Training set reduction has been considered in the contextof k NN for vector data and the Euclidian distance with threeprimary methods: (a) training consistent sampling , (b) pro-totype generation and (c) prototype positioning (for a sur-vey see [44]). Training consistent sampling iteratively addsinputs from the training set to a reduced ‘reference set’ un-til the reference set is perfectly classified by the trainingset. This is precisely the technique of Condensed NearestNeighbors (CNN) [20]. There have been a number of exten-sions of CNN, notably Reduced Nearest Neighbor (RNN)[14] which searches for the smallest subset of the result ofCNN that correctly classifies the training data. Addition-ally, Fast CNN (FCNN) [1] finds a set close to that of CNNbut has training time linear (instead of cubic) in the size ofthe training set. Prototype generation creates new inputsto represent the training set, usually via clustering [23, 40].Prototype positioning learns a reduced training set by op-timizing an appropriate objective. The method most sim-ilar to SCC and SHC is the recently proposed StochasticNeighbor Compression [24], which uses a stochastic neigh-borhood to learn prototypes in Euclidean space (and thus isunsuitable for covariance and histogram features). Finally,Bucilua et al. [3] may have been the first to study modelcompression for machine learning algorithms by compress-ing neural networks. To our knowledge, SCC and SHC arethe first methods to explicitly consider training set reductionfor SPD covariance and histogram descriptors.Work towards speeding up test-time classification for k NN on covariance-valued data is somewhat limited. TheJBLD divergence is proposed to speed up individual dis-tance computations. Cherian et al. [6] show that it is pos-sible to adapt Bregman Ball Trees (BBTs), a generalization of the Euclidean ball tree to Bregman divergences, to theJBLD divergence. This is done using a clever iterative K -means method followed by a leaf node projection techniqueonto relevant Bregman balls. Both of these techniques arecomplementary to our dataset compression method.There has been a large amount of work devoted towardimproving the complexity of the Earth Mover’s distance us-ing approximations [8, 17, 41, 30, 31, 35]. For instance,[35] point out that if an upper bound can be placed on thetransport T ij between any two bins i and j , then the (thresh-olded) EMD can be solved much more efficiently. Lingand Okada [31] show that if the ground distance is the (cid:96) distance between bins, the EMD can be reformulated ex-actly as a tree-based optimization problem with d unknownvariables (instead of d ) and only d constraints (instead of d ). We use the Sinkhorn approximation [8], which has theadded advantage of an unconstrained dual formulation.
7. Conclusion
In many classification settings the sheer amount of dis-tance computations has previously prohibited the use of k NN for covariance and histogram features. We have shownthat these data sets can be compressed to a small fractionof their original sizes while often only slightly increasingthe test error. This drastically speeds up nearest neighborsearch and has the potential to unlock new applications forcovariance and histogram features on large datasets.
References [1] F. Angiulli. Fast condensed nearest neighbor rule. In
ICML ,pages 25–32, 2005.[2] S. Belongie, J. Malik, and J. Puzicha. Shape matching andobject recognition using shape contexts.
TPAMI , 24(4):509–522, 2002.[3] C. Bucilua, R. Caruana, and A. Niculescu-Mizil. Modelcompression. In
SIGKDD , pages 535–541. ACM, 2006.[4] L. Cayton. Fast nearest neighbor retrieval for Bregman di-vergences. In
ICML , pages 112–119. ACM, 2008.[5] A. Cherian and S. Sra. Riemannian sparse coding for positivedefinite matrices. In
ECCV , pages 299–314. Springer, 2014.[6] A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos.Efficient similarity search for covariance matrices via theensen-Bregman LogDet divergence. In
ICCV , pages 2399–2406, 2011.[7] D. Coppersmith and S. Winograd. Matrix multiplication viaarithmetic progressions.
Journal of symbolic computation ,9(3):251–280, 1990.[8] M. Cuturi. Sinkhorn distances: Lightspeed computation ofoptimal transport. In
NIPS , pages 2292–2300, 2013.[9] M. Cuturi and A. Doucet. Fast computation of Wassersteinbarycenters. In
ICML , 2014.[10] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In
CVPR , volume 1, pages 886–893, 2005.[11] L. Fei-Fei, R. Fergus, and A. Torralba. Recognizing andlearning object categories.
CVPR Short Course , 2, 2007.[12] L. Fei-Fei and P. Perona. A bayesian hierarchical model forlearning natural scene categories. In
CVPR , volume 2, pages524–531, 2005.[13] J. Gardner, M. Kusner, Z. Xu, K. Q. Weinberger, and J. Cun-ningham. Bayesian optimization with inequality constraints.In
ICML , pages 937–945, 2014.[14] G. Gates. The reduced nearest neighbor rule.
IEEE Transac-tions on Information Theory , 18:431–433, 1972.[15] A. Goh and R. Vidal. Clustering and dimensionality reduc-tion on Riemannian manifolds. In
CVPR , pages 1–7, 2008.[16] J. Goldberger, G. Hinton, S. Roweis, and R. Salakhutdinov.Neighbourhood components analysis. In
NIPS , pages 513–520. 2004.[17] K. Grauman and T. Darrell. Fast contour matching usingapproximate earth mover’s distance. In
CVPR , volume 1,pages I–220, 2004.[18] M. Harandi, M. Salzmann, and F. Porikli. Bregman di-vergences for infinite dimensional covariance matrices. In
CVPR , pages 1003–1010, 2014.[19] M. T. Harandi, M. Salzmann, and R. Hartley. From manifoldto manifold: geometry-aware dimensionality reduction forSPD matrices. In
ECCV , pages 17–32. 2014.[20] P. Hart. The condensed nearest neighbor rule.
IEEE Trans-actions on Information Theory , 14:515–516, 1968.[21] G. Hinton and S. Roweis. Stochastic neighbor embedding.In
NIPS , pages 833–840. 2002.[22] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Ha-randi. Kernel methods on the Riemannian manifold of sym-metric positive definite matrices. In
CVPR , pages 73–80,2013.[23] T. Kohonen. Improved versions of learning vector quantiza-tion. In
IJCNN , pages 545–550. IEEE, 1990.[24] M. Kusner, S. Tyree, K. Q. Weinberger, and K. Agrawal.Stochastic neighbor compression. In
ICML , pages 622–630,2014.[25] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchicalmulti-view rgb-d object dataset. In
ICRA , pages 1817–1824,2011.[26] K. I. Laws. Rapid texture identification. In , pages 376–381. International Societyfor Optics and Photonics, 1980.[27] S. Lazebnik, C. Schmid, and J. Ponce. A sparse texture rep-resentation using local affine regions.
TPAMI , pages 1265–1278, 2005. [28] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizing naturalscene categories. In
CVPR , pages 2169–2178, 2006.[29] T. Leung and J. Malik. Representing and recognizing thevisual appearance of materials using three-dimensional tex-tons.
IJCV , 43(1):29–44, 2001.[30] E. Levina and P. Bickel. The earth mover’s distance is themallows distance: Some insights from statistics. In
ICCV ,pages 251–256, 2001.[31] H. Ling and K. Okada. An efficient earth mover’s dis-tance algorithm for robust histogram comparison.
TPAMI ,29(5):840–853, 2007.[32] D. G. Lowe. Distinctive image features from scale-invariantkeypoints.
IJCV , 60(2):91–110, 2004.[33] K. Mikolajczyk and C. Schmid. A performance evaluationof local descriptors.
TPAMI , 27(10):1615–1630, 2005.[34] C. W. Niblack, R. Barber, W. Equitz, M. D. Flickner,E. H. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, andG. Taubin. Qbic project: querying images by content, us-ing color, texture, and shape. In
IS&T/SPIE Symposium onElectronic Imaging , pages 173–187, 1993.[35] M. Pele, O.and Werman. Fast and robust earth mover’s dis-tances. In
ICCV , pages 460–467. IEEE, 2009.[36] O. Pele and M. Werman. The quadratic-chi histogram dis-tance family. In
ECCV , pages 749–762. Springer, 2010.[37] X. Pennec, P. Fillard, and N. Ayache. A riemannian frame-work for tensor computing.
IJCV , 66(1):41–66, 2006.[38] Y. Rubner, J. Puzicha, C. Tomasi, and J. M. Buhmann. Em-pirical evaluation of dissimilarity measures for color and tex-ture.
Computer vision and image understanding , 84(1):25–43, 2001.[39] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth mover’sdistance as a metric for image retrieval.
IJCV , 40(2):99–121,2000.[40] S. Salzberg, A. Delcher, D. Heath, and S. Kasif. Best-caseresults for nearest-neighbor learning.
PAMI , 17(6):599–608,1995.[41] S. Shirdhonkar and D. W. Jacobs. Approximate earthmover’s distance in linear time. In
CVPR , pages 1–8, 2008.[42] D. B. Skalak. Prototype and feature selection by samplingand random mutation hill climbing algorithms. In
ICML ,pages 293–301, 1994.[43] M. A. Stricker and M. Orengo. Similarity of color images. In
IS&T/SPIE Symposium on Electronic Imaging , pages 381–392, 1995.[44] G. T. Toussaint. Proximity graphs for nearest neighbor deci-sion rules: recent progress.
Interface , 34, 2002.[45] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fastdescriptor for detection and classification. In
ECCV , pages589–600. Springer, 2006.[46] L. Van der Maaten and G. Hinton. Visualizing data usingt-sne.
JMLR , 9(2579-2605):85, 2008.[47] R. Vemulapalli and D. W. Jacobs. Riemannian metric learn-ing for symmetric positive definite matrices. arXiv preprintarXiv:1501.02393 , 2015.[48] K. Weinberger and L. Saul. Distance metric learning forlarge margin nearest neighbor classification.