Boosting k-NN for categorization of natural scenes
NNoname manuscript No. (will be inserted by the editor)
Boosting k -NN for categorization of natural scenes Paolo Piro · Richard Nock · Frank Nielsen · Michel Barlaud
Received: date / Accepted: date
Abstract
The k -nearest neighbors ( k -NN) classification rulehas proven extremely successful in countless many com-puter vision applications. For example, image categorizationoften relies on uniform voting among the nearest prototypesin the space of descriptors. In spite of its good generalizationproperties and its natural extension to multi-class problems,the classic k -NN rule suffers from high variance when deal-ing with sparse prototype datasets in high dimensions. A fewtechniques have been proposed in order to improve k -NNclassification, which rely on either deforming the nearestneighborhood relationship by learning a distance function ormodifying the input space by means of subspace selection.In this paper, we propose a novel boosting algorithm,called UNN (Universal Nearest Neighbors), which induces leveraged k -NN, thus generalizing the classic k -NN rule.Our approach consists in redefining the voting rule as a strongclassifier that linearly combines predictions from the k clos-est prototypes. Therefore, the k nearest neighbors examplesact as weak classifiers and their weights, called leveragingcoefficients , are learned by UNN so as to minimize a surro- P. Piro · M. BarlaudUniversity of Nice-Sophia Antipolis / CNRS, 2000 route des Lucioles- 06903 Sophia Antipolis Cedex, FranceP. Piro E-mail: [email protected] · R. NockCEREGMIA Department, University of Antilles-Guyane, Martinique,FranceE-mail: [email protected]. NielsenDepartment of Fundamental Research, Sony Computer Science Labo-ratories, Inc., Tokyo, JapanE-mail: [email protected]. NielsenLIX Department, Ecole Polytechnique, Palaiseau, FranceE-mail: [email protected]. BarlaudE-mail: [email protected] gate risk , which upper bounds the empirical misclassifica-tion rate over training data. A major feature of UNN is theability to learn which prototypes are the most relevant for agiven class, thus allowing one for effective data reduction byfiltering the training data.Experimental results on the synthetic two-class datasetof Ripley show that such a filtering strategy is able to reject“noisy” prototypes, and yields a classification error close tothe optimal Bayes error. We carried out image categorizationexperiments on a database containing eight classes of naturalscenes. We show that our method outperforms significantlythe classic k -NN classification, while enabling significantreduction of the computational cost by means of data filter-ing. Keywords
Boosting · k nearest neighbors · Imagecategorization · Scene classification indoor vs outdoor , beaches vs moun-tains , churches vs towers . Generic categorization is distinctfrom object and scene recognition, which are classificationtasks concerning particular instances of objects or scenes(e.g. Notre Dame Cathedral vs St. Peter’s Basilic ). It is alsodistinct from other related computer vision tasks, such ascontent-based image retrieval (that aims at finding imagesfrom a database, which are semantically related or visuallysimilar to a given query image) and object detection (whichrequires to find both the presence and the position of a targetobject in an image, e.g. person detection). a r X i v : . [ c s . C V ] J a n Automatic categorization of generic scenes is still a chal-lenging task, due to the huge number of natural categoriesthat should be considered in general. In addition, natural im-age categories may exhibit high inter-class variability (i.e., vi-sually different images may belong to the same category)and low inter-class variability (i.e., distinct categories maycontain visually similar images).Classifying images requires a reliable description of thecontent relevant for an application (e.g., location and shapeof specific objects or overall scene appearance). Examplesof suitable image descriptors for categorization purposes areGist, i.e. global image features representing the overall scene(Oliva and Torralba, 2001), and SIFT descriptors, i.e. de-scriptors of local features extracted at salient patches (Lowe,2004).Gist descriptor is based on the so-called “spatial enve-lope” (Oliva and Torralba, 2001), which is a very effectivelow dimensional representation of the overall scene basedon spectral information. Such a representation bypasses seg-mentation, extraction of keypoints and processing of indi-vidual objects and regions, thus enabling a compact globaldescription of images. Gist descriptors have been success-fully used for categorizing locations and environments, show-ing their ability to provide relevant priors for more specifictasks, like object recognition and detection (Rubin et al, 2003).1.2 k -NN classificationApart from the descriptors used to compactly represent im-ages, most image categorization methods rely on supervisedlearning techniques for exploiting information about knownsamples when classifying an unlabeled sample. Among thesetechniques, k -NN classification has proven successful, thanksto its easy implementation and its good generalization prop-erties (Shakhnarovich et al, 2006). Indeed, the k -NN ruledoes not require explicit construction of the feature spaceand is naturally adapted to multi-class problems. Moreover,from the theoretical point of view, k -NN classification prov-ably tends to the Bayes optimal when increasing the samplesize. Although such advantages make k -NN classificationvery attractive to practitioners, it is an algorithmic challengeto speed-up k -NN queries and design schemes that scale-upwell with large dimensional datasets (Shakhnarovich et al,2006). Moreover, it is yet another challenge to reduce themisclassification rate of the k -NN rule, usually tackled bydata reduction techniques (Hart, 1968).In a number of works, the classification problem hasbeen reduced to tracking ill-defined categories of neighbors,interpreted as “noisy” (Brighton and Mellish, 2002). Mostof these recent techniques are in fact partial solutions to alarger problem related to nearest neighbors’ error, whichdoes not have to be the discrete prediction of labels, but rather a continuous estimation of class membership prob-abilities (Holmes and Adams, 2003). This problem has beenreformulated by Marin et al (2009) as a strong advocacyfor the formal transposition of boosting to nearest neighborsclassification. Such a formalization is challenging as near-est neighbors rules are indeed not induced , whereas all for-mal boosting algorithms induce so-called strong classifiersby combining weak classifiers (also induced, say by decisionstumps).A survey of the literature shows that at least four differ-ent categories of approaches have been proposed in order toimprove k -NN classification: – learning local or global adaptive distance metric; – embedding data in the feature space (kernel nearest neigh-bors); – distance-weighted and difference-weighted nearest neigh-bors; – boosting nearest neighbors.The earliest approaches to generalizing the k -NN clas-sification rule relied on learning an adaptive distance met-ric from training data. Refer to the seminal work of Fuku-naga and Flick (1984) who presented an optimal global met-ric for k -NN. An analogous approach was later adopted byHastie and Tibshirani (1996), who carried out linear dis-criminant analysis to adaptively deform the distance met-ric. Recently, Paredes (2006) has proposed a method forlearning a weighted distance, where weights can be eitherglobal (i.e., only depending on classes and features) or local(i.e., depending on each individual prototype as well).Other more recent techniques apply the nearest neigh-bors rule to data embedded in a high-dimensional featurespace, following the kernel trick approach of support vec-tor machines. For example, Yu et al (2002) have proposeda straightforward adaptation of the kernel mapping to thenearest neighbors rule, which yields significant improvementin terms of classification accuracy. In the context of vision,a successful technique has been proposed by Zhang et al(2006), which involves a “refinement” step at classificationtime, without relying on explicitely learning the distancemetric. This method trains a local support vector machineon nearest neighbors of a given query, thus limiting the mostexpensive computations to a reduced subset of prototypes.Another class of k -NN methods rely on weighting near-est neighbors votes based on their distances to the querysample (Dudani, 1976). Recently, Zuo et al (2008) have pro-posed a similar weighting approach, where the nearest neigh-bors are weighted based on their vector difference to thequery. Such a difference-weight assignment is defined as aconstrained optimization problem of sample reconstructionfrom its neighborhood. The same authors have proposed akernel-based non-linear version of this algorithm as well. Finally, only very few work have proposed the use ofboosting techniques for k -NN classification. For instance,Amores et al (2006) use AdaBoost for learning a distancefunction to be used for k -NN search. On the other hand,Garc´ıa-Pedrajas and Ortiz-Boyer (2009) adopt the boostingapproach in a non-conventional way. At each iteration a dif-ferent k -NN classifier is trained over a modified input space.Namely, the authors propose two variants of the method,depending on the way the input space is modified. Theirfirst algorithm is based on optimal subspace selection, i.e.,at each boosting iteration the most relevant subset of inputdata is computed. The second algorithm relies on modify-ing the input space by means of non-linear projections. Butneither method is strictly an algorithm for inducing weakclassifiers from the k -NN rule, thus not directly addressingthe problem of boosting k -NN classifiers. Moreover, suchapproaches are computationally expensive, as they rely on agenetic algorithm and a neural network, respectively.Conversely, we propose a complete solution to the prob-lem of boosting k -NN classifiers in the general multi-classsetting. Namely, we propose a novel boosting algorithm, calledUNN, which induces a leveraged nearest neighbors rule thatgeneralizes the uniform k -NN rule. Indeed, the voting rule isredefined as a strong classifier that linearly combines weakclassifiers induced by the k -NN rule. Therefore, our approachdoes not need to learn a distance function, as it directly op-erates on the top of k -nearest neighbors search. At the sametime, it does not require an explicit computation of the fea-ture space, thus preserving one of the main advantages ofprototype-based methods. Our UNN boosting algorithm isan iterative procedure that learns the weights of weak classi-fiers, called leveraging coefficients . We show that this algo-rithm converges to the global minimum of any chosen clas-sification calibrated surrogate (Bartlett et al, 2006). Hence,our framework handles most popular losses in the machinelearning literature: squared loss, exponential loss, logisticloss, etc. In particular, we prove a specific convergence ratefor the exponential loss (reported in our experiments) farbetter than the general rate of Nock and Nielsen (2009). An-other important characteristic of UNN is that it is able todiscriminate the most relevant prototypes for a given class,thus allowing one for significant data reduction while im-proving at the same time classification performances.1.3 Overview of the paperIn the following sections we present our approach to k -NNboosting. Sections 2.1-2.3 present key definitions for k -NNboosting. These sections also describe how to replace theclassic uniform k -NN rule by a leveraged k -NN rule. Lever- A surrogate is a function which is a suitable upperbound for an-other function (here, the non-convex non-differentiable empirical risk). aged k -NN classifiers are induced by UNN algorithm, whichis detailed in Sec. 2.4 for the case of exponential risk. Sec. 2.5presents the generic convergence theorem of UNN and theupper bound performance for the exponential risk minimiza-tion. Our experiments on both synthetic and image catego-rization datasets are reported in Sec. 3. Then, Sec. 4 dis-cusses results and mentions future work.In order not to laden the body of the paper, the generalform of UNN algorithm and proofsketches of our theoremshave been postponed to an appendix in Sec. 5. multi-class , single-label image categorization. Hence, several categories of imagesare predefined, whereas each image is constrained to belongto a single category. The number of categories (or classes)may range from a few to hundreds, depending on appli-cations. E.g., categorization with 67 Indoor categories hasbeen recently studied by Quattoni and Torralba (2009). Wetreat the multi-class problem as multiple binary classifica-tion problems as it is customary in machine learning. I.e.,for each class c , a query image is classified either to c or to ¯ c (the complement class of c , which contains all classes but c )with a certain confidence ( classification score ). Then the la-bel with the maximum score is assigned to the query. Im-ages are represented by descriptors related to given local orglobal features. We refer to an image descriptor as an obser-vation o ∈ O , which is a vector of n features and belongs toa domain O (e.g., R n or [0 , n ). A label is associated to eachimage descriptor according to a predefined set of C classes.Hence, an observation with the corresponding label leads toan example , which is the ordered pair ( o , y ) ∈ O × R C ,where y is termed the class vector that specifies the classmemberships of o . In particular, the sign of y c gives themembership of example ( o , y ) to class c , such that y c is neg-ative iff the observation does not belong to class c , positiveotherwise. At the same time, the absolute value of y c maybe interpreted as a relative confidence in the membership.Inspired by the multi-class boosting analysis of Zhu et al(2006), we constrain class vectors to be symmetric , that is: C (cid:88) c =1 y c = 0 . (1)Hence, in the single-label framework, the class vector of anobservation o belonging to class ˜ c is defined as: y ˜ c = 1 , y c (cid:54) =˜ c = − C − . This setting turns out to be necessary whentreating multi-class classification as multiple binary classifi-cations, as it balances negative and positive labels of a givenexample over all classes. We are given an input set of m examples S = { ( o i , y i ) , i = 1 , , ..., m } , arising from an-notated images, which form the training set .2.2 Boosting k -NN for minimization of surrogate risksWe aim at defining a one-versus-all classifier for each cate-gory, which is to be trained over the set of examples. Thisclassifier is expected to correctly classify as many new ob-servations as possible, i.e. to predict their true labels. There-fore, we aim at determining a classification rule h from theexample dataset, which is able to minimize the classificationerror over all possible new observations. But since the un-derlying class probability densities are generally unknownand difficult to estimate, defining a classifier in the frame-work of supervised learning can be viewed as fitting a clas-sification rule onto a training set S without overfitting. Thiscorresponds to defining a classifier that correctly classifiesmost of the example data themselves, thus minimizing theclassification error over the example dataset (empirical ortrue classification loss). Therefore, in the most basic frame-work of supervised classification, one wishes to train a clas-sifier on S , i.e. build a function h : O → R C with theobjective to minimize its empirical risk on S , defined as: ε ( h , S ) . = 1 mC C (cid:88) c =1 m (cid:88) i =1 [ (cid:37) ( h , i, c ) < , (2)with [ . ] the indicator function ( iff true, otherwise), calledhere the , and: (cid:37) ( h , i, c ) . = y ic h c ( o i ) (3)the edge of classifier h on example ( o i , y i ) for class c . Tak-ing the sign of h c in {− , +1 } as its membership predictionfor class c , one sees that when the edge is positive (resp. neg-ative), the membership predicted by classifier and the actualexample’s membership agree (resp. disagree). Therefore, (2)averages over all classes the number of mismatches for themembership predictions, thus measuring the goodness-of-fitof the classification rule on the training dataset. Providedthat the example dataset has good generalization propertieswith respect to the unknown distribution of possible obser-vations, minimizing this empirical risk is expected to yieldgood accuracy when classifying unlabeled observations. Un-fortunately, minimizing the empirical risk is mathematicallynot tractable as it deals with non-convex optimization. Inorder to bypass this cumbersome optimization challenge,the current trend of supervised learning (including boostingand support vector machines) has replaced the minimiza-tion of the empirical risk (2) by that of a so-called surrogaterisk (Bartlett et al, 2006), to make the optimization problemamenable. In boosting, it amounts to summing (or averag-ing) over classes and examples a real-valued function called the surrogate loss , thus ending up with the following rewrit-ing of (2): ε ψ ( h , S ) . = 1 mC C (cid:88) c =1 m (cid:88) i =1 ψ ( (cid:37) ( h , i, c )) . (4)Important choices available for ψ include: ψ sqr . = (1 − x ) , (5) ψ exp . = exp( − x ) , (6) ψ log . = log(1 + exp( − x )) ; (7)(5) is the squared loss (Bartlett et al, 2006), (6) is the ex-ponential loss (Schapire and Singer, 1999), and (7) is thelogistic loss (Bartlett et al, 2006). Surrogates play a fundamental role in supervised learn-ing. They are upper bounds of the empirical risk with de-sirable convexity properties. Their minimization remarkablyimpacts on that of the empirical risk, thus enabling to pro-vide minimization algorithms with good generalization prop-erties (Nock and Nielsen, 2009).In this paper, we move from recent advances in boost-ing with surrogate risks to redefine the k -NN classificationrule. In particular, we concentrate on the exponential riskand provide a novel algorithm that learns a leveraged k -NNclassifier, while provably converging to the global optimumof a surrogate risk. Our algorithm, called UNN (UniversalNearest Neighbors), meets boosting-type convergence prop-erties under two mild assumptions on the training set: weaklearning and weak coverage properties. In the Appendix, wealso describe how the UNN algorithm generalizes to anysurrogate loss, and provide the most general analysis.2.3 Leveraged k -NN ruleIn the following, we denote by NN k ( o i (cid:48) ) the set of the k -nearest neighbors (with integer constant k > ) of an ex-ample ( o i (cid:48) , y i (cid:48) ) in set S with respect to a non-negative real-valued “distance” function. This function is defined on do-main O and measures how much two observations differfrom each other. This dissimilarity function thus many notnecessarily satisfy the triangle inequality of metrics. (Allexperiments in this paper refer to nearest neighbors withrespect to the Euclidean distance.) For sake of readability,we let i ∼ k i (cid:48) denote an example ( o i , y i ) that belongs toNN k ( o i (cid:48) ) . This neighborhood relationship is intrinsicallyasymmetric, i.e., i ∼ k i (cid:48) does not necessarily imply that i (cid:48) ∼ k i . Indeed, a nearest neighbor of i (cid:48) does not necessarilycontain i (cid:48) among its own nearest neighbors.The k -nearest neighbors rule ( k -NN) is the followingmulti-class classifier h = { h c : c = 1 , , ..., C } ( k appearsin the summation indices): h c ( o i (cid:48) ) = (cid:88) j ∼ k i (cid:48) [ y jc > , (8) where h c is the one-versus-all classifier for class c and squarebrackets denote the indicator function. Hence, the classicnearest neighbors classification is based on majority voteamong the k closest prototypes.In this paper, we propose to weight the votes of nearestneighbors by means of real coefficients, thus generalizing(8) to the following leveraged k -NN rule h (cid:96) = { h (cid:96)c : c =1 , , ..., C } : h (cid:96)c ( o i (cid:48) ) = (cid:88) j ∼ k i (cid:48) α jc y jc , (9)where α jc ∈ R is the leveraging coefficient for example j in class c , with j = 1 , , ..., m and c = 1 , , ..., C . Hence,(9) linearly combines class labels of the k nearest neighbors(defined in Sec. 2.1) with their leveraging coefficients.The main contribution of our work is to define a gen-eral algorithm (UNN) for learning these leveraging coef-ficients from training data. This algorithm operates on thetop of classic k -NN methods, for it does not affect the near-est neighbors search when inducing weak classifiers of (9).Indeed, it is independent on the way nearest neighbors arecomputed, unlike most of the approaches mentioned in Sec. 1.2,which rely on modifying the neighborhood relationship viametric distance deformations or kernel transformations.Though, our approach is still fully compatible with any un-derlying (metric) distance and data structure for k -NN search,as well as possible kernel transformations of the input space.For a given training set S of m labeled examples, wedefine the k -NN edge matrix R ( c ) ∈ R m × m for each class c = 1 , , ..., C (Nock and Nielsen, 2009): r ( c ) ij . = (cid:26) y ic y jc if j ∼ k i otherwise . (10)The name of R ( c ) is justified by an immediate parallel with(3). Indeed, each example j serves as a classifier for eachexample i , predicting if j (cid:54)∈ NN k ( o i ) , y jc otherwise, forthe membership to class c . Hence, the j th column of matrix R ( c ) , r ( c ) j , which is different from when choosing k > ,collects all edges of “classifier” j for class c . Note that non-zero entries of this column correspond to the so-called recip-rocal nearest neighbors (R k -NN) of j , i.e., those examplesfor which j is a neighbor (Fig. 1). It finally comes that theedge of the leveraged k -NN rule on example i for class c is: (cid:37) ( h (cid:96) , i, c ) = ( R ( c ) α ( c ) ) i , c = 1 , , ..., C , (11)where α ( c ) collects all leveraging coefficients in a vectorform for class c : α ( c ) i . = α ic , i = 1 , , ..., m . The expressionof surrogate loss (4) can be written as follows after replacingthe argument of ψ ( · ) in (4) by (11): ε ψ ( h , S ) . = 1 mC C (cid:88) c =1 m (cid:88) i =1 ψ m (cid:88) j =1 r ( c ) ij α jc . (12) j 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 j 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 Fig. 1
A toy example of direct (left) and reciprocal (right) k -nearestneighbors ( k = 1 ) of an example j . Squares and circles represent ex-amples of positive and negative classes. Each arrow connects an exam-ple to its 1-NN. Therefore, fitting all α jc ’s so as to minimize the surrogateloss (12) is the main goal of our learning algorithm UNN forinducing the leveraged k -NN classifier h (cid:96) .2.4 UNN: learning α jc of leveraged k -NN classifierWe propose a novel classification algorithm which inducesthe leveraged nearest neighbors classifier h (cid:96) (Eq. 9) in themulti-class one-versus-all framework. In this section, we ex-plain UNN specialized for the exponential risk minimiza-tion, with pseudo-code shown in Alg. 1. However, our anal-ysis is much more general, as it involves the broad class ofclassification-calibrated surrogate risks (Bartlett et al, 2006),and is postponed to Appendix in order not to burden themethodology. Like common boosting algorithms, UNN op-erates on a set of weights w i ( i = 1 , , ..., m ) defined overtraining data. Such weights are repeatedly updated to fit allleveraging coefficients α ( c ) for class c ( c = 1 , , ..., C ). Ateach iteration, the index to leverage, j ∈ { , , ..., m } , is ob-tained by a call to a weak index chooser oracle W IC ( ., ., . ) ,whose implementation is postponed to steps [A.1] and [A.2] ,detailed later on in this section.The training phase is implemented in a one-versus-allfashion, i.e. C learning problems are solved independently,and for each class c the training examples are considered asbelonging to either class c or the complement class ¯ c , i.e.any other class . Eventually, one leverage coefficient ( α jc )per class is learned for each weak classifier (indexed by j ).In the Appendix, we show that Alg. 1 is a specialization ofa very general classification algorithm, thus justifying thename “Universal Nearest Neighbors”. In particular, Alg. 1induces the leveraged k -NN classifier by minimizing the ex-ponential surrogate risk (6), very much like regular boostingdoes it for inducing a weighted voting rule for a set of weakclassifiers.The key observation when training weak classifiers withUNN is that, at each iteration, one single example (indexedby j ) is considered as a prototype to be leveraged. Indeed, allthe other training data are to be viewed as observations for which j may possibly vote. In particular, due to k -NN vot-ing, j can be a classifier only for its reciprocal nearest neigh-bors (i.e., those data for which j itself is a neighbor, corre-sponding to non-zero entries in matrix (10) on column j ).This brings to a remarkable simplification when computing δ j in step [I.1] and updating weights w i in step [I.2] (Eq. 16,17). Indeed, only weights of reciprocal nearest neighbors of j are involved in these computations, thus allowing us notto store the entire matrix R ( c ) , c = 1 , , ..., C . Note that theset of R k -NN is splitted in two subsets, each containing ex-amples that agree (disagree) with the class membership of j ,thus yielding the partial sums w + j and w − j of (15).Note that when whichever w + j or w − j is zero, δ j in (16)is not finite. There is however a simple alternative, inspiredby Schapire and Singer (1999), which consists in smooth-ing out δ j when necessary, thus guaranteeing its finitenesswithout impairing convergence. More precisely, we suggestto replace: w + j ← w + j + 1 m , (13) w − j ← w − j + 1 m . (14)Also note that step [I.0] relies on oracle W IC ( ., ., . ) forselecting index j of the next weak classifier. We propose twoalternative implementations of this oracle, as follows: [I.0.a] a lazy approach: we set T = m and let j be chosenby W IC ( { , , ..., m } , t, c ) either: (1) randomly, or (2)following the alphabetic order of classes; [I.0.b] the boosting approach: we pick T ≥ m , and let j bechosen by W IC ( { , , ..., m } , t, c ) such that δ j is largeenough. Each j can be chosen more than once.There are also schemes mixing [I.0.a] and [I.0.b] : for exam-ple, we may pick T = m , choose j as in [I.0.b] , but exactlyonce as in [I.0.a] .2.5 Properties of UNNIn this section, we enunciate two fundamental theorems forUNN. The first theorem reports a general monotonic con-vergence property of UNN to the optimal loss, for any givensurrogate function. The second theorem further refines thisgeneral convergence theorem by providing effective conver-gence bound for the exponential loss. Theorem 1
As the number of iteration steps T increases, UNN converges to h (cid:96) realizing the global minimum of thesurrogate risk at hand (4), for any ψ meeting conditions (i) , (ii) and (iii) above. (proofsketch in Appendix)Although we prove the boosting ability of UNN for allapplicable surrogate losses, we choose to show in particularits behavior for the exponential loss ψ exp , which features far Algorithm 1: U NIVERSAL N EAREST N EIGHBORS
UNN( S ) for ψ = ψ exp Input : S = { ( o i , y i ) , i = 1 , , ..., m, o i ∈ O , y i ∈{− C − , } C } Let r ( c ) ij . = (cid:26) y ic y jc if j ∼ k i otherwise , ∀ i, j = 1 , , ..., m, c = 1 , , ..., C ; for c = 1 , , ..., C do Let α jc ← , ∀ j = 1 , , ..., m ;Let w i ← , ∀ i = 1 , , ..., m ; for t = 1 , , ..., T do[I.0] Weak index chooser oracle: Let j ← W IC ( { , , ..., m } , t ) ; [I.1] Let w + j = (cid:88) i :r ( c ) ij > w i , w − j = (cid:88) i :r ( c ) ij < w i , (15) δ j ←
12 log (cid:32) w + j w − j (cid:33) ; (16) [I.2] Let w i ← w i exp( − δ j r ( c ) ij ) , ∀ i : j ∼ k i ; (17) [I.3] Let α jc ← α jc + δ j Output : h c ( o i (cid:48) ) = (cid:80) i ∼ k i (cid:48) α ic y ic , ∀ c = 1 , , ..., C better convergence bound than the general one (Nock andNielsen, 2009).Computing this bound is based on defining a weak in-dex assumption ( WIA ), which is to nearest neighbors whatthe conventional weak learning assumption is to general in-duced classifiers (Schapire and Singer, 1999):(
WIA ) let p ( c ) j . = w ( c )+ j / ( w ( c )+ j + w ( c ) − j ) . There exist some γ > and η > such that the following two inequalityholds for index j returned by W IC ( ., ., . ) : | p ( c ) j − / | ≥ γ , (18) ( w ( c )+ j + w ( c ) − j ) / || w || ≥ η . (19) Theorem 2
If the
WIA holds for τ ≤ T steps in UNN (foreach c ), then ε ( h (cid:96) , S ) ≤ exp( − ηγ τ ) . (proofsketch inAppendix)Inequality (18) is the usual weak learning assumption(Schapire and Singer, 1999), when considering examplesas weak classifiers. But a weak coverage assumption (19)is needed as well, because insufficient coverage of the re-ciprocal neighbors could easily wipe out even the surrogaterisk reduction potentially due to a large γ . In addition, evenwhen classes are significantly overlapping, choosing k nottoo small is enough for the WIA to be met for a large num-ber of boosting rounds τ , thus determining a potential harshdecrease of ε ( h (cid:96) , S ) . This is important, as there are at most m different weak classifiers available to W IC ( ., ., . ) , evenwhen each one may be chosen more than once under the WIA . Last but not least, Theorem 2 also displays the factthat classification (18) may be more important than cover-age (19).
In this section, we present experimental results of UNN vsplain k -NN on both synthetic and real datasets. Such exper-iments allowed us to quantify the gains brought by boost-ing on nearest neighbors voting (Marin et al, 2009). For thispurpose, we first performed tests on two-class synthetic datato drill down into the performances of UNN (Sec. 3.1). InSec. 3.2 we discuss the data reduction ability of our tech-nique. Then, we carried out experiments of multi-class scenecategorization on a dataset of natural images and comparedthe results of UNN to plain k -NN classification (Sec. 3.3).3.1 Synthetic datasetsWe have drilled down into the experimental behavior of UNNusing the synthetic Ripley’s dataset (Ripley, 1994) with twoclasses denoted by P and N . Each population of this datasetis an equal mixture of two two-dimensional normally dis-tributed populations, which are equally likely. Training andtest dataset (consisting of 250 and 1000 points, respectively)are shown in Figure 2, where the optimal classification bound-ary of the Bayes rule is also displayed. This corresponds tothe best theoretical error rate of 8.0% (Ripley, 1994).Fig. 3 validates on this dataset the monotonous decayof the exponential risk (6), mathematically proved in Theo-rem 2 under the two basic weak index/learning assumptions.It also shows the effect of three different implementations ofthe W IC oracle (Sec. 2.5). Note that the boosting approachfor selecting weak classifiers provides much faster decay ofthe surrogate risk, thus outperforming the two tested “lazy”implementations. In these latter cases, the index j of theweak classifier at each UNN iteration was chosen either ran-domly or following the order of examples in their respectivecategories.Classification results for a range of values of k are shownin Fig. 4. They enable to draw two main conclusions: First,test errors display a robustness of UNN against variations of k . Second, filtering out even a large proportion − θ of ex-amples with the smallest || α . || does not degrade classifica-tion performances, and can even significantly improve them.As witnessed by Fig. 4, values as small as θ = 0 . yieldsimprovements that make the test error close to Bayes’. (E.g.,see the minimum error of boosted k -NN for θ = 0 . , k =9 .) We investigate such a data reduction ability of UNN inthe following Section. Fig. 2
Training and validation data for the Ripley’s dataset. The Bayesboundary is also drawn as reported in (Ripley, 1994). ψ e xp T WIC = boostingWIC = random orderWIC = alphabetic order
Fig. 3
Decrease of ε ψ exp ( h (cid:96) , S ) as a function of T in UNN for theRipley’s dataset for different oracle implementations. Note that theboosting implementation ( [I.0.b] , Sec. 2.4) always guarantees mono-tonic decrease of the surrogate loss, until the weak assumptions arematched (red curve). Conversely, the lazy implementation ( [I.0.a] ,Sec. 2.4) may select, at a given step, a classifier that does not matchthose assumptions, thus preventing the loss from strictly decreasing(see green and blue curves). k M i s c l a ss i f i ca ti on e rr o r r a t e k−NNUNN θ =1UNN θ =0.75UNN θ =0.5UNN θ =0.25 Fig. 4
Test error for UNN as a function of k for boosted k -NN. Bayesrule yields 8% optimal misclassification rate. margin effect, well known for induced classifiers(Schapire et al, 1998). The goodness-of-fit of the k -NN ruleis driven by the most accurate examples, i.e. those surroundedby examples of the same class, getting the largest || α . || .The least accurate ones, e.g. those located in overlappingregions between two classes, get the smallest. Discardingthese latter examples tends to increase a gap between classclouds, but each cloud may shelter examples of differentclasses. Fortunately, filtering with boosting is accompaniedby a subtle local repolarization of predictions which, as ex-plained in Figure 5 for θ = 0 . , makes this gap maximiza-tion translate to margin maximization , for which positive ef-fects on learning are known (Schapire et al, 1998). The sec-ond effect is structural: in nearest neighbors rules, the fron-tier between classes stems from the Voronoi cells of thoseleast accurate examples. Discarding them separates betterthe classes, as witnessed by Fig. 5. Above all, it reduces thenumber of Voronoi cells involved in the class frontiers, thusreducing structural parameters ( VC -dimension) of the clas-sifier, possibly buying a reduction of the test error as well(Schapire et al, 1998).3.3 Image CategorizationWe tested our k -NN boosting algorithm for image catego-rization. In particular, we used the global Gist descriptor ofOliva and Torralba (2001) in order to obtain a meaningfulrepresentation of images. This descriptor provides a globalrepresentation of a scene, while not requiring explicit seg-mentation of image regions and objects. In the typical set-ting, an image is represented by a single vector of dimen-sion 512, which collects features related to the spatial or-ganization of dominant scales and orientations in the im-age. This correspondence between images and descriptors isone of the main advantages of using global descriptors overrepresentations based on bags of local features (Graumanand Darrell, 2005). Indeed, global descriptors are straight-forwardly adapted to image categorization methods relyingon machine learning techniques, as most of these techniques,from prototype-based to kernel-based, require any instanceof a particular category to be represented by a single vector.In particular, this is the case of k -NN classification, whichexplicitly relies on measuring one-to-one similarity betweena query image and prototype images. In addition, Gist de-scriptors have proven successful in representing relevant con-textual information of natural scenes, which allows to com- pute meaninfgul priors for exploration tasks like object de-tection and localization (Rubin et al, 2003).The dataset we used contains 2688 color images of out-door scenes of size 256x256 pixels, divided in 8 categories: coast , mountain , forest , open country , street , inside city , tallbuildings and highways . One example image of each cate-gory is shown in Fig. 6.To extract global descriptors from these images we usedthe matlab implementation by Torralba , with the most com-mon settings: 4 resolution levels of the Gabor pyramid, 8 ori-entations per scale and × blocks.We used this database to validate UNN for different val-ues of k . In particular, we concentrated on evaluating clas-sification performances when filtering the prototype dataset,i.e. retaining a proportion θ of the most relevant examplesas prototypes for classification. Such a data reduction capa-bility is one of the most interesting properties of UNN, asit favourably impacts on the computational cost of classi-fication, which grows at least logarithmically (at most lin-early) with the dataset size. Indeed, classification roughlyamounts to searching for the k nearest neighbors amongprototypes, which is O ( kdθm ) for linear exhaustive search, O ( kd log( θm )) for fast kD-tree based search (Arya et al,1998) ( d being the dimension of feature vectors, θ the pro-portion of retained classifiers).Fig. 7 shows results of 3-fold cross-validation in termsof the mean Average Precision (mAP) as a function of θ ,for different values of k . Indeed, we randomly splitted thedatabase into 3 distinct subsets, each containing 896 images.Then, for each fold, we used one of these subsets as trainingset, while validating on the two remaining subsets. In eachexperiment, UNN was run over the training set and a subsetof the trained weak classifiers was retained as prototypesfor classifying the test images. In particular, we selectedall training images j with leveraging coefficients α jc , c =1 , , ..., C , such that α jc > ˜ α > . Note from Fig. 7 that,even when fixing threshold ˜ α so as to retain all the exam-ples, the actual proportion θ of prototypes is less than one,because UNN always discards the examples with null lever-aging coefficients, which do not match assumptions (18,19).We compared UNN with the classic k -NN classifica-tion. Namely, in order for the classification cost of k -NN beroughly the same as UNN, we carried out random samplingof the prototype dataset for selecting proportion θ (between10% and the whole set of examples). UNN significantly out-performs classic k -NN, even increasingly with k , as shownin Fig. 8(a). publicly available at http://people.csail.mit.edu/torralba/code/spatialenvelope/sceneRecognition.m The mAP was computed by averaging classification rates over cat-egories (diagonal of the confusion matrix) and then averaging thosevalues over the 3 cross-validation folds (Oliva and Torralba, 2001).
Fig. 5
Maps of positive/negative leveraging coefficients α j over training data for k = 3 and three different values of θ . Examples of class N withnegative α . (filled squares) and those of class P with positive α . (empty circles) predict class P ; similarly, empty squares and filled circles bothcorrespond to membership prediction in N . For this reason, when θ = 0 . , filtering produces a clear-cut gap between the two possible membershippredictions (but not between the original classes). The optimal Bayes boundary between classes is shown as well. Interestingly, while this frontierstill does not separate the original classes (without error), it does separate the memberships predictions, with much larger minimal margin . Thecombination of the data reduction and polarity reversal for memberships has thus simplified the learning of S , and eased the capture of the optimalfrontier with nearest neighbors. coast forest highway inside city mountain open country street tall buildings Fig. 6
Examples of annotated images of the database containing 2688 images classified into 8 categories.
Image categorization results confirm the trend observedon the synthetic data when filtering the prototype dataset.Hence, selecting a reduced set of prototypes limits over-fitting on training data, while improving classification per-formance on the test set (typically 3% improvement). Mostinterestingly, classification precision of UNN is very stableas a function of θ , as it is shown in Fig. 8(b), where the dropof UNN precision for the largest values of θ is due to in-cluding prototypes with negative leveraging coefficients aswell. To summarize, UNN displays the ability to discrimi-nate the most relevant images of each class, thus inducing a classification rule robust to “noisy” prototypes arising fromlow inter-class variations. Adjusting the value of threshold ˜ α enables to remove those confusing prototypes, thus reduc-ing the representation of each category to a sparse subset ofmeaningful prototype images.Fig. 10 shows two examples of how the leveraged k -NNrule may correct misclassifications due to the uniform k -NNvoting. E.g., in the first example, the classic and the boosted k -NN methods are compared when classifying an image be-longing to class coast , with k = 11 . The leveraged rule withas few as 20% of prototype images is able to correctly la- θ m A P k=5 k−NNUNN θ m A P k=9 k−NNUNN θ m A P k=13 k−NNUNN θ m A P k=17 k−NNUNN Fig. 7
Classification performances of UNN compared to k -NN in 3-fold cross-validation. k m A P k−NNUNN (a) θ m A P k−NNUNN (b) Fig. 8
Performances of k -NN and UNN classification as a function of (a) k and (b) θ . (The best results obtained with each of the two methods areplotted.) bel the query image (first row). Below each nearest neigh-bor image we show its contribution to the classifier of (9):note that negative votes are significantly smaller than pos-itive ones (up to an order of magnitude), thus determiningpositive labeling with high prediction score h (cid:96)c , according to(9). On the contrary, uniform voting rule with all prototypesmisclassifies the test image, not being able to reject contri-butions by “noisy” neighbor images. An example of proto- types selected by filtering the dataset is shown in Fig. 11,where the leveraging coefficients refer to the first category( tall buildings ) versus the remaining ones. In this paper, we contribute to fill an important void of NNmethods, showing how boosting can be transferred to k -NNclassification. Namely, we propose a novel boosting algo-rithm, UNN (Universal Nearest Neighbors rule), for induc-ing a leveraged k -NN rule. This rule generalizes classic k -NN to weighted voting where weights, the so-called leverag-ing coefficients, are iteratively learned by UNN. We provethat this algorithm converges to the global optimum of sur-rogate risks under very mild assumptions.Experiments on both synthetic and image categorizationdatabases display that UNN provides significant performanceimprovements (up to the best possible performance of theBayes rule). Moreover, UNN exhibits consistent data reduc-tion ability, which results in significant speed-ups for classi-fication (up to a factor 16 when removing 3/4 of the coeffi-cients).Our approach is built on the top of k -NN search, thusbeing fully compatible with existing techniques relying onmetric distance learning (Zhang et al, 2006) as well as sub-space projections like PCA (Jain, 2008) or kernel transfor-mations of the input space, which are expected to enablesignificant improvements of categorization performances. References
Amores J, Sebe N, Radeva P (2006) Boosting the distanceestimation: Application to the k-nearest neighbor classi-fier. Pattern Recognition Letters 27(3):201–209Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY(1998) An optimal algorithm for approximate nearestneighbor searching fixed dimensions. Journal of ACM45(6):891–923Bartlett P, Jordan M, McAuliffe JD (2006) Convexity, clas-sification, and risk bounds. Journal of the American Sta-tistical Association 101:138–156Brighton H, Mellish C (2002) Advances in instance selec-tion for instance-based learning algorithms. Data Miningand Knowledge Discovery 6:153–172Dudani S (1976) The distance-weighted k-nearest-neighborrule. IEEE Transactions on Systems, Man, Cybernetics6(4):325–327Fukunaga K, Flick T (1984) An optimal global nearestneighbor metric. IEEE Transactions on Pattern Analysisand Machine Intelligence 6(3):314–318Garc´ıa-Pedrajas N, Ortiz-Boyer D (2009) Boosting k-nearest neighbor classifier by means of input spaceprojection. Expert Systems Applications 36(7):10,570–10,582Grauman K, Darrell T (2005) The pyramid match kernel:Discriminative classification with sets of image features. In: IEEE International Conference on Computer Vision,Beijing, ChinaHart PE (1968) The Condensed Nearest Neighbor rule.IEEE Transactions on Information Theory 14:515–516Hastie T, Tibshirani R (1996) Discriminant adaptive near-est neighbor classification. IEEE Transactions on PatternAnalysis and Machine Intelligence 18(6):607–616Holmes CC, Adams NM (2003) Likelihood inferencein nearest-neighbour classification models. Biometrika90:99–112Jain AK (2008) Data clustering: 50 years beyond k-means.In: ECML PKDD ’08: Proceedings of the 2008 Euro-pean Conference on Machine Learning and KnowledgeDiscovery in Databases - Part I, Springer-Verlag, Berlin,Heidelberg, pp 3–4Lowe DG (2004) Distinctive image features from scale-invariant keypoints. International Journal of ComputerVision 60(2):91–110Marin JM, Robert CP, Titterington DM (2009) A Bayesianreassessment of nearest-neighbor classification. Journalof the American Statistical AssociationNock R, Nielsen F (2009) On the efficient minimization ofclassification calibrated surrogates. In: Koller D, Schuur-mans D, Bengio Y, Bottou L (eds) Advances in NeuralInformation Processing Systems 21, MIT Press, pp 1201–1208Oliva A, Torralba A (2001) Modeling the shape of the scene:A holistic representation of the spatial envelope. Interna-tional Journal of Computer Vision 42(3):145–175Paredes R (2006) Learning weighted metrics to minimizenearest-neighbor classification error. IEEE Trans PatternAnal Mach Intell 28(7):1100–1110Quattoni A, Torralba A (2009) Recognizing indoor scenes.In: IEEE Conference on Computer Vision and PatternRecognition (CVPR)Ripley B (1994) Neural networks and related methods forclassification. Journal of the Royal Statistical Society Se-ries B 56:409–456Rubin M, Freeman W, Murphy K, Torralba A (2003)Context-based vision system for place and object recog-nition. In: MIT AIMSchapire RE, Singer Y (1999) Improved boosting al-gorithms using confidence-rated predictions. MachineLearning Journal 37:297–336Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boostingthe margin : a new explanation for the effectiveness ofvoting methods. Annals of Statistics 26:1651–1686Shakhnarovich G, Darell T, Indyk P (2006) Nearest-Neighbors Methods in Learning and Vision. MIT PressYu K, Ji L, Zhang X (2002) Kernel nearest-neighbor algo-rithm. Neural Processing Letters 15(2):147–156Zhang H, Berg AC, Maire M, Malik J (2006) Svm-knn: Dis-criminative nearest neighbor classification for visual cate- gory recognition. In: CVPR ’06: Proceedings of the 2006IEEE Computer Society Conference on Computer Visionand Pattern Recognition, IEEE Computer Society, Wash-ington, DC, USA, pp 2126–2136Zhu J, Rosset S, Zou H, Hastie T (2006) Multi-class ad-aboost. Tech. rep., Department of Statistics, University ofMichigan, Ann Arbor, MI 48109Zuo W, Zhang D, Wang K (2008) On kernel difference-weighted k-nearest neighbor classification. Pattern Anal-ysis Applications 11(3-4):247–257 Generic
UNN algorithm
The general version of UNN isshown in Alg. 2. This algorithm induces the leveraged k -NN rule (9) for the broad class of surrogate losses meetingconditions of Bartlett et al (2006), thus generalizing Alg. 1.Namely, we constrain ψ to meet the following conditions: (i) im( ψ ) = R + , (ii) ∇ ψ (0) < ( ∇ ψ is the conventionalderivative of ψ loss function), and (iii) ψ is strictly convexand differentiable. (i) and (ii) imply that ψ is classification-calibrated : its local minimization is roughly tied up to thatof the empirical risk (Bartlett et al, 2006). (iii) implies con-venient algorithmic properties for the minimization of thesurrogate risk (Nock and Nielsen, 2009). Three common ex-amples have been shown in Eq. (6 – 5).The main bottleneck of UNN is step [I.1] , as Eq. (21)is non-linear, but it always has a solution, finite under mildassumptions (Nock and Nielsen, 2009): in our case, δ j isguaranteed to be finite when there is no total matching ormismatching of example j ’s memberships with its recipro-cal neighbors’, for the class at hand. The second column ofTable 1 contains the solutions to (21) for surrogate lossesmentioned in Sec. 2.2. Those solutions are always exact forthe exponential loss ( ψ exp ) and squared loss ( ψ squ ); for thelogistic loss ( ψ log ) it is exact when the weights in the recip-rocal neighborhood of j are the same, otherwise it is approx-imated. Since starting weights are all the same, exactnesscan be guaranteed during a large number of inner rounds de-pending on which order is used to choice the examples. Ta-ble 1 helps to formalize the finiteness condition on δ j men-tioned above: when either sum of weights in (20) is zero, thesolutions in the first and third line of Table 1 are not finite.A simple strategy to cope with numerical problems arisingfrom such situations is that proposed by Schapire and Singer(1999). (See Sec. 2.4.) Table 1 also shows how the weightupdate rule (22) specializes for the mentioned losses. Proofsketch of Theorem 1
We show that UNN converges tothe global optimum of any surrogate risk (Sec. 2.2). So, letus consider the surrogate risk (4) for any fixed class c = Algorithm 2:
Algorithm U
NIVERSAL N EAREST N EIGHBORS
UNN( S , ψ ) Input : S = { ( o i , y i ) , i = 1 , , ..., m, o i ∈ O , y i ∈{− C − , } C } , ψ meeting (i) , (ii) , (iii) (Sec. 5);Let r ( c ) ij . = (cid:26) y ic y jc if j ∼ k i otherwise , ∀ i, j = 1 , , ..., m, c = 1 , , ..., C ; for c = 1 , , ..., C do Let α jc ← , ∀ j = 1 , , ..., m ;Let w i ← −∇ ψ (0) ∈ R m + ∗ , ∀ i = 1 , , ..., m ; for t = 1 , , ..., T do[I.0] Let j ← W IC ( { , , ..., m } , t ) ; [I.1] Let w + j = (cid:88) i :r ( c ) ij > w i , w − j = (cid:88) i :r ( c ) ij < w i , (20) [I.1] Let δ j ∈ R solution of: m (cid:88) i =1 r ( c ) ij ∇ ψ (cid:16) δ j r ( c ) ij + ∇ − ψ ( − w i ) (cid:17) = 0 ; (21) [I.2] ∀ i : j ∼ k i , let w i ← −∇ ψ (cid:16) δ j r ( c ) ij + ∇ − ψ ( − w i ) (cid:17) . (22) [I.3] Let α jc ← α jc + δ j ; Output : h c ( o i (cid:48) ) = (cid:80) i ∼ k i (cid:48) α ic y ic , ∀ c = 1 , , ..., C R m W ker r ( c ) > w t w t +1 w = −∇ ψ (0) w ∞ Fig. 9
A geometric view of how UNN converges to the global opti-mum of (4). (See Appendix for details and notations.) , , ..., C : ε ψc ( h , S ) . = 1 m m (cid:88) i =1 ψ ( (cid:37) ( h , i, c )) . (23)Let w t denote the t th weight vector inside the “ for c ” loopof Alg. 2 (assuming w is the initialization of w ); similarly, h (cid:96)t denotes the t th leveraged k -NN rule obtained after theupdate in [I.3] . The following identity holds, whose provefollows from Nock and Nielsen (2009): ψ ( (cid:37) ( h (cid:96)t , i, c )) = g + D ˜ ψ (0 || w ti ) , (24) Table 1
Three common loss functions and the corresponding solutions δ j of (21) and w i of (22). (Vector r ( c ) j designates column j of R ( c ) and || . || is the L norm.) The rightmost column says whether it is (A)lways the solution, or whether it is when the weights of reciprocal neighbors of j are the (S)ame. loss function δ j in (21) w i in (22) Opt ψ exp . = exp( − x ) log (cid:18) w ( c )+ j w ( c ) − j (cid:19) w i exp (cid:16) − δ j r ( c ) ij (cid:17) A ψ squ . = (1 − x ) w ( c )+ j − w ( c ) − j || r ( c ) j || w i − δ j r ( c ) ij A ψ log . = log(1 + exp( − x )) log (cid:18) w ( c )+ j w ( c ) − j (cid:19) w i exp (cid:16) − δ j r ( c ) ij (cid:17) − w i (cid:16) (cid:16) − δ j r ( c ) ij (cid:17)(cid:17) S where g ( m ) . = − ˜ ψ (0) does not depend on the k -NN rule.Eq. (24) makes the connection between the real-valued clas-sification problem and a geometric problem in the non-metricspace of weights. Here, we have made use of the followingnotations: ˜ ψ ( x ) . = ψ (cid:63) ( − x ) , where ψ (cid:63) ( x ) . = x ∇ − ψ ( x ) − ψ ( ∇ − ψ ( x )) is the Legendre conjugate of ψ ; D ˜ ψ ( w i || w (cid:48) i ) . =˜ ψ ( w i ) − ˜ ψ ( w (cid:48) i ) − ( w i − w (cid:48) i ) ∇ ˜ ψ ( w (cid:48) i ) is the Bregman di-vergence with generator ˜ ψ (Nock and Nielsen, 2009). ψ (cid:63) is related to ψ in such a way that ∇ ˜ ψ ( x ) = −∇ − ψ ( − x ) .Eq. (24) proves in handy as one computes the difference ε ψc ( h (cid:96)t +1 , S ) − ε ψc ( h (cid:96)t , S ) . Indeed, using (24) in (23), andcomputing δ j in (21) so as to bring h (cid:96)t +1 from h (cid:96)t , we ob-tain: ε ψc ( h (cid:96)t +1 , S ) − ε ψc ( h (cid:96)t , S ) = − m m (cid:88) i =1 D ˜ ψ (cid:0) w ( t +1) i || w ti (cid:1) . (25)Since Bregman divergences are non negative and meet theidentity of the indiscernibles, (25) implies that steps [I.1] — [I.3] guarantee the decrease of (23) as long as δ j (cid:54) = 0 .But (23) is lowerbounded, hence UNN must converge. Inaddition, it converges to the global optimum of (23). Sincepredictions for each class are independent, the prove con-sists in showing that (23) converges to its global minimumfor each c . Assume this convergence for the current class, c . Then, following Nock and Nielsen (2009), (21) and (22)imply that, when any possible δ j = 0 , the weight vector,say w ∞ , satisfies R ( c ) (cid:62) w (cid:62) = , i.e. , w ∞ ∈ ker R ( c ) (cid:62) , and w ∞ is unique. But the kernel of R ( c ) (cid:62) and W , the closureof W , are provably Bregman orthogonal (Nock and Nielsen,2009), thus yielding: m (cid:88) i =1 D ˜ ψ (0 || w i ) (cid:124) (cid:123)(cid:122) (cid:125) mε ψc ( h (cid:96) , S ) − mg = m (cid:88) i =1 D ˜ ψ (0 || w ∞ i ) (cid:124) (cid:123)(cid:122) (cid:125) mε ψc ( h (cid:96) ∞ , S ) − mg + m (cid:88) i =1 D ˜ ψ ( w ∞ i || w i ) (cid:124) (cid:123)(cid:122) (cid:125) ≥ , ∀ w ∈ W . (26) Underbraces use (24) in (23), and h (cid:96) is a leveraged k -NN rulecorresponding to w . One obtains that h (cid:96) ∞ achieves the globalminimum of (23), as claimed.The proofsketch is graphically summarized in Figure 9.In particular, two crucial Bregman orthogonalities are men-tioned (Nock and Nielsen, 2009). The red one symbolizes: m (cid:88) i =1 D ˜ ψ (0 || w ti ) = m (cid:88) i =1 D ˜ ψ (cid:0) || w ( t +1) i (cid:1) + m (cid:88) i =1 D ˜ ψ (cid:0) w ( t +1) i || w ti (cid:1) , (27)which is equivalent to (25). The black one on w ∞ is (26). Proofsketch of Theorem 2
Using developments analogousto those of Nock and Nielsen (2009), UNN can be shownto be equivalent to AdaBoost in which m weak classifiersare available, each one being an example. Each weak clas-sifier returns a value in {− , , } , where is reserved forexamples outside the reciprocal neighborhood. Theorem 3of Schapire and Singer (1999) brings in our case: ε ( h (cid:96) , S ) ≤ C C (cid:88) c =1 T (cid:89) t =1 Z ( c ) t , (28)where Z ( c ) t . = (cid:80) mi =1 ˜ w ( c ) it is the normalizing coefficient foreach weight vector in UNN. ( ˜ w ( c ) it denotes the weight ofexample i at iteration ( t, c ) of UNN, and the Tilda notationrefers to weights normalized to unity at each step.) It followsthat: Z ( c ) t = 1 − ˜ w ( c )+ − jt (cid:18) − (cid:113) p ( c ) jt (1 − p ( c ) jt ) (cid:19) ≤ exp (cid:18) − ˜ w ( c )+ − jt (cid:18) − (cid:113) p ( c ) jt (1 − p ( c ) jt ) (cid:19)(cid:19) ≤ exp (cid:16) − η (cid:16) − (cid:112) − γ (cid:17)(cid:17) ≤ exp( − ηγ ) , where ˜ w ( c )+ − jt . = ˜ w ( c )+ jt + ˜ w ( c ) − jt , p ( c ) jt . = ˜ w ( c )+ jt / ˜ w ( c )+ − jt = w ( c )+ jt /w ( c )+ − jt . The first inequality uses − x ≤ exp( − x ) ,and the second the WIA . Since even when the
WIA does nothold, we still observe Z ( c ) t ≤ , plugging the last inequalityin (28) yields the statement of the Theorem. query UNN θ = 0 . ) coast -0.16 -0.76 4.22 -0.09 5.13 -0.02 -0.19 3.96 0.33 -0.72 0.29prediction = coast k -NN θ = 1 )0 1 0 0 0 0 0 1 1 0 0prediction = highway query UNN θ = 0 . ) tall building tall building k -NN θ = 1 )0 1 0 0 1 0 0 1 0 0 0prediction = open countryFig. 10 Two examples where UNN corrects misclassifications of k -NN. The query image is shown in the leftmost column. The 11-nearest prototype images are shown on the right: the first rowrefers to UNN with 20% of retained prototypes ( θ = 0 . ), whereas the second column refers to classic k -NN classification over all prototypes ( θ = 1 ). Neighbors in the same category as the queryimage are surrounded by black boxes. Votes given by each prototype for the true category ( coast ) are shown below each image (such values correspond to α ic y jc in (9), where c is the ground-truthcategory). Fig. 11
Examples of image prototypes with their leveraging coefficients for category 1 ( tall buildingstall buildings