[PDF] Boosting k-NN for categorization of natural scenes

Abstract

The k-nearest neighbors (k-NN) classification rule has proven extremely successful in countless many computer vision applications. For example, image categorization often relies on uniform voting among the nearest prototypes in the space of descriptors. In spite of its good properties, the classic k-NN rule suffers from high variance when dealing with sparse prototype datasets in high dimensions. A few techniques have been proposed to improve k-NN classification, which rely on either deforming the nearest neighborhood relationship or modifying the input space. In this paper, we propose a novel boosting algorithm, called UNN (Universal Nearest Neighbors), which induces leveraged k-NN, thus generalizing the classic k-NN rule. We redefine the voting rule as a strong classifier that linearly combines predictions from the k closest prototypes. Weak classifiers are learned by UNN so as to minimize a surrogate risk. A major feature of UNN is the ability to learn which prototypes are the most relevant for a given class, thus allowing one for effective data reduction. Experimental results on the synthetic two-class dataset of Ripley show that such a filtering strategy is able to reject "noisy" prototypes. We carried out image categorization experiments on a database containing eight classes of natural scenes. We show that our method outperforms significantly the classic k-NN classification, while enabling significant reduction of the computational cost by means of data filtering.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

Boosting k -NN for categorization of natural scenes Paolo Piro · Richard Nock · Frank Nielsen · Michel Barlaud

Received: date / Accepted: date

Abstract

The k -nearest neighbors ( k -NN) classiﬁcation rulehas proven extremely successful in countless many com-puter vision applications. For example, image categorizationoften relies on uniform voting among the nearest prototypesin the space of descriptors. In spite of its good generalizationproperties and its natural extension to multi-class problems,the classic k -NN rule suffers from high variance when deal-ing with sparse prototype datasets in high dimensions. A fewtechniques have been proposed in order to improve k -NNclassiﬁcation, which rely on either deforming the nearestneighborhood relationship by learning a distance function ormodifying the input space by means of subspace selection.In this paper, we propose a novel boosting algorithm,called UNN (Universal Nearest Neighbors), which induces leveraged k -NN, thus generalizing the classic k -NN rule.Our approach consists in redeﬁning the voting rule as a strongclassiﬁer that linearly combines predictions from the k clos-est prototypes. Therefore, the k nearest neighbors examplesact as weak classiﬁers and their weights, called leveragingcoefﬁcients , are learned by UNN so as to minimize a surro- P. Piro · M. BarlaudUniversity of Nice-Sophia Antipolis / CNRS, 2000 route des Lucioles- 06903 Sophia Antipolis Cedex, FranceP. Piro E-mail: [email protected] · R. NockCEREGMIA Department, University of Antilles-Guyane, Martinique,FranceE-mail: [email protected]. NielsenDepartment of Fundamental Research, Sony Computer Science Labo-ratories, Inc., Tokyo, JapanE-mail: [email protected]. NielsenLIX Department, Ecole Polytechnique, Palaiseau, FranceE-mail: [email protected]. BarlaudE-mail: [email protected] gate risk , which upper bounds the empirical misclassiﬁca-tion rate over training data. A major feature of UNN is theability to learn which prototypes are the most relevant for agiven class, thus allowing one for effective data reduction byﬁltering the training data.Experimental results on the synthetic two-class datasetof Ripley show that such a ﬁltering strategy is able to reject“noisy” prototypes, and yields a classiﬁcation error close tothe optimal Bayes error. We carried out image categorizationexperiments on a database containing eight classes of naturalscenes. We show that our method outperforms signiﬁcantlythe classic k -NN classiﬁcation, while enabling signiﬁcantreduction of the computational cost by means of data ﬁlter-ing. Keywords

Boosting · k nearest neighbors · Imagecategorization · Scene classiﬁcation indoor vs outdoor , beaches vs moun-tains , churches vs towers . Generic categorization is distinctfrom object and scene recognition, which are classiﬁcationtasks concerning particular instances of objects or scenes(e.g. Notre Dame Cathedral vs St. Peter’s Basilic ). It is alsodistinct from other related computer vision tasks, such ascontent-based image retrieval (that aims at ﬁnding imagesfrom a database, which are semantically related or visuallysimilar to a given query image) and object detection (whichrequires to ﬁnd both the presence and the position of a targetobject in an image, e.g. person detection). a r X i v : . [ c s . C V ] J a n Automatic categorization of generic scenes is still a chal-lenging task, due to the huge number of natural categoriesthat should be considered in general. In addition, natural im-age categories may exhibit high inter-class variability (i.e., vi-sually different images may belong to the same category)and low inter-class variability (i.e., distinct categories maycontain visually similar images).Classifying images requires a reliable description of thecontent relevant for an application (e.g., location and shapeof speciﬁc objects or overall scene appearance). Examplesof suitable image descriptors for categorization purposes areGist, i.e. global image features representing the overall scene(Oliva and Torralba, 2001), and SIFT descriptors, i.e. de-scriptors of local features extracted at salient patches (Lowe,2004).Gist descriptor is based on the so-called “spatial enve-lope” (Oliva and Torralba, 2001), which is a very effectivelow dimensional representation of the overall scene basedon spectral information. Such a representation bypasses seg-mentation, extraction of keypoints and processing of indi-vidual objects and regions, thus enabling a compact globaldescription of images. Gist descriptors have been success-fully used for categorizing locations and environments, show-ing their ability to provide relevant priors for more speciﬁctasks, like object recognition and detection (Rubin et al, 2003).1.2 k -NN classiﬁcationApart from the descriptors used to compactly represent im-ages, most image categorization methods rely on supervisedlearning techniques for exploiting information about knownsamples when classifying an unlabeled sample. Among thesetechniques, k -NN classiﬁcation has proven successful, thanksto its easy implementation and its good generalization prop-erties (Shakhnarovich et al, 2006). Indeed, the k -NN ruledoes not require explicit construction of the feature spaceand is naturally adapted to multi-class problems. Moreover,from the theoretical point of view, k -NN classiﬁcation prov-ably tends to the Bayes optimal when increasing the samplesize. Although such advantages make k -NN classiﬁcationvery attractive to practitioners, it is an algorithmic challengeto speed-up k -NN queries and design schemes that scale-upwell with large dimensional datasets (Shakhnarovich et al,2006). Moreover, it is yet another challenge to reduce themisclassiﬁcation rate of the k -NN rule, usually tackled bydata reduction techniques (Hart, 1968).In a number of works, the classiﬁcation problem hasbeen reduced to tracking ill-deﬁned categories of neighbors,interpreted as “noisy” (Brighton and Mellish, 2002). Mostof these recent techniques are in fact partial solutions to alarger problem related to nearest neighbors’ error, whichdoes not have to be the discrete prediction of labels, but rather a continuous estimation of class membership prob-abilities (Holmes and Adams, 2003). This problem has beenreformulated by Marin et al (2009) as a strong advocacyfor the formal transposition of boosting to nearest neighborsclassiﬁcation. Such a formalization is challenging as near-est neighbors rules are indeed not induced , whereas all for-mal boosting algorithms induce so-called strong classiﬁersby combining weak classiﬁers (also induced, say by decisionstumps).A survey of the literature shows that at least four differ-ent categories of approaches have been proposed in order toimprove k -NN classiﬁcation: – learning local or global adaptive distance metric; – embedding data in the feature space (kernel nearest neigh-bors); – distance-weighted and difference-weighted nearest neigh-bors; – boosting nearest neighbors.The earliest approaches to generalizing the k -NN clas-siﬁcation rule relied on learning an adaptive distance met-ric from training data. Refer to the seminal work of Fuku-naga and Flick (1984) who presented an optimal global met-ric for k -NN. An analogous approach was later adopted byHastie and Tibshirani (1996), who carried out linear dis-criminant analysis to adaptively deform the distance met-ric. Recently, Paredes (2006) has proposed a method forlearning a weighted distance, where weights can be eitherglobal (i.e., only depending on classes and features) or local(i.e., depending on each individual prototype as well).Other more recent techniques apply the nearest neigh-bors rule to data embedded in a high-dimensional featurespace, following the kernel trick approach of support vec-tor machines. For example, Yu et al (2002) have proposeda straightforward adaptation of the kernel mapping to thenearest neighbors rule, which yields signiﬁcant improvementin terms of classiﬁcation accuracy. In the context of vision,a successful technique has been proposed by Zhang et al(2006), which involves a “reﬁnement” step at classiﬁcationtime, without relying on explicitely learning the distancemetric. This method trains a local support vector machineon nearest neighbors of a given query, thus limiting the mostexpensive computations to a reduced subset of prototypes.Another class of k -NN methods rely on weighting near-est neighbors votes based on their distances to the querysample (Dudani, 1976). Recently, Zuo et al (2008) have pro-posed a similar weighting approach, where the nearest neigh-bors are weighted based on their vector difference to thequery. Such a difference-weight assignment is deﬁned as aconstrained optimization problem of sample reconstructionfrom its neighborhood. The same authors have proposed akernel-based non-linear version of this algorithm as well. Finally, only very few work have proposed the use ofboosting techniques for k -NN classiﬁcation. For instance,Amores et al (2006) use AdaBoost for learning a distancefunction to be used for k -NN search. On the other hand,Garc´ıa-Pedrajas and Ortiz-Boyer (2009) adopt the boostingapproach in a non-conventional way. At each iteration a dif-ferent k -NN classiﬁer is trained over a modiﬁed input space.Namely, the authors propose two variants of the method,depending on the way the input space is modiﬁed. Theirﬁrst algorithm is based on optimal subspace selection, i.e.,at each boosting iteration the most relevant subset of inputdata is computed. The second algorithm relies on modify-ing the input space by means of non-linear projections. Butneither method is strictly an algorithm for inducing weakclassiﬁers from the k -NN rule, thus not directly addressingthe problem of boosting k -NN classiﬁers. Moreover, suchapproaches are computationally expensive, as they rely on agenetic algorithm and a neural network, respectively.Conversely, we propose a complete solution to the prob-lem of boosting k -NN classiﬁers in the general multi-classsetting. Namely, we propose a novel boosting algorithm, calledUNN, which induces a leveraged nearest neighbors rule thatgeneralizes the uniform k -NN rule. Indeed, the voting rule isredeﬁned as a strong classiﬁer that linearly combines weakclassiﬁers induced by the k -NN rule. Therefore, our approachdoes not need to learn a distance function, as it directly op-erates on the top of k -nearest neighbors search. At the sametime, it does not require an explicit computation of the fea-ture space, thus preserving one of the main advantages ofprototype-based methods. Our UNN boosting algorithm isan iterative procedure that learns the weights of weak classi-ﬁers, called leveraging coefﬁcients . We show that this algo-rithm converges to the global minimum of any chosen clas-siﬁcation calibrated surrogate (Bartlett et al, 2006). Hence,our framework handles most popular losses in the machinelearning literature: squared loss, exponential loss, logisticloss, etc. In particular, we prove a speciﬁc convergence ratefor the exponential loss (reported in our experiments) farbetter than the general rate of Nock and Nielsen (2009). An-other important characteristic of UNN is that it is able todiscriminate the most relevant prototypes for a given class,thus allowing one for signiﬁcant data reduction while im-proving at the same time classiﬁcation performances.1.3 Overview of the paperIn the following sections we present our approach to k -NNboosting. Sections 2.1-2.3 present key deﬁnitions for k -NNboosting. These sections also describe how to replace theclassic uniform k -NN rule by a leveraged k -NN rule. Lever- A surrogate is a function which is a suitable upperbound for an-other function (here, the non-convex non-differentiable empirical risk). aged k -NN classiﬁers are induced by UNN algorithm, whichis detailed in Sec. 2.4 for the case of exponential risk. Sec. 2.5presents the generic convergence theorem of UNN and theupper bound performance for the exponential risk minimiza-tion. Our experiments on both synthetic and image catego-rization datasets are reported in Sec. 3. Then, Sec. 4 dis-cusses results and mentions future work.In order not to laden the body of the paper, the generalform of UNN algorithm and proofsketches of our theoremshave been postponed to an appendix in Sec. 5. multi-class , single-label image categorization. Hence, several categories of imagesare predeﬁned, whereas each image is constrained to belongto a single category. The number of categories (or classes)may range from a few to hundreds, depending on appli-cations. E.g., categorization with 67 Indoor categories hasbeen recently studied by Quattoni and Torralba (2009). Wetreat the multi-class problem as multiple binary classiﬁca-tion problems as it is customary in machine learning. I.e.,for each class c , a query image is classiﬁed either to c or to ¯ c (the complement class of c , which contains all classes but c )with a certain conﬁdence ( classiﬁcation score ). Then the la-bel with the maximum score is assigned to the query. Im-ages are represented by descriptors related to given local orglobal features. We refer to an image descriptor as an obser-vation o ∈ O , which is a vector of n features and belongs toa domain O (e.g., R n or [0 , n ). A label is associated to eachimage descriptor according to a predeﬁned set of C classes.Hence, an observation with the corresponding label leads toan example , which is the ordered pair ( o , y ) ∈ O × R C ,where y is termed the class vector that speciﬁes the classmemberships of o . In particular, the sign of y c gives themembership of example ( o , y ) to class c , such that y c is neg-ative iff the observation does not belong to class c , positiveotherwise. At the same time, the absolute value of y c maybe interpreted as a relative conﬁdence in the membership.Inspired by the multi-class boosting analysis of Zhu et al(2006), we constrain class vectors to be symmetric , that is: C (cid:88) c =1 y c = 0 . (1)Hence, in the single-label framework, the class vector of anobservation o belonging to class ˜ c is deﬁned as: y ˜ c = 1 , y c (cid:54) =˜ c = − C − . This setting turns out to be necessary whentreating multi-class classiﬁcation as multiple binary classiﬁ-cations, as it balances negative and positive labels of a givenexample over all classes. We are given an input set of m examples S = { ( o i , y i ) , i = 1 , , ..., m } , arising from an-notated images, which form the training set .2.2 Boosting k -NN for minimization of surrogate risksWe aim at deﬁning a one-versus-all classiﬁer for each cate-gory, which is to be trained over the set of examples. Thisclassiﬁer is expected to correctly classify as many new ob-servations as possible, i.e. to predict their true labels. There-fore, we aim at determining a classiﬁcation rule h from theexample dataset, which is able to minimize the classiﬁcationerror over all possible new observations. But since the un-derlying class probability densities are generally unknownand difﬁcult to estimate, deﬁning a classiﬁer in the frame-work of supervised learning can be viewed as ﬁtting a clas-siﬁcation rule onto a training set S without overﬁtting. Thiscorresponds to deﬁning a classiﬁer that correctly classiﬁesmost of the example data themselves, thus minimizing theclassiﬁcation error over the example dataset (empirical ortrue classiﬁcation loss). Therefore, in the most basic frame-work of supervised classiﬁcation, one wishes to train a clas-siﬁer on S , i.e. build a function h : O → R C with theobjective to minimize its empirical risk on S , deﬁned as: ε ( h , S ) . = 1 mC C (cid:88) c =1 m (cid:88) i =1 [ (cid:37) ( h , i, c ) < , (2)with [ . ] the indicator function ( iff true, otherwise), calledhere the , and: (cid:37) ( h , i, c ) . = y ic h c ( o i ) (3)the edge of classiﬁer h on example ( o i , y i ) for class c . Tak-ing the sign of h c in {− , +1 } as its membership predictionfor class c , one sees that when the edge is positive (resp. neg-ative), the membership predicted by classiﬁer and the actualexample’s membership agree (resp. disagree). Therefore, (2)averages over all classes the number of mismatches for themembership predictions, thus measuring the goodness-of-ﬁtof the classiﬁcation rule on the training dataset. Providedthat the example dataset has good generalization propertieswith respect to the unknown distribution of possible obser-vations, minimizing this empirical risk is expected to yieldgood accuracy when classifying unlabeled observations. Un-fortunately, minimizing the empirical risk is mathematicallynot tractable as it deals with non-convex optimization. Inorder to bypass this cumbersome optimization challenge,the current trend of supervised learning (including boostingand support vector machines) has replaced the minimiza-tion of the empirical risk (2) by that of a so-called surrogaterisk (Bartlett et al, 2006), to make the optimization problemamenable. In boosting, it amounts to summing (or averag-ing) over classes and examples a real-valued function called the surrogate loss , thus ending up with the following rewrit-ing of (2): ε ψ ( h , S ) . = 1 mC C (cid:88) c =1 m (cid:88) i =1 ψ ( (cid:37) ( h , i, c )) . (4)Important choices available for ψ include: ψ sqr . = (1 − x ) , (5) ψ exp . = exp( − x ) , (6) ψ log . = log(1 + exp( − x )) ; (7)(5) is the squared loss (Bartlett et al, 2006), (6) is the ex-ponential loss (Schapire and Singer, 1999), and (7) is thelogistic loss (Bartlett et al, 2006). Surrogates play a fundamental role in supervised learn-ing. They are upper bounds of the empirical risk with de-sirable convexity properties. Their minimization remarkablyimpacts on that of the empirical risk, thus enabling to pro-vide minimization algorithms with good generalization prop-erties (Nock and Nielsen, 2009).In this paper, we move from recent advances in boost-ing with surrogate risks to redeﬁne the k -NN classiﬁcationrule. In particular, we concentrate on the exponential riskand provide a novel algorithm that learns a leveraged k -NNclassiﬁer, while provably converging to the global optimumof a surrogate risk. Our algorithm, called UNN (UniversalNearest Neighbors), meets boosting-type convergence prop-erties under two mild assumptions on the training set: weaklearning and weak coverage properties. In the Appendix, wealso describe how the UNN algorithm generalizes to anysurrogate loss, and provide the most general analysis.2.3 Leveraged k -NN ruleIn the following, we denote by NN k ( o i (cid:48) ) the set of the k -nearest neighbors (with integer constant k > ) of an ex-ample ( o i (cid:48) , y i (cid:48) ) in set S with respect to a non-negative real-valued “distance” function. This function is deﬁned on do-main O and measures how much two observations differfrom each other. This dissimilarity function thus many notnecessarily satisfy the triangle inequality of metrics. (Allexperiments in this paper refer to nearest neighbors withrespect to the Euclidean distance.) For sake of readability,we let i ∼ k i (cid:48) denote an example ( o i , y i ) that belongs toNN k ( o i (cid:48) ) . This neighborhood relationship is intrinsicallyasymmetric, i.e., i ∼ k i (cid:48) does not necessarily imply that i (cid:48) ∼ k i . Indeed, a nearest neighbor of i (cid:48) does not necessarilycontain i (cid:48) among its own nearest neighbors.The k -nearest neighbors rule ( k -NN) is the followingmulti-class classiﬁer h = { h c : c = 1 , , ..., C } ( k appearsin the summation indices): h c ( o i (cid:48) ) = (cid:88) j ∼ k i (cid:48) [ y jc > , (8) where h c is the one-versus-all classiﬁer for class c and squarebrackets denote the indicator function. Hence, the classicnearest neighbors classiﬁcation is based on majority voteamong the k closest prototypes.In this paper, we propose to weight the votes of nearestneighbors by means of real coefﬁcients, thus generalizing(8) to the following leveraged k -NN rule h (cid:96) = { h (cid:96)c : c =1 , , ..., C } : h (cid:96)c ( o i (cid:48) ) = (cid:88) j ∼ k i (cid:48) α jc y jc , (9)where α jc ∈ R is the leveraging coefﬁcient for example j in class c , with j = 1 , , ..., m and c = 1 , , ..., C . Hence,(9) linearly combines class labels of the k nearest neighbors(deﬁned in Sec. 2.1) with their leveraging coefﬁcients.The main contribution of our work is to deﬁne a gen-eral algorithm (UNN) for learning these leveraging coef-ﬁcients from training data. This algorithm operates on thetop of classic k -NN methods, for it does not affect the near-est neighbors search when inducing weak classiﬁers of (9).Indeed, it is independent on the way nearest neighbors arecomputed, unlike most of the approaches mentioned in Sec. 1.2,which rely on modifying the neighborhood relationship viametric distance deformations or kernel transformations.Though, our approach is still fully compatible with any un-derlying (metric) distance and data structure for k -NN search,as well as possible kernel transformations of the input space.For a given training set S of m labeled examples, wedeﬁne the k -NN edge matrix R ( c ) ∈ R m × m for each class c = 1 , , ..., C (Nock and Nielsen, 2009): r ( c ) ij . = (cid:26) y ic y jc if j ∼ k i otherwise . (10)The name of R ( c ) is justiﬁed by an immediate parallel with(3). Indeed, each example j serves as a classiﬁer for eachexample i , predicting if j (cid:54)∈ NN k ( o i ) , y jc otherwise, forthe membership to class c . Hence, the j th column of matrix R ( c ) , r ( c ) j , which is different from when choosing k > ,collects all edges of “classiﬁer” j for class c . Note that non-zero entries of this column correspond to the so-called recip-rocal nearest neighbors (R k -NN) of j , i.e., those examplesfor which j is a neighbor (Fig. 1). It ﬁnally comes that theedge of the leveraged k -NN rule on example i for class c is: (cid:37) ( h (cid:96) , i, c ) = ( R ( c ) α ( c ) ) i , c = 1 , , ..., C , (11)where α ( c ) collects all leveraging coefﬁcients in a vectorform for class c : α ( c ) i . = α ic , i = 1 , , ..., m . The expressionof surrogate loss (4) can be written as follows after replacingthe argument of ψ ( · ) in (4) by (11): ε ψ ( h , S ) . = 1 mC C (cid:88) c =1 m (cid:88) i =1 ψ  m (cid:88) j =1 r ( c ) ij α jc  . (12) j 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 j 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 𝑖 Fig. 1

A toy example of direct (left) and reciprocal (right) k -nearestneighbors ( k = 1 ) of an example j . Squares and circles represent ex-amples of positive and negative classes. Each arrow connects an exam-ple to its 1-NN. Therefore, ﬁtting all α jc ’s so as to minimize the surrogateloss (12) is the main goal of our learning algorithm UNN forinducing the leveraged k -NN classiﬁer h (cid:96) .2.4 UNN: learning α jc of leveraged k -NN classiﬁerWe propose a novel classiﬁcation algorithm which inducesthe leveraged nearest neighbors classiﬁer h (cid:96) (Eq. 9) in themulti-class one-versus-all framework. In this section, we ex-plain UNN specialized for the exponential risk minimiza-tion, with pseudo-code shown in Alg. 1. However, our anal-ysis is much more general, as it involves the broad class ofclassiﬁcation-calibrated surrogate risks (Bartlett et al, 2006),and is postponed to Appendix in order not to burden themethodology. Like common boosting algorithms, UNN op-erates on a set of weights w i ( i = 1 , , ..., m ) deﬁned overtraining data. Such weights are repeatedly updated to ﬁt allleveraging coefﬁcients α ( c ) for class c ( c = 1 , , ..., C ). Ateach iteration, the index to leverage, j ∈ { , , ..., m } , is ob-tained by a call to a weak index chooser oracle W IC ( ., ., . ) ,whose implementation is postponed to steps [A.1] and [A.2] ,detailed later on in this section.The training phase is implemented in a one-versus-allfashion, i.e. C learning problems are solved independently,and for each class c the training examples are considered asbelonging to either class c or the complement class ¯ c , i.e.any other class . Eventually, one leverage coefﬁcient ( α jc )per class is learned for each weak classiﬁer (indexed by j ).In the Appendix, we show that Alg. 1 is a specialization ofa very general classiﬁcation algorithm, thus justifying thename “Universal Nearest Neighbors”. In particular, Alg. 1induces the leveraged k -NN classiﬁer by minimizing the ex-ponential surrogate risk (6), very much like regular boostingdoes it for inducing a weighted voting rule for a set of weakclassiﬁers.The key observation when training weak classiﬁers withUNN is that, at each iteration, one single example (indexedby j ) is considered as a prototype to be leveraged. Indeed, allthe other training data are to be viewed as observations for which j may possibly vote. In particular, due to k -NN vot-ing, j can be a classiﬁer only for its reciprocal nearest neigh-bors (i.e., those data for which j itself is a neighbor, corre-sponding to non-zero entries in matrix (10) on column j ).This brings to a remarkable simpliﬁcation when computing δ j in step [I.1] and updating weights w i in step [I.2] (Eq. 16,17). Indeed, only weights of reciprocal nearest neighbors of j are involved in these computations, thus allowing us notto store the entire matrix R ( c ) , c = 1 , , ..., C . Note that theset of R k -NN is splitted in two subsets, each containing ex-amples that agree (disagree) with the class membership of j ,thus yielding the partial sums w + j and w − j of (15).Note that when whichever w + j or w − j is zero, δ j in (16)is not ﬁnite. There is however a simple alternative, inspiredby Schapire and Singer (1999), which consists in smooth-ing out δ j when necessary, thus guaranteeing its ﬁnitenesswithout impairing convergence. More precisely, we suggestto replace: w + j ← w + j + 1 m , (13) w − j ← w − j + 1 m . (14)Also note that step [I.0] relies on oracle W IC ( ., ., . ) forselecting index j of the next weak classiﬁer. We propose twoalternative implementations of this oracle, as follows: [I.0.a] a lazy approach: we set T = m and let j be chosenby W IC ( { , , ..., m } , t, c ) either: (1) randomly, or (2)following the alphabetic order of classes; [I.0.b] the boosting approach: we pick T ≥ m , and let j bechosen by W IC ( { , , ..., m } , t, c ) such that δ j is largeenough. Each j can be chosen more than once.There are also schemes mixing [I.0.a] and [I.0.b] : for exam-ple, we may pick T = m , choose j as in [I.0.b] , but exactlyonce as in [I.0.a] .2.5 Properties of UNNIn this section, we enunciate two fundamental theorems forUNN. The ﬁrst theorem reports a general monotonic con-vergence property of UNN to the optimal loss, for any givensurrogate function. The second theorem further reﬁnes thisgeneral convergence theorem by providing effective conver-gence bound for the exponential loss. Theorem 1

As the number of iteration steps T increases, UNN converges to h (cid:96) realizing the global minimum of thesurrogate risk at hand (4), for any ψ meeting conditions (i) , (ii) and (iii) above. (proofsketch in Appendix)Although we prove the boosting ability of UNN for allapplicable surrogate losses, we choose to show in particularits behavior for the exponential loss ψ exp , which features far Algorithm 1: U NIVERSAL N EAREST N EIGHBORS

UNN( S ) for ψ = ψ exp Input : S = { ( o i , y i ) , i = 1 , , ..., m, o i ∈ O , y i ∈{− C − , } C } Let r ( c ) ij . = (cid:26) y ic y jc if j ∼ k i otherwise , ∀ i, j = 1 , , ..., m, c = 1 , , ..., C ; for c = 1 , , ..., C do Let α jc ← , ∀ j = 1 , , ..., m ;Let w i ← , ∀ i = 1 , , ..., m ; for t = 1 , , ..., T do[I.0] Weak index chooser oracle: Let j ← W IC ( { , , ..., m } , t ) ; [I.1] Let w + j = (cid:88) i :r ( c ) ij > w i , w − j = (cid:88) i :r ( c ) ij < w i , (15) δ j ←

12 log (cid:32) w + j w − j (cid:33) ; (16) [I.2] Let w i ← w i exp( − δ j r ( c ) ij ) , ∀ i : j ∼ k i ; (17) [I.3] Let α jc ← α jc + δ j Output : h c ( o i (cid:48) ) = (cid:80) i ∼ k i (cid:48) α ic y ic , ∀ c = 1 , , ..., C better convergence bound than the general one (Nock andNielsen, 2009).Computing this bound is based on deﬁning a weak in-dex assumption ( WIA ), which is to nearest neighbors whatthe conventional weak learning assumption is to general in-duced classiﬁers (Schapire and Singer, 1999):(

WIA ) let p ( c ) j . = w ( c )+ j / ( w ( c )+ j + w ( c ) − j ) . There exist some γ > and η > such that the following two inequalityholds for index j returned by W IC ( ., ., . ) : | p ( c ) j − / | ≥ γ , (18) ( w ( c )+ j + w ( c ) − j ) / || w || ≥ η . (19) Theorem 2

If the

WIA holds for τ ≤ T steps in UNN (foreach c ), then ε ( h (cid:96) , S ) ≤ exp( − ηγ τ ) . (proofsketch inAppendix)Inequality (18) is the usual weak learning assumption(Schapire and Singer, 1999), when considering examplesas weak classiﬁers. But a weak coverage assumption (19)is needed as well, because insufﬁcient coverage of the re-ciprocal neighbors could easily wipe out even the surrogaterisk reduction potentially due to a large γ . In addition, evenwhen classes are signiﬁcantly overlapping, choosing k nottoo small is enough for the WIA to be met for a large num-ber of boosting rounds τ , thus determining a potential harshdecrease of ε ( h (cid:96) , S ) . This is important, as there are at most m different weak classiﬁers available to W IC ( ., ., . ) , evenwhen each one may be chosen more than once under the WIA . Last but not least, Theorem 2 also displays the factthat classiﬁcation (18) may be more important than cover-age (19).

In this section, we present experimental results of UNN vsplain k -NN on both synthetic and real datasets. Such exper-iments allowed us to quantify the gains brought by boost-ing on nearest neighbors voting (Marin et al, 2009). For thispurpose, we ﬁrst performed tests on two-class synthetic datato drill down into the performances of UNN (Sec. 3.1). InSec. 3.2 we discuss the data reduction ability of our tech-nique. Then, we carried out experiments of multi-class scenecategorization on a dataset of natural images and comparedthe results of UNN to plain k -NN classiﬁcation (Sec. 3.3).3.1 Synthetic datasetsWe have drilled down into the experimental behavior of UNNusing the synthetic Ripley’s dataset (Ripley, 1994) with twoclasses denoted by P and N . Each population of this datasetis an equal mixture of two two-dimensional normally dis-tributed populations, which are equally likely. Training andtest dataset (consisting of 250 and 1000 points, respectively)are shown in Figure 2, where the optimal classiﬁcation bound-ary of the Bayes rule is also displayed. This corresponds tothe best theoretical error rate of 8.0% (Ripley, 1994).Fig. 3 validates on this dataset the monotonous decayof the exponential risk (6), mathematically proved in Theo-rem 2 under the two basic weak index/learning assumptions.It also shows the effect of three different implementations ofthe W IC oracle (Sec. 2.5). Note that the boosting approachfor selecting weak classiﬁers provides much faster decay ofthe surrogate risk, thus outperforming the two tested “lazy”implementations. In these latter cases, the index j of theweak classiﬁer at each UNN iteration was chosen either ran-domly or following the order of examples in their respectivecategories.Classiﬁcation results for a range of values of k are shownin Fig. 4. They enable to draw two main conclusions: First,test errors display a robustness of UNN against variations of k . Second, ﬁltering out even a large proportion − θ of ex-amples with the smallest || α . || does not degrade classiﬁca-tion performances, and can even signiﬁcantly improve them.As witnessed by Fig. 4, values as small as θ = 0 . yieldsimprovements that make the test error close to Bayes’. (E.g.,see the minimum error of boosted k -NN for θ = 0 . , k =9 .) We investigate such a data reduction ability of UNN inthe following Section. Fig. 2

Training and validation data for the Ripley’s dataset. The Bayesboundary is also drawn as reported in (Ripley, 1994). ψ e xp T WIC = boostingWIC = random orderWIC = alphabetic order

Fig. 3

Decrease of ε ψ exp ( h (cid:96) , S ) as a function of T in UNN for theRipley’s dataset for different oracle implementations. Note that theboosting implementation ( [I.0.b] , Sec. 2.4) always guarantees mono-tonic decrease of the surrogate loss, until the weak assumptions arematched (red curve). Conversely, the lazy implementation ( [I.0.a] ,Sec. 2.4) may select, at a given step, a classiﬁer that does not matchthose assumptions, thus preventing the loss from strictly decreasing(see green and blue curves). k M i s c l a ss i f i ca ti on e rr o r r a t e k−NNUNN θ =1UNN θ =0.75UNN θ =0.5UNN θ =0.25 Fig. 4

Test error for UNN as a function of k for boosted k -NN. Bayesrule yields 8% optimal misclassiﬁcation rate. margin effect, well known for induced classiﬁers(Schapire et al, 1998). The goodness-of-ﬁt of the k -NN ruleis driven by the most accurate examples, i.e. those surroundedby examples of the same class, getting the largest || α . || .The least accurate ones, e.g. those located in overlappingregions between two classes, get the smallest. Discardingthese latter examples tends to increase a gap between classclouds, but each cloud may shelter examples of differentclasses. Fortunately, ﬁltering with boosting is accompaniedby a subtle local repolarization of predictions which, as ex-plained in Figure 5 for θ = 0 . , makes this gap maximiza-tion translate to margin maximization , for which positive ef-fects on learning are known (Schapire et al, 1998). The sec-ond effect is structural: in nearest neighbors rules, the fron-tier between classes stems from the Voronoi cells of thoseleast accurate examples. Discarding them separates betterthe classes, as witnessed by Fig. 5. Above all, it reduces thenumber of Voronoi cells involved in the class frontiers, thusreducing structural parameters ( VC -dimension) of the clas-siﬁer, possibly buying a reduction of the test error as well(Schapire et al, 1998).3.3 Image CategorizationWe tested our k -NN boosting algorithm for image catego-rization. In particular, we used the global Gist descriptor ofOliva and Torralba (2001) in order to obtain a meaningfulrepresentation of images. This descriptor provides a globalrepresentation of a scene, while not requiring explicit seg-mentation of image regions and objects. In the typical set-ting, an image is represented by a single vector of dimen-sion 512, which collects features related to the spatial or-ganization of dominant scales and orientations in the im-age. This correspondence between images and descriptors isone of the main advantages of using global descriptors overrepresentations based on bags of local features (Graumanand Darrell, 2005). Indeed, global descriptors are straight-forwardly adapted to image categorization methods relyingon machine learning techniques, as most of these techniques,from prototype-based to kernel-based, require any instanceof a particular category to be represented by a single vector.In particular, this is the case of k -NN classiﬁcation, whichexplicitly relies on measuring one-to-one similarity betweena query image and prototype images. In addition, Gist de-scriptors have proven successful in representing relevant con-textual information of natural scenes, which allows to com- pute meaninfgul priors for exploration tasks like object de-tection and localization (Rubin et al, 2003).The dataset we used contains 2688 color images of out-door scenes of size 256x256 pixels, divided in 8 categories: coast , mountain , forest , open country , street , inside city , tallbuildings and highways . One example image of each cate-gory is shown in Fig. 6.To extract global descriptors from these images we usedthe matlab implementation by Torralba , with the most com-mon settings: 4 resolution levels of the Gabor pyramid, 8 ori-entations per scale and × blocks.We used this database to validate UNN for different val-ues of k . In particular, we concentrated on evaluating clas-siﬁcation performances when ﬁltering the prototype dataset,i.e. retaining a proportion θ of the most relevant examplesas prototypes for classiﬁcation. Such a data reduction capa-bility is one of the most interesting properties of UNN, asit favourably impacts on the computational cost of classi-ﬁcation, which grows at least logarithmically (at most lin-early) with the dataset size. Indeed, classiﬁcation roughlyamounts to searching for the k nearest neighbors amongprototypes, which is O ( kdθm ) for linear exhaustive search, O ( kd log( θm )) for fast kD-tree based search (Arya et al,1998) ( d being the dimension of feature vectors, θ the pro-portion of retained classiﬁers).Fig. 7 shows results of 3-fold cross-validation in termsof the mean Average Precision (mAP) as a function of θ ,for different values of k . Indeed, we randomly splitted thedatabase into 3 distinct subsets, each containing 896 images.Then, for each fold, we used one of these subsets as trainingset, while validating on the two remaining subsets. In eachexperiment, UNN was run over the training set and a subsetof the trained weak classiﬁers was retained as prototypesfor classifying the test images. In particular, we selectedall training images j with leveraging coefﬁcients α jc , c =1 , , ..., C , such that α jc > ˜ α > . Note from Fig. 7 that,even when ﬁxing threshold ˜ α so as to retain all the exam-ples, the actual proportion θ of prototypes is less than one,because UNN always discards the examples with null lever-aging coefﬁcients, which do not match assumptions (18,19).We compared UNN with the classic k -NN classiﬁca-tion. Namely, in order for the classiﬁcation cost of k -NN beroughly the same as UNN, we carried out random samplingof the prototype dataset for selecting proportion θ (between10% and the whole set of examples). UNN signiﬁcantly out-performs classic k -NN, even increasingly with k , as shownin Fig. 8(a). publicly available at http://people.csail.mit.edu/torralba/code/spatialenvelope/sceneRecognition.m The mAP was computed by averaging classiﬁcation rates over cat-egories (diagonal of the confusion matrix) and then averaging thosevalues over the 3 cross-validation folds (Oliva and Torralba, 2001).

Fig. 5

Maps of positive/negative leveraging coefﬁcients α j over training data for k = 3 and three different values of θ . Examples of class N withnegative α . (ﬁlled squares) and those of class P with positive α . (empty circles) predict class P ; similarly, empty squares and ﬁlled circles bothcorrespond to membership prediction in N . For this reason, when θ = 0 . , ﬁltering produces a clear-cut gap between the two possible membershippredictions (but not between the original classes). The optimal Bayes boundary between classes is shown as well. Interestingly, while this frontierstill does not separate the original classes (without error), it does separate the memberships predictions, with much larger minimal margin . Thecombination of the data reduction and polarity reversal for memberships has thus simpliﬁed the learning of S , and eased the capture of the optimalfrontier with nearest neighbors. coast forest highway inside city mountain open country street tall buildings Fig. 6

Examples of annotated images of the database containing 2688 images classiﬁed into 8 categories.

Image categorization results conﬁrm the trend observedon the synthetic data when ﬁltering the prototype dataset.Hence, selecting a reduced set of prototypes limits over-ﬁtting on training data, while improving classiﬁcation per-formance on the test set (typically 3% improvement). Mostinterestingly, classiﬁcation precision of UNN is very stableas a function of θ , as it is shown in Fig. 8(b), where the dropof UNN precision for the largest values of θ is due to in-cluding prototypes with negative leveraging coefﬁcients aswell. To summarize, UNN displays the ability to discrimi-nate the most relevant images of each class, thus inducing a classiﬁcation rule robust to “noisy” prototypes arising fromlow inter-class variations. Adjusting the value of threshold ˜ α enables to remove those confusing prototypes, thus reduc-ing the representation of each category to a sparse subset ofmeaningful prototype images.Fig. 10 shows two examples of how the leveraged k -NNrule may correct misclassiﬁcations due to the uniform k -NNvoting. E.g., in the ﬁrst example, the classic and the boosted k -NN methods are compared when classifying an image be-longing to class coast , with k = 11 . The leveraged rule withas few as 20% of prototype images is able to correctly la- θ m A P k=5 k−NNUNN θ m A P k=9 k−NNUNN θ m A P k=13 k−NNUNN θ m A P k=17 k−NNUNN Fig. 7

Classiﬁcation performances of UNN compared to k -NN in 3-fold cross-validation. k m A P k−NNUNN (a) θ m A P k−NNUNN (b) Fig. 8

Performances of k -NN and UNN classiﬁcation as a function of (a) k and (b) θ . (The best results obtained with each of the two methods areplotted.) bel the query image (ﬁrst row). Below each nearest neigh-bor image we show its contribution to the classiﬁer of (9):note that negative votes are signiﬁcantly smaller than pos-itive ones (up to an order of magnitude), thus determiningpositive labeling with high prediction score h (cid:96)c , according to(9). On the contrary, uniform voting rule with all prototypesmisclassiﬁes the test image, not being able to reject contri-butions by “noisy” neighbor images. An example of proto- types selected by ﬁltering the dataset is shown in Fig. 11,where the leveraging coefﬁcients refer to the ﬁrst category( tall buildings ) versus the remaining ones. In this paper, we contribute to ﬁll an important void of NNmethods, showing how boosting can be transferred to k -NNclassiﬁcation. Namely, we propose a novel boosting algo-rithm, UNN (Universal Nearest Neighbors rule), for induc-ing a leveraged k -NN rule. This rule generalizes classic k -NN to weighted voting where weights, the so-called leverag-ing coefﬁcients, are iteratively learned by UNN. We provethat this algorithm converges to the global optimum of sur-rogate risks under very mild assumptions.Experiments on both synthetic and image categorizationdatabases display that UNN provides signiﬁcant performanceimprovements (up to the best possible performance of theBayes rule). Moreover, UNN exhibits consistent data reduc-tion ability, which results in signiﬁcant speed-ups for classi-ﬁcation (up to a factor 16 when removing 3/4 of the coefﬁ-cients).Our approach is built on the top of k -NN search, thusbeing fully compatible with existing techniques relying onmetric distance learning (Zhang et al, 2006) as well as sub-space projections like PCA (Jain, 2008) or kernel transfor-mations of the input space, which are expected to enablesigniﬁcant improvements of categorization performances. References

Amores J, Sebe N, Radeva P (2006) Boosting the distanceestimation: Application to the k-nearest neighbor classi-ﬁer. Pattern Recognition Letters 27(3):201–209Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY(1998) An optimal algorithm for approximate nearestneighbor searching ﬁxed dimensions. Journal of ACM45(6):891–923Bartlett P, Jordan M, McAuliffe JD (2006) Convexity, clas-siﬁcation, and risk bounds. Journal of the American Sta-tistical Association 101:138–156Brighton H, Mellish C (2002) Advances in instance selec-tion for instance-based learning algorithms. Data Miningand Knowledge Discovery 6:153–172Dudani S (1976) The distance-weighted k-nearest-neighborrule. IEEE Transactions on Systems, Man, Cybernetics6(4):325–327Fukunaga K, Flick T (1984) An optimal global nearestneighbor metric. IEEE Transactions on Pattern Analysisand Machine Intelligence 6(3):314–318Garc´ıa-Pedrajas N, Ortiz-Boyer D (2009) Boosting k-nearest neighbor classiﬁer by means of input spaceprojection. Expert Systems Applications 36(7):10,570–10,582Grauman K, Darrell T (2005) The pyramid match kernel:Discriminative classiﬁcation with sets of image features. In: IEEE International Conference on Computer Vision,Beijing, ChinaHart PE (1968) The Condensed Nearest Neighbor rule.IEEE Transactions on Information Theory 14:515–516Hastie T, Tibshirani R (1996) Discriminant adaptive near-est neighbor classiﬁcation. IEEE Transactions on PatternAnalysis and Machine Intelligence 18(6):607–616Holmes CC, Adams NM (2003) Likelihood inferencein nearest-neighbour classiﬁcation models. Biometrika90:99–112Jain AK (2008) Data clustering: 50 years beyond k-means.In: ECML PKDD ’08: Proceedings of the 2008 Euro-pean Conference on Machine Learning and KnowledgeDiscovery in Databases - Part I, Springer-Verlag, Berlin,Heidelberg, pp 3–4Lowe DG (2004) Distinctive image features from scale-invariant keypoints. International Journal of ComputerVision 60(2):91–110Marin JM, Robert CP, Titterington DM (2009) A Bayesianreassessment of nearest-neighbor classiﬁcation. Journalof the American Statistical AssociationNock R, Nielsen F (2009) On the efﬁcient minimization ofclassiﬁcation calibrated surrogates. In: Koller D, Schuur-mans D, Bengio Y, Bottou L (eds) Advances in NeuralInformation Processing Systems 21, MIT Press, pp 1201–1208Oliva A, Torralba A (2001) Modeling the shape of the scene:A holistic representation of the spatial envelope. Interna-tional Journal of Computer Vision 42(3):145–175Paredes R (2006) Learning weighted metrics to minimizenearest-neighbor classiﬁcation error. IEEE Trans PatternAnal Mach Intell 28(7):1100–1110Quattoni A, Torralba A (2009) Recognizing indoor scenes.In: IEEE Conference on Computer Vision and PatternRecognition (CVPR)Ripley B (1994) Neural networks and related methods forclassiﬁcation. Journal of the Royal Statistical Society Se-ries B 56:409–456Rubin M, Freeman W, Murphy K, Torralba A (2003)Context-based vision system for place and object recog-nition. In: MIT AIMSchapire RE, Singer Y (1999) Improved boosting al-gorithms using conﬁdence-rated predictions. MachineLearning Journal 37:297–336Schapire RE, Freund Y, Bartlett P, Lee WS (1998) Boostingthe margin : a new explanation for the effectiveness ofvoting methods. Annals of Statistics 26:1651–1686Shakhnarovich G, Darell T, Indyk P (2006) Nearest-Neighbors Methods in Learning and Vision. MIT PressYu K, Ji L, Zhang X (2002) Kernel nearest-neighbor algo-rithm. Neural Processing Letters 15(2):147–156Zhang H, Berg AC, Maire M, Malik J (2006) Svm-knn: Dis-criminative nearest neighbor classiﬁcation for visual cate- gory recognition. In: CVPR ’06: Proceedings of the 2006IEEE Computer Society Conference on Computer Visionand Pattern Recognition, IEEE Computer Society, Wash-ington, DC, USA, pp 2126–2136Zhu J, Rosset S, Zou H, Hastie T (2006) Multi-class ad-aboost. Tech. rep., Department of Statistics, University ofMichigan, Ann Arbor, MI 48109Zuo W, Zhang D, Wang K (2008) On kernel difference-weighted k-nearest neighbor classiﬁcation. Pattern Anal-ysis Applications 11(3-4):247–257 Generic

UNN algorithm

The general version of UNN isshown in Alg. 2. This algorithm induces the leveraged k -NN rule (9) for the broad class of surrogate losses meetingconditions of Bartlett et al (2006), thus generalizing Alg. 1.Namely, we constrain ψ to meet the following conditions: (i) im( ψ ) = R + , (ii) ∇ ψ (0) < ( ∇ ψ is the conventionalderivative of ψ loss function), and (iii) ψ is strictly convexand differentiable. (i) and (ii) imply that ψ is classiﬁcation-calibrated : its local minimization is roughly tied up to thatof the empirical risk (Bartlett et al, 2006). (iii) implies con-venient algorithmic properties for the minimization of thesurrogate risk (Nock and Nielsen, 2009). Three common ex-amples have been shown in Eq. (6 – 5).The main bottleneck of UNN is step [I.1] , as Eq. (21)is non-linear, but it always has a solution, ﬁnite under mildassumptions (Nock and Nielsen, 2009): in our case, δ j isguaranteed to be ﬁnite when there is no total matching ormismatching of example j ’s memberships with its recipro-cal neighbors’, for the class at hand. The second column ofTable 1 contains the solutions to (21) for surrogate lossesmentioned in Sec. 2.2. Those solutions are always exact forthe exponential loss ( ψ exp ) and squared loss ( ψ squ ); for thelogistic loss ( ψ log ) it is exact when the weights in the recip-rocal neighborhood of j are the same, otherwise it is approx-imated. Since starting weights are all the same, exactnesscan be guaranteed during a large number of inner rounds de-pending on which order is used to choice the examples. Ta-ble 1 helps to formalize the ﬁniteness condition on δ j men-tioned above: when either sum of weights in (20) is zero, thesolutions in the ﬁrst and third line of Table 1 are not ﬁnite.A simple strategy to cope with numerical problems arisingfrom such situations is that proposed by Schapire and Singer(1999). (See Sec. 2.4.) Table 1 also shows how the weightupdate rule (22) specializes for the mentioned losses. Proofsketch of Theorem 1

We show that UNN converges tothe global optimum of any surrogate risk (Sec. 2.2). So, letus consider the surrogate risk (4) for any ﬁxed class c = Algorithm 2:

Algorithm U

NIVERSAL N EAREST N EIGHBORS

UNN( S , ψ ) Input : S = { ( o i , y i ) , i = 1 , , ..., m, o i ∈ O , y i ∈{− C − , } C } , ψ meeting (i) , (ii) , (iii) (Sec. 5);Let r ( c ) ij . = (cid:26) y ic y jc if j ∼ k i otherwise , ∀ i, j = 1 , , ..., m, c = 1 , , ..., C ; for c = 1 , , ..., C do Let α jc ← , ∀ j = 1 , , ..., m ;Let w i ← −∇ ψ (0) ∈ R m + ∗ , ∀ i = 1 , , ..., m ; for t = 1 , , ..., T do[I.0] Let j ← W IC ( { , , ..., m } , t ) ; [I.1] Let w + j = (cid:88) i :r ( c ) ij > w i , w − j = (cid:88) i :r ( c ) ij < w i , (20) [I.1] Let δ j ∈ R solution of: m (cid:88) i =1 r ( c ) ij ∇ ψ (cid:16) δ j r ( c ) ij + ∇ − ψ ( − w i ) (cid:17) = 0 ; (21) [I.2] ∀ i : j ∼ k i , let w i ← −∇ ψ (cid:16) δ j r ( c ) ij + ∇ − ψ ( − w i ) (cid:17) . (22) [I.3] Let α jc ← α jc + δ j ; Output : h c ( o i (cid:48) ) = (cid:80) i ∼ k i (cid:48) α ic y ic , ∀ c = 1 , , ..., C R m W ker r ( c ) > w t w t +1 w = −∇ ψ (0) w ∞ Fig. 9

A geometric view of how UNN converges to the global opti-mum of (4). (See Appendix for details and notations.) , , ..., C : ε ψc ( h , S ) . = 1 m m (cid:88) i =1 ψ ( (cid:37) ( h , i, c )) . (23)Let w t denote the t th weight vector inside the “ for c ” loopof Alg. 2 (assuming w is the initialization of w ); similarly, h (cid:96)t denotes the t th leveraged k -NN rule obtained after theupdate in [I.3] . The following identity holds, whose provefollows from Nock and Nielsen (2009): ψ ( (cid:37) ( h (cid:96)t , i, c )) = g + D ˜ ψ (0 || w ti ) , (24) Table 1

Three common loss functions and the corresponding solutions δ j of (21) and w i of (22). (Vector r ( c ) j designates column j of R ( c ) and || . || is the L norm.) The rightmost column says whether it is (A)lways the solution, or whether it is when the weights of reciprocal neighbors of j are the (S)ame. loss function δ j in (21) w i in (22) Opt ψ exp . = exp( − x ) log (cid:18) w ( c )+ j w ( c ) − j (cid:19) w i exp (cid:16) − δ j r ( c ) ij (cid:17) A ψ squ . = (1 − x ) w ( c )+ j − w ( c ) − j || r ( c ) j || w i − δ j r ( c ) ij A ψ log . = log(1 + exp( − x )) log (cid:18) w ( c )+ j w ( c ) − j (cid:19) w i exp (cid:16) − δ j r ( c ) ij (cid:17) − w i (cid:16) (cid:16) − δ j r ( c ) ij (cid:17)(cid:17) S where g ( m ) . = − ˜ ψ (0) does not depend on the k -NN rule.Eq. (24) makes the connection between the real-valued clas-siﬁcation problem and a geometric problem in the non-metricspace of weights. Here, we have made use of the followingnotations: ˜ ψ ( x ) . = ψ (cid:63) ( − x ) , where ψ (cid:63) ( x ) . = x ∇ − ψ ( x ) − ψ ( ∇ − ψ ( x )) is the Legendre conjugate of ψ ; D ˜ ψ ( w i || w (cid:48) i ) . =˜ ψ ( w i ) − ˜ ψ ( w (cid:48) i ) − ( w i − w (cid:48) i ) ∇ ˜ ψ ( w (cid:48) i ) is the Bregman di-vergence with generator ˜ ψ (Nock and Nielsen, 2009). ψ (cid:63) is related to ψ in such a way that ∇ ˜ ψ ( x ) = −∇ − ψ ( − x ) .Eq. (24) proves in handy as one computes the difference ε ψc ( h (cid:96)t +1 , S ) − ε ψc ( h (cid:96)t , S ) . Indeed, using (24) in (23), andcomputing δ j in (21) so as to bring h (cid:96)t +1 from h (cid:96)t , we ob-tain: ε ψc ( h (cid:96)t +1 , S ) − ε ψc ( h (cid:96)t , S ) = − m m (cid:88) i =1 D ˜ ψ (cid:0) w ( t +1) i || w ti (cid:1) . (25)Since Bregman divergences are non negative and meet theidentity of the indiscernibles, (25) implies that steps [I.1] — [I.3] guarantee the decrease of (23) as long as δ j (cid:54) = 0 .But (23) is lowerbounded, hence UNN must converge. Inaddition, it converges to the global optimum of (23). Sincepredictions for each class are independent, the prove con-sists in showing that (23) converges to its global minimumfor each c . Assume this convergence for the current class, c . Then, following Nock and Nielsen (2009), (21) and (22)imply that, when any possible δ j = 0 , the weight vector,say w ∞ , satisﬁes R ( c ) (cid:62) w (cid:62) = , i.e. , w ∞ ∈ ker R ( c ) (cid:62) , and w ∞ is unique. But the kernel of R ( c ) (cid:62) and W , the closureof W , are provably Bregman orthogonal (Nock and Nielsen,2009), thus yielding: m (cid:88) i =1 D ˜ ψ (0 || w i ) (cid:124) (cid:123)(cid:122) (cid:125) mε ψc ( h (cid:96) , S ) − mg = m (cid:88) i =1 D ˜ ψ (0 || w ∞ i ) (cid:124) (cid:123)(cid:122) (cid:125) mε ψc ( h (cid:96) ∞ , S ) − mg + m (cid:88) i =1 D ˜ ψ ( w ∞ i || w i ) (cid:124) (cid:123)(cid:122) (cid:125) ≥ , ∀ w ∈ W . (26) Underbraces use (24) in (23), and h (cid:96) is a leveraged k -NN rulecorresponding to w . One obtains that h (cid:96) ∞ achieves the globalminimum of (23), as claimed.The proofsketch is graphically summarized in Figure 9.In particular, two crucial Bregman orthogonalities are men-tioned (Nock and Nielsen, 2009). The red one symbolizes: m (cid:88) i =1 D ˜ ψ (0 || w ti ) = m (cid:88) i =1 D ˜ ψ (cid:0) || w ( t +1) i (cid:1) + m (cid:88) i =1 D ˜ ψ (cid:0) w ( t +1) i || w ti (cid:1) , (27)which is equivalent to (25). The black one on w ∞ is (26). Proofsketch of Theorem 2

Using developments analogousto those of Nock and Nielsen (2009), UNN can be shownto be equivalent to AdaBoost in which m weak classiﬁersare available, each one being an example. Each weak clas-siﬁer returns a value in {− , , } , where is reserved forexamples outside the reciprocal neighborhood. Theorem 3of Schapire and Singer (1999) brings in our case: ε ( h (cid:96) , S ) ≤ C C (cid:88) c =1 T (cid:89) t =1 Z ( c ) t , (28)where Z ( c ) t . = (cid:80) mi =1 ˜ w ( c ) it is the normalizing coefﬁcient foreach weight vector in UNN. ( ˜ w ( c ) it denotes the weight ofexample i at iteration ( t, c ) of UNN, and the Tilda notationrefers to weights normalized to unity at each step.) It followsthat: Z ( c ) t = 1 − ˜ w ( c )+ − jt (cid:18) − (cid:113) p ( c ) jt (1 − p ( c ) jt ) (cid:19) ≤ exp (cid:18) − ˜ w ( c )+ − jt (cid:18) − (cid:113) p ( c ) jt (1 − p ( c ) jt ) (cid:19)(cid:19) ≤ exp (cid:16) − η (cid:16) − (cid:112) − γ (cid:17)(cid:17) ≤ exp( − ηγ ) , where ˜ w ( c )+ − jt . = ˜ w ( c )+ jt + ˜ w ( c ) − jt , p ( c ) jt . = ˜ w ( c )+ jt / ˜ w ( c )+ − jt = w ( c )+ jt /w ( c )+ − jt . The ﬁrst inequality uses − x ≤ exp( − x ) ,and the second the WIA . Since even when the

WIA does nothold, we still observe Z ( c ) t ≤ , plugging the last inequalityin (28) yields the statement of the Theorem. query UNN θ = 0 . ) coast -0.16 -0.76 4.22 -0.09 5.13 -0.02 -0.19 3.96 0.33 -0.72 0.29prediction = coast k -NN θ = 1 )0 1 0 0 0 0 0 1 1 0 0prediction = highway query UNN θ = 0 . ) tall building tall building k -NN θ = 1 )0 1 0 0 1 0 0 1 0 0 0prediction = open countryFig. 10 Two examples where UNN corrects misclassiﬁcations of k -NN. The query image is shown in the leftmost column. The 11-nearest prototype images are shown on the right: the ﬁrst rowrefers to UNN with 20% of retained prototypes ( θ = 0 . ), whereas the second column refers to classic k -NN classiﬁcation over all prototypes ( θ = 1 ). Neighbors in the same category as the queryimage are surrounded by black boxes. Votes given by each prototype for the true category ( coast ) are shown below each image (such values correspond to α ic y jc in (9), where c is the ground-truthcategory). Fig. 11

Examples of image prototypes with their leveraging coefﬁcients for category 1 ( tall buildingstall buildings