Nonparametric adaptive active learning under local smoothness condition
aa r X i v : . [ c s . L G ] F e b Nonparametric adaptive active learning underlocal smoothness condition
Boris Ndjia Njike, Xavier Siebert
Université de Mons, Faculté polytechniqueDépartement de Mathématique et recherche opérationnellee-mail: [email protected] e-mail: [email protected]
Abstract
Active learning is typically used to label data, when the labeling pro-cess is expensive. Several active learning algorithms have been theoreti-cally proved to perform better than their passive counterpart. However,these algorithms rely on some assumptions, which themselves containsome specific parameters. This paper adresses the problem of adaptiveactive learning in a nonparametric setting with minimal assumptions. Wepresent a novel algorithm that is valid under more general assumptionsthan the previously known algorithms, and that can moreover adapt tothe parameters used in these assumptions. This allows us to work with alarger class of distributions, thereby avoiding to exclude important densi-ties like gaussians. Our algorithm achieves a minimax rate of convergence,and therefore performs almost as well as the best known non-adaptive al-gorithms.
The paradigm of passive learning consists in providing a classifier based on la-belled data, identically independently distributed from a large pool of data.Due to a huge increase in the volume of the data available, we are sometimesconstrained, from the point of view of the process of labeling data only, to lookbeyond the standard passive learning. In this context, one of the most studiedtechnique is active learning, where the algorithm is presented with a large un-labelled pool of data and can iteratively request at a certain cost, the label Y ∈{ , } of an instance X ∈ R d from the pool. We are constrained to use at mosta budget of n requests to a so-called oracle. The goal is to use this interactionto drastically reduce the number of labels needed to provide a classifier whoseexcess error is as small as possible.Over the past decade, there has been a large body of work aiming at understand-ing theoretically the benefits and limits of active learning over passive learning[ ? , 1, 2]. One of the seminal works due to Castro and Nowak [1] analyzed variousscenarios and provided one in which active learning outperforms passive learn-ing. This situation corresponds to a common assumption called the Tsybakovnoise assumption [3] that characterizes the noise near the boundary decision.Together with a smoothness assumption related to the boundary decision they1rovided an active learning strategy that is better than the passive learning, inthe sense that it uses fewer labels request to reach a low error. Also, in the para-metric setting, Castro and Nowak [1] studied the effectiveness of active learningfor one-dimensional threshold classifier. Under the Tsybakov noise assumption,according to the knowledge of certain noise parameters, they provided an activelearning algorithm, more effective than passive learning. However, one of thepractical limitations of these active learning strategies is that the knowledge ofthe noise and smoothness parameters is required [1]. This sounds unrealisticin many cases in practise, so that it would be interesting to provide algorithmsthat adapt to these parameters.This paper is organized as follows. In Section 2, we provide a review of themain adaptive active learning algorithms, both in parametric and nonparametricsettings. In Section 3 we describe some related works that inspired us andhighlight the main contributions of our work. In Section 4 we provide the maindefinitions that will be used throughout this work. In Section 5 we explain thedifferent assumptions and highlight their practical implications. In Section 6 wedescribe our new adaptive algorithm called AKALLS. Section 7 provides upperand lower bounds on the excess risk of our algorithm. Section 8 is the conclusionof the paper. In a parametric setting, Hanneke [4] opened the possibility for adaptation tocertain key parameters like noise parameters by extending the work of Castro et al. [5] to a general class of hypotheses with finite complexity (VC-class, finitedisagreement coefficient). Active learning strategies were designed on generalclasses of hypotheses and these active learning strategies were proved to adaptto the noise parameters [4]. In particular, one of these adaptive active learn-ing algorithms achieves the same minimax rate as in the problem of learning athreshold classifier studied in [5]. Also, Balcan and Hanneke [6] introduced sometheoretical aspects of a variant of standard active learning. Their algorithm al-lows to select an unlabelled subset of the pool and then to request a point thathas a given label within this subset, if one exists. Under the Tsybakov noiseassumption, some algorithms adaptive to the noise parameter were designed [6]based on a general class of hypotheses with finite complexity (finite disagree-ment coefficient, VC-class or more generally finite Natarajan dimension). Therate of convergence achieved is as good as in non-adaptive setting, up to a log-arithmic factor.In nonparametric setting, Minsker [7] assumed that the regression function η ( X ) = E ( Y | X ) belongs to the Hölder class ( with a fixed parameter α ) , andsatisfies the Tsybakov noise assumption ( with a fixed parameter β ) . By usinga geometrical assumption called strong density assumption , and under the con-dition αβ ≤ d , he designed an adaptive active learning strategy that nearlyachieves the minimax rate of convergence n − α ( β +1)2 α + d − αβ better than in passivelearning n − α ( β +1)2 α + d where n is the number of labels requested and d the dimen-sion of the instance space. However, Minsker’s active learning strategy workswith an additional assumption compared to the passive setting : the regression2unction relating the L and L ∞ approximation losses of certain piecewise con-stant or polynomial approximations of the regression function in the vicinityof the decision boundary. This algorithm is based on a model selection relatedto a dyadic partition of the cube [0 , d , where d is the dimension of the dataspace, and he used a powerful oracle inequality that allows adaptation to thesmoothness parameter α . Remarkably, this algorithm adapts naturally to thenoise parameter β .Locatelli et al. [8] also consider a dyadic partition along with the Hölder smooth-ness and the Tsybakov noise assumption on the regression function. By usingthe strong density assumption, and under the condition αβ ≤ d , they providedan active learning strategy that adapts both to the smoothness and naturallyto the noise parameters and that achieves the same minimax rate as obtainedin [7]. They assumed that the smoothness parameter α belongs to a range ofvalues I and considered a finite increasing sequence ( α i ) ⊂ I . Their adaptivealgorithm with respect to the smoothness parameter is based on a non-adaptivealgorithm that iteratively takes as input a smoothness parameter α i ( i = 1 , , ... ) and outputs a labeled set S i . Because the Hölder class is a nested class, thelabel of a point does not change between two consecutive iterations and then S i ⊂ S i +1 . Finally, Locatelli et al. proved that it is possible to control the errorrate beyond the maximum α i such that α i ≤ α and obtained the optimal rateof convergence up to a logarithmic factor.In the context of nonparametric active learning under a smoothness assump-tion, the problem of designing adaptive algorithms that achieves optimal ratesunder minimal assumption is still evolving. In this paper, we aim at designingan adaptive active learning strategy that achieves an optimal rate, but under amore general smoothness assumption than that used previously [7, 8]. Chaudhuri and Dasgupta [9] studied the problem of passive learning (morespecifically, K-nn classification) under minimal assumptions. Their motivationwas to design a smoothness assumption related to the underlying marginal den-sity P X which therefore allows to overcome some disadvantages of the Höldersmoothness assumption. Under this new smoothness assumption they provideda K-nn classifier, and designed a region of confidence that can reliably be clas-sified, and outside of which the error rate is controlled by the Tsybakov noiseassumption. This allows to achieve a rate of convergence as good as underHölder smoothness in passive learning.This work was previously extended to the context of active learning [10].Under the new smoothness assumption, and the Tsybakov noise assumption,an active learning algorithm was designed, which achieves the same rate ofconvergence as was obtained under the Hölder smoothness assumption in [8, 7].This algorithm is based on a pool of unlabeled examples K , and consists inproviding a labeled subset ˆ S ⊂ K called the active set and finally consider the1-NN classifier on ˆ S . Instead of asking directly the label of an example in ˆ S ,it infers it by asking the labels of its neighbors and then it obtains the correctlabel for a point relatively far from the Bayes decision boundary. Finally, [10]proved that for each example in the interior of the support of the underlying3arginal distribution, relatively far away from the Bayes decision boundary { x, η ( x ) = } , its label coincides with both the true label and the inferred labelof its nearest neighbor in S . But, for a practical point of view, their algorithmmay be sometimes not applicable due to the fact that it requires the knowledgeboth to the smoothness and noise parameters. In this work, we establish two main results. First, we provide an active learn-ing algorithm that adapts both to the smoothness and noise parameters, andprove theoretically that it achieves the same rate of convergence as that of non-adaptive algorithms which require the knowledge of smoothness and noise pa-rameters. It is important to underline that our smoothness assumption is moregeneral than was done in the previous works, particularly the Hölder smooth-ness assumption. Second, we also extend the work of [10] by providing a lowerbound that matches (up to a logarithmic factor) the upper bound establishedin this paper.
Let
X ⊂ R d the data space, called instance space and Y = { , } the labelspace. Let w ∈ N ∗ and an i.i.d sample K ⊂ X × Y : K = { ( X , Y ) , . . . , ( X w , Y w ) } drawn according to a probability P over X × Y and K x = { X , . . . , X w } its corresponding sequence of unlabeled points . The probability P can bedecomposed as a couple ( P X , η ) where P X is the marginal probability on X and η the regression function defined by η ( x ) = P ( Y = 1 | X = x ) for all x in thesupport of P X . We define a classifier as a measurable function f : X → Y .The standard (passive) learning based on the sample K consists in designing analgorithm that provides a classifier ˆ f w . The performance of ˆ f w is measured by R ( ˆ f w ) where the function R is called classification error, is defined by R ( f ) = P ( f ( X ) = Y ) over all measurable functions f : X → Y . It is known [11] thatthe Bayes classifier, defined by f ∗ ( x ) = η ( x ) ≥ / minimizes the classificationerror. Then for a classifier f , the quantity R ( f ) − R ( f ∗ ) is called the excess riskof f .In active learning, we do not directly have access to the label of X ∈ K x andrequesting its label is considered costly. At the beginning, the label budget n isthus fixed. The challenge consists in designing a strategy by requesting at most n labels while achieving a performance competitive with that of passive learning,where the label budget would correspond to n = w . At any time, we chooseto request the label of a point X ∈ K x according to the previous observations.The point X is chosen to be most “informative”, which amounts to belongingto a region where classification is difficult and requires more labeled data to becollected. 4 .2 Definitions In this section, we present some definitions of the important concepts we usethroughout this paper. First let us recall
X ⊂ R d the instance space, and ρ the Euclidean metric on X . For x ∈ X , and r > , we define ¯ B ( x, r ) = { z ∈ R d , ρ ( z, x ) ≤ r } and B ( x, r ) = { z ∈ R d , ρ ( z, x ) < r } . Definition 4.1. ( α, L ) -Hölder smoothnessLet η : X → [0 , the regression function (defined in Section 4.1). We say that η is ( α, L ) - Hölder continuous (0 < α ≤ , and L > if ∀ x, x ′ ∈ X , | η ( x ) − η ( x ′ ) | ≤ Lρ ( x, x ′ ) α . (H1) Definition 4.2. ( α, L ) -smoothnessLet < α ≤ and L > . The regression function is ( α, L ) -smooth if for all x, x ′ ∈ supp ( P X ) we have: | η ( x ) − η ( x ′ ) | ≤ L.P X ( B ( x, ρ ( x, x ′ ))) α/d , (H2) where d is the dimension of the instance space. Definition 4.3.
Margin noiseWe say that P satisfies margin noise or Tsybakov’s noise assumption withparameter β ≥ if for all < ǫ ≤ P X ( x ∈ X , | η ( x ) − / | < ǫ ) < Cǫ β , (H4) for C := C ( β ) ∈ [1 , + ∞ [ . Definition 4.4.
Strong densityLet P the distribution probability defined over X × Y and P X the marginal dis-tribution of P over X . We say that P satisfies the strong density assumptionif there exists some constants r > , c > , p min > such that for all x ∈ supp ( P X ) : λ ( B ( x, r ) ∩ supp ( P X )) ≥ c λ ( B ( x, r )) , ∀ r ≤ r and p X ( x ) > p min , (H3) where p X is the density function of the marginal distribution P X and λ is theLebesgue measure. In this work, we use two main assumptions described in details in this section.
Assumption 1:
We suppose that the regression function satisfies (H2).This assumption was introduced by Chaudhuri and Dasgupta [9], who pointedout some disadvantages of the assumption (H1). Their motivation was to definea smoothness assumption that measures the change of the regression function5ith respect to the marginal distribution P X , instead of the Hölder smoothnessassumption (H1) that measures the change of the regression function with re-spect to the instance x . They also proved that (H2) generalizes (H1) along with(H3) as stated in the following theorem. Theorem 5.1 (Chaudhuri and Dasgupta) . [9]Suppose that X ⊂ R d , that the regression function η is ( α h , L h ) -Hölder smooth,and that P X satisfies (H3) . Then there is a constant L > such that for any x, z ∈ supp( P X ), we have: | η ( x ) − η ( z ) | ≤ L.P X ( B ( x, ρ ( x, z ))) α h /d . This theorem states that a regression function which satisfies (H1) and (H3)also satisfies (H2).To illustrate the importance of this assumption, we provide an example ofa regression function that does not satisfy simultaneously (H1) and (H3), butsatisfies (H2).
Example 5.2 (Distribution that satisfies (H2)) . Let P = ( η, P X ) the distribution defined as follows: • The marginal distribution P X is such that X ∼ N (0 , the univariatenormal distribution. • For α ≤ the regression function is defined by: η : R −→ [0 , x − α +1 | x − | α if x ∈ [0 , elsewhere . This regression function is represented simultaneously with the densityfunction of the univariate normal distribution on Figure 1. − x Figure 1: Example of regression function η ( x ) (blue) that satisfies (H2) alongwith the marginal distribution P X (red). The probability P does not satisfy (H3) because the marginal density is notbounded below and it can easily be shown that it satisfies (H2) with parame-ters ( α, exp( − α )) or more formally ( α, because the constant L in (H2) isgreater than 1. .2 Second assumption Assumption 2: (Tsybakov noise assumption)We suppose that P satisfies the Tsybakov noise assumption with parameters ( β, C ) such that β > , C > .This assumption was introduced in [3] and characterizes the behavior of theregression function near the decision boundary using a parameter β . For a largevalue of β , we can observe a "jump" of the regression function on the decisionboundary, and a small value of β covers the interesting case where the regressionfunction crosses the decision boundary. In Section 6.1 we provide a general description of the AKALLS algorithm. Thenin Section 6.2 we introduce some notations that will be used through the re-mainder of this paper. The pseudo-code of AKALLS algorithm is provided inSection 6.3, and the main subroutines are explained in Section 6.4.
Our active learning algorithm adapts to the smoothness and noise parameters( α and β , respectively), at least in a reasonable range of these parameters.The algorithm takes as input a pool of unlabelled data K , a label bud-get n , the constant parameters L , C respectively used in Assumption 1 andAssumption 2, a confidence parameter δ ∈ (0 , , an accuracy parameter ǫ ∈ (0 , ) . For handling the adaptivity to the parameters L , C , we suppose theyare both bounded by a logarithmic factor in ǫ .We design a decreasing sequence of smoothness parameters ( α i ) such thatat each step i , we execute a non-adaptive algorithm similar to that introducedin [10].Each step produces a set b S i of informative points. The sequence ( b S i ) isincreasing, and at the end of step i , the points added in b S i potentially improvethe classification compared to the previous step. At the end of our algorithm,we obtain an aggregate set S on which we apply a 1-NN classifier. For X s ∈ K = { X , . . . , X w } , we denote by X ( k ) s its k -th nearest neighbor in K ,and Y ( k ) s the corresponding label.For an integer k ≥ , let b η k ( X s ) = 1 k k X i =1 Y ( i ) s , ¯ η k ( X s ) = 1 k k X i =1 η ( X ( i ) s ) . (1)For a set S ⊂ X × Y , we denote by S x the set S x = { X ∈ X , ( X, Y ) ∈ S} . ǫ, δ, ∆ ∈ (0 , . The following quantities (2), (3), (4), (5) arederived from the detailed convergence proofs. k ( δ, ∆) = c ∆ (cid:20) log( 1 δ ) + log log( 1 δ ) + log log (cid:18) √ e ∆ (cid:19)(cid:21) (2)where c ≥ . . b δ,k = s k (cid:18) log (cid:18) δ (cid:19) + log log (cid:18) δ (cid:19) + log log( ek ) (cid:19) . (3) ∆ = max( ǫ , (cid:16) ǫ C (cid:17) β +1 ) (4)where ( β, C ) are the parameters introduced in Assumption 2. φ n = s n (cid:18) log (cid:18) δ (cid:19) + log log (cid:18) δ (cid:19)(cid:19) . (5) lgorithm 1: Adaptive Active Learning under Local Smoothness(AKALLS)
Input:
A pool K x = { X , . . . , X w } , label budget n , L , C , confidenceparameter δ , accuracy parameter ǫ . Output: b f n,w Initialization ¯ n = n log ( ǫ ) I = ∅ Current active set ; b S = ∅ , Current "noisy" points; b S nois = ∅ i = 1 repeat s = 1 ⊲ index of point currently examined t = ¯ n ⊲ current label budget α i = 2 − i b C i = ∅ ⊲ current informative set at the i-th step I = ∅ repeat if X s ∈ ( b S i − ) x ∪ ( b S nois ) x then s = s + 1 else T = Reliable ( X s , δ s , α i , L , I ∪ I i − ) if T=True then s = s + 1 else Let δ s = δ s log ( ǫ ) [ b Y , Q s ] = confidentAdapt ( X s , ǫ , t , δ s ) d LB s = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | Q s | X ( X,Y ) ∈ Q s Y − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − b δ s , | Q s | ⊲ Lower boundguarantee on | η ( X s ) − | t = t − | Q s | if d LB s ≥ . b δ s , | Q s | then b C i = b C i ∪ { ( X s , b Y ) } I = I ∪ { ( X s , d LB s ) } else b S nois = b S nois ∪ { ( X s , b Y ) } until t < and s > w ; b S i = b C i ∪ b S i − I i = I ∪ I i − i = i + 1 until i > log ( ǫ ) ; S = S log ( ǫ ) b f n,w ← Learn ( b S ) .4 Main subroutines The AKALLS algorithm uses two main subroutines called
Reliable and
ConfidentAdapt .The
Reliable subroutine is a boolean test that checks if the label of thecurrent point X s can be inferred with high confidence using the informationcollected on the previous points examined by the subroutine ConfidentAdapt .If the
Reliable subroutine returns True at point X s , the latter is not consideredto be informative, and therefore is not considered further by the subroutine ConfidentAdapt . Conversely, if the
Reliable subroutine returns False at point X s , the ConfidentAdapt subroutine is used to determine its label with a givenlevel of confidence. The
ConfidentAdapt subroutine infers the label of X s byusing the labels of its nearest neighbors, with respect to a sequence of noiseparameters ( β i ) in an adaptive way. Reliable subroutine
The
Reliable subroutine takes as inputs an instance point X , a confidenceparameter δ , the smoothness parameters α , L , a set I ⊂ X × R . For ( X ′ , c ) ∈ I , X ′ represents a point whose label we have already inferred with a guarantee c . Then the Reliable subroutine allows us to know if we can guess with highprobability the label of point X , by using the set I . The Reliable subroutineuses the marginal distribution P X which is supposed to be known by the learner.This is not a limitation, since we can assume that our pool of data is large enoughsuch that P X can be estimated to any desired accuracy as was done in [10]. Algorithm 2:
Reliable subroutine
Input: an instance X , a confidence parameter δ , smoothnessparameters α , L , a set I ⊂ X × R KwOutA boolean value T for ( X ′ , c ) ∈ I do if ∃ ( X ′ , c ) ∈ I such that P X ( B ( X, ρ ( X, X ′ )) ≤ (cid:0) c L (cid:1) d/α then T = T rue else T = F alse
ConfidentAdapt subroutineConfidentAdapt takes as input an instance X , an accuracy parameter ǫ , abudget parameter t ≥ and a confidence parameter δ . ConfidentAdapt infersthe label of an instance X ∈ K x by requesting the label of its neighbors in thepool K x . The output b Y corresponds to the majority vote of requested labels. ConfidentAdapt operates in adaptive way, so that we do not have to knowbeforehand the smoothness and the noise parameters. Indeed, we introduce inthe subroutine several noise levels β i and we expect that if the noise parameter β ≥ β i , ConfidentAdapt uses at most k ( δ, ∆ i ) label requests. ConfidentAdapt is designed such that the inferred label produced at point X s with a given value of α i does not change subsequently (for α j , j > i ).Consequently, at iteration i (relatively to the smoothness parameter α i ), any10oint that has already been examined previously by ConfidentAdapt can nolonger introduced into the
Reliable and
ConfidentAdapt subroutines in thefuture iterations ( α j , j > i ) . Algorithm 3: confidentAdapt subroutine
Input: an instance X , accuracy parameter ǫ , budget parameter t ≥ ,confidence parameter δ . Output: ( b Y , Q ) Initialization: Q = ∅ k = 1 for i = 1 : log ( ǫ ) do β i = i log ( ǫ ) ∆ i = max( ǫ , (cid:0) ǫ C (cid:1) βi +1 ) for i = 1 to log ( ǫ ) do repeat Request the label Y ( k ) of X ( k ) Q = Q ∪ { ( X ( k ) , Y ( k ) ) } if (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k k X j =1 Y ( j ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > b δ,k then exit ⊲ cut-off condition k = k + 1 until k > min( k ( δ, ∆ i ) , t ) ; b η ← | Q | X ( X,Y ) ∈ Q Y b Y = b η ≥ / In this section we provide the upper and lower bounds on the excess risk ofour algorithm. We state these bounds in a more practical form by using labelcomplexity.
In this Section we show that the rate of convergence achieved by
AKALLS isnearly the same (up to a logarithmic factor) as that achieved by non-adaptivealgorithms. It is important to note that this rate of convergence covers onlythe case αβ ≤ , especially when the regression function crosses the boundarydecision in the interior of the support of P X . Let us write P ( α, β ) := the set ofdistribution of probabilities that satisfy Assumption 1 and Assumption 2, wherethe parameters α and β respectively come from (H2) and (H4).The following Theorem states the upper bound on the excess risk of the classifierprovided by the AKALLS algorithm. 11 heorem 7.1.
Let ǫ , δ ∈ (0 , ) , n ∈ N , d the dimension of the instance space. Let α ∈ (2 ǫ, ,and β ∈ [ ( ǫ ) , log( ǫ )] . Let K = { X , . . . , X w } a pool of data. There existsan active learning algorithm based on K , that is independent of α and β , whichprovides a classifier f n,w by using at most n label requests such that: If αβ ≤ d , and the number of labels request satisfies: n ≥ ˜ O (cid:18) ǫ (cid:19) α + d − αβα ( β +1) ! , (6) and w satisfies w ≥ ˜ O (cid:18) ǫ (cid:19) α + dα ( β +1) ! (7) then with probability at least − δ we have: sup P ∈P ( α,β ) h R ( b f n,w ) − R ( f ∗ ) i ≤ ǫ. (8)We can equivalently express Theorem 7.1 only as a function of number oflabel requests n . Specifically, there are values n , and w sufficiently large suchthat: sup P ∈P ( α,β ) h R ( b f n,w ) − R ( f ∗ ) i ≤ ˜ O (cid:16) n − α ( β +1)2 α + d − αβ (cid:17) . (9) In this Section we state that for a given probability P ∈ P ( α, β ) , no activelearner can provide a classifier whose expected excess risk (with respect to thesample) decreases to 0 faster than ˜ O (cid:16) n − α ( β +1)2 α + d − αβ (cid:17) . Combined with (9), thistherefore provides a minimax rate on the form: ˜ O (cid:16) n − α ( β +1)2 α + d − αβ (cid:17) . The following theorem is inspired by the minimax bounds of [12, 7, 8].
Theorem 7.2.
Let α , β the smoothness and noise parameters respectively introduced in H2, andH4 and d the dimension of the instance space. Let us assume that αβ ≤ d andfor any P ∈ P ( α, β ) , supp ( P X ) ⊂ [0 , d . Then there exists a constant γ > such that for all n large enough and for any active classifier b f n we have: sup P ∈P ( α,β ) h R ( b f n,w ) − R ( f ∗ ) i ≥ γn − α ( β +1)2 α + d − αβ . (10) In this paper, we described an active learning algorithm with minimal regu-larity assumptions, that adapts to the parameters used in these assumptions.12his algorithm achieves a better rate of convergence than its passive counter-part. Additionally, we provided a lower bound on the excess risk, and thereforeobtained a minimax rate of convergence. Interesting future directions includean extension to multi-class instead of binary classification. Also, due to thecomputational issues in high-dimensional feature spaces, we could assume thatthe data is constrained to a lower-dimensional manifold, a setting in which thenearest neighbors method of our algorithm is expected to work particularly well[13].
References [1] Rui M Castro and Robert D Nowak. Upper and lower error bounds foractive learning.[2] Sanjoy Dasgupta. Two faces of active learning.
Theoretical computer sci-ence , 412(19):1767–1781, 2011.[3] Enno Mammen, Alexandre B Tsybakov, et al. Smooth discrimination anal-ysis.
The Annals of Statistics , 27(6):1808–1829, 1999.[4] Steve Hanneke et al. Rates of convergence in active learning.
The Annalsof Statistics , 39(1):333–361, 2011.[5] Rui M Castro and Robert D Nowak. Minimax bounds for active learning.
IEEE Transactions on Information Theory , 54(5):2339–2353, 2008.[6] Maria Florina Balcan and Steve Hanneke. Robust interactive learning. In
Conference on Learning Theory , pages 20–1, 2012.[7] Stanislav Minsker. Plug-in approach to active learning.
Journal of MachineLearning Research , 13(Jan):67–90, 2012.[8] Andrea Locatelli, Alexandra Carpentier, and Samory Kpotufe. Adaptiv-ity to noise parameters in nonparametric active learning.
Proceedings ofMachine Learning Research vol , 65:1–34, 2017.[9] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for near-est neighbor classification. In
Advances in Neural Information ProcessingSystems , pages 3437–3445, 2014.[10] Boris Ndjia and Xavier Siebert. K-nn active learning under local smooth-ness assumption. arXiv , pages arXiv–2001, 2020.[11] Gábor Lugosi. Pattern classification and learning theory. In
Principles ofnonparametric learning , pages 1–56. Springer, 2002.[12] Jean-Yves Audibert, Alexandre B Tsybakov, et al. Fast learning rates forplug-in classifiers.
The Annals of statistics , 35(2):608–633, 2007.[13] Ata Kabán. A new look at nearest neighbours: Identifying benign inputgeometries via random projections. In