Minimax semi-supervised confidence sets for multi-class classification
MMinimax semi-supervised confidencesets for multi-class classification
Evgenii Chzhen ∗ , Christophe Denis and Mohamed Hebiri Universit´e Paris-Est – Marne-la-Vall´eeCit´e Descartes, Btiment Copernic5 boulevard Descartes77454 Marne-la-Vall´ee cedex 2e-mail: [email protected]@[email protected]
Abstract:
In this work we study the semi-supervised framework of confi-dence set classification with controlled expected size in minimax settings.We obtain semi-supervised minimax rates of convergence under the marginassumption and a H¨older condition on the regression function. Besides, weshow that if no further assumptions are made, there is no supervised methodthat outperforms the semi-supervised estimator proposed in this work. Weestablish that the best achievable rate for any supervised method is n − / ,even if the margin assumption is extremely favorable. On the contrary, semi-supervised estimators can achieve faster rates of convergence provided thatsufficiently many unlabeled samples are available. We additionally performnumerical evaluation of the proposed algorithms empirically confirming ourtheoretical findings. MSC 2010 subject classifications:
Primary 62G05; secondary 62G30,62H05, 68T10.
Keywords and phrases: multi-class classification, confidence sets, mini-max optimality, semi-supervised classification.
1. Introduction
Let K ≥ X, Y ) ∈ R d × [ K ] := { , . . . , K } be a random couple distributedaccording to a distribution P on R d × [ K ], where X ∈ R d is seen as the featurevector and Y ∈ [ K ] as the class. This problem falls within the scope of themulti-class setting where the goal is to predict the label Y for a given feature.Commonly, prediction is performed by a classifier that outputs a single label.However, in the confidence set framework, the objective differs: we aim at pre-dicting a set of labels instead of a single one. This problem has been studied in afew works, and we consider in this contribution the setup put forward by Denisand Hebiri (2017). The essential feature of their perspective is the control ofthe size of confidence sets in expectation. While they provided a procedure tobuild confidence sets based on Empirical Risk Minimization (ERM) and estab- ∗ This work was partially supported by “Labex B´ezout” of Universit´e Paris-Est1 a r X i v : . [ m a t h . S T ] A p r hzhen, Denis and Hebiri/Confidence Sets lished upper bounds, the present work aims at giving a general analysis of theconfidence problem in the minimax sense. All along the paper, we denote by P X the marginal distribution of X ∈ R d andby p ( · ) := ( p ( · ) , . . . , p K ( · )) (cid:62) the regression function defined for all k ∈ [ K ] andall x ∈ R d as p k ( x ) := P ( Y = 1 | X = x ). For any sets A, A (cid:48) ⊂ [ K ] we denoteby A (cid:52) A (cid:48) their symmetric difference. We assume that two data samples D n , D N are available. The first sample D n = { ( X i , Y i ) } ni =1 consists of n ∈ N i.i.d. copiesof ( X, Y ) ∈ R d × [ K ] and the second sample D N = { X i } n + Ni = n +1 consist of N ∈ N i.i.d. copies of X ∈ R d .A confidence set classifier Γ is a measurable function from R d to 2 [ K ] := { A : A ⊂ [ K ] } , that is, Γ : R d → [ K ] and we denote by Υ the set of all suchfunctions. For any confidence set Γ : R d → [ K ] we define its error and itsinformation asP (Γ) = P ( Y / ∈ Γ( X )) (cid:124) (cid:123)(cid:122) (cid:125) error , I (Γ) = E P X | Γ( X ) | (cid:124) (cid:123)(cid:122) (cid:125) information , respectively, where E P X stands for the expectation w.r.t. the marginal distribu-tion of X ∈ R d and | Γ( x ) | is the cardinal of Γ at x ∈ R d .For a fixed integer β ∈ [ K ] a β -Oracle confidence set Γ ∗ β is defined asΓ ∗ β ∈ arg min { P (Γ) : Γ ∈ Υ s.t. I(Γ) = β } . The set { Γ ∈ Υ : I(Γ) = β } is always non-empty, as it always contain thoseconfidence sets whose cardinal is equals to β for every x ∈ R d .The description of β -Oracle confidence set in general situation might be com-plicated. Hence, we introduce the following mild assumption, which allows toobtain an explicit expression. Assumption 1.1 (Continuity of CDF) . For all k ∈ [ K ] the cumulative dis-tribution function (CDF) F p k ( · ) := P X ( p k ( X ) ≤ · ) of p k ( X ) is continuous on (0 , . Proposition 1.2 ( β -Oracle confidence set) . Fix β ∈ [ K − , and let the function G : [0 , → [0 , K ] be defined for all t ∈ [0 , as G ( t ) := K (cid:88) k =1 (1 − F p k ( t )) = K (cid:88) k =1 P X ( p k ( X ) > t ) , then under Assumption 1.1 a β -Oracle confidence set Γ ∗ β can be obtained as Γ ∗ β ( x ) = (cid:8) k ∈ [ K ] : p k ( x ) ≥ G − ( β ) (cid:9) , (1.1) where we denote by G − the generalized inverse of G defined for all β ∈ [0 , K ] as G − ( β ) := inf { t ∈ [0 ,
1] : G ( t ) ≤ β } . hzhen, Denis and Hebiri/Confidence Sets Proposition 1.3.
Assume that Assumption 1.1 is fulfilled, then the β -Oracledefined in Eq. (1.1) is a minimizer of the following risk R β (Γ) = P(Γ) + G − ( β ) I(Γ) . (1.2)These propositions have been proven in (Denis and Hebiri, 2017, Proposi-tion 4 and Proposition 7). Consequently, the accuracy of a confidence set Γ canbe for instance quantified according its excess riskR β (Γ) − R β (Γ ∗ β ) = K (cid:88) k =1 E P X (cid:104) | p k ( X ) − G − ( β ) | { k ∈ Γ( X ) (cid:52) Γ ∗ β ( X ) } (cid:105) . The statistical learning problem is then to estimate Γ ∗ β given the data sample D n and D N . The formulation in Eq. (1.1) of the β -Oracle appears to be closelyrelated to the level set estimation problem (Hartigan, 1987; Polonik, 1995; Tsy-bakov, 1997; Rigollet and Vert, 2009). Hence at first sight, the introductionof an unlabeled sample may be surprising. However, in our setup the estima-tion of the β -Oracle does not only rely on the regression function but also onthe threshold G − ( β ) which is unknown beforehand and can be estimated in asemi-supervised way (Denis and Hebiri, 2017). To fix these ideas, we give someexamples of possible estimation procedures of Γ ∗ β . An estimator ˆΓ is a measurable function that maps any given data samplesinto a confidence set classifier. We shall distinguish two types of estimators:supervised and semi-supervised whose formal definition is provided below.
Definition 1.4 (Supervised and semi-supervised estimators) . A measurablemapping ˆΓ : (cid:91) n,N ∈ N (cid:0) R d × [ K ] (cid:1) n × (cid:0) R d (cid:1) N → Υ , is called a supervised estimator if for any n, N ∈ N and any data samples D n = { ( X i , Y i ) } ni =1 , D N = { X i } n + Ni = n +1 , and D (cid:48) N = { X (cid:48) i } n + Ni = n +1 it holds that ˆΓ( x ; D n , D N ) = ˆΓ( x ; D n , D (cid:48) N ) , a.e. x ∈ R d w.r.t. the Lebesgue measure . Otherwise the estimator is called semi-supervised . In the sequel, for the sim-plicity of notation we write ˆΓ( x ) instead of ˆΓ( x ; D n , D N ) where no ambiguity ispresent. Intuitively, the supervised estimators do not take into account the informationthat is provided by the unlabeled sample. Besides, if we denote by ˆΥ the set ofall estimators, Definition 1.4 generates a natural partition of ˆΥ into two disjointsets: the supervised estimators ˆΥ SE and the semi-supervised estimators ˆΥ SSE .Hereafter, we provide three different examples of estimation procedures whichare the core of our study. All these methods rely on plug-in principle. hzhen, Denis and Hebiri/Confidence Sets • Top- β procedure. This method is the most intuitive estimator in the con-sidered context. It is a supervised procedure, that is, based only on D n .Let consider an estimator ˆ p of the regression function p . Let (cid:0) ˆ p σ k ( X ) (cid:1) k ∈ [ K ] be the order statistic associated to ˆ p ( X ), such that for all x ∈ R d we haveˆ p σ ( x ) ( x ) ≥ . . . ≥ ˆ p σ K ( x ) ( x ). A top- β confidence set is then defined asˆΓ top ( x ) = { σ ( x ) , . . . , σ β ( x ) } , ∀ x ∈ R d . (1.3) • Supervised procedure.
Formally, in this type of methods, we only care about D n (we forget about D N ). We split D n into two independent samples suchthat D n = D (cid:98) n/ (cid:99) (cid:83) D (cid:100) n/ (cid:101) . Based on the first sample D (cid:98) n/ (cid:99) , we consideran estimator ˆ p of the regression function p . Furthermore, we defineˆ G ( · ) = 1 (cid:100) n/ (cid:101) (cid:88) i ∈D (cid:100) n/ (cid:101) K (cid:88) k =1 { ˆ p k ( X i ) ≥·} , and one type of supervised estimator is then defined as followsˆΓ SE ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , ∀ x ∈ R d . (1.4)Interestingly, conditional on the data sample D (cid:98) n/ (cid:99) , the definition of theestimator ˆ G does not involves the labels associated to D (cid:100) n/ (cid:101) . As a con-sequence, we can naturally consider a semi-supervised version of this es-timator. • Semi-supervised procedure.
Based on D n , we consider an estimator ˆ p ofthe regression function p . Furthermore, we defineˆ G ( · ) = 1 N (cid:88) i ∈D N K (cid:88) k =1 { ˆ p k ( X i ) ≥·} , and one type of semi-supervised estimator is then defined as followsˆΓ SSE ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , ∀ x ∈ R d . (1.5)One can note that these procedures are based on a preliminary estimator of p built from D n , that is, all of them are plug-in type procedures. However, theseprocedures differ by the construction of the output set. The top- β procedure andthe supervised procedure rely only on the labeled data while the semi-supervisedestimator takes advantage of the information provided by the unlabeled data.The top- β procedure is the simplest among them, it naturally satisfies | ˆΓ( x ) | = β for all x ∈ R d . At the same time, the others are more involved and can havedifferent cardinals for different values of x ∈ R d . Nevertheless, for the other twoprocedures one can guarantee I(ˆΓ) ≈ β .These examples give a rise to natural questions which form the core ourtheoretical study and which are summarized below. hzhen, Denis and Hebiri/Confidence Sets
1. The first question is the statistical performance of these plug-in procedureswhich is assessed through rates of convergence and their optimality in theminimax sense.2. The second question focuses on the benefit of the semi-supervised ap-proach. Roughly speaking, are there situations where the semi-supervisedapproach outperforms the supervised one and how can it be quantified?3. The third question concentrates on the reason why it is more relevant forthis problem to consider more involved estimators than the simple top- β method. For a given family P of joint distributions on R d × [ K ], a given estimator ˆΓ ∈ ˆΥ,and fixed integers K ≥ β ∈ [ K ], n, N ∈ N we are interested in the followingmaximal risksΨ H n,N (ˆΓ; P ) := sup P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) (Hamming risk) , Ψ E n,N (ˆΓ; P ) := sup P ∈P E ( D n , D N ) R β (ˆΓ) − R β (Γ ∗ β ) (Excess risk) , Ψ D n,N (ˆΓ; P ) := sup P ∈P E ( D n , D N ) (cid:104)(cid:12)(cid:12)(cid:12) P(ˆΓ) − P(Γ ∗ β ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) β − I(ˆΓ) (cid:12)(cid:12)(cid:12)(cid:105) (Discrepancy) , where E ( D n , D N ) denotes the expectation w.r.t. P ⊗ n ⊗ P ⊗ NX . These maximal risksare arising in a natural way in the context of the confidence set estimation withcontrolled expected size. The risk Ψ H n,N (ˆΓ; P ) corresponds to the estimationof the β -Oracle through the Hamming distance. The second risks is directlyconnected with Proposition 1.2, which gives a description of the β -Oracle asa minimizer of R β ( · ). As the goal in this problem is to construct a procedureˆΓ that exhibits a low error P(ˆΓ) and low cardinal discrepancy | β − I(ˆΓ) | , it isnatural to consider Ψ D n,N (ˆΓ; P ) which is composed of both.Finally, we are in position to define the notion of the minimax rate. Theminimax rate in this context is not only determined by the family of distributions P but also by the family of estimators ˆ Γ ⊂ ˆΥ that we consider. Definition 1.5 (Minimax rate of convergence) . For a given family P of jointdistributions on R d × [ K ] and a given family of estimators ˆ Γ ⊂ ˆΥ the minimaxrates are defined as Ψ (cid:3) n,N (ˆ Γ ; P ) := inf ˆΓ ∈ ˆ Γ Ψ (cid:3) n,N (ˆΓ; P ) , where (cid:3) is H , E or D . The main families of estimators that we study are the supervised ˆΥ SE andthe semi-supervised ˆΥ SSE estimators. Obviously, since ˆΥ = ˆΥ SE (cid:83) ˆΥ SSE andˆΥ SE (cid:84) ˆΥ SSE = ∅ , we have the following relationΨ (cid:3) n,N ( ˆΥ; P ) = Ψ (cid:3) n,N ( ˆΥ SE ; P ) (cid:94) Ψ (cid:3) n,N ( ˆΥ SSE ; P ) . hzhen, Denis and Hebiri/Confidence Sets As a consequence, a lower and an upper bounds on Ψ (cid:3) n,N ( ˆΥ SE ; P ), Ψ (cid:3) n,N ( ˆΥ SSE ; P )yield the bounds on the minimax rate over all estimators. Confidence set approach for classification was pioneered by Vovk (2002a,b);Vovk, Gammerman and Shafer (2005) by the means of conformal predictiontheory. They rely on non-conformity measures which are based on some patternrecognition methods, and develop an asymptotic theory. In this work, we con-sider a statistical perspective of confidence set classification and put our focuson non-asymptotic minimax theory.The problem of confidence set multi-class classification has strong ties withthe binary classification with reject option, also known as binary classificationwith abstention in machine learning literature. In the binary classification withrejection, a classifier is allowed to output some special symbol, which indicatesthe rejection. Such type of classifiers can be seen as confidence sets, which areallowed to output ∅ or { , } and are interpreted as reject. This line of researchwas initiated by Chow (1957, 1970) in the context of information retrieval,where a predefined cost of rejection was considered. An extensive statisticalstudy of this framework was carried in (Herbei and Wegkamp, 2006; Bartlettand Wegkamp, 2008; Wegkamp and Yuan, 2011).Instead of considering a fixed cost for rejection, which might be too restric-tive, one may define two entities: probability of rejection and the probability ofmissclassification. In the spirit of conformal prediction, Lei (2014) aims at mini-mizing the probability rejection provided a fixed upper bound on the probabilityof missclassification. In contrast, Denis and Hebiri (2015) consider a reversedproblem of minimizing the probability of missclassification given a fixed upperbound on the probability of rejection.Once the multi-class classification is considered, there are several possibleways to extend the binary case: the confidence set approach and the rejectionapproach. The reject counterpart is a more studied and known version, thoughit lacks statistical analysis. To the best of our knowledge the only work whichprovides statistical guarantees is (Ramaswamy, Tewari and Agarwal, 2018).As for the confidence set approach there are again two possibilities, simi-lar to the binary case. The one that is considered in this work was proposedby Denis and Hebiri (2017), where the authors analyse an ERM algorithm andderive oracle inequalities under the margin assumption (Tsybakov, 2004). Morespecifically, they consider a convex surrogate of the error P( · ) which relies on aconvex real valued loss function φ . For a suitable choice of the convex function φ they show that, under Assumption 1.1, their β -Oracle satisfiesΓ ∗ β ( · ) = (cid:110) k ∈ [ K ] : f ∗ k ( · ) ≥ G − f ∗ ( β ) (cid:111) , where the function f ∗ depends on φ and the value of G − f ∗ ( β ) is defined similarlyto the present manuscript. They propose a two step estimation procedure of the hzhen, Denis and Hebiri/Confidence Sets β -Oracle set. Based on the ERM algorithm, they first estimate f ∗ and in thesecond step, they estimate the threshold G − f ∗ ( β ) with an unlabeled sample.This procedure is in the same spirit as the semi-supervised procedure (1.5).Under mild assumptions, they provide an upper bound on the excess risk andobtain a rate of convergence of order ( n/ log n ) − α/ ( α + s ) + N − / , with s being aparameter that depends on the function φ and α being the margin parameter.Note that this rate is slower than the rate obtained in the standard classificationframework.The conformal prediction theory (Vovk, Gammerman and Shafer, 2005) sug-gests to minimize the information level with a fixed budget on the error level.Statistical properties of this framework were considered in the work of Sadinle,Lei and Wasserman (2018). Their objective is formulated for some a ∈ (0 ,
1) asΓ ∗ a ∈ arg min { I(Γ) : Γ ∈ Υ s.t. P(Γ) ≤ a } , and such a confidence set is called a least ambiguous confidence set with boundederror rate. The authors show that under Assumption 1.1 this oracle set can bedescribed as a thresholding of the regression functionΓ ∗ a ( · ) = { k ∈ [ K ] : p k ( · ) ≥ t a } , where the threshold t a is defined as t a = sup (cid:40) t ∈ [0 ,
1] : L (cid:88) k =1 P ( p k ( X ) ≥ t | Y = k ) P ( Y = k ) ≥ − a (cid:41) . Notice that this framework is very similar to (Denis and Hebiri, 2017) in thetreatment of the Bayes optimal confidence set, as in both cases they are obtainedvia thresholding of the posterior distribution of the labels. Sadinle, Lei andWasserman (2018) also proceed in two steps as here, that is, they first estimatethe posterior distribution p k ( · ) for all k ∈ [ K ] and estimate the threshold t a after. However, they require the second labeled dataset for the estimator of t a ,due to the presence of P ( Y = k ), the marginal distribution of the labels. Besides,their theoretical analysis is carried out under a different set of assumptionson the joint distribution P . Apart from the standard margin assumption, theyrequire a so-called detectability, that is, they require that the upper bound inthe margin assumption is tight. Under these assumptions they provide an upperbound on the Hamming excess risk and obtain a rate of convergence of order O (( n/ log n ) − / ).Interestingly, both approaches can be encompassed into the constrained es-timation framework (Anbar, 1977; Lepskii, 1990; Brown and Low, 1996), whereone would like to construct an estimator with some prescribed properties. Theseproperties are typically reflected by the form of the risk which in our case is thediscrepancy measure, that is, the sum of error and information discrepancies.Thus, both frameworks of Sadinle, Lei and Wasserman (2018); Denis and Hebiri(2017) can be seen as an extension of the constrained estimation to the clas-sification problems. From the modeling point of view, we believe that the two hzhen, Denis and Hebiri/Confidence Sets frameworks can co-exist nicely and a particular choice depends on the consideredapplication. The major difference between the present work and those by De-nis and Hebiri (2017) and Sadinle, Lei and Wasserman (2018) is the minimaxanalysis which we provide here and our treatment of semi-supervised techniques.As already pointed out, the confidence set estimation problem is closely re-lated to the level set estimation setup (Hartigan, 1987; Polonik, 1995; Tsybakov,1997; Rigollet and Vert, 2009). This problem focuses on the estimation of a levelset defined as Γ p ( λ ) = { x ∈ R d : p ( x ) ≥ λ } , where p is the density of the observations and λ > X , . . . , X n distributed according the density p the goal is to estimateΓ p ( λ ). In (Rigollet and Vert, 2009), the authors study plug-in density level setestimators through the measure of symmetric differences and the excess mass . Inconfidence set estimation the measure of symmetric differences is the Hammingrisk whereas the excess mass is the excess risk. They show that kernel basedestimators are optimal in the minimax sense over a H¨older class of densities andunder a margin type assumption (Polonik, 1995; Tsybakov, 2004). In particular,they derive fast rates of convergence, that is faster than n − / , for the excessmass . In the level set estimation problem, the threshold λ is chosen beforehand;whereas in our work, the threshold G − ( β ) depends on the distribution of thedata which makes the statistical analysis more difficult.On the other part, the confidence set estimation problem is directly related tothe standard classification settings. This problem has been widely studied froma theoretical point of view in the binary classification framework. Audibert andTsybakov (2007) study the statistical performance of plug-in classification rulesunder assumptions which involve the smoothness of the regression function andthe margin condition. In particular, they derive fast rates of convergence forplug-in classifiers based on local polynomial estimators (Stone, 1977; Tsybakov,1986; Audibert and Tsybakov, 2007) and show their optimality in the minimaxsense. One of the aim of present work is to extend these results to the confidenceset classification framework.Another part of our work is to provide a comparison between supervised andsemi-supervised procedures. Semi-supervised methods are studied in several pa-pers (Vapnik, 1998; Rigollet, 2007; Singh, Nowak and Zhu, 2009; Bellec et al.,2018) and references therein. A simple intuition can be provided on whetherone should or not expect a superior performance of the semi-supervised ap-proach. Imagine a situation when the unlabeled sample D N is so large that onecan approximate P X up to any desired precision, then, if the optimal decisionis independent of P X , the semi-supervised estimators are not to be consideredsuperior over the supervised estimation. This is the case in a lot of classical prob-lems of statistics, where the inference is solely governed by the behavior of theconditional distribution P Y | X (for instance regression or binary classification).The situation might be different once the optimal decision relies on the marginaldistribution P X . In this case, as suggested by our findings, the semi-supervised hzhen, Denis and Hebiri/Confidence Sets approach might or not outperform the supervised one even in the context of thesame problem. Similar conclusions were stated by Singh, Nowak and Zhu (2009)in the context of learning under the cluster assumption (Rigollet, 2007). Bellow we summarize our contributions. • Our results focus on the case where the regression p belongs to a H¨olderclass and satisfy the margin condition. Under these assumptions, we es-tablish lower bounds on the minimax rates, defined in Section 1.3 in theconfidence set framework. • As important consequences of our results, we first show that top- β typeprocedures are in general inconsistent. Furthermore, by providing a rig-orous definition of the semi-supervised and supervised estimators, we de-scribe the situations when the semi-supervised estimation should be con-sidered superior to its supervised counterpart. Interestingly, our analysissuggests that these regimes are governed by the interplay of the familyof distributions and by the considered measure of performance. Besides,we show that in our settings supervised procedures cannot achieve fastrates, that is, its rate cannot be faster than n − / . In contrast, some otherclassical settings (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009;Herbei and Wegkamp, 2006) allow to achieve faster rates for supervisedmethods. • We provide supervised and semi-supervised estimation procedures, whichare optimal or optimal up to an extra logarithmic factor. Importantly,our results show that semi-supervised plug-in procedure based on localpolynomial estimators can achieve fast rates, provided that the size of theunlabeled samples is large enough. • Finally, we perform a numerical evaluation of the proposed plug-in algo-rithms against the top- β counterparts. This part supports our theoreticalresults and empirically demonstrates the reason to consider more involvedprocedures. The paper is organized as follow. In Section 2, we put some additional notationand introduce the family of distributions P that we consider. Section 3 is devotedto the lower bounds on the minimax rates and their implications. In Section 4 weintroduce the proposed algorithm, establish upper bounds for it, and evaluateits numerical performance. We conclude this paper by Sections 5 and 6 wherewe discuss and sum-up our results. hzhen, Denis and Hebiri/Confidence Sets
2. Class of confidence sets
First let us introduce some generic notation that is used throughout this work.For two numbers a, a (cid:48) ∈ R we denote by a ∨ a (cid:48) (resp. a ∧ a (cid:48) ) the maximum (resp.minimum) between a and a (cid:48) . For a positive real number a we denote by (cid:98) a (cid:99) (resp. (cid:100) a (cid:101) ) the largest (resp. the smallest) non-negative integer that is less thanor equal (resp. greater than or equal) to a . The standard Euclidean norm of avector x ∈ R d is denoted by (cid:107) x (cid:107) and the standard Lebesgue measure is denotedby Leb( · ). A Euclidean ball centered at x ∈ R d of radius r > B ( x, r ). For an arbitrary Borel measure µ on R d that is absolutely continuous w.r.t. the Lebesgue measure we denote by supp( µ ) its support, that is, the setwhere the Radon-Nikodym derivative of µ w.r.t. Leb is strictly positive. For avector function p : R d (cid:55)→ R K and a Borel measure µ on R d we define the infinitynorm of p as (cid:107) p (cid:107) ∞ ,µ := inf (cid:26) C ≥ k ∈ [ K ] | p k ( x ) | ≤ C, a.e. x ∈ R d w.r.t. µ (cid:27) . In this work C or its lower-cased versions always refer to some constants whichmight different from line to line. Importantly, all these constants are independentof n, N but could depend on K, d and other parameters which are assumed tobe fixed. Before introducing the families of distributions P that are consideredin this work we need the following definitions. Assumption 2.1 ( α -margin assumption) . We say that the distribution P ofthe pair ( X, Y ) ∈ R d × [ K ] satisfies α -margin assumption if there exists C > and t ∈ (0 , such that for every positive t ≤ t P X (cid:0) < (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ C t α . Let us point out an important consequence of Assumption 1.1. We have thatthe condition P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ C t α , for all t ∈ [0 , t ] is equivalent to Assumption 2.1. Indeed, since the randomvariables p k ( X )’s cannot concentrate at a constant level, in particular at G − ( β ).Moreover, again due to the continuity Assumption 1.1 we havelim t → +0 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) = 0 , thus the α -margin Assumption 2.1 specifies the rate of this convergence. Finally,the restriction of the range of t to [0 , t ] in α -margin Assumption 2.1 does notaffect its global behavior as for all t ∈ [0 , P X (cid:0) < (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ c t α , with c = C ∨ t − α . Let c and r be two positive constants. We say that a Borel set A ⊂ R d isa ( c , r )-regular set ifLeb ( A ∩ B ( x, r )) ≥ c Leb ( B ( x, r )) , ∀ r ∈ (0 , r ] , ∀ x ∈ A . hzhen, Denis and Hebiri/Confidence Sets Definition 2.2 (Strong density) . We say that the probability measure P X on R d satisfies the ( µ min , µ max , c , r ) -strong density assumption if it is supportedon a compact ( c , r ) -regular set A ⊂ R d and has a density µ w.r.t. the Lebesguemeasure such that µ ( x ) = 0 for all x ∈ R d \ A and < µ min ≤ µ ( x ) ≤ µ max < ∞ , ∀ x ∈ A .
Definition 2.3 (H¨older class, Tsybakov (2008)) . We say that a function h : R d → R is ( γ, L ) -H¨older for γ > and L > if h is (cid:98) γ (cid:99) times continuouslydifferentiable and ∀ x, x (cid:48) ∈ R d we have | h ( x (cid:48) ) − h x ( x (cid:48) ) | ≤ L (cid:107) x − x (cid:48) (cid:107) γ , where h x ( · ) is the Taylor polynomial of degree (cid:98) γ (cid:99) of h ( · ) at the point x ∈ R d . Consequently, the set of all functions from R d to R satisfying the aboveconditions is called ( γ, L, R d ) -H¨older and is denoted by H ( γ, L, R d ) . Definition 2.4.
We denote by P ( L, γ, α ) a set of joint distributions on R d × [ K ] which satisfies the following conditions • the marginal P X satisfies the ( µ min , µ max , c , r ) -strong density, • for all k ∈ [ K ] the k th regression function p k ( · ) = P ( Y = k | X = · ) belongsto the ( γ, L, R d ) -H¨older class, that is p k ∈ H ( γ, L, R d ) for all k ∈ [ K ] , • for all k ∈ [ K ] the regression function p k satisfy the ( C , α, β ) -Marginassumption, • for all k ∈ [ K ] , the cumulative distribution function F p k of p k ( X ) is con-tinuous. The family of distributions P ( L, γ, α ) is similar to the one considered in (Au-dibert and Tsybakov, 2007) in the context of binary classification. The onlymajor difference is the continuity Assumption 1.1, which does not allow to re-use in a straightforward way their construction for lower bounds.
3. Lower bounds
The main results in the present work are the lower bounds we provide in thissection. In particular, we establish in Section 3.1 the inconsistency of top- β pro-cedures (see Eq. (1.3) for a definition of the method). Therefore more elaboratemethods are required in this framework. As pointed out in the introduction, wedistinguish two types of estimators: supervised and semi-supervised for which weprovide lower bounds in Section 3.2. The obtained rates highlight the benefit ofthe semi-supervised approach in the context of the confidence set classification.Before considering the lower bounds, let us first display connection betweenthe different minimax rates. Such links are used in the proofs of the lowerbounds. hzhen, Denis and Hebiri/Confidence Sets Proposition 3.1.
Let Γ be a measurable function from R d to [ K ] , β ∈ [ k ] andassume that Assumption 1.1 is fulfilled, then P(Γ) − P(Γ ∗ β ) = R β (Γ) − R β (Γ ∗ β ) + G − ( β ) ( β − I(Γ)) , R β (Γ) − R β (Γ ∗ β ) = K (cid:88) k =1 E P X (cid:104) | p k ( X ) − G − ( β ) | { k ∈ Γ( X ) (cid:52) Γ ∗ β ( X ) } (cid:105) . Furthermore, if additionally Assumption 2.1 is satisfied with α > , then thereexist C > which depends only on K, α, C such that for any pair of confidenceset classifiers Γ , Γ (cid:48) it holds that E P X (cid:12)(cid:12)(cid:12) Γ( X ) (cid:52) Γ (cid:48) ( X ) (cid:12)(cid:12)(cid:12) ≤ C (R β (Γ) − R β (Γ (cid:48) )) α/ ( α +1) . (3.1) Proposition 3.2.
For any K ≥ , β ∈ [ K ] and n, N ∈ N the following relationbetween minimax rates holds: Ψ H n,N (ˆ Γ ; P ) ≥ Ψ D n,N (ˆ Γ ; P ) ≥ Ψ E n,N (ˆ Γ ; P ) . Proposition 3.1, and in particular Eq. (3.1) gives an easy way to estab-lish a lower bound on Ψ E n,N (ˆ Γ ; P ) via a lower bound on the Hamming dis-tance Ψ H n,N (ˆ Γ ; P ). However, this approach does not allow to get ( N + n ) − / (resp. n − / ) part of the rate in the lower bound of Ψ E n,N ( ˆΥ SSE , P ) (resp.Ψ E n,N ( ˆΥ SE , P )). Besides, Proposition 3.2 allows to prove a lower bound on thediscrepancy Ψ D n,N (ˆ Γ ; P ) with the correct rate via the lower bound on the excessrisk Ψ E n,N (ˆ Γ ; P ). β procedure Before stating our results on the supervised and the semi-supervised estimators,we discuss another interesting class of confidence sets, which might be a naturalchoice at the first sight. We consider estimators which consists of β classes atevery point x ∈ R d since such estimators naturally satisfy I (ˆΓ) = β . Let usdenote by ˆΥ β the set of all estimators ˆΓ such that | ˆΓ( x ) | = β for all x ∈ R d ,that is, ˆΥ β = (cid:110) ˆΓ ∈ ˆΥ : | ˆΓ( x ) | = β, a.e. x ∈ R d w.r.t. Leb (cid:111) . Despite an obvious restriction on the cardinal of the confidence sets, the fam-ily of estimators ˆΥ β is rather broad. Indeed, every procedure which estimatesthe regression functions p k ( · )’s and includes the top β scores as the output areincluded in ˆΥ β . The nature of the estimator can also be different, that is, the es-timates could be based on the ERM, non-parametric or parametric approaches.Clearly, the family ˆΥ β is neither included in ˆΥ SE nor in ˆΥ SSE and has a non-trivial intersection with both. The next result states that there is no uniformlyconsistent estimator ˆΓ ∈ ˆΥ β over the family of distributions P ( L, γ, α ). hzhen, Denis and Hebiri/Confidence Sets Proposition 3.3.
Assume that K ≥ , β ∈ [ (cid:98) K/ (cid:99) − and β ≥ , then for all n, N ∈ N we have Ψ E n,N (cid:16) ˆΥ β ; P ( L, γ, α ) (cid:17) ≥ β − K .
The proof builds an explicit construction of a distribution P whose β -Oraclesatisfies | Γ ∗ β ( x ) | > β for all x in some A ⊂ R d with P X ( A ) >
0. Indeed, if such adistribution exists then there is no estimator in ˆΥ β that would consistently esti-mate this β -Oracle. The negative result established in Proposition 3.3 is ratherinstructive by itself as it advocates that a more involved estimation procedureought to be constructed. Clearly, estimators which achieve the infimum in the minimax rates are either su-pervised or semi-supervised, thus a lower bound on Ψ (cid:3) n,N ( ˆΥ SE ; P ) together witha lower bound on Ψ (cid:3) n,N ( ˆΥ SSE ; P ) yield a lower bound on Ψ (cid:3) n,N ( ˆΥ; P ). However,a lower bound on Ψ (cid:3) n,N ( ˆΥ; P ) does not discriminate between the supervised andthe semi-supervised estimators. Theorem 3.4 (Supervised estimation) . Let K ≤ , β ∈ [ (cid:98) K/ (cid:99)− . If α (cid:100) γ (cid:101) ≤ d , then there exist constants c, c (cid:48) , c (cid:48)(cid:48) > such that for all n, N ∈ N Ψ H n,N ( ˆΥ SE ; P ( L, γ, α )) ≥ c (cid:16) n − αγ γ + d (cid:95) n − / (cid:17) , Ψ E n,N ( ˆΥ SE ; P ( L, γ, α )) ≥ c (cid:48) (cid:16) n − (1+ α ) γ γ + d (cid:95) n − / (cid:17) , Ψ D n,N ( ˆΥ SE ; P ( L, γ, α )) ≥ c (cid:48)(cid:48) (cid:16) n − (1+ α ) γ γ + d (cid:95) n − / (cid:17) . Based on this results we observe that the lower bound for the Hamming riskΨ H n,N is slower than those for the other risks. It is even more significant that thebest rate that a supervised estimator can achieve for all of the risks is n − / evenif the margin assumption holds. This is the major difference with the classicalsettings where the value of threshold is known (such as classification and level setestimation). Indeed, under the same assumptions on the family of distributions,besides the continuity Assumption 1.1, the minimax rate in those frameworksis n − (1+ α ) γ/ (2 γ + d ) as proved for instance in (Audibert and Tsybakov, 2007;Rigollet and Vert, 2009). Next theorem deals with semi-supervised proceduresand displays another behavior. Theorem 3.5 (Semi-supervised estimation) . Let K ≥ , β ∈ [ (cid:98) K/ (cid:99) − . If hzhen, Denis and Hebiri/Confidence Sets α (cid:100) γ (cid:101) ≤ d , then there exist constants c, c (cid:48) , c (cid:48)(cid:48) > such that for all n, N ∈ N Ψ H n,N ( ˆΥ SSE ; P ( L, γ, α )) ≥ c (cid:16) n − αγ γ + d (cid:95) ( n + N ) − / (cid:17) , Ψ E n,N ( ˆΥ SSE ; P ( L, γ, α )) ≥ c (cid:48) (cid:16) n − (1+ α ) γ γ + d (cid:95) ( n + N ) − / (cid:17) , Ψ D n,N ( ˆΥ SSE ; P ( L, γ, α )) ≥ c (cid:48)(cid:48) (cid:16) n − (1+ α ) γ γ + d (cid:95) ( n + N ) − / (cid:17) . First, observe that the lower bound for the Hamming distance is, as in thesupervised setting, worse than for the other measures of performance. Howeverthere is a major difference with the supervised case: as compared to Theorem 3.4,it is possible for a semi-supervised estimator to achieve rates that are faster than n − / if the size of the unlabeled dataset N ∈ N is large enough. In particular,when we consider Ψ E n,N or Ψ D n,N the following relations are necessary to get fastrates ( n + N ) − / = o (cid:16) n − (1+ α ) γ/ (2 γ + d ) (cid:17) , n − (1+ α ) γ/ (2 γ + d ) = o ( n − / ) . In this case, we recover the same fast rates as in the classical settings of classi-fication and level set estimation. It suggests that the lack of knowledge of thethreshold G − ( β ) does not alter the quality of estimation for the semi-supervisedprocedure, provided that N is sufficiently large. Next corollary makes these ob-servations clearer. Corollary 3.6.
Assume that the rates in Theorem 3.5 (resp. Theorem 3.4) areminimax, that is, there exist a confidence set ˆΓ SSE (resp. ˆΓ SE ) that achievesthese rates. Regarding Ψ E n,N and Ψ D n,N the following conclusions hold • There is no semi-supervised estimator that achieves faster rate than ˆΓ SE if: (cid:40) (1+ α ) γ γ + d ≤ / N ∈ N or (cid:40) (1+ α ) γ γ + d > / N = O ( n ) . • The rate of ˆΓ SSE is faster than the rate of any supervised estimator if: (1 + α ) γ γ + d > / and n = o ( N ) . Moreover, if there exists ρ > such that n ρ = o ( N ) , then the rate of ˆΓ SSE is polynomially faster than n − / . • The rate of ˆΓ SSE is fast similarly to the classical frameworks if (1 + α ) γ γ + d > / and N = Ω (cid:16) n α ) γ γ + d (cid:17) . Clearly, similar observation is true for the Hamming risk Ψ H n,N ; however theregime when improvement is possible thanks to semi-supervised approaches isnarrowed as n − (1+ α ) γ/ (2 γ + d ) = o (cid:0) n − αγ/ (2 γ + d ) (cid:1) . We summarize Corollary 3.6 inTable 3.2. hzhen, Denis and Hebiri/Confidence Sets (1+ α ) γ γ + d N, n
SE rate SSE rate SSE > SE ≤ N ∈ N , n ∈ N n − (1+ α ) γ γ + d n − (1+ α ) γ γ + d NO > N = O ( n ) n − n − NO > n = o ( N ) n − N − (cid:87) n − (1+ α ) γ γ + d YES > N = Ω (cid:18) n α ) γ γ + d (cid:19) n − n − (1+ α ) γ γ + d YES
Table 1
This table summarizes observations of Corollary 3.6 for Ψ E n,N and Ψ D n,N . Depending on therelations between α, γ, d and N, n the semi-supervised approach can significantly improve therates of convergence.
Essentially, the above results suggest that the advantage of the semi-supervisedapproaches over the supervised ones depends not only on the underlying familyof distributions P but also on the metric that is considered. Yet, necessaryand sufficient conditions that must be imposed in general on the problem andthe metric so that the semi-supervised estimation provably improve upon thesupervised one remain an open problem.A final remark we could make before going further concerns the assumptionon the parameters α and γ . The condition 2 α (cid:100) γ (cid:101) ≤ d in the lower bounds isslightly more restrictive than the conditions given in (Audibert and Tsybakov,2007) (they have αγ ≤ d ). We believe that this is an artifact of our proofand could be avoided with a finer choice of hypotheses. Simple modifications ofthe lower bound of Audibert and Tsybakov (2007) do not work in our settingsbecause their hypotheses are not satisfying Assumption 1.1. In contrast, theconstruction of Rigollet and Vert (2009) satisfies Assumption 1.1 but theirlower bound is limited by the condition αγ ≤
1, that is, it does not cover thefast rates as long as the dimension d > In order to prove the lower bounds of Theorems 3.4, 3.5 we actually prove twoseparate lower bounds on the minimax rates. The two lower bounds that weprove are naturally connected with the proposed two-step estimator in Eq. (1.5),that is, the first lower bound is connected with the problem of non-parametricestimation of p k for all k ∈ [ K ] and the second describes the estimation of theunknown threshold G − ( β ).In particular, the first lower bound is closely related to the one providedin (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009), however, the conti-nuity Assumption 1.1 makes the proof more involved and results in a final con-struction of hypotheses that differs significantly. This part of our lower bound Modified properly to fit the classification framework. hzhen, Denis and Hebiri/Confidence Sets relies on Fano’s inequality in the form of Birg´e (2005). The second lower boundis based on two hypotheses testing and is derived by constructing two differ-ent marginal distributions of X ∈ R d which are sufficiently close and a fixedregression function p ( · ). Crucially, these marginal distributions admit two dif-ferent values of threshold G − ( β ) and thus two different β -Oracle. In this partwe make use of Pinsker’s inequality, see for instance (Tsybakov, 2008).In order to discriminate the supervised and the semi-supervised procedureswe make use of Definition 1.4. Notice that every supervised procedure thanksto Definition 1.4 is not “sensitive” to the expectation taken w.r.t. the unlabeleddataset D N , that is, randomness is only induced by the labeled dataset D n .This strategy allows to eliminate the dependence of the lower bound on the sizeof the unlabeled dataset D N for supervised procedures. Informally, the lowerbound on Ψ (cid:3) n,N ( ˆΥ SE ; P ) is obtained from the lower bound on Ψ (cid:3) n,N ( ˆΥ SSE ; P ) bysetting N = 0.
4. Upper bounds
In this section, we show that we can build confidence set estimators that achieve,up to a logarithmic factor, the lower bounds stated in Theorems 3.4-3.5. Inother words, those estimators are nearly optimal in the minimax sense. To comestraight to the point, we delay the construction of the estimators to Section 4.1and their properties to Section 4.2, and focus right now on their upper bounds.
Theorem 4.1 (Supervised estimation) . Let K ∈ N , β ∈ [ K − , then thereexists a supervised estimator ˆΓ SE ∈ ˆΥ SE and a constant C > such that for all n, N ∈ N we have Ψ H n,N (ˆΓ SE ; P ( L, γ, α )) ≤ C (cid:16) n − αγ γ + d (cid:95) n − / (cid:17) , Ψ E n,N (ˆΓ SE ; P ( L, γ, α )) ≤ C (cid:48) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d (cid:95) n − / , Ψ D n,N (ˆΓ SE ; P ( L, γ, α )) ≤ C (cid:48)(cid:48) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d (cid:95) n − / . Theorem 4.2 (Semi-supervised estimation) . Let K ∈ N , β ∈ [ K − , then thereexists a semi-supervised estimator ˆΓ SSE ∈ ˆΥ SSE and constants
C, C (cid:48) , C (cid:48)(cid:48) > hzhen, Denis and Hebiri/Confidence Sets such that for all n, N ∈ N we have Ψ H n,N (ˆΓ SSE ; P ( L, γ, α )) ≤ C (cid:16) n − αγ γ + d (cid:95) ( n + N ) − / (cid:17) , Ψ E n,N (ˆΓ SSE ; P ( L, γ, α )) ≤ C (cid:48) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d (cid:95) ( n + N ) − / , Ψ D n,N (ˆΓ SSE ; P ( L, γ, α )) ≤ C (cid:48)(cid:48) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d (cid:95) ( n + N ) − / . We show here that the lower bounds of Theorems 3.4-3.5 are achievable. Inparticular, in the case of Hamming risk, the upper bounds are optimal; whereasfor the Excess risk and the Discrepancy, the upper bounds fit the lower boundsup to a logarithmic factor. Thus, the comments we made in Corollary 3.6 arecorrect. Let us mention that the presence of the logarithmic factor in theseupper bounds is due to (cid:96) ∞ -norm estimation (see Lemma 4.5).Hamming risk as a measure of performance was considered in the settingsof Sadinle, Lei and Wasserman (2018). They also establish upper bounds forthis measure, though do not assess their optimality. Besides, as we alreadymentioned, Denis and Hebiri (2017) provide an upper bound on the excess riskin the context of ERM. Let us point out, that the comparison with these twoworks is not fair as the assumptions and even frameworks under which we andthey formulate results are different. Building estimators ˆΓ SE and ˆΓ SSE that reach the rates in the former upperbounds involves a preliminary estimators ˆ p k of the regression functions p k , k ∈ [ K ]. These estimators ˆ p k are constructed using an arbitrary half D (cid:98) n/ (cid:99) of thelabeled dataset D n and they satisfy the following assumptions. Assumption 4.3 (Exponential concentration) . There exist estimators ˆ p k forall k ∈ [ K ] based on D (cid:98) n/ (cid:99) and positive constants C , C such that for all k ∈ [ K ] and all n ≥ we have for all δ > P ∈P ( L,γ,α ) P ⊗(cid:98) n/ (cid:99) ( | ˆ p k ( x ) − p k ( x ) | ≥ δ ) ≤ C exp (cid:16) − C n γ γ + d δ (cid:17) , for almost all x ∈ R d w.r.t. P X . Assumption 4.4 (Continuity of CDF) . For all k ∈ [ K ] the cumulative dis-tribution function F ˆ p k ( t ) := P X (ˆ p k ( X ) ≤ t ) of ˆ p k ( X ) is almost surely P ⊗(cid:98) n/ (cid:99) continuous on (0 , . First let us point out that Assumption 4.3 induces that there exists a constant
C > n ≥ α > P ∈P ( L,γ,α ) E D (cid:98) n/ (cid:99) (cid:107) p − ˆ p (cid:107) α ∞ , P X ≤ C (cid:18) n log n (cid:19) − (1+ α ) γ γ + d . hzhen, Denis and Hebiri/Confidence Sets Assumption 4.3 is commonly used in the statistical community when we dealwith rates of convergence in the classification settings (Audibert and Tsybakov,2007; Lei, 2014; Sadinle, Lei and Wasserman, 2018). It is for instance satisfiedby the locally polynomial estimator (Stone, 1977; Tsybakov, 1986; Audibert andTsybakov, 2007). Assumption 4.4 can always be satisfied by slightly processingany estimator ˆ p . Indeed, assume Assumption 4.4 fails to be satisfied by someestimator ˆ p . It means that there exists a subset of R d of non-zero measure suchthat at least one ˆ p k , with k ∈ [ K ], is constant on this set. Then, if we add a deterministic continuous function of a sufficiently bounded variation to ˆ p suchregions can no longer exist.Since, the threshold level G − ( β ) is not known beforehand, it ought to beestimated using data. A straightforward estimator of this threshold can be con-structed using the unlabeled dataset D N . To make our presentation mathemat-ically correct we introduce the following notation D n = D (cid:98) n/ (cid:99) (cid:83) D (cid:100) n/ (cid:101) , where D (cid:98) n/ (cid:99) is the dataset used to build the estimators ˆ p k for k ∈ [ K ]. Now, all thelabels are removed from D (cid:100) n/ (cid:101) , that is it consists of (cid:100) n/ (cid:101) i.i.d. samples from P X . The supervised and semi-supervised estimators of G ( · ) are defined asˆ G SE ( · ) = 1 (cid:100) n/ (cid:101) (cid:88) X ∈D (cid:100) n/ (cid:101) K (cid:88) k =1 { ˆ p k ( X ) > ·} , ˆ G SSE ( · ) = 1 (cid:100) n/ (cid:101) + N (cid:88) X ∈D N (cid:83) D (cid:100) n/ (cid:101) K (cid:88) k =1 { ˆ p k ( X ) > ·} , respectively. Finally, we are in position to define ˆΓ SE and ˆΓ SSE asˆΓ SE ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , ˆΓ SSE ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , for all x ∈ R d . Note that ˆΓ SE is clearly supervised in the sense of Definition 1.4,as it is independent of the unlabeled sample D N . In contrast, ˆΓ SSE is semi-supervised, since we can find two samples D N and D (cid:48) N which induce differentconfidence sets. To show that the estimators introduced in this section satisfythe statements of Theorems 4.1-4.2 we refine the proof technique used in (Denisand Hebiri, 2017). That is, we introduce an intermediate quantity˜ G ( · ) := K (cid:88) k =1 P X (ˆ p k ( X ) > · ) , and the associated confidence set, which we refer to as the pseudo Oracle con-fidence set given for all x ∈ R d by˜Γ( x ) := (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˜ G − ( β ) (cid:111) . It is sufficient to make sure that adding the function preserves its statistical properties,that is, Assumption 4.3 hzhen, Denis and Hebiri/Confidence Sets The confidence set ˜Γ assumes knowledge of the marginal distribution P X andis seen as an idealized version of both ˆΓ SE and ˆΓ SSE , note however, that thepseudo Oracle ˜Γ is not an estimator.
An important step of our analysis is the following lemma, that bounds thedifference between ˜ G − ( β ) and G − ( β ). Lemma 4.5 (Upper bound on the thresholds) . Let Assumption 1.1 be satisfied,then for all β ∈ [ K ] (cid:12)(cid:12)(cid:12) G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) p − ˆ p (cid:107) ∞ , P X , almost surely P ⊗ n ⊗ P ⊗ NX . The proof of Lemma 4.5 uses elementary properties of the generalized inversefunctions which are provided in Appendix. Besides, let us mention, that thedifference | G − ( β ) − ˜ G − ( β ) | resembles the Wasserstein infinity distance whichgives an alternative approach to prove Lemma 4.5, see (Bobkov and Ledoux,2016). Lemma 4.5 explains the extra log n factor that appears in the upperbound, as the minimax estimation in sup norm contains the log n factor, seefor instance (Stone, 1982; Tsybakov, 2008). Another important property of theintroduced estimators ˆΓ SE and ˆΓ SSE is obtained via Assumption 4.4. It describesthe deviation of the information of ˆΓ SE and ˆΓ SSE from the desired level β . Proposition 4.6 (Denis and Hebiri (2017)) . Let ˆ p k for all k ∈ [ K ] be arbitraryestimators of the regression functions constructed using D (cid:98) n/ (cid:99) that satisfiesAssumption 4.4, then there exist constants C, C (cid:48) > such that for all n, N ∈ N it holds that E ( D n , D N ) (cid:12)(cid:12)(cid:12) β − I (cid:16) ˆΓ SE (cid:17)(cid:12)(cid:12)(cid:12) ≤ Cn − / , E ( D n , D N ) (cid:12)(cid:12)(cid:12) β − I (cid:16) ˆΓ SSE (cid:17)(cid:12)(cid:12)(cid:12) ≤ C (cid:48) ( N + n ) − / . Note that if ˆ p k satisfies Assumption 4.4 for all k ∈ [ K ], then β = I(˜Γ).This simple fact is a step in the proof of Proposition 4.6. Finally, combina-tion of Lemma 4.5, Proposition 4.6, Assumption 4.3 with the peeling argumentused in (Audibert and Tsybakov, 2007, Lemma 3.1) yields the results of Theo-rems 4.1-4.2. The goal of this part is to numerically address the following points.1) Is it more advantageous to go outside of the classical multi-class classifi-cation settings and consider the confidence set framework? To respond tothis question we compute the Bayes optimal multi-class classifier and viewit as a confidence set with one label. We compare this Bayes rule with the hzhen, Denis and Hebiri/Confidence Sets K = 10 β β -Oracle top- β Oracle2 0.05 (0.01) 0.09 (0.01)5 0.00 (0.00) 0.01 (0.00) K = 100 β β -Oracle top- β Oracle2 0.39 (0.01) 0.42 (0.01)5 0.20 (0.01) 0.22 (0.01)10 0.09 (0.01) 0.11 (0.01)20 0.03 (0.01) 0.04 (0.01)
Table 2
For each of the B = 100 repetitions and each model, we derive the estimated errors P M ofthe β -Oracle and of the top - β Oracle w.r.t. β . We compute the means and standarddeviations (between parentheses) over the B = 100 repetitions. Top: the data are generatedaccording to K = 10 – Bottom: the data are generated according to K = 100 . β -Oracle in terms of the error P( · ) using various values of β ∈ [ K ] and K ∈ N .2) How does the β -Oracle confidence set compares to another ”Oracle” (top- β Oracle) which simply includes classes corresponding to the largest valuesof p k ( · )’s?3) Does the proposed plug-in approach indeed gives a good approximationof the β -Oracle through the error P( · ) and the information I( · )?4) Despite demonstrating the minimax inconsistency of the top- β approach,we wonder whether in some scenarios it can achieve a comparable perfor-mance against our semi-supervised plug-in procedure.We consider two simulation schemes depending on the parameter K ∈ { , } .For each K , we generate ( X, Y ) according to a mixture model. More precisely,i) the label Y follows uniform distribution on [ K ];ii) conditional on Y = k , the feature X is generated according to a multi-variate gaussian distribution with mean µ k ∈ R and identity covariancematrix.For each k ∈ [ K ], the vectors µ k are i.i.d. realizations of uniform distribution hzhen, Denis and Hebiri/Confidence Sets β K = 10 K = 1002 2.00 (0.03) 2.00 (0.03)5 5.00 (0.08) 5.00 (0.06)10 · · Table 3
For each of the B = 100 repetitions and each model, we derive the estimated informationlevels I M of the β -Oracle set w.r.t. β . We compute the means and standard deviations (inparentheses) over the B = 100 repetitions. Left: the data are generated according to K = 10 – Right: the data are generated according to K = 100 . on [0 , . For this distribution, we have p k ( X ) = f k ( X ) (cid:80) Kj =1 f j ( X ) , where for each k ∈ [ K ], f k ( X ) is the density function of a multivariate gaussiandistribution with mean parameter µ k and identity covariance matrix.For each K , the missclassification error of the classical multi-class classifi-cation Bayes rule is evaluated based on a sufficiently large dataset. It is valuedat 0 .
22 and at 0 .
60 for K = 10 and for K = 100 respectively. These values arerelatively high, which suggests that confusion is induced by the large numberof classes. Hence, it is reasonable to apply the confidence set approach to thisproblem. In the sequel, we aim at providing the estimation of the error of the β -Oracle. To this end, for β ∈ { , , , } and each K , we repeat B times thefollowing steps.i) simulate two datasets D N and D M with N = 10000 and M = 1000;ii) based on D N , we compute the empirical counterpart of G and provide anapproximation of the β -Oracle Γ ∗ β given in Eq. (1.1) (we recall that thisstep requires a dataset which contains only unlabeled features);iii) finally, over D M , we compute the empirical counterparts P M (of P(Γ ∗ β ))and I M (of I(Γ ∗ β )).From this estimates, we compute the mean and the standard deviation of P M and I M . Tables 2 and 3 present values of the error and of the information whichare achieved by the β - Oracle and by the top- β Oracle.We now move towards the construction of our semi-supervised plug-in es-timators ˆΓ
SSE . For each K and each β , we evaluate the performance of ˆΓ SSE according to three different estimations of the regression function: the ˆ p k ’s arebased on random forests, softmax regression and deep learning procedures. Letus point out, that for random forests and softmax regression algorithms, the hzhen, Denis and Hebiri/Confidence Sets K = 10ˆΓ SSE top- ββ rforest softmax reg deep learn rforest softmax reg deep learn K = 100ˆΓ SSE top- ββ rforest softmax reg deep learn rforest softmax reg deep learn Table 4
For each of the B = 100 repetitions and for each model, we derive the estimated errors P ofthree different ˆΓ SSE ’s w.r.t. β . We compute the means and standard deviations (inparentheses) over the B = 100 repetitions. For each β and for each N , the ˆΓ SSE ’s, as wellas the top procedures are based on, from left to right, rforest , softmax reg and deeplearn , which are respectively the random forest, the softmax regression and the deeplearning methods. Top: the data are generated according to K = 10 – Bottom: the data aregenerated according to K = 100 . random variables ˆ p k ( X ) appear to be not continuous. Hence Assumption 4.4 isviolated. To alleviate this issue, we add to ˆ p k ( X ) an independent small pertur-bation |N (0 , e − ) | for simplicity. The evaluation of the performance of ˆΓ SSE relies on the following stepsi) simulate three datasets D n , D N and D M ;ii) based on D n , we compute the estimators ˆ p k of p k according to the consid-ered procedure;iii) based on D N and ˆ p k we compute the function ˆ G and the estimator ˆΓ SSE as in Eq. (1.5) (we recall that this step requires a dataset which containsonly unlabeled features);iv) finally, we compute over D M the empirical counterpart of P and of I forthe considered ˆΓ SSE .Again, during these experiments, we compute means and standard deviations.The parameters
K, n, N are fixed as follows: for K = 10, we fix n = 1000 and N ∈ { , } ; for K = 100 we fix n = 10000 and N ∈ { , } . Finally,the size of D M is fixed to M = 1000. The results are illustrated in Tables 4and 5. hzhen, Denis and Hebiri/Confidence Sets K = 10 N = 100 N = 10000 β rforest softmax reg deep learn rforest softmax reg deep learn K = 100 N = 100 N = 10000 β rforest softmax reg deep learn rforest softmax reg deep learn Table 5
For each of the B = 100 repetitions and for each model, we derive the estimatedinformation levels I of three different ˆΓ SSE ’s w.r.t. β and the sample size N . We computethe means and standard deviations (in parentheses) over the B = 100 repetitions. For each β and each N , the ˆΓ SSE ’s are based on, from left to right, rforest , softmax reg and deeplearn , which are respectively the random forest, the softmax regression and the deeplearning procedures. Top: the data are generated according to K = 10 – Bottom: the dataare generated according to K = 100 . As benchmark for the continuation of our experiments, the classical missclas-sification errors of the multi-class classifiers based on random forests, softmaxregression and deep learning methods are valued respectively at 0 .
28, 0 .
24, 0 . K = 10, and at 0 .
65, 0 .
98 0 .
63 for K = 100.Turning to Table 2 we confirm the intuition that the error of the β -Oracledecreases as the value of the parameter β increases. Nevertheless, for moderatevalues of β , compared to K , we obtain a satisfactory improvement comparedto standard multi-class classification Bayes rule. For instance, when K = 10and β = 2 the error of the 2-Oracle confidence set is 0 .
05, whereas the Bayesclassifier has 0 .
22; likewise, when K = 100 and β = 5 the the classification errordecreases from 0 .
60 to 0 .
20. Table 2 shows that the top- β Oracle is slightlyoutperformed by the β -Oracle in terms of the error, but still performs well.From Tables 3 and 5, we observe that the approximation of the informationis reasonably good and it gets better with N the number of unlabeled data.Besides, Tables 2 and 4 demonstrate that our algorithm is sensitive to the choiceof the underlying estimator ˆ p k . Indeed, when ˆ p k is estimated via the softmaxregression, our algorithm fails to give a good approximation to the error of the hzhen, Denis and Hebiri/Confidence Sets β -Oracle.Table 4 provides similar conclusions regarding ˆΓ SSE , though, unlike the theo-retical quantities, there are more scenarios where our method is better than itstop- β counterpart. Let us point out, that for K = 100 methods that are basedon the softmax regression perform poorly in this setup.
5. Discussions
The bedrock of this paper is Assumption 1.1. Based on it, we ensure that the β -Oracle confidence set given by Eq. (1.1) is indeed of information β . On topof that, the explicit formulation of excess risk in Proposition 3.1 relies on thecontinuity of function G ( · ). Should Assumption 1.1 fail to be satisfied, then theremight be no β -Oracle given by thresholding on some level θ ∈ (0 , β -Oracle having theform Γ ∗ β ( · ) = { k ∈ [ K ] : p k ( · ) > θ } with some θ , then β = I(Γ ∗ β ) = G ( θ ) . However, without the continuity, the function G ( · ) is not surjective and there-fore, the equation G ( θ ) = β may have no solution, which contradicts the factthat I(Γ ∗ β ) = β . Therefore, the settings without the continuity of G ( · ) deservea separate study. Let us also point out that the continuity assumption impliesthat the β -Oracle can also be defined asΓ ∗ β ∈ arg min { P (Γ) : Γ ∈ Υ s.t. I(Γ) ≤ β } , where the inequality used in place of the equality. Indeed, under continuityassumption thanks to Propositions 1.3 and 3.1 we have for all confidence sets Γsuch that I(Γ) ≤ β P(Γ) − P(Γ ∗ β ) = R β (Γ) − R β (Γ ∗ β ) (cid:124) (cid:123)(cid:122) (cid:125) ≥ + G − ( β ) ( β − I(Γ)) (cid:124) (cid:123)(cid:122) (cid:125) ≥ , which implies that the β -Oracle Γ ∗ β is a minimizer. G − ( · ) Under the assumptions needed in this work, and in particular the continuity as-sumption we showed two important facts: i) no supervised approach can achievefast rates, that is, faster than n − / ; ii) some semi-supervised approaches canachieve fast rate.One might wonder whether extra assumptions on the problem allow a super-vised method to get faster rates than n − / . We give to this question a partial hzhen, Denis and Hebiri/Confidence Sets answer following the recent work of Bobkov and Ledoux (2016) and more pre-cisely their Theorem 5.11. Applying this result to our framework, we can statethat there exists a positive constant c such that E D N (cid:12)(cid:12) G − ( β ) − G − N ( β ) (cid:12)(cid:12) ≤ c Lip( G − ) N − / , where Lip( G − ) is the Lipschitz constant of G − ( · ) and G − N ( · ) is the generalizedinverse of G N ( · ) = 1 N (cid:88) X ∈D N K (cid:88) k =1 { p k ( X ) > ·} . If, on top of the above, one can show that for any α > c (cid:48) E D N (cid:12)(cid:12) G − ( β ) − G − N ( β ) (cid:12)(cid:12) α ≤ c (cid:48) Lip α ( G − ) N − (1+ α ) / , then under Lipschitz continuity of G − ( · ), we can prove thatΨ E n,N (ˆΓ (cid:3) ; P ( L, γ, α )) (cid:46) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d , where (cid:3) stands for SE or SSE. This would illustrate that both ˆΓ SE and ˆΓ SSE arestatistically equivalent under Lipschitz condition on G − ( · ), that is, both reachthe same rate and the impact of the unlabeled data D N is negligible. We planto further investigate the influence of this Lipschitz condition on the minimaxrates of convergence in our future works. Since in the present contribution we donot impose this assumption on G − ( · ), the upper bound of Bobkov and Ledoux(2016) is not applicable and we had to rely on a different approach. Theorems 3.5 and 4.2 demonstrate that for the excess risk and the Discrepancy,the upper and the lower bounds differ by a logarithmic factor. As we havealready pointed out, this factor appears in the upper bounds due to Lemma 4.5which relates the difference between two thresholds to the infinity norm. Onemight hope that if we manage to replace the infinity norm by any other (cid:96) q -norm on the right hand side of the inequality in Lemma 4.5 this logarithm canbe eliminated. Unfortunately, it appears that this bound is actually tight, in asense that one can construct a distribution P and an estimator ˆ p k for all k ∈ [ K ]such that an equality is achieved in Lemma 4.5. These arguments suggest thatthe obtained upper bound should be optimal. They also imply that the lowerbounds could be further refined to get an extra logarithmic factor. Let us alsomention that the continuity Assumption 1.1 in combination with the marginAssumption 2.1 are main obstacles that did not allow us to provide better lowerbounds. Nevertheless, our proofs are already involved and our results allow tomake non trivial conclusions even without going into the details concerning thelogarithms. hzhen, Denis and Hebiri/Confidence Sets
6. Conclusion
In this work we have studied the minimax settings of confidence set multi-classclassification. First of all, following previous works we have shown that a top- β type procedure is inconsistent in our settings and more involved estimatorsshould be proposed. Besides, we have demonstrated that no supervised estimatorcan achieve rates that are faster than n − / , which stays in contrast with otherclassical settings. Additionally, we have shown that fast rates are achievablefor semi-supervised techniques provided that the size of the unlabeled sampleis large enough. Consecutively, we have established that our lower bounds areeither optimal or nearly optimal by providing a supervised and a semi-supervisedestimators which are tractable in practice. Our future works shall be focusedon the Lipschitz condition of G − ( · ) discussed in Section 5.2, in particular, wewant to understand how this extra assumption affects our lower bounds. References
Anbar, D. (1977). A Modified Robbins-Monro Procedure Approximating theZero of a Regression Function from Below.
Ann. Statist. Audibert, J. and
Tsybakov, A. (2007). Fast learning rates for plug-in clas-sifiers.
Ann. Statist. Bartlett, P. and
Wegkamp, M. (2008). Classification with a reject optionusing a hinge loss.
J. Mach. Learn. Res. Bellec, P. , Dalalyan, A. , Grappin, E. and
Paris, Q. (2018). On the pre-diction loss of the lasso in the partially labeled setting.
Electron. J. Statist. Birg´e, L. (2005). A new lower bound for multiple hypothesis testing.
IEEETrans. Inform. Theory . Bobkov, S. and
Ledoux, M. (2016). One-dimensional empirical measures,order statistics and Kantorovich transport distances. to appear in the Memoirsof the Amer. Math. Soc.
Brown, L. and
Low, M. (1996). A constrained risk inequality with applica-tions to nonparametric functional estimation.
Ann. Statist Chow, C. (1957). An optimum character recognition system using decisionfunctions.
IRE Transactions on Electronic Computers Chow, C. (1970). On optimum error and reject trade-off.
IEEE Trans. Inform.Theory de Finetti, B. (1972). Probability, induction and statistics. The art of guessing .John Wiley & Sons, London-New York-Sydney Wiley Series in Probabilityand Mathematical Statistics. de Finetti, B. (1974).
Theory of probability: a critical introductory treatment.Vol. 1 . John Wiley & Sons, London-New York-Sydney.
Denis, C. and
Hebiri, M. (2015). Consistency of plug-in confidence sets forclassification in semi-supervised learning. preprint.
Denis, C. and
Hebiri, M. (2017). Confidence Sets with Expected Sizes forMulticlass Classification.
J. Mach. Learn. Res. hzhen, Denis and Hebiri/Confidence Sets Gilbert, E. (1952). A comparison of signalling alphabets.
The Bell systemtechnical journal Hartigan, J. A. (1987). Estimation of a Convex Density Contour in TwoDimensions.
J. Amer. Statist. Assoc. Herbei, R. and
Wegkamp, M. (2006). Classification with reject option.
Canad. J. Statist. Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded RandomVariables.
J. Amer. Statist. Assoc. Lei, J. (2014). Classification with confidence.
Biometrika
Lepskii, O. (1990). Asymptotic minimax estimation with prescribed properties.
Theory Probab. Appl. Polonik, W. (1995). Measuring Mass Concentrations and Estimating DensityContour Clusters-An Excess Mass Approach.
Ann. Statist. Ramaswamy, H. , Tewari, A. and
Agarwal, S. (2018). Consistent algorithmsfor multiclass classification with an abstain option.
Electron. J. Stat. Rigollet, P. (2007). Generalization error bounds in semi-supervised classifi-cation under the cluster assumption.
J. Mach. Learn. Res. Rigollet, P. and
Vert, R. (2009). Optimal rates for plug-in estimators ofdensity level sets.
Bernoulli Sadinle, M. , Lei, J. and
Wasserman, L. (2018). Least Ambiguous Set-ValuedClassifiers With Bounded Error Levels.
J. Amer. Statist. Assoc.
Singh, A. , Nowak, R. and
Zhu, J. (2009). Unlabeled data: Now it helps, nowit doesn’t. In
NIPS
Stone, C. (1977). Consistent nonparametric regression.
Ann. Statist.
Stone, C. (1982). Optimal global rates of convergence for nonparametric re-gression.
Ann. Statist.
Tsybakov, A. (1986). Robust reconstruction of functions by the local-approximation method.
Problemy Peredachi Informatsii Tsybakov, A. B. (1997). On nonparametric estimation of density level sets.
Ann. Statist. Tsybakov, A. (2004). Optimal aggregation of classifiers in statistical learning.
Ann. Statist. Tsybakov, A. (2008).
Introduction to Nonparametric Estimation . Springer Ser.Statist.
Springer New York. van der Vaart, A. (1998).
Asymptotic statistics . Camb. Ser. Stat. Probab.Math. . Cambridge University Press, Cambridge. Vapnik, V. (1998).
Statistical learning theory . Adaptive and Learning Systemsfor Signal Processing, Communications, and Control . John Wiley & Sons Inc.,New York. A Wiley-Interscience Publication.
Varshamov, R. (1957). Estimate of the number of signals in error correctingcodes.
Dokl. Akad. Nauk SSSR
Vovk, V. (2002a). Asymptotic optimality of transductive confidence machine.In
Algorithmic learning theory . Lecture Notes in Comput. Sci.
Vovk, V. (2002b). On-line confidence machines are well-calibrated. In
Pro- hzhen, Denis and Hebiri/Confidence Sets ceedings of the Forty-Third Annual Symposium on Foundations of ComputerScience Vovk, V. , Gammerman, A. and
Shafer, G. (2005).
Algorithmic learning ina random world . Springer, New York.
Wegkamp, M. and
Yuan, M. (2011). Support vector machines with a rejectoption.
Bernoulli Appendix A: Technical results
Here we provide proofs for our result. This Appendix is composed of the fol-lowing part: in Appendix A we introduce some technical results used for ourproofs; Appendix B is devoted to the proofs of the upper bounds; Appendix Cprovides with the proofs our our main lower bounds; finally, in Appendix D weprove the inconsistency of top- β approaches.In this section we gather several technical results which are used to derivethe contributions of this work. Let us start by introducing notation used in theappendix. Given any two probability measures P , P on some space measurablespace ( X , A ) the KullbackLeibler divergence between P and P is defined asKL( P , P ) := (cid:40)(cid:82) X log (cid:16) d P d P (cid:17) d P , supp( P ) ⊂ supp( P )+ ∞ , otherwise , (A.1)and the total variation distance is defined asTV( P , P ) := sup A ∈A | P ( A ) − P ( A ) | . (A.2)We start with Fano’s inequality in the form proved by (Birg´e, 2005). Lemma A.1 (Fano’s inequality (Birg´e, 2005)) . Let { P i } mi =0 be a finite familyof probability measures on ( X , A ) and let { A i } mi =0 be a finite family of disjointevents such that A i ∈ A for each i = 0 , . . . , m . Then, min i ∈{ , ,...,m } P i ( A i ) ≤ (cid:18) . (cid:95) m (cid:80) mi =1 KL( P i , P )log( m + 1) (cid:19) . Lemma A.2 (Pinsker’s inequality) . Given any two probability measures P , P on some measurable space ( X , A ) we have TV( P , P ) ≤ (cid:114)
12 KL( P , P ) . Lemma A.3 (Hoeffding’s inequality (Hoeffding, 1963)) . Let b > be a realnumber, and N be a positive integer. Let X , . . . , X N be N random variableshaving values in [0 , b ] , then P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ( X i − E [ X i ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:33) ≤ (cid:18) − N t b (cid:19) , ∀ t > . hzhen, Denis and Hebiri/Confidence Sets Proposition A.4 (Properties of the generalized inverse) . Let X ∈ R d and P X be a Borel measure on R d , let p : R d → [0 , K be a vector function, we definefor all t ∈ [0 , and all β ∈ (0 , K ) G ( t ) := K (cid:88) k =1 P X ( p k ( X ) > t ) , G − ( β ) := inf { t ∈ [0 ,
1] : G ( t ) ≤ β } . Then, • for all t ∈ (0 , and β ∈ (0 , K ) we have G − ( β ) ≤ t ⇐⇒ G ( t ) ≤ β . • if for all k ∈ [ K ] the mappings t (cid:55)→ P X ( p k ( X ) > t ) are continuous on (0 , , then – for all β ∈ (0 , K ) we have G ( G − ( β )) = β . The next result is an analogue of the classical inverse transform theorem (van derVaart, 1998, Lemma 21.1) and was already established by Denis and Hebiri(2017).
Lemma A.5.
Let ε distributed from a uniform distribution on [ K ] and Z , . . . , Z K , K real valued random variables independent from ε , such that the function t (cid:55)→ H ( t ) defined as H ( t ) := 1 K K (cid:88) k =1 P ( Z k ≤ t ) , is continuous. Consider random variable Z = (cid:80) Kk =1 Z k { ε = k } and let U bedistributed according to the uniform distribution on [0 , . Then H ( Z ) L = U and H − ( U ) L = Z , where H − denotes the generalized inverse of H .Proof. First we note that for every t ∈ [0 , P ( H ( Z ) ≤ t ) = P (cid:0) Z ≤ H − ( t ) (cid:1) .Moreover, we have P ( H ( Z ) ≤ t ) = K (cid:88) k =1 P ( Z ≤ H − ( t ) , ε = k )= 1 K K (cid:88) k =1 P ( Z k ≤ H − ( t )) (with ε independent of Z )= H ( H − ( t ))= t (with H continuous) . hzhen, Denis and Hebiri/Confidence Sets To conclude the proof, we observe that P (cid:0) H − ( U ) ≤ t (cid:1) = P ( U ≥ H ( t )) = 1 K K (cid:88) k =1 P ( Z k ≤ t )= K (cid:88) k =1 P ( Z k ≤ t, ε = k ) = P ( Z ≤ t ) . Appendix B: Upper bounds
In this section we prove Theorems 4.1 and 4.2. It will be clear from our analysisthat the proof of Theorem 4.1 follows directly from Theorem 4.2 by setting N = 0 in the statement of Theorem 4.2. Thus, in this section for simplicitywe omit the subscript SSE from ˆΓ SSE . Recall that our dataset consists of threeparts D (cid:98) n/ (cid:99) , D (cid:100) n/ (cid:101) , D N . The set D (cid:98) n/ (cid:99) is used to construct an estimator ˆ p ofthe regression function p , that is, ˆ p is independent from both D (cid:100) n/ (cid:101) , D N . Theother two sets D (cid:100) n/ (cid:101) , D N are used in a semi-supervised manner to estimatethe threshold, that is, we erase the labels from D (cid:100) n/ (cid:101) . Let β ∈ [ K − x ∈ R d ˆΓ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , with ˆ p k ( x ) satisfying Assumptions 4.4, 4.3 for all k ∈ [ K ]. Moreover, ˆ G − ( β )defined as the generalized inverse ofˆ G ( t ) = 1 (cid:100) n/ (cid:101) + N (cid:88) X ∈D N (cid:83) D (cid:100) n/ (cid:101) K (cid:88) k =1 ˆ p k ( X ) >t , where t ∈ [0 , β -Oracle is given asΓ ∗ β ( x ) = (cid:8) k ∈ [ K ] : p k ( x ) ≥ G − ( β ) (cid:9) , (B.1)where G − ( · ) is the generalized inverse of G ( t ) := K (cid:88) k =1 P ( p k ( X ) ≥ t ) . Lastly, let us re-introduce an idealized version ˜Γ of the proposed estimator ˆΓwhich ’knows’ the marginal distribution P X of the feature vector X ∈ R d as˜Γ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˜ G − ( β ) (cid:111) , with ˜ G := (cid:80) Kk =1 P X (ˆ p k ( X ) > t ), conditionally on the data. The following resultis needed to relate the threshold ˜ G − ( β ) of ˜Γ to the true value of the threshold G − ( β ). hzhen, Denis and Hebiri/Confidence Sets Lemma B.1 (Upper-bound on the thresholds) . Let X ∈ R d and P X be a Borelmeasure on R d . For two vector functions p, ˆ p : R d → [0 , K , we define G ( · ) := K (cid:88) k =1 P X ( p k ( X ) > · ) , ˜ G ( · ) := K (cid:88) k =1 P X (ˆ p k ( X ) > · ) . If for all k ∈ [ K ] the mapping t (cid:55)→ P X ( p k ( X ) > t ) is continuous on (0 , , thenfor every β ∈ (0 , K ) (cid:12)(cid:12)(cid:12) G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) ˆ p − p (cid:107) ∞ , P X . Proof.
The proof of this result is very similar to the proof of (Bobkov andLedoux, 2016, Theorem 2.12). We start by defining the following quantity h ∗ = inf (cid:110) h ≥ ∀ t ∈ [0 ,
1] ˜ G ( t + h ) ≤ G ( t ) ≤ ˜ G ( t − h ) (cid:111) . Due to the definition of h ∗ we have that for all t ∈ [0 , G ( t + h ∗ ) ≤ G ( t ) ≤ ˜ G ( t − h ∗ ) , that is, applying Proposition A.4 to the second inequality we get for all t ∈ [0 , t − h ∗ ≤ ˜ G − ( G ( t )) , thus, for t = G − ( β ) with β ∈ (0 , K ) thanks to Proposition A.4 we get G − ( β ) − ˜ G − ( β ) ≤ h ∗ . The inequality ˜ G − ( β ) − G − ( β ) ≤ h ∗ is obtained in the same way. Thus, wehave proved that (cid:12)(cid:12)(cid:12) G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) ≤ h ∗ . Finally, notice that for all t ∈ [0 , K (cid:88) k =1 P X (cid:16) ˆ p k ( X ) > t + (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) ˜ G (cid:16) t + (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17) ≤ K (cid:88) k =1 P X ( p k ( X ) > t ) (cid:124) (cid:123)(cid:122) (cid:125) G ( t ) ≤ K (cid:88) k =1 P X (cid:16) ˆ p k ( X ) > t − (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) ˜ G (cid:16) t −(cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17) , where we used the fact that for all k ∈ [ K ] P X (ˆ p k ( X ) > t + | ˆ p k ( X ) − p k ( X ) | ) ≤ P X ( p k ( X ) > t ) ≤ P X (ˆ p k ( X ) > t − | ˆ p k ( X ) − p k ( X ) | ) , and P X (cid:16) | ˆ p k ( X ) − p k ( X ) | ≤ (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17) = 1. Therefore by definition of h ∗ ,we can write h ∗ ≤ (cid:107) ˆ p − p (cid:107) ∞ , P X and we conclude. hzhen, Denis and Hebiri/Confidence Sets We are in position to prove Theorem 4.2, let us point out that the mostdifficult part in Theorem 4.2 is the upper-bound on the excess risk. The upper-bound on the discrepancy follows the same arguments as the ones we use forthe excess-risk.
Excess risk and discrepancy: to upper-bound the excess risk we firstseparate it into two parts asR β (ˆΓ) − R β (Γ ∗ β ) = (cid:16) R β (˜Γ) − R β (Γ ∗ β ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) R + (cid:16) R β (ˆΓ) − R β (˜Γ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) R . Recall that thanks to Proposition 3.1 we have R = K (cid:88) k =1 E (cid:104) | p k ( X ) − G − ( β ) | { k ∈ ˜Γ( X ) (cid:52) Γ ∗ β ( X ) } (cid:105) . Moreover, let us point out that if some k ∈ ˜Γ( X ) (cid:52) Γ ∗ β ( X ) then either (cid:40) p k ( X ) − G − ( β ) ≥ p k ( X ) − ˜ G − ( β ) < (cid:40) p k ( X ) − G − ( β ) < p k ( X ) − ˜ G − ( β ) ≥ , holds. Thus on the event k ∈ ˜Γ( X ) (cid:52) Γ ∗ β ( X ) we have (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˆ p k ( X ) − p k ( X ) + G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | + (cid:12)(cid:12)(cid:12) G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) . Therefore, for R using Lemma B.1 and the observations above we can write R ≤ K (cid:88) k =1 E (cid:104) | p k ( X ) − G − ( β ) | { | p k ( X ) − G − ( β ) |≤| ˆ p k ( X ) − p k ( X ) | + | G − ( β ) − ˜ G − ( β ) |} (cid:105) ≤ K (cid:88) k =1 E (cid:20) | p k ( X ) − G − ( β ) | (cid:110) | p k ( X ) − G − ( β ) |≤ (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:111) (cid:21) ≤ K (cid:88) k =1 E (cid:20) (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:110) | p k ( X ) − G − ( β ) |≤ (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:111) (cid:21) = 2 (cid:107) ˆ p − p (cid:107) ∞ , P X K (cid:88) k =1 P X (cid:16)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17) , finally, using the margin Assumption 2.1 we get almost surely data R ≤ c α K (cid:107) ˆ p − p (cid:107) α ∞ , P X . Integrating over the data from both sides and using Assumption 4.3 we get E ( D n , D N ) R ≤ c C α K (cid:18) n log n (cid:19) − (1+ α ) γ γ + d . hzhen, Denis and Hebiri/Confidence Sets For R the following trivial upper-bound holds R = (cid:16) P(ˆΓ) − P(˜Γ) (cid:17) + G − ( β ) (cid:16) I(ˆΓ) − I(˜Γ) (cid:17) = K (cid:88) k =1 E (cid:0) p k ( X ) − G − ( β ) (cid:1) (cid:16) { k ∈ ˜Γ( X ) } − { k ∈ ˆΓ( X ) } (cid:17) ≤ K (cid:88) k =1 E (cid:12)(cid:12)(cid:12) { k ∈ ˜Γ( X ) } − { k ∈ ˆΓ( X ) } (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) E | ˜Γ( X ) (cid:52) ˆΓ( X ) | (B.2)= K (cid:88) k =1 E (cid:12)(cid:12)(cid:12) { ˆ p k ( X ) ≥ ˆ G − ( β ) } − { ˆ p k ( X ) ≥ ˜ G − ( β ) } (cid:12)(cid:12)(cid:12) , now, thanks to the first property of Proposition A.4 we can write R ≤ K (cid:88) k =1 E (cid:12)(cid:12)(cid:12) { ˆ G (ˆ p k ( X )) ≤ β } − { ˜ G (ˆ p k ( X )) ≤ β } (cid:12)(cid:12)(cid:12) ≤ K (cid:88) k =1 P X (cid:16)(cid:12)(cid:12)(cid:12) ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) (cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12)(cid:17) To finish our proof we make use of the peeling technique of (Audibert andTsybakov, 2007, Lemma 3.1). That is, we define for δ > k ∈ [ K ] A k = (cid:110)(cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12) ≤ δ (cid:111) A kj = (cid:110) j − δ < (cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12) ≤ j δ (cid:111) , j ≥ . Since, for every k ∈ [ K ], the events ( A kj ) j ≥ are mutually exclusive, we deduce K (cid:88) k =1 P X (cid:16)(cid:12)(cid:12)(cid:12) ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) (cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12)(cid:17) = (B.3) K (cid:88) k =1 (cid:88) j ≥ P X (cid:16)(cid:12)(cid:12)(cid:12) ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) (cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12) , A kj (cid:17) . Now, we consider ε uniformly distributed on [ K ] independent of the data and X . Conditional on the data and under Assumption 4.4, we apply Lemma A.5with Z k = ˆ p k ( X ), Z = (cid:80) Kk =1 Z k { ε = k } and then obtain that ˜ G ( Z ) is uniformlydistributed on [0 , K ]. Therefore, for all j ≥ δ >
0, we deduce1 K K (cid:88) k =1 P X (cid:16) | ˜ G (ˆ p k ( X )) − β | ≤ j δ (cid:17) = P X (cid:16) | ˜ G ( Z ) − β | ≤ j δ (cid:17) ≤ j +1 δK . hzhen, Denis and Hebiri/Confidence Sets Hence, for all j ≥
0, we obtain K (cid:88) k =1 P X ( A kj ) ≤ K (cid:88) k =1 P X (cid:16) | ˜ G (ˆ p k ( X )) − β | ≤ j δ (cid:17) ≤ j +1 δ . (B.4)Next, we observe that for all j ≥ K (cid:88) k =1 P X (cid:16) | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) | ≥ | ˜ G (ˆ p k ( X )) − β | , A kj (cid:17) ≤ (B.5) K (cid:88) k =1 P X (cid:16) | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) | ≥ j − δ, A kj (cid:17) . Thus, we obtain that R ≤ K (cid:88) k =1 (cid:88) j ≥ P X (cid:16) | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) | ≥ j − δ, A kj (cid:17) , almost surely data. Integrating from both sides with respect to the data we get E ( D n , D N ) R ≤ K (cid:88) k =1 (cid:88) j ≥ E ( D n , D N ) P X (cid:16) | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) | ≥ j − δ, A kj (cid:17) = K (cid:88) k =1 (cid:88) j ≥ E ( D (cid:98) n/ (cid:99) , D (cid:100) n/ (cid:101) , D N ,X ∼ P X ) { | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) |≥ j − δ } { A kj } . recall that the function { A kj } for all j ≥ k ∈ [ K ] is independent from D (cid:100) n/ (cid:101) , D N , thus we can write E ( D (cid:98) n/ (cid:99) , D (cid:100) n/ (cid:101) , D N ,X ∼ P X ) { | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) |≥ j − δ } { A kj } = E ( D (cid:98) n/ (cid:99) ,X ∼ P X ) E ( D (cid:100) n/ (cid:101) , D N ) (cid:104) { | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) |≥ j − δ } (cid:12)(cid:12)(cid:12) D (cid:98) n/ (cid:99) , X (cid:105) { A kj } ]Now, since conditional on ( D (cid:98) n/ (cid:99) , X ), ˆ G (ˆ p k ( X )) is an empirical mean of i.i.d. random variables of common mean ˜ G (ˆ p k ( X )) ∈ [0 , K ], we deduce fromHoeffding’s inequality that E ( D (cid:100) n/ (cid:101) , D N ) (cid:104) { | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) |≥ j − δ } (cid:12)(cid:12)(cid:12) D (cid:98) n/ (cid:99) , X (cid:105) ≤ e − ( N + (cid:100) n/ (cid:101) ) δ j − K . Therefore, treating A k separately, we get from inequalities of Eqs. (B.3), (B.4),and (B.5) E ( D n , D N ) R ≤ δ + δ (cid:88) j ≥ j +2 exp (cid:18) − ( N + (cid:100) n/ (cid:101) ) δ j − K (cid:19) . hzhen, Denis and Hebiri/Confidence Sets Finally, choosing δ = K (cid:112) N + (cid:100) n/ (cid:101) in the above inequality, we finish the proof. Hamming risk: here we provide an upper bound on the Hamming risk.First, by the triangle inequality we can write for the proposed estimator ˆΓ andthe pseudo Oracle β set ˜Γ E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≤ E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˜Γ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) + E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) ˜Γ( X ) (cid:12)(cid:12)(cid:12) . Notice that for the term E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) ˜Γ( X ) (cid:12)(cid:12)(cid:12) we can re-use the prooftechnique used for the term R in Eq. (B.2). Thus, it remain to upper-boundthe term E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˜Γ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) . The proof on this part closely followsthe machinery used in Denis and Hebiri (2017), however, let us mention thatthey used this method to obtain a bound on the Discrepancy which leads to asub-optimal rate. Nevertheless, their approach gives a correct rate if instead ofthe Discrepancy we bound the Hamming distance. For the sake of completenesswe write the principal parts of the proof here.First of all, by the definition of sets Γ ∗ β and ˜Γ we can write for ( ∗ ) = E X ∼ P X (cid:12)(cid:12)(cid:12) ˜Γ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ( ∗ ) = K (cid:88) k =1 E X ∼ P X (cid:12)(cid:12)(cid:12) { ˆ p k ( X ) ≥ ˜ G − ( β ) } − { p k ( X ) ≥ G − ( β ) } (cid:12)(cid:12)(cid:12) , Now if ˆ p k ( X ) ≥ ˜ G − ( β ) and p k ( X ) < G − ( β ) we can have the following situa-tions • if ˜ G − ( β ) > G − ( β ), then (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | ; • if ˜ G − ( β ) ≤ G − ( β ), then either (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | orˆ p k ( X ) ∈ (cid:16) ˜ G − ( β ) , G − ( β ) (cid:17) ;Similar conditions are satisfied if ˆ p k ( X ) < ˜ G − ( β ) and p k ( X ) ≥ G − ( β ). Using hzhen, Denis and Hebiri/Confidence Sets the above arguments we can upper-bound ( ∗ ) as( ∗ ) ≤ K (cid:88) k =1 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | (cid:1) + { ˜ G − ( β ) ≤ G − ( β ) } K (cid:88) k =1 P X (cid:16) ˜ G − ( β ) < ˆ p k ( X ) < G − ( β ) (cid:17) + { G − ( β ) < ˜ G − ( β ) } K (cid:88) k =1 P X (cid:16) G − ( β ) < ˆ p k ( X ) < ˜ G − ( β ) (cid:17) = K (cid:88) k =1 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | (cid:1) + (cid:12)(cid:12)(cid:12) ˜ G (cid:16) ˜ G − ( β ) (cid:17) − ˜ G (cid:0) G − ( β ) (cid:1)(cid:12)(cid:12)(cid:12) . Thanks to the continuity Assumption 4.4 on the estimator and the continu-ity Assumption 1.1 on the distribution we clearly have ˜ G (cid:16) ˜ G − ( β ) (cid:17) = β = G (cid:0) G − ( β ) (cid:1) . Moreover, we can write | ˜ G (cid:16) ˜ G − ( β ) (cid:17) − ˜ G (cid:0) G − ( β ) (cid:1) | = (cid:12)(cid:12)(cid:12) G (cid:0) G − ( β ) (cid:1) − ˜ G (cid:0) G − ( β ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ K (cid:88) k =1 E X ∼ P X (cid:12)(cid:12) { ˆ p k ( X ) ≥ G − ( β ) } − { p k ( X ) ≥ G − ( β ) } (cid:12)(cid:12) ≤ K (cid:88) k =1 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | (cid:1) . Thus, our bound reads as( ∗ ) ≤ K (cid:88) k =1 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | (cid:1) . Finally, in order to upper-bound the term above one can use the peeling argu-ment of Audibert and Tsybakov (2007) applied with the exponential concentra-tion inequality provided by Assumption 4.3. This part of the proof we omit hereand refer the reader to Denis and Hebiri (2017) or to Audibert and Tsybakov(2007) for a complete result.Let us emphasize that the argument above is only possible due to the conti-nuity Assumptions 1.1, 4.4 on the distribution and the estimator respectively.
Appendix C: Proof of the lower bounds
This section is devoted to the proof of the lower bounds provided by Theo-rems 3.4-3.5. Before proceeding to the proofs let us briefly sketch the high-level hzhen, Denis and Hebiri/Confidence Sets strategy used in this work. In order to prove the lower bounds of Theorems 3.4-3.5 we actually prove to separate lower bounds on the minimax risk. Clearly, ifsome non-negative quantity is lower-bounded by two different values, thereforeit is lower-bounded by the maximum between the two. The two lower boundsthat we prove are naturally connected with the proposed two-steps estimator,that is, the first lower bound is connected with the problem of non-parametricestimation of p k for all k ∈ [ K ] and the second describes the estimation of theunknown threshold G − ( β ).In particular, the first lower bound is closely related to the one providedin (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009), though, cruciallythe continuity Assumption 1.1 makes the proof more involved. The second lowerbound is based on two hypotheses testing and is derived by constructing twodifferent marginal distributions of X ∈ R d and a fixed regression vector p ( · ). Inthis part we make use of Pinsker’s inequality recalled in Lemma A.2.In order to discriminate the supervised and the semi-supervised procedureswe make use of Definition 1.4. Notice that every supervised procedure thanksto Definition 1.4 is not ’sensitive’ to the expectation taken w.r.t. the unlabeleddataset D N , that is, randomness is only induced by the labeled dataset D n .This strategy allows to eliminate the dependence of the lower bound on thesize of the unlabeled dataset D N for supervised procedures. Indeed, let ˆΓ beany supervised estimator in the sense of Definition 1.4, then for any real valuedfunction of confidence sets Z we have E ( D n , D N ) [ E P X Z (ˆΓ( X ; D n , D N ))] = E D n [ E P X Z (ˆΓ( X ; D n , D (cid:48) N ))] , with D (cid:48) N being an arbitrary set of N points in R d . C.1. Part I: ( N + n ) − / Here we prove that the rate ( N + n ) − / is optimal for semi-supervised methods,as already mentioned the rate for the supervised methods can be obtained byformally setting N = 0. The constant C (cid:48) , C, c are always assumed to be inde-pendent of N, n and can differ from line to line. Let us fix β ∈ { , . . . , (cid:98) K/ (cid:99)− } and K ≥
5. For a positive constant
C < / κ N,n = C ( N + n ) − < . . To prove the lower bound we construct two distribution P and P on R d shar-ing the same regression function p ( · ) = ( p ( · ) , . . . , p K ( · )) and with differentmarginals admitting densities µ , µ . First, for a fixed parameter 0 < ρ < < r < r < r < r < r to be specified we define the hzhen, Denis and Hebiri/Confidence Sets a ( a + b ) / b e − b − a )2 Fig 1: Bump function: x (cid:55)→ ψ a,b ( x ). Importantly, this function is supported on( a, b ) and is infinitely smooth.following sets X = (cid:8) x ∈ R d : (cid:107) x (cid:107) ≤ r (cid:9) , X = x ∈ R d : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x − ( r + ρ, , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ∈ R d ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ρ/ , X = x ∈ R d : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x − ( r + ρ, , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ∈ R d ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ρ ) , X = x ∈ R d : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x − ( r + ρ, , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ∈ R d ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ρ/ , X = (cid:8) x ∈ R d : r ≤ (cid:107) x (cid:107) ≤ r (cid:9) . Let us denote by o i = ( r i + ρ, , . . . , (cid:62) for i = 1 , , X , X and X . Using these sets we define the regression vector as hzhen, Denis and Hebiri/Confidence Sets p ( x ) = . . . = p β ( x ) = β − ϕ ( x )2 β , x ∈ X K +2 β Kβ − ϕ ( x )2 β , x ∈ X K +4 β Kβ − ϕ ( x )2 β , x ∈ X K +6 β Kβ − ϕ ( x )2 β , x ∈ X K − ϕ ( x )2 β , x ∈ X p β +1 ( x ) = . . . = p K ( x ) = ϕ ( x ) K − β , x ∈ X K + ϕ ( x ) K − β , x ∈ X K − β K ( K − β ) + ϕ ( x ) K − β , x ∈ X K − β K ( K − β ) + ϕ ( x ) K − β , x ∈ X K + ϕ ( x ) K − β , x ∈ X , In order to define the functions ϕ i for i = 0 , . . . , a < bψ a,b ( x ) = (cid:40) exp (cid:16) − b − x )( x − a ) (cid:17) , x ∈ ( a, b )0 , otherwise . Figure 1 illustrates the behavior of ψ a,b function in one dimension. Note thatfor every a, b ∈ R the function above is infinitely smooth. Using the definitionof ψ a,b we define the functions ϕ i for i = 0 , . . . , ϕ ( x ) = C (cid:48) (cid:18) K − β Kβ (cid:94) K (cid:19) ψ − ,r ( (cid:107) x (cid:107) ) ,ϕ i ( x ) = C (cid:48) ρ γ (cid:18) K − β Kβ (cid:94) K (cid:19) (cid:18) (cid:107) x − o i (cid:107) ρ (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x − o i (cid:107) ρ (cid:19) , i = 1 , ,ϕ ( x ) = C (cid:48) ρ γ (cid:18) K − β Kβ (cid:94) K (cid:19) ψ − , (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) ,ϕ ( x ) = C (cid:48) (cid:18) K − β Kβ (cid:94) K (cid:19) ψ r , r ( (cid:107) x (cid:107) ) , and the constant C (cid:48) ≤ ϕ i for i = 0 , . . . , γ, L )-H¨older. Let us point out that such value C (cid:48) exists and isindependent of n, N , indeed, the mapping x (cid:55)→ C (cid:48) (cid:107) x (cid:107) (cid:100) γ (cid:101) ψ − , ( (cid:107) x (cid:107) ) , is infinitely smooth, thus it is ( γ, L )-H¨older for a properly chosen C (cid:48) . Figure 2demonstrates the behavior of the considered construction in one dimension.Note that ϕ i ( x ) for i = 1 , L . Same reasoning applies to ϕ i for i = 0 , , hzhen, Denis and Hebiri/Confidence Sets a ( a + b ) / b Fig 2: Dumped bump function: x (cid:55)→ (cid:0) x − a + b (cid:1) (cid:100) γ (cid:101) ψ a,b ( x ). Importantly, thisfunction behaves as polynomial of even degree 2 (cid:100) γ (cid:101) in the affinity of a + b , whilebeing infinitely smooth and supported on ( a, b ). It means that if we select ameasure which is supported in the affinity of a + b (light-blue hatched region)the function on the plot is essentially polynomial w.r.t. such a measure.Since β < K/ < K < K − β K ( K − β ) < K − β K ( K − β ) < K − β K ( K − β ) < K< K + 6 β Kβ < K + 4 β Kβ < K + 2 β Kβ < β , which will help us to ensure that the thresholds under P , P are K +2 β Kβ and K +6 β Kβ respectively. Now, we define two marginal distributions µ , µ by theirdensities as µ ( x ) = / X ) , x ∈ X κ N,n
Leb( X ) , x ∈ X κ N,n
Leb( X ) , x ∈ X κ N,n
Leb( X ) , x ∈ X / − κ N,n
Leb( X ) , x ∈ X , µ ( x ) = / − κ N,n
Leb( X ) , x ∈ X κ N,n
Leb( X ) , x ∈ X κ N,n
Leb( X ) , x ∈ X κ N,n
Leb( X ) , x ∈ X / X ) , x ∈ X , and both µ , µ are equal to zero in unspecified regions. Clearly, the strongdensity assumption is satisfied on X and X since the density is lower andupper-bounded by a constant independent of both N, n . The parameter ρ ischosen such that the strong density assumption on X i for i = 1 , , X i ) = cρ d , for some constant c > N, n , thus we set ρ = C ( N + n ) − / d .For these hypotheses one can easily check that the thresholds G − ( β ) , G − ( β ) hzhen, Denis and Hebiri/Confidence Sets and the optimal β -sets Γ ∗ , Γ ∗ are given as G − ( β ) = 3 K + 2 β Kβ , G − ( β ) = K + 6 β Kβ , Γ ∗ ( x ) = (cid:40) { , . . . , β } , x ∈ X ∅ , otherwise , Γ ∗ ( x ) = (cid:40) { , . . . , β } , x ∈ X (cid:83) X (cid:83) X , ∅ , otherwise . The margin assumption: we are in position to check the margin Assump-tion 2.1. Let t = (cid:16) K − β Kβ (cid:86) K (cid:17) , thus for every k ∈ { β + 1 , . . . , K } and every t ≤ t we have P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) = 0 , P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) = 0 , moreover for every k ∈ { , . . . , β } and every t ≤ t we can write P (cid:32) (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:33) = P (cid:32) C (cid:48) ρ γ (cid:18) K − β Kβ (cid:94) K (cid:19) (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) ≤ βt, X ∈ X (cid:33) , P (cid:32) (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:33) = P (cid:32) C (cid:48) ρ γ (cid:18) K − β Kβ (cid:94) K (cid:19) (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) ≤ βt, X ∈ X (cid:33) . Hence, for the 0 hypothesis there exists c independent of N, n such that P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ P (cid:32)(cid:18) (cid:107) x − o (cid:107) ρ (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) ≤ c ρ − γ t, X ∈ X (cid:33) Therefore we can write using the strong density assumption P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ (cid:90) (cid:107) x − o (cid:107)≤ ρ/ (cid:26)(cid:16) (cid:107) x − o (cid:107) ρ (cid:17) (cid:100) γ (cid:101) ψ − , (cid:16) (cid:107) x − o (cid:107) ρ (cid:17) ≤ c ρ − γ t (cid:27) dµ ( x ) ≤ C (cid:90) (cid:107) x − o (cid:107)≤ ρ/ (cid:26)(cid:16) (cid:107) x − o (cid:107) ρ (cid:17) (cid:100) γ (cid:101) ψ − , (cid:16) (cid:107) x − o (cid:107) ρ (cid:17) ≤ c ρ − γ t (cid:27) dx = C (cid:90) (cid:107) x (cid:107)≤ ρ/ (cid:26)(cid:16) (cid:107) x (cid:107) ρ (cid:17) (cid:100) γ (cid:101) ψ − , (cid:16) (cid:107) x (cid:107) ρ (cid:17) ≤ c ρ − γ t (cid:27) dx = Cρ d (cid:90) (cid:107) x (cid:107)≤ / (cid:26) (cid:107) x (cid:107) (cid:100) γ (cid:101) ψ − , ( (cid:107) x (cid:107) ) ≤ c ρ − γ t (cid:27) dx , Finally notice that for every x ∈ R d such that (cid:107) x (cid:107) ≤ / C > ψ − , ( (cid:107) x (cid:107) ) ≥ ψ − , (1 / ≥ C , hzhen, Denis and Hebiri/Confidence Sets which implies that for some positive C, C (cid:48) independent of
N, n we can write P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ Cρ d (cid:90) (cid:107) x (cid:107)≤ / (cid:26) (cid:107) x (cid:107) (cid:100) γ (cid:101) ≤ C (cid:48) ρ − γ t (cid:27) dx = Cρ d (cid:90) (cid:107) x (cid:107)≤ / (cid:26) (cid:107) x (cid:107)≤ C (cid:48) / (2 (cid:100) γ (cid:101) ) ρ − γ/ (2 (cid:100) γ (cid:101) ) t / (2 (cid:100) γ (cid:101) ) (cid:27) dx ≤ Cρ d (1 − γ/ (2 (cid:100) γ (cid:101) ) t d/ (2 (cid:100) γ (cid:101) ) . This implies that for as long as α ≤ d/ (2 (cid:100) γ (cid:101) ) (and since we have γ ≤ (cid:100) γ (cid:101) ) themargin assumption is satisfied. Moreover, these conditions imply that αγ ≤ d ,which we will also require while proving the supervised part of the rate. Samereasoning can be carried out for the case of the first hypothesis P on the set X .Finally, the parameters r , r , r , r are chosen as constants independent of n, N such that there exists a smooth connection between the parts of the regres-sion functions p k ( · ) which are defined on X , X , X , X , X . Notice that such achoice is possible since by the construction of functions ϕ i for i = 0 , , , , X , X , X , X , X . Thus in the region R d \ (cid:83) i =0 X i it is sufficient to construct a function which connects four differentconstants smoothly. We avoid this over complication on this part and hope thatthe guidelines provided above are sufficient for the understanding.Notice that the constructed distributions are satisfying Assumption 1.1 sincethe measures are only defined on X , X , X , X , X and the regression functionson these sets are not concentrated around any constant.Before proceeding to the final stage of the proof let us mention that in whatfollows we use the de Finetti (de Finetti, 1972, 1974) notation which is commonin probability. That is, given a probability measure P on some measurable space(Ω , A ) and a measurable function X : (Ω , A ) → ( R , Borel( R )) we write P [ X ] := E [ X ] . Bound on the KL-divergence: we start by computing the KL-divergencebetween µ and µ KL( µ , µ ) := (cid:90) R d µ ( x ) log (cid:18) µ µ (cid:19) dx = (cid:88) i =0 (cid:90) x ∈X i µ ( x ) log (cid:18) µ ( x ) µ ( x ) (cid:19) dx = 1Leb( X ) (cid:90) x ∈X
12 log (cid:18) / / − κ N,n (cid:19) dx + 1Leb( X ) (cid:90) x ∈X (cid:18) − κ N,n (cid:19) log (cid:18) / − κ N,n / (cid:19) dx = 12 log (cid:18) / / − κ N,n (cid:19) + (cid:18) − κ N,n (cid:19) log (cid:18) / − κ N,n / (cid:19) = − κ N,n log (1 − κ N,n ) ≤ κ N,n . hzhen, Denis and Hebiri/Confidence Sets Lower bound for the Hamming risk: first of all let us introduce thefollowing notation for i = 0 , , Γ ∗ i ) := µ i (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ i ( X ) (cid:12)(cid:12)(cid:12) . Recall that we are interested in the following quantityinf ˆΓ sup P ∈P E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) , since the hypotheses P , P ∈ P we can write2 sup P ∈P E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ ( ∗ ) , where ( ∗ ) is defined as( ∗ ) = µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X H(ˆΓ , Γ ∗ ) + µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X H(ˆΓ , Γ ∗ ) , thus, for the Hamming risk we can write( ∗ ) ≥ µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X dµ ⊗ ( n + N )1 ⊗ P ⊗ nY | X dµ ⊗ ( n + N )0 ⊗ P ⊗ nY | X (cid:94) (cid:16) H(ˆΓ , Γ ∗ ) + H(ˆΓ , Γ ∗ ) (cid:17) . Now we focus our attention to the sum of two Hamming differences which ap-pearing on the right hand side of the above inequalityH(ˆΓ , Γ ∗ )+ H(ˆΓ , Γ ∗ ) = µ K (cid:88) k =1 k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) + µ K (cid:88) k =1 k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) ≥ µ (cid:18) dµ dµ (cid:94) (cid:19) K (cid:88) k =1 k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) + µ (cid:18) dµ dµ (cid:94) (cid:19) K (cid:88) k =1 k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) ≥ µ (cid:18) dµ dµ (cid:94) (cid:19) K (cid:88) k =1 k ∈ Γ ∗ ( X ) (cid:52) Γ ∗ ( X ) (Triangle inequality)= 2 βµ (cid:18) dµ dµ (cid:94) (cid:19) ( X + X )= 2 β (cid:90) R d (cid:18) µ ( x ) µ ( x ) (cid:94) (cid:19) ( X + X ) dµ ( x )= 2 β (cid:90) X (cid:18) µ ( x ) µ ( x ) (cid:94) (cid:19) dµ ( x ) + 2 β (cid:90) X (cid:18) µ ( x ) µ ( x ) (cid:94) (cid:19) dµ ( x )= 2 β P ( X ∪ X ) ≥ βκ n,N . hzhen, Denis and Hebiri/Confidence Sets Substituting this lower bound into the initial inequality we arrive at( ∗ ) ≥ βκ n,N µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X dµ ⊗ ( n + N )1 ⊗ P ⊗ nY | X dµ ⊗ ( n + N )0 ⊗ P ⊗ nY | X (cid:94) = 2 βκ n,N (cid:16) − TV (cid:16) µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X , µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X (cid:17)(cid:17) = 2 βκ n,N (cid:16) − TV (cid:16) µ ⊗ ( n + N )0 , µ ⊗ ( n + N )1 (cid:17)(cid:17) ≥ βκ n,N (cid:32) − (cid:114)
12 KL (cid:16) µ ⊗ ( n + N )0 , µ ⊗ ( n + N )1 (cid:17)(cid:33) (Pinsker’s inequality) ≥ βκ n,N (cid:16) − κ n,N √ n + N (cid:17) , which implies the desired lower bound on the Hamming risk. Lower bound for the β excess risk: this part is analogues to the caseof the Hamming distance. Let us recall that for every ˆΓ we have the followingexpression for i = 0 , , Γ ∗ i ) := R β (ˆΓ) − R β (Γ ∗ i ) = µ i K (cid:88) k =1 (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) { k ∈ ˆΓ( X ) (cid:52) Γ ∗ i ( X ) } . Again, recall that we are interested ininf ˆΓ sup P ∈P E ( D n , D N ) [R β (ˆΓ)] − R(Γ ∗ β )similarly to the previous case, since the hypotheses P , P ∈ P we can write2 sup P ∈P E ( D n , D N ) [R β (ˆΓ)] − R(Γ ∗ β ) ≥ ( ∗∗ ) , where ( ∗∗ ) is defined as( ∗∗ ) = µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X D(ˆΓ , Γ ∗ ) + µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X D(ˆΓ , Γ ∗ ) , we can write ( ∗∗ ) ≥ µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X (cid:32) dµ ⊗ ( n + N )1 ⊗ P ⊗ nY | X dµ ⊗ ( n + N )0 ⊗ P ⊗ nY | X (cid:94) (cid:33) (cid:16) D(ˆΓ , Γ ∗ ) + D(ˆΓ , Γ ∗ ) (cid:17) and we continue in a similar fashionD(ˆΓ , Γ ∗ ) + D(ˆΓ , Γ ∗ ) = µ K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) + µ K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) ≥ µ β (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) X ∈X + µ β (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) X ∈X , hzhen, Denis and Hebiri/Confidence Sets since µ ( x ) = µ ( x ) for all x ∈ X we obtainD(ˆΓ , Γ ∗ ) + D(ˆΓ , Γ ∗ ) ≥ µ (cid:32) β (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) X ∈X + β (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) X ∈X (cid:33) ≥ µ (cid:32) β (cid:88) k =1 (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) (cid:94) (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) k ∈ Γ ∗ ( X ) (cid:52) Γ ∗ ( X ) X ∈X (cid:33) = µ (cid:32) β (cid:88) k =1 (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) (cid:94) (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) X ∈X (cid:33) = µ (cid:18) β (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) K + 4 β Kβ − ϕ ( X )2 β − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) (cid:94) (cid:12)(cid:12)(cid:12)(cid:12) K + 4 β Kβ − ϕ ( X )2 β − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) X ∈X (cid:19) = µ (cid:18) β (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) K − β Kβ + ϕ ( X )2 β (cid:12)(cid:12)(cid:12)(cid:12) (cid:94) (cid:12)(cid:12)(cid:12)(cid:12) K − β Kβ − ϕ ( X )2 β (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) X ∈X (cid:19) = µ (cid:18) β (cid:12)(cid:12)(cid:12)(cid:12) K − β Kβ − ϕ ( X )2 β (cid:12)(cid:12)(cid:12)(cid:12) X ∈X (cid:19) , then, since ϕ ( x ) β ≤ K − β Kβ for all x ∈ X , we haveD(ˆΓ , Γ ∗ ) + D(ˆΓ , Γ ∗ ) ≥ β ( K − β )16 Kβ µ ( X ) = K − β K κ n,N . Thus, ( ∗∗ ) ≥ K − β K κ n,N (cid:16) − TV (cid:16) µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X , µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X (cid:17)(cid:17) ≥ K − β K κ n,N (cid:32) − (cid:114)
12 KL (cid:16) µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X , µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X (cid:17)(cid:33) ≥ K − β K κ n,N (cid:16) − κ n,N √ n + N (cid:17) Which concludes the first part of the lower bounds.
C.2. Part II: n − αγ/ (2 γ + d ) In this section we prove that in case of the Hamming risk Ψ H the rate n − αγ/ (2 γ + d ) is minimax optimal. Notice, that thanks to Proposition 3.1 a lower bound of or-der n − αγ/ (2 γ + d ) on the Hamming risk Ψ H immediately implies a lower boundof order n − ( α +1) γ/ (2 γ + d ) on both Ψ E and Ψ D .The proof is based on the reduction of the Hamming risk to a multiplehypotheses testing problem and an application of Fano’s inequality providedby Birg´e (2005) recalled in Lemma A.1. hzhen, Denis and Hebiri/Confidence Sets a b Fig 3: Integrated bump: x (cid:55)→ (cid:82) ∞ x ψ a,b ( t ) dt (cid:82) ba ψ a,b ( t ) dt . Importantly, this function is infinitelysmooth and is equal to one or zero only outside of the interval ( a, b ).Assume that K ≥ β ∈ { , . . . , ( K − ∧ (cid:98) K/ (cid:99)} , define theregular grid on [0 , d as G q := (cid:40)(cid:18) k + 12 q , . . . , k d + 12 q (cid:19) (cid:62) : k i ∈ { , . . . , q − } , i = 1 , . . . , d (cid:41) , and denote by n q ( x ) ∈ G q as the closest point to of the grid G q to the point x ∈ R d . Such a grid defines a partition of the unit cube [0 , d ⊂ R d denotedby X (cid:48) , . . . , X (cid:48) q d . Besides, denote by X (cid:48)− j := { x ∈ R d : − x ∈ X (cid:48) j } for all j =1 , . . . , q d . For a fixed integer m ≤ q d and for any j ∈ { , . . . , m } define X i := X (cid:48) i , X − i := X (cid:48)− i . Additionally we introduce the following set X = B (0 , (4 q ) − ). Forevery w ∈ W := {− , } m we build the distribution P w ∈ P W , such that, themarginal distribution P w,X is independent of w ∈ {− , } m and the regressionvector ( p w ( x ) , . . . , p wK ( x )) is constructed as hzhen, Denis and Hebiri/Confidence Sets p w ( x ) = . . . = p wβ − ( x ) = v + c (cid:48) β − g ( x ) β − ,p wβ ( x ) = v + φ ( x ) , if x ∈ X v + w i ϕ ( x − n q ( x )) , if x ∈ X i v − w i ϕ ( x − n q ( x )) , if x ∈ X − i K , if x ∈ B (0 , √ d ) \ (cid:16)(cid:83) mi = − m,i (cid:54) =0 X i (cid:17) v + g ( x ) , if x ∈ R d \ B (0 , √ d + ρ ) v + ξ ( x ) , if x ∈ B (0 , √ d + ρ ) \ B (0 , √ d ) ,p wβ +1 ( x ) = v − φ ( x ) , if x ∈ X v − w i ϕ ( x − n q ( x )) , if x ∈ X i v + w i ϕ ( x − n q ( x )) , if x ∈ X − i v, if x ∈ B (0 , √ d ) \ (cid:16)(cid:83) mi = − m,i (cid:54) =0 X i (cid:17) v − g ( x ) , if x ∈ R d \ B (0 , √ d + ρ ) v − ξ ( x ) , if x ∈ B (0 , √ d + ρ ) \ B (0 , √ d ) ,p wβ +2 ( x ) = . . . = p wK ( x ) = v − c (cid:48) K − β − − g ( x ) K − β − , where v ∈ [0 , ϕ : R d (cid:55)→ R + , and ξ : R d (cid:55)→ R + are to be specified. Theconstants v, c (cid:48) are set as v = 1 K , c (cid:48) = ( β − K − β − K The function ξ is constructed as ξ ( x ) = v u (cid:32) (cid:107) x (cid:107) − √ dρ (cid:33) , ¯ u ( x ) = 1 − (cid:82) ∞ x ψ , ( t ) dt (cid:82) ψ , ( t ) dt , the function ¯ u is infinitely many times differentialble, is equal to zero on ( −∞ , , + ∞ ). Figure 3 shows the behavior of 1 − ¯ u . Taking the constant ρ > N, n we can ensure that the function ξ is( γ, L )-H¨older.The function φ is constructed similarly to the previous part of the rate, thatis, for φ we choose φ ( x ) = C φ (2 q ) − γ (cid:18) (cid:107) x (cid:107) (2 q ) − (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x (cid:107) (2 q ) − (cid:19) , with C φ being sufficiently small such that φ ( · ) is ( γ, L )-H¨older and upper-bounded by c (cid:48) / ∧ v/
4. For the function ϕ we consider the following construction ϕ ( x ) = C ϕ q − γ (cid:18) u (cid:18) (cid:107) x (cid:107) q − (cid:19) + ψ − , (cid:18) (cid:107) x (cid:107) q − (cid:19)(cid:19) , hzhen, Denis and Hebiri/Confidence Sets − / − / / / Fig 4: The function x (cid:55)→ u ( | x | ) + ψ − , ( x ). Importantly, this function is in-finitely smooth nowhere concentrates at any constant on ( − / , / u ( · ) is defined as u ( x ) = (cid:82) ∞ x ψ , ( t ) dt (cid:82) / / ψ , ( t ) dt . Figure 4 explains the behavior of this function and helps for better understand-ing of our results. The constant C ϕ is chosen in such a way that the constructedfunction ϕ ( · ) is ( γ, L )-H¨older and and upper-bounded by c (cid:48) / ∧ v/
4. Notice thatthe function ϕ ( x ) for all x ∈ B (0 , (4 q ) − ) satisfies C ϕ q − γ ≤ ϕ ( x ) ≤ C ϕ q − γ (cid:16) ψ − , (0) (cid:17) ≤ C ϕ q − γ . Finally, the function g is any ( γ, L )-H¨older function with sufficiently boundedvariation which is not concentrated around any constant, for example g ( x ) = C g ¯ u (cid:16) (cid:107) x (cid:107) − √ d − ρ (cid:17) cos (cid:16) (cid:107) x (cid:107) − √ d − ρ (cid:17) , For C g chosen small enough to ensure that it is ( γ, L )-H¨older and has a boundedby c (cid:48) / ∧ v/ X ∈ R d . Weselect a Euclidean ball in R d denoted by A that has an empty intersection with B (0 , √ d + ρ ) and whose Lebesgue measure is Leb( A ) = 1 − mq − d . The density µ of the marginal distribution of X ∈ R d is constructed as • µ ( x ) = τ Leb( B (0 , (4 q ) − )) for every z ∈ G q ∪ { } and every x ∈ B ( z, (4 q ) − ))or x ∈ B ( − z, (4 q ) − )), • µ ( x ) = − mτ Leb( A ) for every x ∈ A , • µ ( x ) = 0 for every other x ∈ R d , hzhen, Denis and Hebiri/Confidence Sets for some τ to be specified. Now, we check that the distributions constructedabove belong to the set P for every w ∈ W . Namely, we check the following listof assumption • The functions p w , . . . , p wK are defining some regression function for every w ∈ W . That is, for each x ∈ R d we have (cid:80) Kk =1 p wk ( x ) = 1 and 0 ≤ p wk ( x ) ≤ • the functions p w , . . . , p wK are ( γ, L )-H¨older, • the function G w ( t ) := (cid:80) Kk =1 (cid:82) R d { p wk ( x ) ≥ t } µ ( x ) dx is continuous, • the threshold G − ( β ) is equal to v for every w ∈ W , • the marginal distribution satisfies the strong density assumption, • the regression function satisfies α -margin assumption. The regression function is well defined: to see this, notice that for every w ∈ W and every x ∈ R d we have by construction p wβ +1 ( x ) + p wβ ( x ) = 2 v , β − (cid:88) k =1 p wk ( x ) + K (cid:88) k = β +2 p wk ( x ) = ( K − v , and the combination of both with v = 1 /K implies that (cid:80) Kk =1 p wk ( x ) = 1.Moreover, as long as sup x ∈X i ϕ ( x ) ≤ v/ i = − m, . . . , − , , . . . , m wehave for every x ∈ R d < v/ ≤ p wβ +1 ( x ) ≤ v/ ≤ , < v/ ≤ p wβ ( x ) ≤ v/ ≤ , and by construction of the function g we have for every k = 1 , . . . , β −
1, every x ∈ R d and every w ∈ W ≤ p wk ( x ) ≤ v + 3 c (cid:48) β − , due to the choice of c (cid:48) , v we have v + 3 c (cid:48) β −
1) = 1 K + 3( K − β − K ≤ K ≤ . Similarly, for every k = β + 2 , . . . , K , every x ∈ R d and every w ∈ Wv − c (cid:48) K − β − ∧ ( β − ≤ p wk ( x ) ≤ , and with the choice of v, c (cid:48) specified above and the constraint β ≤ (cid:98) K/ (cid:99) wehave v − c (cid:48) K − β −
1) = 1 K − β − K ≥ K − K/ − K = 14 K + 32 K ≥ . hzhen, Denis and Hebiri/Confidence Sets Thus, the construction above defines some regression function for every w ∈ W . The regression function is ( γ, L ) -H¨older: this implication follows imme-diately from the construction of ϕ, ξ, g . Continuity of G ( t ) : first let us show that (cid:82) R d p wk ( x ) ≥ t µ ( x ) dx is continuousfor every k ∈ [ K ]. For k = 1 , . . . , β − , β + 2 , . . . , K the continuity follows fromthe fact that g is not concentrated around any constant. For k = β, β + 1 wefirst write (cid:90) R d p wk ( x ) ≥ t µ ( x ) dx = m (cid:88) c ∈ G q ∪− G q τ Leb ( B ( c, (4 q ) − )) (cid:90) B ( c, (4 q ) − ) p wk ( x ) ≥ t dx + 1 − mτ Leb( A ) (cid:90) A p wk ( x ) ≥ t dx , thus for this choice of k the continuity follows from the fact that ϕ and g arenot concentrated around any constant. Threshold G − ( β ) = v : to see this notice that for every w ∈ W , K (cid:88) k =1 p wk ( x ) ≥ v = β, a.e. µ , and the condition on the threshold follows from the continuity of G ( · ). Besides,the corresponding β -Oracle sets Γ ∗ w are given for every w ∈ W asΓ ∗ w ( x ) = { , . . . , β − , β } , x ∈ X i , w i = 1 , { , . . . , β − , β + 1 } , x ∈ X i , w i = − , { , . . . , β − , β } , x ∈ X − i , w i = − , { , . . . , β − , β + 1 } , x ∈ X − i , w i = 1 , { , . . . , β − , β } , x ∈ R d \ ( (cid:83) mi = − m X i ) The strong density assumption: the strong density assumption can bechecked following the proof of (Audibert and Tsybakov, 2007, Theorem 3.5)where an analogous construction of the marginal distribution was considered. α -margin assumption: for all t ≤ t := v/
4, all k ∈ [ K ] \ { β, β + 1 } and all w ∈ W we have µ ( | p wk ( X ) − v | ≤ t ) = 0 , thus for k ∈ [ K ] \ { β, β + 1 } the margin assumption is satisfied. It remains tocheck that the margin assumption is satisfied for k ∈ { β, β +1 } . Fix an arbitrary w ∈ W and k = β , then for all t ≤ t we can write µ (cid:32) | p wk ( X ) − v | ≤ t (cid:33) = m (cid:88) i = − m µ ( | p wk ( X ) − v | ≤ t, X ∈ X i )= m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) + µ ( φ ( X ) ≤ t, X ∈ X ) . hzhen, Denis and Hebiri/Confidence Sets We separately upper-bound both terms which appear on the right hand side ofthe equality. µ ( φ ( X ) ≤ t, X ∈ X ) = τ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) { φ ( X ) ≤ t } dx = τ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) (cid:26) C φ (2 q ) − γ (cid:16) (cid:107) x (cid:107) (2 q ) − (cid:17) (cid:100) γ (cid:101) ψ − , (cid:16) (cid:107) x (cid:107) (2 q ) − (cid:17) ≤ t (cid:27) dx = Cτ q − d Leb( B (0 , (4 q ) − )) (cid:90) B (0 , / (cid:110) ( (cid:107) x (cid:107) ) (cid:100) γ (cid:101) ψ − , ( (cid:107) x (cid:107) ) ≤ C − φ (2 q ) γ t (cid:111) dx , clearly there exists a constant C such that for all x ∈ B (0 , /
2) we have ψ − , ( (cid:107) x (cid:107) ) ≥ C ,
Therefore for some constant
C > µ ( φ ( X ) ≤ t, X ∈ X ) ≤ Cτ q − d Leb( B (0 , (4 q ) − )) (cid:90) B (0 , / (cid:110) (cid:107) x (cid:107)≤ C ( q ) γ/ (cid:100) γ (cid:101) t / (cid:100) γ (cid:101) (cid:111) dx ≤ Cτ q − d ( − γ/ (cid:100) γ (cid:101) )Leb( B (0 , (4 q ) − )) t d/ (cid:100) γ (cid:101) , thanks to the strong density assumption we can write for some C > µ ( φ ( X ) ≤ t, X ∈ X ) ≤ Cq − d ( − γ/ (cid:100) γ (cid:101) ) t d/ (cid:100) γ (cid:101) . Thus since 1 − γ/ (cid:100) γ (cid:101) ≥ d/ (cid:100) γ (cid:101) ≥ α we can write for some C > µ ( φ ( X ) ≤ t, X ∈ X ) ≤ Ct α . To finish this part it remains to upper-bound the other term in the marginassumption m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) = 2 mτ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) { ϕ ( X ) ≤ t } dx , using the fact that the function ϕ ( x ) for all x ∈ B (0 , (4 q ) − ) satisfies C ϕ q − γ ≤ ϕ ( x ) ≤ C ϕ q − γ (cid:16) ψ − , (0) (cid:17) ≤ C ϕ q − γ , we can write for all t ≤ C ϕ q − γm (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) = 0 , moreover, for all t ≥ C ϕ q − γ we can write m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) ≤ mτ , hzhen, Denis and Hebiri/Confidence Sets and finally for t ∈ ( C ϕ q − γ , C ϕ q − γ ) we can write m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) = 2 mτ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) { ϕ ( X ) ≤ t } dx ≤ mτ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) { C ϕ q − γ ≤ t } dx = 2 mτ . The above implies that for some constant
C > t ≤ t m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) ≤ τ m { t ≤ C ϕ q − γ } ≤ Cτ mq γα t α . Thus the margin assumption is satisfied as long as • τ m = O ( q − γα ); • (cid:100) γ (cid:101) α ≤ d .Similarly one can check that the margin assumption is satisfied for k = β + 1 Bound on the KL-divergence: we are in position to upper-bound the KLdivergence between any two hypotheses. Fix some w, w (cid:48) ∈ W , then using theupper bound on ϕ ( · ) we can write for some C > P w , P w (cid:48) ) ≤ m (cid:88) i = − m,i (cid:54) =0 µ (cid:18) ϕ ( X − n q ( X )) log (cid:18) ϕ ( X − n q ( X ))1 − ϕ ( X − n q ( X )) (cid:19) , X ∈ X i (cid:19) ≤ Cmτ q − γ How many hypotheses to take: let us recall the following result which isa version of Varshamov-Gilbert bound (Gilbert, 1952; Varshamov, 1957).
Lemma C.1.
Let δ ( w, w (cid:48) ) denote the Hamming distance between w, w (cid:48) ∈ W given by δ ( w, w (cid:48) ) := m (cid:88) i =1 { w i (cid:54) = w (cid:48) i } . There exists
W ⊂ W such that for all w (cid:54) = w (cid:48) ∈ W we have δ ( w, w (cid:48) ) ≥ m , and log |W| ≥ m . Denote
W ⊂ W the set provided by Lemma C.1 and by P W the set ofdistributions P w with w ∈ W . Taking into account all the above we concludethat P W satisfies the assumptions of our result. hzhen, Denis and Hebiri/Confidence Sets Lower bound on the Hamming risk (applying Birg´e’s Lemma A.1): finally, we are in position to lower bound the hamming risk. Recall that we areinterested in the following quantityinf ˆΓ sup P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) . The rest of the proof follows standard arguments, which again using the deFinetti notation read asinf ˆΓ sup P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ inf ˆΓ sup w ∈W µ ⊗ N ⊗ P ⊗ nw µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) . Denote by ˆ w the following minimizerˆ w ∈ arg min w ∈W µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) , thus if w (cid:54) = ˆ w we can write using the definition of ˆ w and the triangle inequality2 µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) ≥ µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) + µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ ˆ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) ≥ µ ( | Γ ∗ ˆ w ( X ) (cid:52) Γ ∗ w ( X ) | ) ≥ δ ( w, ˆ w ) µ ( X )= 2 δ ( w, ˆ w ) τ ≥ mτ . These arguments and Birge’s lemma A.1 imply thatsup P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ mτ w ∈W µ ⊗ N ⊗ P ⊗ nw ( w (cid:54) = ˆ w ) ≥ mτ (cid:32) . (cid:95) − (cid:80) w ∈W\{ w (cid:48) } KL( µ ⊗ N ⊗ P ⊗ nw , µ ⊗ N ⊗ P ⊗ nw (cid:48) ) |W − | log |W| (cid:33) . Since the marginal distribution of the vector X ∈ R d is shared among thehypotheses, using the upper-bound on the KL-divergence and the conditions on W we get for some C > P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ mτ (cid:0) − Cnτ q − γ (cid:1) . Finally, let q = (cid:98) ¯ Cn / (2 γ + d ) (cid:99) , τ = (cid:98) C (cid:48) q − d (cid:99) and m = (cid:98) C (cid:48)(cid:48) q d − αγ (cid:99) for some¯ C, C (cid:48) , C (cid:48)(cid:48) > C > c < P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ Cn − αγ/ (2 γ + d ) (1 − c ) . One can easily verify that this choice of parameters τ, m, q is possible as longas 2 (cid:100) γ (cid:101) α ≤ d and clearly with our choice we have τ m = O ( q − αγ ). As alreadymentioned the lower bound for the excess risk and the discrepancy follows fromPropositions 3.1 and 3.2. hzhen, Denis and Hebiri/Confidence Sets Appendix D: Inconsistency of top- β approach In this section we prove Proposition 3.3. The proof builds an explicit con-struction of a distribution P whose β -Oracle satisfies | Γ ∗ β ( x ) | > β for all x in some A ⊂ R d with P X ( A ) >
0. Clearly, if such a distribution exists thenthere is no estimator in ˆΥ β that would consistently estimate this β -Oracle.Let β ∈ [0 , . . . , (cid:98) K/ (cid:99) −
1] be a fixed integer and K ≥
3. For the proof ofthe theorem we shall construct one distribution P for which none of the esti-mators with a fixed information can perform well. We start by specifying themarginal distribution of X ∈ R d . We start the construction by specifying thedensity µ of the marginal distribution P X . Define a disk in R d for some positive r ≤ r (cid:48) as D ( r, r (cid:48) ) = (cid:8) x ∈ R d : r ≤ (cid:107) x (cid:107) ≤ r (cid:48) (cid:9) . First of all fix some parameters r < r < r < r < r which are independent from n, N . The density µ issupported on B (0 , r ) ∪ D ( r , r ) ∪ D ( r , r ).Moreover, • µ ( x ) = ββ +1 − Leb( B (0 ,r ))Leb( D ( r , r )) for all x ∈ D ( r , r ), • µ ( x ) = β +1) Leb( D ( r , r )) for all x ∈ D ( r , r ), • µ ( x ) = 1, for all x ∈ B (0 , r ), • µ ( x ) = 0 otherwise,where r > ββ +1 − Leb( B (0 , r )) > p ( · ) = ( p ( · ) , . . . , p K ( · )) (cid:62) are defined as p ( x ) = . . . = p β +1 ( x ) = β +1) + C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ B (0 , r ) β +1) + g ( x ) β +1 , x ∈ D ( r , r ) β +1 − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ D ( r , r ) β +1 − ξ ( x ) β +1 , x ∈ D (2 r , r ) β +1) − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ R d \ B (0 , r ) ,p β +2 ( x ) = . . . = p K ( x ) = K − β − − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ B (0 , r ) K − β − − g ( x ) K − β − , x ∈ D ( r , r ) C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) K − β − , x ∈ D ( r , r ) ξ ( x ) K − β − , x ∈ D (2 r , r ) K − β − + C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ R d \ B (0 , r ) , where the constant C L is chosen small enough to ensure that these functionsare ( γ, L )-H¨older and have sufficiently small variation. Consider an arbitraryinfinitely many times differentiable function v : R (cid:55)→ [0 ,
1] which satisfies v ( x ) =0 for all x ≤ v ( x ) = 1 for all x ≥
1. Then, the functions g ( · ) and ξ ( · ) aredefined as g ( x ) = v (cid:16) (cid:107) x (cid:107)− r r − r (cid:17) , ξ ( x ) = v (cid:16) (cid:107) x (cid:107)− r r − r (cid:17) . The above construction hzhen, Denis and Hebiri/Confidence Sets defines a distribution P for which we have G − ( β ) = 12( β + 1)Γ ∗ β ( x ) = (cid:40) { , . . . , β + 1 } , x ∈ B (0 , r ) (cid:83) D ( r , r ) ∅ , otherwise . Indeed, let us evaluate the following quantity under the assumption that β ≤(cid:98) K/ (cid:99) − K (cid:88) k =1 (cid:90) p k ( x ) ≥ G − ( β ) µ ( x ) dx = ( β + 1) (cid:32)(cid:90) B (0 ,r ) µ ( x ) dx + (cid:90) D ( r , r ) µ ( x ) dx (cid:33) = ( β + 1) (cid:16) Leb ( B (0 , r )) + (cid:16) ββ +1 − Leb( B (0 , r )) (cid:17)(cid:17) = β . Thus, using this distribution we can write for any classifier ˆΓ ∈ ˆΥ β with fixedcardinalP(ˆΓ) − P(Γ ∗ β ) = (cid:90) R d K (cid:88) k =1 (cid:12)(cid:12) p k ( x ) − G − ( β ) (cid:12)(cid:12) k ∈ ˆΓ( x ) (cid:52) Γ ∗ ( x ) µ ( x ) dx ≥ (cid:90) D ( r , r ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β + 1) − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β + 1 − β + 1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ( x ) dx = (cid:90) D ( r , r ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β + 1) − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β + 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ββ +1 − Leb( B (0 , r ))Leb ( D ( r , r )) dx , where the first inequality follows from the observation that for x ∈ D ( r , r )there is always at least one label k such that k ∈ ˆΓ( x ) (cid:52) Γ ∗ ( x ). Thus, since theconstant C L is chosen to satisfy 2 C L / ( β + 1) ≤ / β + 1) we have for anyˆΓ ∈ ˆΥ β P(ˆΓ) − P(Γ ∗ β ) ≥ ββ +1 − Leb( B (0 , r ))4( β + 1) , If r is such that Leb( B (0 , r )) ≤ β β +1) we getP(ˆΓ) − P(Γ ∗ β ) ≥ β β + 1) , almost surely . By construction, the regression vector is ( γ, L )-H¨older and the density is lower-and upper-bounded by some positive constants. Hence, it remains to checkthat the constructed distribution satisfies the α -margin assumption. This canbe achieved by an appropriate choice of r . Indeed, on the sets D ( r , r ) ∪ hzhen, Denis and Hebiri/Confidence Sets D ( r , r ) there is a “corridor” of constant size between the regression functionsand the threshold G − ( β ). The threshold G − ( β ) is only approached by the re-gression function on the set B (0 , r ). As all the parameters in our constructionare independent from n, N ∈ N we can find a value r being small enough sothat the α -margin assumption is verified for a fixed α >α >