[PDF] Minimax semi-supervised confidence sets for multi-class classification

Abstract

In this work we study the semi-supervised framework of confidence set classification with controlled expected size in minimax settings. We obtain semi-supervised minimax rates of convergence under the margin assumption and a H{ö}lder condition on the regression function. Besides, we show that if no further assumptions are made, there is no supervised method that outperforms the semi-supervised estimator proposed in this work. We establish that the best achievable rate for any supervised method is n^{--1/2} , even if the margin assumption is extremely favorable. On the contrary, semi-supervised estimators can achieve faster rates of convergence provided that sufficiently many unlabeled samples are available. We additionally perform numerical evaluation of the proposed algorithms empirically confirming our theoretical findings.

Full PDF

MMinimax semi-supervised conﬁdencesets for multi-class classiﬁcation

Evgenii Chzhen ∗ , Christophe Denis and Mohamed Hebiri Universit´e Paris-Est – Marne-la-Vall´eeCit´e Descartes, Btiment Copernic5 boulevard Descartes77454 Marne-la-Vall´ee cedex 2e-mail: [email protected]@[email protected]

Abstract:

In this work we study the semi-supervised framework of conﬁ-dence set classiﬁcation with controlled expected size in minimax settings.We obtain semi-supervised minimax rates of convergence under the marginassumption and a H¨older condition on the regression function. Besides, weshow that if no further assumptions are made, there is no supervised methodthat outperforms the semi-supervised estimator proposed in this work. Weestablish that the best achievable rate for any supervised method is n − / ,even if the margin assumption is extremely favorable. On the contrary, semi-supervised estimators can achieve faster rates of convergence provided thatsuﬃciently many unlabeled samples are available. We additionally performnumerical evaluation of the proposed algorithms empirically conﬁrming ourtheoretical ﬁndings. MSC 2010 subject classiﬁcations:

Primary 62G05; secondary 62G30,62H05, 68T10.

Keywords and phrases: multi-class classiﬁcation, conﬁdence sets, mini-max optimality, semi-supervised classiﬁcation.

1. Introduction

Let K ≥ X, Y ) ∈ R d × [ K ] := { , . . . , K } be a random couple distributedaccording to a distribution P on R d × [ K ], where X ∈ R d is seen as the featurevector and Y ∈ [ K ] as the class. This problem falls within the scope of themulti-class setting where the goal is to predict the label Y for a given feature.Commonly, prediction is performed by a classiﬁer that outputs a single label.However, in the conﬁdence set framework, the objective diﬀers: we aim at pre-dicting a set of labels instead of a single one. This problem has been studied in afew works, and we consider in this contribution the setup put forward by Denisand Hebiri (2017). The essential feature of their perspective is the control ofthe size of conﬁdence sets in expectation. While they provided a procedure tobuild conﬁdence sets based on Empirical Risk Minimization (ERM) and estab- ∗ This work was partially supported by “Labex B´ezout” of Universit´e Paris-Est1 a r X i v : . [ m a t h . S T ] A p r hzhen, Denis and Hebiri/Conﬁdence Sets lished upper bounds, the present work aims at giving a general analysis of theconﬁdence problem in the minimax sense. All along the paper, we denote by P X the marginal distribution of X ∈ R d andby p ( · ) := ( p ( · ) , . . . , p K ( · )) (cid:62) the regression function deﬁned for all k ∈ [ K ] andall x ∈ R d as p k ( x ) := P ( Y = 1 | X = x ). For any sets A, A (cid:48) ⊂ [ K ] we denoteby A (cid:52) A (cid:48) their symmetric diﬀerence. We assume that two data samples D n , D N are available. The ﬁrst sample D n = { ( X i , Y i ) } ni =1 consists of n ∈ N i.i.d. copiesof ( X, Y ) ∈ R d × [ K ] and the second sample D N = { X i } n + Ni = n +1 consist of N ∈ N i.i.d. copies of X ∈ R d .A conﬁdence set classiﬁer Γ is a measurable function from R d to 2 [ K ] := { A : A ⊂ [ K ] } , that is, Γ : R d → [ K ] and we denote by Υ the set of all suchfunctions. For any conﬁdence set Γ : R d → [ K ] we deﬁne its error and itsinformation asP (Γ) = P ( Y / ∈ Γ( X )) (cid:124) (cid:123)(cid:122) (cid:125) error , I (Γ) = E P X | Γ( X ) | (cid:124) (cid:123)(cid:122) (cid:125) information , respectively, where E P X stands for the expectation w.r.t. the marginal distribu-tion of X ∈ R d and | Γ( x ) | is the cardinal of Γ at x ∈ R d .For a ﬁxed integer β ∈ [ K ] a β -Oracle conﬁdence set Γ ∗ β is deﬁned asΓ ∗ β ∈ arg min { P (Γ) : Γ ∈ Υ s.t. I(Γ) = β } . The set { Γ ∈ Υ : I(Γ) = β } is always non-empty, as it always contain thoseconﬁdence sets whose cardinal is equals to β for every x ∈ R d .The description of β -Oracle conﬁdence set in general situation might be com-plicated. Hence, we introduce the following mild assumption, which allows toobtain an explicit expression. Assumption 1.1 (Continuity of CDF) . For all k ∈ [ K ] the cumulative dis-tribution function (CDF) F p k ( · ) := P X ( p k ( X ) ≤ · ) of p k ( X ) is continuous on (0 , . Proposition 1.2 ( β -Oracle conﬁdence set) . Fix β ∈ [ K − , and let the function G : [0 , → [0 , K ] be deﬁned for all t ∈ [0 , as G ( t ) := K (cid:88) k =1 (1 − F p k ( t )) = K (cid:88) k =1 P X ( p k ( X ) > t ) , then under Assumption 1.1 a β -Oracle conﬁdence set Γ ∗ β can be obtained as Γ ∗ β ( x ) = (cid:8) k ∈ [ K ] : p k ( x ) ≥ G − ( β ) (cid:9) , (1.1) where we denote by G − the generalized inverse of G deﬁned for all β ∈ [0 , K ] as G − ( β ) := inf { t ∈ [0 ,

1] : G ( t ) ≤ β } . hzhen, Denis and Hebiri/Conﬁdence Sets Proposition 1.3.

Assume that Assumption 1.1 is fulﬁlled, then the β -Oracledeﬁned in Eq. (1.1) is a minimizer of the following risk R β (Γ) = P(Γ) + G − ( β ) I(Γ) . (1.2)These propositions have been proven in (Denis and Hebiri, 2017, Proposi-tion 4 and Proposition 7). Consequently, the accuracy of a conﬁdence set Γ canbe for instance quantiﬁed according its excess riskR β (Γ) − R β (Γ ∗ β ) = K (cid:88) k =1 E P X (cid:104) | p k ( X ) − G − ( β ) | { k ∈ Γ( X ) (cid:52) Γ ∗ β ( X ) } (cid:105) . The statistical learning problem is then to estimate Γ ∗ β given the data sample D n and D N . The formulation in Eq. (1.1) of the β -Oracle appears to be closelyrelated to the level set estimation problem (Hartigan, 1987; Polonik, 1995; Tsy-bakov, 1997; Rigollet and Vert, 2009). Hence at ﬁrst sight, the introductionof an unlabeled sample may be surprising. However, in our setup the estima-tion of the β -Oracle does not only rely on the regression function but also onthe threshold G − ( β ) which is unknown beforehand and can be estimated in asemi-supervised way (Denis and Hebiri, 2017). To ﬁx these ideas, we give someexamples of possible estimation procedures of Γ ∗ β . An estimator ˆΓ is a measurable function that maps any given data samplesinto a conﬁdence set classiﬁer. We shall distinguish two types of estimators:supervised and semi-supervised whose formal deﬁnition is provided below.

Deﬁnition 1.4 (Supervised and semi-supervised estimators) . A measurablemapping ˆΓ : (cid:91) n,N ∈ N (cid:0) R d × [ K ] (cid:1) n × (cid:0) R d (cid:1) N → Υ , is called a supervised estimator if for any n, N ∈ N and any data samples D n = { ( X i , Y i ) } ni =1 , D N = { X i } n + Ni = n +1 , and D (cid:48) N = { X (cid:48) i } n + Ni = n +1 it holds that ˆΓ( x ; D n , D N ) = ˆΓ( x ; D n , D (cid:48) N ) , a.e. x ∈ R d w.r.t. the Lebesgue measure . Otherwise the estimator is called semi-supervised . In the sequel, for the sim-plicity of notation we write ˆΓ( x ) instead of ˆΓ( x ; D n , D N ) where no ambiguity ispresent. Intuitively, the supervised estimators do not take into account the informationthat is provided by the unlabeled sample. Besides, if we denote by ˆΥ the set ofall estimators, Deﬁnition 1.4 generates a natural partition of ˆΥ into two disjointsets: the supervised estimators ˆΥ SE and the semi-supervised estimators ˆΥ SSE .Hereafter, we provide three diﬀerent examples of estimation procedures whichare the core of our study. All these methods rely on plug-in principle. hzhen, Denis and Hebiri/Conﬁdence Sets • Top- β procedure. This method is the most intuitive estimator in the con-sidered context. It is a supervised procedure, that is, based only on D n .Let consider an estimator ˆ p of the regression function p . Let (cid:0) ˆ p σ k ( X ) (cid:1) k ∈ [ K ] be the order statistic associated to ˆ p ( X ), such that for all x ∈ R d we haveˆ p σ ( x ) ( x ) ≥ . . . ≥ ˆ p σ K ( x ) ( x ). A top- β conﬁdence set is then deﬁned asˆΓ top ( x ) = { σ ( x ) , . . . , σ β ( x ) } , ∀ x ∈ R d . (1.3) • Supervised procedure.

Formally, in this type of methods, we only care about D n (we forget about D N ). We split D n into two independent samples suchthat D n = D (cid:98) n/ (cid:99) (cid:83) D (cid:100) n/ (cid:101) . Based on the ﬁrst sample D (cid:98) n/ (cid:99) , we consideran estimator ˆ p of the regression function p . Furthermore, we deﬁneˆ G ( · ) = 1 (cid:100) n/ (cid:101) (cid:88) i ∈D (cid:100) n/ (cid:101) K (cid:88) k =1 { ˆ p k ( X i ) ≥·} , and one type of supervised estimator is then deﬁned as followsˆΓ SE ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , ∀ x ∈ R d . (1.4)Interestingly, conditional on the data sample D (cid:98) n/ (cid:99) , the deﬁnition of theestimator ˆ G does not involves the labels associated to D (cid:100) n/ (cid:101) . As a con-sequence, we can naturally consider a semi-supervised version of this es-timator. • Semi-supervised procedure.

Based on D n , we consider an estimator ˆ p ofthe regression function p . Furthermore, we deﬁneˆ G ( · ) = 1 N (cid:88) i ∈D N K (cid:88) k =1 { ˆ p k ( X i ) ≥·} , and one type of semi-supervised estimator is then deﬁned as followsˆΓ SSE ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , ∀ x ∈ R d . (1.5)One can note that these procedures are based on a preliminary estimator of p built from D n , that is, all of them are plug-in type procedures. However, theseprocedures diﬀer by the construction of the output set. The top- β procedure andthe supervised procedure rely only on the labeled data while the semi-supervisedestimator takes advantage of the information provided by the unlabeled data.The top- β procedure is the simplest among them, it naturally satisﬁes | ˆΓ( x ) | = β for all x ∈ R d . At the same time, the others are more involved and can havediﬀerent cardinals for diﬀerent values of x ∈ R d . Nevertheless, for the other twoprocedures one can guarantee I(ˆΓ) ≈ β .These examples give a rise to natural questions which form the core ourtheoretical study and which are summarized below. hzhen, Denis and Hebiri/Conﬁdence Sets

1. The ﬁrst question is the statistical performance of these plug-in procedureswhich is assessed through rates of convergence and their optimality in theminimax sense.2. The second question focuses on the beneﬁt of the semi-supervised ap-proach. Roughly speaking, are there situations where the semi-supervisedapproach outperforms the supervised one and how can it be quantiﬁed?3. The third question concentrates on the reason why it is more relevant forthis problem to consider more involved estimators than the simple top- β method. For a given family P of joint distributions on R d × [ K ], a given estimator ˆΓ ∈ ˆΥ,and ﬁxed integers K ≥ β ∈ [ K ], n, N ∈ N we are interested in the followingmaximal risksΨ H n,N (ˆΓ; P ) := sup P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) (Hamming risk) , Ψ E n,N (ˆΓ; P ) := sup P ∈P E ( D n , D N ) R β (ˆΓ) − R β (Γ ∗ β ) (Excess risk) , Ψ D n,N (ˆΓ; P ) := sup P ∈P E ( D n , D N ) (cid:104)(cid:12)(cid:12)(cid:12) P(ˆΓ) − P(Γ ∗ β ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) β − I(ˆΓ) (cid:12)(cid:12)(cid:12)(cid:105) (Discrepancy) , where E ( D n , D N ) denotes the expectation w.r.t. P ⊗ n ⊗ P ⊗ NX . These maximal risksare arising in a natural way in the context of the conﬁdence set estimation withcontrolled expected size. The risk Ψ H n,N (ˆΓ; P ) corresponds to the estimationof the β -Oracle through the Hamming distance. The second risks is directlyconnected with Proposition 1.2, which gives a description of the β -Oracle asa minimizer of R β ( · ). As the goal in this problem is to construct a procedureˆΓ that exhibits a low error P(ˆΓ) and low cardinal discrepancy | β − I(ˆΓ) | , it isnatural to consider Ψ D n,N (ˆΓ; P ) which is composed of both.Finally, we are in position to deﬁne the notion of the minimax rate. Theminimax rate in this context is not only determined by the family of distributions P but also by the family of estimators ˆ Γ ⊂ ˆΥ that we consider. Deﬁnition 1.5 (Minimax rate of convergence) . For a given family P of jointdistributions on R d × [ K ] and a given family of estimators ˆ Γ ⊂ ˆΥ the minimaxrates are deﬁned as Ψ (cid:3) n,N (ˆ Γ ; P ) := inf ˆΓ ∈ ˆ Γ Ψ (cid:3) n,N (ˆΓ; P ) , where (cid:3) is H , E or D . The main families of estimators that we study are the supervised ˆΥ SE andthe semi-supervised ˆΥ SSE estimators. Obviously, since ˆΥ = ˆΥ SE (cid:83) ˆΥ SSE andˆΥ SE (cid:84) ˆΥ SSE = ∅ , we have the following relationΨ (cid:3) n,N ( ˆΥ; P ) = Ψ (cid:3) n,N ( ˆΥ SE ; P ) (cid:94) Ψ (cid:3) n,N ( ˆΥ SSE ; P ) . hzhen, Denis and Hebiri/Conﬁdence Sets As a consequence, a lower and an upper bounds on Ψ (cid:3) n,N ( ˆΥ SE ; P ), Ψ (cid:3) n,N ( ˆΥ SSE ; P )yield the bounds on the minimax rate over all estimators. Conﬁdence set approach for classiﬁcation was pioneered by Vovk (2002a,b);Vovk, Gammerman and Shafer (2005) by the means of conformal predictiontheory. They rely on non-conformity measures which are based on some patternrecognition methods, and develop an asymptotic theory. In this work, we con-sider a statistical perspective of conﬁdence set classiﬁcation and put our focuson non-asymptotic minimax theory.The problem of conﬁdence set multi-class classiﬁcation has strong ties withthe binary classiﬁcation with reject option, also known as binary classiﬁcationwith abstention in machine learning literature. In the binary classiﬁcation withrejection, a classiﬁer is allowed to output some special symbol, which indicatesthe rejection. Such type of classiﬁers can be seen as conﬁdence sets, which areallowed to output ∅ or { , } and are interpreted as reject. This line of researchwas initiated by Chow (1957, 1970) in the context of information retrieval,where a predeﬁned cost of rejection was considered. An extensive statisticalstudy of this framework was carried in (Herbei and Wegkamp, 2006; Bartlettand Wegkamp, 2008; Wegkamp and Yuan, 2011).Instead of considering a ﬁxed cost for rejection, which might be too restric-tive, one may deﬁne two entities: probability of rejection and the probability ofmissclassiﬁcation. In the spirit of conformal prediction, Lei (2014) aims at mini-mizing the probability rejection provided a ﬁxed upper bound on the probabilityof missclassiﬁcation. In contrast, Denis and Hebiri (2015) consider a reversedproblem of minimizing the probability of missclassiﬁcation given a ﬁxed upperbound on the probability of rejection.Once the multi-class classiﬁcation is considered, there are several possibleways to extend the binary case: the conﬁdence set approach and the rejectionapproach. The reject counterpart is a more studied and known version, thoughit lacks statistical analysis. To the best of our knowledge the only work whichprovides statistical guarantees is (Ramaswamy, Tewari and Agarwal, 2018).As for the conﬁdence set approach there are again two possibilities, simi-lar to the binary case. The one that is considered in this work was proposedby Denis and Hebiri (2017), where the authors analyse an ERM algorithm andderive oracle inequalities under the margin assumption (Tsybakov, 2004). Morespeciﬁcally, they consider a convex surrogate of the error P( · ) which relies on aconvex real valued loss function φ . For a suitable choice of the convex function φ they show that, under Assumption 1.1, their β -Oracle satisﬁesΓ ∗ β ( · ) = (cid:110) k ∈ [ K ] : f ∗ k ( · ) ≥ G − f ∗ ( β ) (cid:111) , where the function f ∗ depends on φ and the value of G − f ∗ ( β ) is deﬁned similarlyto the present manuscript. They propose a two step estimation procedure of the hzhen, Denis and Hebiri/Conﬁdence Sets β -Oracle set. Based on the ERM algorithm, they ﬁrst estimate f ∗ and in thesecond step, they estimate the threshold G − f ∗ ( β ) with an unlabeled sample.This procedure is in the same spirit as the semi-supervised procedure (1.5).Under mild assumptions, they provide an upper bound on the excess risk andobtain a rate of convergence of order ( n/ log n ) − α/ ( α + s ) + N − / , with s being aparameter that depends on the function φ and α being the margin parameter.Note that this rate is slower than the rate obtained in the standard classiﬁcationframework.The conformal prediction theory (Vovk, Gammerman and Shafer, 2005) sug-gests to minimize the information level with a ﬁxed budget on the error level.Statistical properties of this framework were considered in the work of Sadinle,Lei and Wasserman (2018). Their objective is formulated for some a ∈ (0 ,

1) asΓ ∗ a ∈ arg min { I(Γ) : Γ ∈ Υ s.t. P(Γ) ≤ a } , and such a conﬁdence set is called a least ambiguous conﬁdence set with boundederror rate. The authors show that under Assumption 1.1 this oracle set can bedescribed as a thresholding of the regression functionΓ ∗ a ( · ) = { k ∈ [ K ] : p k ( · ) ≥ t a } , where the threshold t a is deﬁned as t a = sup (cid:40) t ∈ [0 ,

1] : L (cid:88) k =1 P ( p k ( X ) ≥ t | Y = k ) P ( Y = k ) ≥ − a (cid:41) . Notice that this framework is very similar to (Denis and Hebiri, 2017) in thetreatment of the Bayes optimal conﬁdence set, as in both cases they are obtainedvia thresholding of the posterior distribution of the labels. Sadinle, Lei andWasserman (2018) also proceed in two steps as here, that is, they ﬁrst estimatethe posterior distribution p k ( · ) for all k ∈ [ K ] and estimate the threshold t a after. However, they require the second labeled dataset for the estimator of t a ,due to the presence of P ( Y = k ), the marginal distribution of the labels. Besides,their theoretical analysis is carried out under a diﬀerent set of assumptionson the joint distribution P . Apart from the standard margin assumption, theyrequire a so-called detectability, that is, they require that the upper bound inthe margin assumption is tight. Under these assumptions they provide an upperbound on the Hamming excess risk and obtain a rate of convergence of order O (( n/ log n ) − / ).Interestingly, both approaches can be encompassed into the constrained es-timation framework (Anbar, 1977; Lepskii, 1990; Brown and Low, 1996), whereone would like to construct an estimator with some prescribed properties. Theseproperties are typically reﬂected by the form of the risk which in our case is thediscrepancy measure, that is, the sum of error and information discrepancies.Thus, both frameworks of Sadinle, Lei and Wasserman (2018); Denis and Hebiri(2017) can be seen as an extension of the constrained estimation to the clas-siﬁcation problems. From the modeling point of view, we believe that the two hzhen, Denis and Hebiri/Conﬁdence Sets frameworks can co-exist nicely and a particular choice depends on the consideredapplication. The major diﬀerence between the present work and those by De-nis and Hebiri (2017) and Sadinle, Lei and Wasserman (2018) is the minimaxanalysis which we provide here and our treatment of semi-supervised techniques.As already pointed out, the conﬁdence set estimation problem is closely re-lated to the level set estimation setup (Hartigan, 1987; Polonik, 1995; Tsybakov,1997; Rigollet and Vert, 2009). This problem focuses on the estimation of a levelset deﬁned as Γ p ( λ ) = { x ∈ R d : p ( x ) ≥ λ } , where p is the density of the observations and λ > X , . . . , X n distributed according the density p the goal is to estimateΓ p ( λ ). In (Rigollet and Vert, 2009), the authors study plug-in density level setestimators through the measure of symmetric diﬀerences and the excess mass . Inconﬁdence set estimation the measure of symmetric diﬀerences is the Hammingrisk whereas the excess mass is the excess risk. They show that kernel basedestimators are optimal in the minimax sense over a H¨older class of densities andunder a margin type assumption (Polonik, 1995; Tsybakov, 2004). In particular,they derive fast rates of convergence, that is faster than n − / , for the excessmass . In the level set estimation problem, the threshold λ is chosen beforehand;whereas in our work, the threshold G − ( β ) depends on the distribution of thedata which makes the statistical analysis more diﬃcult.On the other part, the conﬁdence set estimation problem is directly related tothe standard classiﬁcation settings. This problem has been widely studied froma theoretical point of view in the binary classiﬁcation framework. Audibert andTsybakov (2007) study the statistical performance of plug-in classiﬁcation rulesunder assumptions which involve the smoothness of the regression function andthe margin condition. In particular, they derive fast rates of convergence forplug-in classiﬁers based on local polynomial estimators (Stone, 1977; Tsybakov,1986; Audibert and Tsybakov, 2007) and show their optimality in the minimaxsense. One of the aim of present work is to extend these results to the conﬁdenceset classiﬁcation framework.Another part of our work is to provide a comparison between supervised andsemi-supervised procedures. Semi-supervised methods are studied in several pa-pers (Vapnik, 1998; Rigollet, 2007; Singh, Nowak and Zhu, 2009; Bellec et al.,2018) and references therein. A simple intuition can be provided on whetherone should or not expect a superior performance of the semi-supervised ap-proach. Imagine a situation when the unlabeled sample D N is so large that onecan approximate P X up to any desired precision, then, if the optimal decisionis independent of P X , the semi-supervised estimators are not to be consideredsuperior over the supervised estimation. This is the case in a lot of classical prob-lems of statistics, where the inference is solely governed by the behavior of theconditional distribution P Y | X (for instance regression or binary classiﬁcation).The situation might be diﬀerent once the optimal decision relies on the marginaldistribution P X . In this case, as suggested by our ﬁndings, the semi-supervised hzhen, Denis and Hebiri/Conﬁdence Sets approach might or not outperform the supervised one even in the context of thesame problem. Similar conclusions were stated by Singh, Nowak and Zhu (2009)in the context of learning under the cluster assumption (Rigollet, 2007). Bellow we summarize our contributions. • Our results focus on the case where the regression p belongs to a H¨olderclass and satisfy the margin condition. Under these assumptions, we es-tablish lower bounds on the minimax rates, deﬁned in Section 1.3 in theconﬁdence set framework. • As important consequences of our results, we ﬁrst show that top- β typeprocedures are in general inconsistent. Furthermore, by providing a rig-orous deﬁnition of the semi-supervised and supervised estimators, we de-scribe the situations when the semi-supervised estimation should be con-sidered superior to its supervised counterpart. Interestingly, our analysissuggests that these regimes are governed by the interplay of the familyof distributions and by the considered measure of performance. Besides,we show that in our settings supervised procedures cannot achieve fastrates, that is, its rate cannot be faster than n − / . In contrast, some otherclassical settings (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009;Herbei and Wegkamp, 2006) allow to achieve faster rates for supervisedmethods. • We provide supervised and semi-supervised estimation procedures, whichare optimal or optimal up to an extra logarithmic factor. Importantly,our results show that semi-supervised plug-in procedure based on localpolynomial estimators can achieve fast rates, provided that the size of theunlabeled samples is large enough. • Finally, we perform a numerical evaluation of the proposed plug-in algo-rithms against the top- β counterparts. This part supports our theoreticalresults and empirically demonstrates the reason to consider more involvedprocedures. The paper is organized as follow. In Section 2, we put some additional notationand introduce the family of distributions P that we consider. Section 3 is devotedto the lower bounds on the minimax rates and their implications. In Section 4 weintroduce the proposed algorithm, establish upper bounds for it, and evaluateits numerical performance. We conclude this paper by Sections 5 and 6 wherewe discuss and sum-up our results. hzhen, Denis and Hebiri/Conﬁdence Sets

2. Class of conﬁdence sets

First let us introduce some generic notation that is used throughout this work.For two numbers a, a (cid:48) ∈ R we denote by a ∨ a (cid:48) (resp. a ∧ a (cid:48) ) the maximum (resp.minimum) between a and a (cid:48) . For a positive real number a we denote by (cid:98) a (cid:99) (resp. (cid:100) a (cid:101) ) the largest (resp. the smallest) non-negative integer that is less thanor equal (resp. greater than or equal) to a . The standard Euclidean norm of avector x ∈ R d is denoted by (cid:107) x (cid:107) and the standard Lebesgue measure is denotedby Leb( · ). A Euclidean ball centered at x ∈ R d of radius r > B ( x, r ). For an arbitrary Borel measure µ on R d that is absolutely continuous w.r.t. the Lebesgue measure we denote by supp( µ ) its support, that is, the setwhere the Radon-Nikodym derivative of µ w.r.t. Leb is strictly positive. For avector function p : R d (cid:55)→ R K and a Borel measure µ on R d we deﬁne the inﬁnitynorm of p as (cid:107) p (cid:107) ∞ ,µ := inf (cid:26) C ≥ k ∈ [ K ] | p k ( x ) | ≤ C, a.e. x ∈ R d w.r.t. µ (cid:27) . In this work C or its lower-cased versions always refer to some constants whichmight diﬀerent from line to line. Importantly, all these constants are independentof n, N but could depend on K, d and other parameters which are assumed tobe ﬁxed. Before introducing the families of distributions P that are consideredin this work we need the following deﬁnitions. Assumption 2.1 ( α -margin assumption) . We say that the distribution P ofthe pair ( X, Y ) ∈ R d × [ K ] satisﬁes α -margin assumption if there exists C > and t ∈ (0 , such that for every positive t ≤ t P X (cid:0) < (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ C t α . Let us point out an important consequence of Assumption 1.1. We have thatthe condition P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ C t α , for all t ∈ [0 , t ] is equivalent to Assumption 2.1. Indeed, since the randomvariables p k ( X )’s cannot concentrate at a constant level, in particular at G − ( β ).Moreover, again due to the continuity Assumption 1.1 we havelim t → +0 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) = 0 , thus the α -margin Assumption 2.1 speciﬁes the rate of this convergence. Finally,the restriction of the range of t to [0 , t ] in α -margin Assumption 2.1 does notaﬀect its global behavior as for all t ∈ [0 , P X (cid:0) < (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ c t α , with c = C ∨ t − α . Let c and r be two positive constants. We say that a Borel set A ⊂ R d isa ( c , r )-regular set ifLeb ( A ∩ B ( x, r )) ≥ c Leb ( B ( x, r )) , ∀ r ∈ (0 , r ] , ∀ x ∈ A . hzhen, Denis and Hebiri/Conﬁdence Sets Deﬁnition 2.2 (Strong density) . We say that the probability measure P X on R d satisﬁes the ( µ min , µ max , c , r ) -strong density assumption if it is supportedon a compact ( c , r ) -regular set A ⊂ R d and has a density µ w.r.t. the Lebesguemeasure such that µ ( x ) = 0 for all x ∈ R d \ A and < µ min ≤ µ ( x ) ≤ µ max < ∞ , ∀ x ∈ A .

Deﬁnition 2.3 (H¨older class, Tsybakov (2008)) . We say that a function h : R d → R is ( γ, L ) -H¨older for γ > and L > if h is (cid:98) γ (cid:99) times continuouslydiﬀerentiable and ∀ x, x (cid:48) ∈ R d we have | h ( x (cid:48) ) − h x ( x (cid:48) ) | ≤ L (cid:107) x − x (cid:48) (cid:107) γ , where h x ( · ) is the Taylor polynomial of degree (cid:98) γ (cid:99) of h ( · ) at the point x ∈ R d . Consequently, the set of all functions from R d to R satisfying the aboveconditions is called ( γ, L, R d ) -H¨older and is denoted by H ( γ, L, R d ) . Deﬁnition 2.4.

We denote by P ( L, γ, α ) a set of joint distributions on R d × [ K ] which satisﬁes the following conditions • the marginal P X satisﬁes the ( µ min , µ max , c , r ) -strong density, • for all k ∈ [ K ] the k th regression function p k ( · ) = P ( Y = k | X = · ) belongsto the ( γ, L, R d ) -H¨older class, that is p k ∈ H ( γ, L, R d ) for all k ∈ [ K ] , • for all k ∈ [ K ] the regression function p k satisfy the ( C , α, β ) -Marginassumption, • for all k ∈ [ K ] , the cumulative distribution function F p k of p k ( X ) is con-tinuous. The family of distributions P ( L, γ, α ) is similar to the one considered in (Au-dibert and Tsybakov, 2007) in the context of binary classiﬁcation. The onlymajor diﬀerence is the continuity Assumption 1.1, which does not allow to re-use in a straightforward way their construction for lower bounds.

3. Lower bounds

The main results in the present work are the lower bounds we provide in thissection. In particular, we establish in Section 3.1 the inconsistency of top- β pro-cedures (see Eq. (1.3) for a deﬁnition of the method). Therefore more elaboratemethods are required in this framework. As pointed out in the introduction, wedistinguish two types of estimators: supervised and semi-supervised for which weprovide lower bounds in Section 3.2. The obtained rates highlight the beneﬁt ofthe semi-supervised approach in the context of the conﬁdence set classiﬁcation.Before considering the lower bounds, let us ﬁrst display connection betweenthe diﬀerent minimax rates. Such links are used in the proofs of the lowerbounds. hzhen, Denis and Hebiri/Conﬁdence Sets Proposition 3.1.

Let Γ be a measurable function from R d to [ K ] , β ∈ [ k ] andassume that Assumption 1.1 is fulﬁlled, then P(Γ) − P(Γ ∗ β ) = R β (Γ) − R β (Γ ∗ β ) + G − ( β ) ( β − I(Γ)) , R β (Γ) − R β (Γ ∗ β ) = K (cid:88) k =1 E P X (cid:104) | p k ( X ) − G − ( β ) | { k ∈ Γ( X ) (cid:52) Γ ∗ β ( X ) } (cid:105) . Furthermore, if additionally Assumption 2.1 is satisﬁed with α > , then thereexist C > which depends only on K, α, C such that for any pair of conﬁdenceset classiﬁers Γ , Γ (cid:48) it holds that E P X (cid:12)(cid:12)(cid:12) Γ( X ) (cid:52) Γ (cid:48) ( X ) (cid:12)(cid:12)(cid:12) ≤ C (R β (Γ) − R β (Γ (cid:48) )) α/ ( α +1) . (3.1) Proposition 3.2.

For any K ≥ , β ∈ [ K ] and n, N ∈ N the following relationbetween minimax rates holds: Ψ H n,N (ˆ Γ ; P ) ≥ Ψ D n,N (ˆ Γ ; P ) ≥ Ψ E n,N (ˆ Γ ; P ) . Proposition 3.1, and in particular Eq. (3.1) gives an easy way to estab-lish a lower bound on Ψ E n,N (ˆ Γ ; P ) via a lower bound on the Hamming dis-tance Ψ H n,N (ˆ Γ ; P ). However, this approach does not allow to get ( N + n ) − / (resp. n − / ) part of the rate in the lower bound of Ψ E n,N ( ˆΥ SSE , P ) (resp.Ψ E n,N ( ˆΥ SE , P )). Besides, Proposition 3.2 allows to prove a lower bound on thediscrepancy Ψ D n,N (ˆ Γ ; P ) with the correct rate via the lower bound on the excessrisk Ψ E n,N (ˆ Γ ; P ). β procedure Before stating our results on the supervised and the semi-supervised estimators,we discuss another interesting class of conﬁdence sets, which might be a naturalchoice at the ﬁrst sight. We consider estimators which consists of β classes atevery point x ∈ R d since such estimators naturally satisfy I (ˆΓ) = β . Let usdenote by ˆΥ β the set of all estimators ˆΓ such that | ˆΓ( x ) | = β for all x ∈ R d ,that is, ˆΥ β = (cid:110) ˆΓ ∈ ˆΥ : | ˆΓ( x ) | = β, a.e. x ∈ R d w.r.t. Leb (cid:111) . Despite an obvious restriction on the cardinal of the conﬁdence sets, the fam-ily of estimators ˆΥ β is rather broad. Indeed, every procedure which estimatesthe regression functions p k ( · )’s and includes the top β scores as the output areincluded in ˆΥ β . The nature of the estimator can also be diﬀerent, that is, the es-timates could be based on the ERM, non-parametric or parametric approaches.Clearly, the family ˆΥ β is neither included in ˆΥ SE nor in ˆΥ SSE and has a non-trivial intersection with both. The next result states that there is no uniformlyconsistent estimator ˆΓ ∈ ˆΥ β over the family of distributions P ( L, γ, α ). hzhen, Denis and Hebiri/Conﬁdence Sets Proposition 3.3.

Assume that K ≥ , β ∈ [ (cid:98) K/ (cid:99) − and β ≥ , then for all n, N ∈ N we have Ψ E n,N (cid:16) ˆΥ β ; P ( L, γ, α ) (cid:17) ≥ β − K .

The proof builds an explicit construction of a distribution P whose β -Oraclesatisﬁes | Γ ∗ β ( x ) | > β for all x in some A ⊂ R d with P X ( A ) >

0. Indeed, if such adistribution exists then there is no estimator in ˆΥ β that would consistently esti-mate this β -Oracle. The negative result established in Proposition 3.3 is ratherinstructive by itself as it advocates that a more involved estimation procedureought to be constructed. Clearly, estimators which achieve the inﬁmum in the minimax rates are either su-pervised or semi-supervised, thus a lower bound on Ψ (cid:3) n,N ( ˆΥ SE ; P ) together witha lower bound on Ψ (cid:3) n,N ( ˆΥ SSE ; P ) yield a lower bound on Ψ (cid:3) n,N ( ˆΥ; P ). However,a lower bound on Ψ (cid:3) n,N ( ˆΥ; P ) does not discriminate between the supervised andthe semi-supervised estimators. Theorem 3.4 (Supervised estimation) . Let K ≤ , β ∈ [ (cid:98) K/ (cid:99)− . If α (cid:100) γ (cid:101) ≤ d , then there exist constants c, c (cid:48) , c (cid:48)(cid:48) > such that for all n, N ∈ N Ψ H n,N ( ˆΥ SE ; P ( L, γ, α )) ≥ c (cid:16) n − αγ γ + d (cid:95) n − / (cid:17) , Ψ E n,N ( ˆΥ SE ; P ( L, γ, α )) ≥ c (cid:48) (cid:16) n − (1+ α ) γ γ + d (cid:95) n − / (cid:17) , Ψ D n,N ( ˆΥ SE ; P ( L, γ, α )) ≥ c (cid:48)(cid:48) (cid:16) n − (1+ α ) γ γ + d (cid:95) n − / (cid:17) . Based on this results we observe that the lower bound for the Hamming riskΨ H n,N is slower than those for the other risks. It is even more signiﬁcant that thebest rate that a supervised estimator can achieve for all of the risks is n − / evenif the margin assumption holds. This is the major diﬀerence with the classicalsettings where the value of threshold is known (such as classiﬁcation and level setestimation). Indeed, under the same assumptions on the family of distributions,besides the continuity Assumption 1.1, the minimax rate in those frameworksis n − (1+ α ) γ/ (2 γ + d ) as proved for instance in (Audibert and Tsybakov, 2007;Rigollet and Vert, 2009). Next theorem deals with semi-supervised proceduresand displays another behavior. Theorem 3.5 (Semi-supervised estimation) . Let K ≥ , β ∈ [ (cid:98) K/ (cid:99) − . If hzhen, Denis and Hebiri/Conﬁdence Sets α (cid:100) γ (cid:101) ≤ d , then there exist constants c, c (cid:48) , c (cid:48)(cid:48) > such that for all n, N ∈ N Ψ H n,N ( ˆΥ SSE ; P ( L, γ, α )) ≥ c (cid:16) n − αγ γ + d (cid:95) ( n + N ) − / (cid:17) , Ψ E n,N ( ˆΥ SSE ; P ( L, γ, α )) ≥ c (cid:48) (cid:16) n − (1+ α ) γ γ + d (cid:95) ( n + N ) − / (cid:17) , Ψ D n,N ( ˆΥ SSE ; P ( L, γ, α )) ≥ c (cid:48)(cid:48) (cid:16) n − (1+ α ) γ γ + d (cid:95) ( n + N ) − / (cid:17) . First, observe that the lower bound for the Hamming distance is, as in thesupervised setting, worse than for the other measures of performance. Howeverthere is a major diﬀerence with the supervised case: as compared to Theorem 3.4,it is possible for a semi-supervised estimator to achieve rates that are faster than n − / if the size of the unlabeled dataset N ∈ N is large enough. In particular,when we consider Ψ E n,N or Ψ D n,N the following relations are necessary to get fastrates ( n + N ) − / = o (cid:16) n − (1+ α ) γ/ (2 γ + d ) (cid:17) , n − (1+ α ) γ/ (2 γ + d ) = o ( n − / ) . In this case, we recover the same fast rates as in the classical settings of classi-ﬁcation and level set estimation. It suggests that the lack of knowledge of thethreshold G − ( β ) does not alter the quality of estimation for the semi-supervisedprocedure, provided that N is suﬃciently large. Next corollary makes these ob-servations clearer. Corollary 3.6.

Assume that the rates in Theorem 3.5 (resp. Theorem 3.4) areminimax, that is, there exist a conﬁdence set ˆΓ SSE (resp. ˆΓ SE ) that achievesthese rates. Regarding Ψ E n,N and Ψ D n,N the following conclusions hold • There is no semi-supervised estimator that achieves faster rate than ˆΓ SE if: (cid:40) (1+ α ) γ γ + d ≤ / N ∈ N or (cid:40) (1+ α ) γ γ + d > / N = O ( n ) . • The rate of ˆΓ SSE is faster than the rate of any supervised estimator if: (1 + α ) γ γ + d > / and n = o ( N ) . Moreover, if there exists ρ > such that n ρ = o ( N ) , then the rate of ˆΓ SSE is polynomially faster than n − / . • The rate of ˆΓ SSE is fast similarly to the classical frameworks if (1 + α ) γ γ + d > / and N = Ω (cid:16) n α ) γ γ + d (cid:17) . Clearly, similar observation is true for the Hamming risk Ψ H n,N ; however theregime when improvement is possible thanks to semi-supervised approaches isnarrowed as n − (1+ α ) γ/ (2 γ + d ) = o (cid:0) n − αγ/ (2 γ + d ) (cid:1) . We summarize Corollary 3.6 inTable 3.2. hzhen, Denis and Hebiri/Conﬁdence Sets (1+ α ) γ γ + d N, n

SE rate SSE rate SSE > SE ≤ N ∈ N , n ∈ N n − (1+ α ) γ γ + d n − (1+ α ) γ γ + d NO > N = O ( n ) n − n − NO > n = o ( N ) n − N − (cid:87) n − (1+ α ) γ γ + d YES > N = Ω (cid:18) n α ) γ γ + d (cid:19) n − n − (1+ α ) γ γ + d YES

Table 1

This table summarizes observations of Corollary 3.6 for Ψ E n,N and Ψ D n,N . Depending on therelations between α, γ, d and N, n the semi-supervised approach can signiﬁcantly improve therates of convergence.

Essentially, the above results suggest that the advantage of the semi-supervisedapproaches over the supervised ones depends not only on the underlying familyof distributions P but also on the metric that is considered. Yet, necessaryand suﬃcient conditions that must be imposed in general on the problem andthe metric so that the semi-supervised estimation provably improve upon thesupervised one remain an open problem.A ﬁnal remark we could make before going further concerns the assumptionon the parameters α and γ . The condition 2 α (cid:100) γ (cid:101) ≤ d in the lower bounds isslightly more restrictive than the conditions given in (Audibert and Tsybakov,2007) (they have αγ ≤ d ). We believe that this is an artifact of our proofand could be avoided with a ﬁner choice of hypotheses. Simple modiﬁcations ofthe lower bound of Audibert and Tsybakov (2007) do not work in our settingsbecause their hypotheses are not satisfying Assumption 1.1. In contrast, theconstruction of Rigollet and Vert (2009) satisﬁes Assumption 1.1 but theirlower bound is limited by the condition αγ ≤

1, that is, it does not cover thefast rates as long as the dimension d > In order to prove the lower bounds of Theorems 3.4, 3.5 we actually prove twoseparate lower bounds on the minimax rates. The two lower bounds that weprove are naturally connected with the proposed two-step estimator in Eq. (1.5),that is, the ﬁrst lower bound is connected with the problem of non-parametricestimation of p k for all k ∈ [ K ] and the second describes the estimation of theunknown threshold G − ( β ).In particular, the ﬁrst lower bound is closely related to the one providedin (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009), however, the conti-nuity Assumption 1.1 makes the proof more involved and results in a ﬁnal con-struction of hypotheses that diﬀers signiﬁcantly. This part of our lower bound Modiﬁed properly to ﬁt the classiﬁcation framework. hzhen, Denis and Hebiri/Conﬁdence Sets relies on Fano’s inequality in the form of Birg´e (2005). The second lower boundis based on two hypotheses testing and is derived by constructing two diﬀer-ent marginal distributions of X ∈ R d which are suﬃciently close and a ﬁxedregression function p ( · ). Crucially, these marginal distributions admit two dif-ferent values of threshold G − ( β ) and thus two diﬀerent β -Oracle. In this partwe make use of Pinsker’s inequality, see for instance (Tsybakov, 2008).In order to discriminate the supervised and the semi-supervised procedureswe make use of Deﬁnition 1.4. Notice that every supervised procedure thanksto Deﬁnition 1.4 is not “sensitive” to the expectation taken w.r.t. the unlabeleddataset D N , that is, randomness is only induced by the labeled dataset D n .This strategy allows to eliminate the dependence of the lower bound on the sizeof the unlabeled dataset D N for supervised procedures. Informally, the lowerbound on Ψ (cid:3) n,N ( ˆΥ SE ; P ) is obtained from the lower bound on Ψ (cid:3) n,N ( ˆΥ SSE ; P ) bysetting N = 0.

4. Upper bounds

In this section, we show that we can build conﬁdence set estimators that achieve,up to a logarithmic factor, the lower bounds stated in Theorems 3.4-3.5. Inother words, those estimators are nearly optimal in the minimax sense. To comestraight to the point, we delay the construction of the estimators to Section 4.1and their properties to Section 4.2, and focus right now on their upper bounds.

Theorem 4.1 (Supervised estimation) . Let K ∈ N , β ∈ [ K − , then thereexists a supervised estimator ˆΓ SE ∈ ˆΥ SE and a constant C > such that for all n, N ∈ N we have Ψ H n,N (ˆΓ SE ; P ( L, γ, α )) ≤ C (cid:16) n − αγ γ + d (cid:95) n − / (cid:17) , Ψ E n,N (ˆΓ SE ; P ( L, γ, α )) ≤ C (cid:48) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d (cid:95) n − /  , Ψ D n,N (ˆΓ SE ; P ( L, γ, α )) ≤ C (cid:48)(cid:48) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d (cid:95) n − /  . Theorem 4.2 (Semi-supervised estimation) . Let K ∈ N , β ∈ [ K − , then thereexists a semi-supervised estimator ˆΓ SSE ∈ ˆΥ SSE and constants

C, C (cid:48) , C (cid:48)(cid:48) > hzhen, Denis and Hebiri/Conﬁdence Sets such that for all n, N ∈ N we have Ψ H n,N (ˆΓ SSE ; P ( L, γ, α )) ≤ C (cid:16) n − αγ γ + d (cid:95) ( n + N ) − / (cid:17) , Ψ E n,N (ˆΓ SSE ; P ( L, γ, α )) ≤ C (cid:48) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d (cid:95) ( n + N ) − /  , Ψ D n,N (ˆΓ SSE ; P ( L, γ, α )) ≤ C (cid:48)(cid:48) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d (cid:95) ( n + N ) − /  . We show here that the lower bounds of Theorems 3.4-3.5 are achievable. Inparticular, in the case of Hamming risk, the upper bounds are optimal; whereasfor the Excess risk and the Discrepancy, the upper bounds ﬁt the lower boundsup to a logarithmic factor. Thus, the comments we made in Corollary 3.6 arecorrect. Let us mention that the presence of the logarithmic factor in theseupper bounds is due to (cid:96) ∞ -norm estimation (see Lemma 4.5).Hamming risk as a measure of performance was considered in the settingsof Sadinle, Lei and Wasserman (2018). They also establish upper bounds forthis measure, though do not assess their optimality. Besides, as we alreadymentioned, Denis and Hebiri (2017) provide an upper bound on the excess riskin the context of ERM. Let us point out, that the comparison with these twoworks is not fair as the assumptions and even frameworks under which we andthey formulate results are diﬀerent. Building estimators ˆΓ SE and ˆΓ SSE that reach the rates in the former upperbounds involves a preliminary estimators ˆ p k of the regression functions p k , k ∈ [ K ]. These estimators ˆ p k are constructed using an arbitrary half D (cid:98) n/ (cid:99) of thelabeled dataset D n and they satisfy the following assumptions. Assumption 4.3 (Exponential concentration) . There exist estimators ˆ p k forall k ∈ [ K ] based on D (cid:98) n/ (cid:99) and positive constants C , C such that for all k ∈ [ K ] and all n ≥ we have for all δ > P ∈P ( L,γ,α ) P ⊗(cid:98) n/ (cid:99) ( | ˆ p k ( x ) − p k ( x ) | ≥ δ ) ≤ C exp (cid:16) − C n γ γ + d δ (cid:17) , for almost all x ∈ R d w.r.t. P X . Assumption 4.4 (Continuity of CDF) . For all k ∈ [ K ] the cumulative dis-tribution function F ˆ p k ( t ) := P X (ˆ p k ( X ) ≤ t ) of ˆ p k ( X ) is almost surely P ⊗(cid:98) n/ (cid:99) continuous on (0 , . First let us point out that Assumption 4.3 induces that there exists a constant

C > n ≥ α > P ∈P ( L,γ,α ) E D (cid:98) n/ (cid:99) (cid:107) p − ˆ p (cid:107) α ∞ , P X ≤ C (cid:18) n log n (cid:19) − (1+ α ) γ γ + d . hzhen, Denis and Hebiri/Conﬁdence Sets Assumption 4.3 is commonly used in the statistical community when we dealwith rates of convergence in the classiﬁcation settings (Audibert and Tsybakov,2007; Lei, 2014; Sadinle, Lei and Wasserman, 2018). It is for instance satisﬁedby the locally polynomial estimator (Stone, 1977; Tsybakov, 1986; Audibert andTsybakov, 2007). Assumption 4.4 can always be satisﬁed by slightly processingany estimator ˆ p . Indeed, assume Assumption 4.4 fails to be satisﬁed by someestimator ˆ p . It means that there exists a subset of R d of non-zero measure suchthat at least one ˆ p k , with k ∈ [ K ], is constant on this set. Then, if we add a deterministic continuous function of a suﬃciently bounded variation to ˆ p suchregions can no longer exist.Since, the threshold level G − ( β ) is not known beforehand, it ought to beestimated using data. A straightforward estimator of this threshold can be con-structed using the unlabeled dataset D N . To make our presentation mathemat-ically correct we introduce the following notation D n = D (cid:98) n/ (cid:99) (cid:83) D (cid:100) n/ (cid:101) , where D (cid:98) n/ (cid:99) is the dataset used to build the estimators ˆ p k for k ∈ [ K ]. Now, all thelabels are removed from D (cid:100) n/ (cid:101) , that is it consists of (cid:100) n/ (cid:101) i.i.d. samples from P X . The supervised and semi-supervised estimators of G ( · ) are deﬁned asˆ G SE ( · ) = 1 (cid:100) n/ (cid:101) (cid:88) X ∈D (cid:100) n/ (cid:101) K (cid:88) k =1 { ˆ p k ( X ) > ·} , ˆ G SSE ( · ) = 1 (cid:100) n/ (cid:101) + N (cid:88) X ∈D N (cid:83) D (cid:100) n/ (cid:101) K (cid:88) k =1 { ˆ p k ( X ) > ·} , respectively. Finally, we are in position to deﬁne ˆΓ SE and ˆΓ SSE asˆΓ SE ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , ˆΓ SSE ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , for all x ∈ R d . Note that ˆΓ SE is clearly supervised in the sense of Deﬁnition 1.4,as it is independent of the unlabeled sample D N . In contrast, ˆΓ SSE is semi-supervised, since we can ﬁnd two samples D N and D (cid:48) N which induce diﬀerentconﬁdence sets. To show that the estimators introduced in this section satisfythe statements of Theorems 4.1-4.2 we reﬁne the proof technique used in (Denisand Hebiri, 2017). That is, we introduce an intermediate quantity˜ G ( · ) := K (cid:88) k =1 P X (ˆ p k ( X ) > · ) , and the associated conﬁdence set, which we refer to as the pseudo Oracle con-ﬁdence set given for all x ∈ R d by˜Γ( x ) := (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˜ G − ( β ) (cid:111) . It is suﬃcient to make sure that adding the function preserves its statistical properties,that is, Assumption 4.3 hzhen, Denis and Hebiri/Conﬁdence Sets The conﬁdence set ˜Γ assumes knowledge of the marginal distribution P X andis seen as an idealized version of both ˆΓ SE and ˆΓ SSE , note however, that thepseudo Oracle ˜Γ is not an estimator.

An important step of our analysis is the following lemma, that bounds thediﬀerence between ˜ G − ( β ) and G − ( β ). Lemma 4.5 (Upper bound on the thresholds) . Let Assumption 1.1 be satisﬁed,then for all β ∈ [ K ] (cid:12)(cid:12)(cid:12) G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) p − ˆ p (cid:107) ∞ , P X , almost surely P ⊗ n ⊗ P ⊗ NX . The proof of Lemma 4.5 uses elementary properties of the generalized inversefunctions which are provided in Appendix. Besides, let us mention, that thediﬀerence | G − ( β ) − ˜ G − ( β ) | resembles the Wasserstein inﬁnity distance whichgives an alternative approach to prove Lemma 4.5, see (Bobkov and Ledoux,2016). Lemma 4.5 explains the extra log n factor that appears in the upperbound, as the minimax estimation in sup norm contains the log n factor, seefor instance (Stone, 1982; Tsybakov, 2008). Another important property of theintroduced estimators ˆΓ SE and ˆΓ SSE is obtained via Assumption 4.4. It describesthe deviation of the information of ˆΓ SE and ˆΓ SSE from the desired level β . Proposition 4.6 (Denis and Hebiri (2017)) . Let ˆ p k for all k ∈ [ K ] be arbitraryestimators of the regression functions constructed using D (cid:98) n/ (cid:99) that satisﬁesAssumption 4.4, then there exist constants C, C (cid:48) > such that for all n, N ∈ N it holds that E ( D n , D N ) (cid:12)(cid:12)(cid:12) β − I (cid:16) ˆΓ SE (cid:17)(cid:12)(cid:12)(cid:12) ≤ Cn − / , E ( D n , D N ) (cid:12)(cid:12)(cid:12) β − I (cid:16) ˆΓ SSE (cid:17)(cid:12)(cid:12)(cid:12) ≤ C (cid:48) ( N + n ) − / . Note that if ˆ p k satisﬁes Assumption 4.4 for all k ∈ [ K ], then β = I(˜Γ).This simple fact is a step in the proof of Proposition 4.6. Finally, combina-tion of Lemma 4.5, Proposition 4.6, Assumption 4.3 with the peeling argumentused in (Audibert and Tsybakov, 2007, Lemma 3.1) yields the results of Theo-rems 4.1-4.2. The goal of this part is to numerically address the following points.1) Is it more advantageous to go outside of the classical multi-class classiﬁ-cation settings and consider the conﬁdence set framework? To respond tothis question we compute the Bayes optimal multi-class classiﬁer and viewit as a conﬁdence set with one label. We compare this Bayes rule with the hzhen, Denis and Hebiri/Conﬁdence Sets K = 10 β β -Oracle top- β Oracle2 0.05 (0.01) 0.09 (0.01)5 0.00 (0.00) 0.01 (0.00) K = 100 β β -Oracle top- β Oracle2 0.39 (0.01) 0.42 (0.01)5 0.20 (0.01) 0.22 (0.01)10 0.09 (0.01) 0.11 (0.01)20 0.03 (0.01) 0.04 (0.01)

Table 2

For each of the B = 100 repetitions and each model, we derive the estimated errors P M ofthe β -Oracle and of the top - β Oracle w.r.t. β . We compute the means and standarddeviations (between parentheses) over the B = 100 repetitions. Top: the data are generatedaccording to K = 10 – Bottom: the data are generated according to K = 100 . β -Oracle in terms of the error P( · ) using various values of β ∈ [ K ] and K ∈ N .2) How does the β -Oracle conﬁdence set compares to another ”Oracle” (top- β Oracle) which simply includes classes corresponding to the largest valuesof p k ( · )’s?3) Does the proposed plug-in approach indeed gives a good approximationof the β -Oracle through the error P( · ) and the information I( · )?4) Despite demonstrating the minimax inconsistency of the top- β approach,we wonder whether in some scenarios it can achieve a comparable perfor-mance against our semi-supervised plug-in procedure.We consider two simulation schemes depending on the parameter K ∈ { , } .For each K , we generate ( X, Y ) according to a mixture model. More precisely,i) the label Y follows uniform distribution on [ K ];ii) conditional on Y = k , the feature X is generated according to a multi-variate gaussian distribution with mean µ k ∈ R and identity covariancematrix.For each k ∈ [ K ], the vectors µ k are i.i.d. realizations of uniform distribution hzhen, Denis and Hebiri/Conﬁdence Sets β K = 10 K = 1002 2.00 (0.03) 2.00 (0.03)5 5.00 (0.08) 5.00 (0.06)10 · · Table 3

For each of the B = 100 repetitions and each model, we derive the estimated informationlevels I M of the β -Oracle set w.r.t. β . We compute the means and standard deviations (inparentheses) over the B = 100 repetitions. Left: the data are generated according to K = 10 – Right: the data are generated according to K = 100 . on [0 , . For this distribution, we have p k ( X ) = f k ( X ) (cid:80) Kj =1 f j ( X ) , where for each k ∈ [ K ], f k ( X ) is the density function of a multivariate gaussiandistribution with mean parameter µ k and identity covariance matrix.For each K , the missclassiﬁcation error of the classical multi-class classiﬁ-cation Bayes rule is evaluated based on a suﬃciently large dataset. It is valuedat 0 .

22 and at 0 .

60 for K = 10 and for K = 100 respectively. These values arerelatively high, which suggests that confusion is induced by the large numberof classes. Hence, it is reasonable to apply the conﬁdence set approach to thisproblem. In the sequel, we aim at providing the estimation of the error of the β -Oracle. To this end, for β ∈ { , , , } and each K , we repeat B times thefollowing steps.i) simulate two datasets D N and D M with N = 10000 and M = 1000;ii) based on D N , we compute the empirical counterpart of G and provide anapproximation of the β -Oracle Γ ∗ β given in Eq. (1.1) (we recall that thisstep requires a dataset which contains only unlabeled features);iii) ﬁnally, over D M , we compute the empirical counterparts P M (of P(Γ ∗ β ))and I M (of I(Γ ∗ β )).From this estimates, we compute the mean and the standard deviation of P M and I M . Tables 2 and 3 present values of the error and of the information whichare achieved by the β - Oracle and by the top- β Oracle.We now move towards the construction of our semi-supervised plug-in es-timators ˆΓ

SSE . For each K and each β , we evaluate the performance of ˆΓ SSE according to three diﬀerent estimations of the regression function: the ˆ p k ’s arebased on random forests, softmax regression and deep learning procedures. Letus point out, that for random forests and softmax regression algorithms, the hzhen, Denis and Hebiri/Conﬁdence Sets K = 10ˆΓ SSE top- ββ rforest softmax reg deep learn rforest softmax reg deep learn K = 100ˆΓ SSE top- ββ rforest softmax reg deep learn rforest softmax reg deep learn Table 4

For each of the B = 100 repetitions and for each model, we derive the estimated errors P ofthree diﬀerent ˆΓ SSE ’s w.r.t. β . We compute the means and standard deviations (inparentheses) over the B = 100 repetitions. For each β and for each N , the ˆΓ SSE ’s, as wellas the top procedures are based on, from left to right, rforest , softmax reg and deeplearn , which are respectively the random forest, the softmax regression and the deeplearning methods. Top: the data are generated according to K = 10 – Bottom: the data aregenerated according to K = 100 . random variables ˆ p k ( X ) appear to be not continuous. Hence Assumption 4.4 isviolated. To alleviate this issue, we add to ˆ p k ( X ) an independent small pertur-bation |N (0 , e − ) | for simplicity. The evaluation of the performance of ˆΓ SSE relies on the following stepsi) simulate three datasets D n , D N and D M ;ii) based on D n , we compute the estimators ˆ p k of p k according to the consid-ered procedure;iii) based on D N and ˆ p k we compute the function ˆ G and the estimator ˆΓ SSE as in Eq. (1.5) (we recall that this step requires a dataset which containsonly unlabeled features);iv) ﬁnally, we compute over D M the empirical counterpart of P and of I forthe considered ˆΓ SSE .Again, during these experiments, we compute means and standard deviations.The parameters

K, n, N are ﬁxed as follows: for K = 10, we ﬁx n = 1000 and N ∈ { , } ; for K = 100 we ﬁx n = 10000 and N ∈ { , } . Finally,the size of D M is ﬁxed to M = 1000. The results are illustrated in Tables 4and 5. hzhen, Denis and Hebiri/Conﬁdence Sets K = 10 N = 100 N = 10000 β rforest softmax reg deep learn rforest softmax reg deep learn K = 100 N = 100 N = 10000 β rforest softmax reg deep learn rforest softmax reg deep learn Table 5

For each of the B = 100 repetitions and for each model, we derive the estimatedinformation levels I of three diﬀerent ˆΓ SSE ’s w.r.t. β and the sample size N . We computethe means and standard deviations (in parentheses) over the B = 100 repetitions. For each β and each N , the ˆΓ SSE ’s are based on, from left to right, rforest , softmax reg and deeplearn , which are respectively the random forest, the softmax regression and the deeplearning procedures. Top: the data are generated according to K = 10 – Bottom: the dataare generated according to K = 100 . As benchmark for the continuation of our experiments, the classical missclas-siﬁcation errors of the multi-class classiﬁers based on random forests, softmaxregression and deep learning methods are valued respectively at 0 .

28, 0 .

24, 0 . K = 10, and at 0 .

65, 0 .

98 0 .

63 for K = 100.Turning to Table 2 we conﬁrm the intuition that the error of the β -Oracledecreases as the value of the parameter β increases. Nevertheless, for moderatevalues of β , compared to K , we obtain a satisfactory improvement comparedto standard multi-class classiﬁcation Bayes rule. For instance, when K = 10and β = 2 the error of the 2-Oracle conﬁdence set is 0 .

05, whereas the Bayesclassiﬁer has 0 .

22; likewise, when K = 100 and β = 5 the the classiﬁcation errordecreases from 0 .

60 to 0 .

20. Table 2 shows that the top- β Oracle is slightlyoutperformed by the β -Oracle in terms of the error, but still performs well.From Tables 3 and 5, we observe that the approximation of the informationis reasonably good and it gets better with N the number of unlabeled data.Besides, Tables 2 and 4 demonstrate that our algorithm is sensitive to the choiceof the underlying estimator ˆ p k . Indeed, when ˆ p k is estimated via the softmaxregression, our algorithm fails to give a good approximation to the error of the hzhen, Denis and Hebiri/Conﬁdence Sets β -Oracle.Table 4 provides similar conclusions regarding ˆΓ SSE , though, unlike the theo-retical quantities, there are more scenarios where our method is better than itstop- β counterpart. Let us point out, that for K = 100 methods that are basedon the softmax regression perform poorly in this setup.

5. Discussions

The bedrock of this paper is Assumption 1.1. Based on it, we ensure that the β -Oracle conﬁdence set given by Eq. (1.1) is indeed of information β . On topof that, the explicit formulation of excess risk in Proposition 3.1 relies on thecontinuity of function G ( · ). Should Assumption 1.1 fail to be satisﬁed, then theremight be no β -Oracle given by thresholding on some level θ ∈ (0 , β -Oracle having theform Γ ∗ β ( · ) = { k ∈ [ K ] : p k ( · ) > θ } with some θ , then β = I(Γ ∗ β ) = G ( θ ) . However, without the continuity, the function G ( · ) is not surjective and there-fore, the equation G ( θ ) = β may have no solution, which contradicts the factthat I(Γ ∗ β ) = β . Therefore, the settings without the continuity of G ( · ) deservea separate study. Let us also point out that the continuity assumption impliesthat the β -Oracle can also be deﬁned asΓ ∗ β ∈ arg min { P (Γ) : Γ ∈ Υ s.t. I(Γ) ≤ β } , where the inequality used in place of the equality. Indeed, under continuityassumption thanks to Propositions 1.3 and 3.1 we have for all conﬁdence sets Γsuch that I(Γ) ≤ β P(Γ) − P(Γ ∗ β ) = R β (Γ) − R β (Γ ∗ β ) (cid:124) (cid:123)(cid:122) (cid:125) ≥ + G − ( β ) ( β − I(Γ)) (cid:124) (cid:123)(cid:122) (cid:125) ≥ , which implies that the β -Oracle Γ ∗ β is a minimizer. G − ( · ) Under the assumptions needed in this work, and in particular the continuity as-sumption we showed two important facts: i) no supervised approach can achievefast rates, that is, faster than n − / ; ii) some semi-supervised approaches canachieve fast rate.One might wonder whether extra assumptions on the problem allow a super-vised method to get faster rates than n − / . We give to this question a partial hzhen, Denis and Hebiri/Conﬁdence Sets answer following the recent work of Bobkov and Ledoux (2016) and more pre-cisely their Theorem 5.11. Applying this result to our framework, we can statethat there exists a positive constant c such that E D N (cid:12)(cid:12) G − ( β ) − G − N ( β ) (cid:12)(cid:12) ≤ c Lip( G − ) N − / , where Lip( G − ) is the Lipschitz constant of G − ( · ) and G − N ( · ) is the generalizedinverse of G N ( · ) = 1 N (cid:88) X ∈D N K (cid:88) k =1 { p k ( X ) > ·} . If, on top of the above, one can show that for any α > c (cid:48) E D N (cid:12)(cid:12) G − ( β ) − G − N ( β ) (cid:12)(cid:12) α ≤ c (cid:48) Lip α ( G − ) N − (1+ α ) / , then under Lipschitz continuity of G − ( · ), we can prove thatΨ E n,N (ˆΓ (cid:3) ; P ( L, γ, α )) (cid:46) (cid:18) n log n (cid:19) − (1+ α ) γ γ + d , where (cid:3) stands for SE or SSE. This would illustrate that both ˆΓ SE and ˆΓ SSE arestatistically equivalent under Lipschitz condition on G − ( · ), that is, both reachthe same rate and the impact of the unlabeled data D N is negligible. We planto further investigate the inﬂuence of this Lipschitz condition on the minimaxrates of convergence in our future works. Since in the present contribution we donot impose this assumption on G − ( · ), the upper bound of Bobkov and Ledoux(2016) is not applicable and we had to rely on a diﬀerent approach. Theorems 3.5 and 4.2 demonstrate that for the excess risk and the Discrepancy,the upper and the lower bounds diﬀer by a logarithmic factor. As we havealready pointed out, this factor appears in the upper bounds due to Lemma 4.5which relates the diﬀerence between two thresholds to the inﬁnity norm. Onemight hope that if we manage to replace the inﬁnity norm by any other (cid:96) q -norm on the right hand side of the inequality in Lemma 4.5 this logarithm canbe eliminated. Unfortunately, it appears that this bound is actually tight, in asense that one can construct a distribution P and an estimator ˆ p k for all k ∈ [ K ]such that an equality is achieved in Lemma 4.5. These arguments suggest thatthe obtained upper bound should be optimal. They also imply that the lowerbounds could be further reﬁned to get an extra logarithmic factor. Let us alsomention that the continuity Assumption 1.1 in combination with the marginAssumption 2.1 are main obstacles that did not allow us to provide better lowerbounds. Nevertheless, our proofs are already involved and our results allow tomake non trivial conclusions even without going into the details concerning thelogarithms. hzhen, Denis and Hebiri/Conﬁdence Sets

6. Conclusion

In this work we have studied the minimax settings of conﬁdence set multi-classclassiﬁcation. First of all, following previous works we have shown that a top- β type procedure is inconsistent in our settings and more involved estimatorsshould be proposed. Besides, we have demonstrated that no supervised estimatorcan achieve rates that are faster than n − / , which stays in contrast with otherclassical settings. Additionally, we have shown that fast rates are achievablefor semi-supervised techniques provided that the size of the unlabeled sampleis large enough. Consecutively, we have established that our lower bounds areeither optimal or nearly optimal by providing a supervised and a semi-supervisedestimators which are tractable in practice. Our future works shall be focusedon the Lipschitz condition of G − ( · ) discussed in Section 5.2, in particular, wewant to understand how this extra assumption aﬀects our lower bounds. References

Anbar, D. (1977). A Modiﬁed Robbins-Monro Procedure Approximating theZero of a Regression Function from Below.

Ann. Statist. Audibert, J. and

Tsybakov, A. (2007). Fast learning rates for plug-in clas-siﬁers.

Ann. Statist. Bartlett, P. and

Wegkamp, M. (2008). Classiﬁcation with a reject optionusing a hinge loss.

J. Mach. Learn. Res. Bellec, P. , Dalalyan, A. , Grappin, E. and

Paris, Q. (2018). On the pre-diction loss of the lasso in the partially labeled setting.

Electron. J. Statist. Birg´e, L. (2005). A new lower bound for multiple hypothesis testing.

IEEETrans. Inform. Theory . Bobkov, S. and

Ledoux, M. (2016). One-dimensional empirical measures,order statistics and Kantorovich transport distances. to appear in the Memoirsof the Amer. Math. Soc.

Brown, L. and

Low, M. (1996). A constrained risk inequality with applica-tions to nonparametric functional estimation.

Ann. Statist Chow, C. (1957). An optimum character recognition system using decisionfunctions.

IRE Transactions on Electronic Computers Chow, C. (1970). On optimum error and reject trade-oﬀ.

IEEE Trans. Inform.Theory de Finetti, B. (1972). Probability, induction and statistics. The art of guessing .John Wiley & Sons, London-New York-Sydney Wiley Series in Probabilityand Mathematical Statistics. de Finetti, B. (1974).

Theory of probability: a critical introductory treatment.Vol. 1 . John Wiley & Sons, London-New York-Sydney.

Denis, C. and

Hebiri, M. (2015). Consistency of plug-in conﬁdence sets forclassiﬁcation in semi-supervised learning. preprint.

Denis, C. and

Hebiri, M. (2017). Conﬁdence Sets with Expected Sizes forMulticlass Classiﬁcation.

J. Mach. Learn. Res. hzhen, Denis and Hebiri/Conﬁdence Sets Gilbert, E. (1952). A comparison of signalling alphabets.

The Bell systemtechnical journal Hartigan, J. A. (1987). Estimation of a Convex Density Contour in TwoDimensions.

J. Amer. Statist. Assoc. Herbei, R. and

Wegkamp, M. (2006). Classiﬁcation with reject option.

Canad. J. Statist. Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded RandomVariables.

J. Amer. Statist. Assoc. Lei, J. (2014). Classiﬁcation with conﬁdence.

Biometrika

Lepskii, O. (1990). Asymptotic minimax estimation with prescribed properties.

Theory Probab. Appl. Polonik, W. (1995). Measuring Mass Concentrations and Estimating DensityContour Clusters-An Excess Mass Approach.

Ann. Statist. Ramaswamy, H. , Tewari, A. and

Agarwal, S. (2018). Consistent algorithmsfor multiclass classiﬁcation with an abstain option.

Electron. J. Stat. Rigollet, P. (2007). Generalization error bounds in semi-supervised classiﬁ-cation under the cluster assumption.

J. Mach. Learn. Res. Rigollet, P. and

Vert, R. (2009). Optimal rates for plug-in estimators ofdensity level sets.

Bernoulli Sadinle, M. , Lei, J. and

Wasserman, L. (2018). Least Ambiguous Set-ValuedClassiﬁers With Bounded Error Levels.

J. Amer. Statist. Assoc.

Singh, A. , Nowak, R. and

Zhu, J. (2009). Unlabeled data: Now it helps, nowit doesn’t. In

NIPS

Stone, C. (1977). Consistent nonparametric regression.

Ann. Statist.

Stone, C. (1982). Optimal global rates of convergence for nonparametric re-gression.

Ann. Statist.

Tsybakov, A. (1986). Robust reconstruction of functions by the local-approximation method.

Problemy Peredachi Informatsii Tsybakov, A. B. (1997). On nonparametric estimation of density level sets.

Ann. Statist. Tsybakov, A. (2004). Optimal aggregation of classiﬁers in statistical learning.

Ann. Statist. Tsybakov, A. (2008).

Introduction to Nonparametric Estimation . Springer Ser.Statist.

Springer New York. van der Vaart, A. (1998).

Asymptotic statistics . Camb. Ser. Stat. Probab.Math. . Cambridge University Press, Cambridge. Vapnik, V. (1998).

Statistical learning theory . Adaptive and Learning Systemsfor Signal Processing, Communications, and Control . John Wiley & Sons Inc.,New York. A Wiley-Interscience Publication.

Varshamov, R. (1957). Estimate of the number of signals in error correctingcodes.

Dokl. Akad. Nauk SSSR

Vovk, V. (2002a). Asymptotic optimality of transductive conﬁdence machine.In

Algorithmic learning theory . Lecture Notes in Comput. Sci.

Vovk, V. (2002b). On-line conﬁdence machines are well-calibrated. In

Pro- hzhen, Denis and Hebiri/Conﬁdence Sets ceedings of the Forty-Third Annual Symposium on Foundations of ComputerScience Vovk, V. , Gammerman, A. and

Shafer, G. (2005).

Algorithmic learning ina random world . Springer, New York.

Wegkamp, M. and

Yuan, M. (2011). Support vector machines with a rejectoption.

Bernoulli Appendix A: Technical results

Here we provide proofs for our result. This Appendix is composed of the fol-lowing part: in Appendix A we introduce some technical results used for ourproofs; Appendix B is devoted to the proofs of the upper bounds; Appendix Cprovides with the proofs our our main lower bounds; ﬁnally, in Appendix D weprove the inconsistency of top- β approaches.In this section we gather several technical results which are used to derivethe contributions of this work. Let us start by introducing notation used in theappendix. Given any two probability measures P , P on some space measurablespace ( X , A ) the KullbackLeibler divergence between P and P is deﬁned asKL( P , P ) := (cid:40)(cid:82) X log (cid:16) d P d P (cid:17) d P , supp( P ) ⊂ supp( P )+ ∞ , otherwise , (A.1)and the total variation distance is deﬁned asTV( P , P ) := sup A ∈A | P ( A ) − P ( A ) | . (A.2)We start with Fano’s inequality in the form proved by (Birg´e, 2005). Lemma A.1 (Fano’s inequality (Birg´e, 2005)) . Let { P i } mi =0 be a ﬁnite familyof probability measures on ( X , A ) and let { A i } mi =0 be a ﬁnite family of disjointevents such that A i ∈ A for each i = 0 , . . . , m . Then, min i ∈{ , ,...,m } P i ( A i ) ≤ (cid:18) . (cid:95) m (cid:80) mi =1 KL( P i , P )log( m + 1) (cid:19) . Lemma A.2 (Pinsker’s inequality) . Given any two probability measures P , P on some measurable space ( X , A ) we have TV( P , P ) ≤ (cid:114)

12 KL( P , P ) . Lemma A.3 (Hoeﬀding’s inequality (Hoeﬀding, 1963)) . Let b > be a realnumber, and N be a positive integer. Let X , . . . , X N be N random variableshaving values in [0 , b ] , then P (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 ( X i − E [ X i ]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ t (cid:33) ≤ (cid:18) − N t b (cid:19) , ∀ t > . hzhen, Denis and Hebiri/Conﬁdence Sets Proposition A.4 (Properties of the generalized inverse) . Let X ∈ R d and P X be a Borel measure on R d , let p : R d → [0 , K be a vector function, we deﬁnefor all t ∈ [0 , and all β ∈ (0 , K ) G ( t ) := K (cid:88) k =1 P X ( p k ( X ) > t ) , G − ( β ) := inf { t ∈ [0 ,

1] : G ( t ) ≤ β } . Then, • for all t ∈ (0 , and β ∈ (0 , K ) we have G − ( β ) ≤ t ⇐⇒ G ( t ) ≤ β . • if for all k ∈ [ K ] the mappings t (cid:55)→ P X ( p k ( X ) > t ) are continuous on (0 , , then – for all β ∈ (0 , K ) we have G ( G − ( β )) = β . The next result is an analogue of the classical inverse transform theorem (van derVaart, 1998, Lemma 21.1) and was already established by Denis and Hebiri(2017).

Lemma A.5.

Let ε distributed from a uniform distribution on [ K ] and Z , . . . , Z K , K real valued random variables independent from ε , such that the function t (cid:55)→ H ( t ) deﬁned as H ( t ) := 1 K K (cid:88) k =1 P ( Z k ≤ t ) , is continuous. Consider random variable Z = (cid:80) Kk =1 Z k { ε = k } and let U bedistributed according to the uniform distribution on [0 , . Then H ( Z ) L = U and H − ( U ) L = Z , where H − denotes the generalized inverse of H .Proof. First we note that for every t ∈ [0 , P ( H ( Z ) ≤ t ) = P (cid:0) Z ≤ H − ( t ) (cid:1) .Moreover, we have P ( H ( Z ) ≤ t ) = K (cid:88) k =1 P ( Z ≤ H − ( t ) , ε = k )= 1 K K (cid:88) k =1 P ( Z k ≤ H − ( t )) (with ε independent of Z )= H ( H − ( t ))= t (with H continuous) . hzhen, Denis and Hebiri/Conﬁdence Sets To conclude the proof, we observe that P (cid:0) H − ( U ) ≤ t (cid:1) = P ( U ≥ H ( t )) = 1 K K (cid:88) k =1 P ( Z k ≤ t )= K (cid:88) k =1 P ( Z k ≤ t, ε = k ) = P ( Z ≤ t ) . Appendix B: Upper bounds

In this section we prove Theorems 4.1 and 4.2. It will be clear from our analysisthat the proof of Theorem 4.1 follows directly from Theorem 4.2 by setting N = 0 in the statement of Theorem 4.2. Thus, in this section for simplicitywe omit the subscript SSE from ˆΓ SSE . Recall that our dataset consists of threeparts D (cid:98) n/ (cid:99) , D (cid:100) n/ (cid:101) , D N . The set D (cid:98) n/ (cid:99) is used to construct an estimator ˆ p ofthe regression function p , that is, ˆ p is independent from both D (cid:100) n/ (cid:101) , D N . Theother two sets D (cid:100) n/ (cid:101) , D N are used in a semi-supervised manner to estimatethe threshold, that is, we erase the labels from D (cid:100) n/ (cid:101) . Let β ∈ [ K − x ∈ R d ˆΓ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˆ G − ( β ) (cid:111) , with ˆ p k ( x ) satisfying Assumptions 4.4, 4.3 for all k ∈ [ K ]. Moreover, ˆ G − ( β )deﬁned as the generalized inverse ofˆ G ( t ) = 1 (cid:100) n/ (cid:101) + N (cid:88) X ∈D N (cid:83) D (cid:100) n/ (cid:101) K (cid:88) k =1 ˆ p k ( X ) >t , where t ∈ [0 , β -Oracle is given asΓ ∗ β ( x ) = (cid:8) k ∈ [ K ] : p k ( x ) ≥ G − ( β ) (cid:9) , (B.1)where G − ( · ) is the generalized inverse of G ( t ) := K (cid:88) k =1 P ( p k ( X ) ≥ t ) . Lastly, let us re-introduce an idealized version ˜Γ of the proposed estimator ˆΓwhich ’knows’ the marginal distribution P X of the feature vector X ∈ R d as˜Γ( x ) = (cid:110) k ∈ [ K ] : ˆ p k ( x ) ≥ ˜ G − ( β ) (cid:111) , with ˜ G := (cid:80) Kk =1 P X (ˆ p k ( X ) > t ), conditionally on the data. The following resultis needed to relate the threshold ˜ G − ( β ) of ˜Γ to the true value of the threshold G − ( β ). hzhen, Denis and Hebiri/Conﬁdence Sets Lemma B.1 (Upper-bound on the thresholds) . Let X ∈ R d and P X be a Borelmeasure on R d . For two vector functions p, ˆ p : R d → [0 , K , we deﬁne G ( · ) := K (cid:88) k =1 P X ( p k ( X ) > · ) , ˜ G ( · ) := K (cid:88) k =1 P X (ˆ p k ( X ) > · ) . If for all k ∈ [ K ] the mapping t (cid:55)→ P X ( p k ( X ) > t ) is continuous on (0 , , thenfor every β ∈ (0 , K ) (cid:12)(cid:12)(cid:12) G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) ˆ p − p (cid:107) ∞ , P X . Proof.

The proof of this result is very similar to the proof of (Bobkov andLedoux, 2016, Theorem 2.12). We start by deﬁning the following quantity h ∗ = inf (cid:110) h ≥ ∀ t ∈ [0 ,

1] ˜ G ( t + h ) ≤ G ( t ) ≤ ˜ G ( t − h ) (cid:111) . Due to the deﬁnition of h ∗ we have that for all t ∈ [0 , G ( t + h ∗ ) ≤ G ( t ) ≤ ˜ G ( t − h ∗ ) , that is, applying Proposition A.4 to the second inequality we get for all t ∈ [0 , t − h ∗ ≤ ˜ G − ( G ( t )) , thus, for t = G − ( β ) with β ∈ (0 , K ) thanks to Proposition A.4 we get G − ( β ) − ˜ G − ( β ) ≤ h ∗ . The inequality ˜ G − ( β ) − G − ( β ) ≤ h ∗ is obtained in the same way. Thus, wehave proved that (cid:12)(cid:12)(cid:12) G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) ≤ h ∗ . Finally, notice that for all t ∈ [0 , K (cid:88) k =1 P X (cid:16) ˆ p k ( X ) > t + (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) ˜ G (cid:16) t + (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17) ≤ K (cid:88) k =1 P X ( p k ( X ) > t ) (cid:124) (cid:123)(cid:122) (cid:125) G ( t ) ≤ K (cid:88) k =1 P X (cid:16) ˆ p k ( X ) > t − (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) ˜ G (cid:16) t −(cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17) , where we used the fact that for all k ∈ [ K ] P X (ˆ p k ( X ) > t + | ˆ p k ( X ) − p k ( X ) | ) ≤ P X ( p k ( X ) > t ) ≤ P X (ˆ p k ( X ) > t − | ˆ p k ( X ) − p k ( X ) | ) , and P X (cid:16) | ˆ p k ( X ) − p k ( X ) | ≤ (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17) = 1. Therefore by deﬁnition of h ∗ ,we can write h ∗ ≤ (cid:107) ˆ p − p (cid:107) ∞ , P X and we conclude. hzhen, Denis and Hebiri/Conﬁdence Sets We are in position to prove Theorem 4.2, let us point out that the mostdiﬃcult part in Theorem 4.2 is the upper-bound on the excess risk. The upper-bound on the discrepancy follows the same arguments as the ones we use forthe excess-risk.

Excess risk and discrepancy: to upper-bound the excess risk we ﬁrstseparate it into two parts asR β (ˆΓ) − R β (Γ ∗ β ) = (cid:16) R β (˜Γ) − R β (Γ ∗ β ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) R + (cid:16) R β (ˆΓ) − R β (˜Γ) (cid:17)(cid:124) (cid:123)(cid:122) (cid:125) R . Recall that thanks to Proposition 3.1 we have R = K (cid:88) k =1 E (cid:104) | p k ( X ) − G − ( β ) | { k ∈ ˜Γ( X ) (cid:52) Γ ∗ β ( X ) } (cid:105) . Moreover, let us point out that if some k ∈ ˜Γ( X ) (cid:52) Γ ∗ β ( X ) then either (cid:40) p k ( X ) − G − ( β ) ≥ p k ( X ) − ˜ G − ( β ) < (cid:40) p k ( X ) − G − ( β ) < p k ( X ) − ˜ G − ( β ) ≥ , holds. Thus on the event k ∈ ˜Γ( X ) (cid:52) Γ ∗ β ( X ) we have (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˆ p k ( X ) − p k ( X ) + G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | + (cid:12)(cid:12)(cid:12) G − ( β ) − ˜ G − ( β ) (cid:12)(cid:12)(cid:12) . Therefore, for R using Lemma B.1 and the observations above we can write R ≤ K (cid:88) k =1 E (cid:104) | p k ( X ) − G − ( β ) | { | p k ( X ) − G − ( β ) |≤| ˆ p k ( X ) − p k ( X ) | + | G − ( β ) − ˜ G − ( β ) |} (cid:105) ≤ K (cid:88) k =1 E (cid:20) | p k ( X ) − G − ( β ) | (cid:110) | p k ( X ) − G − ( β ) |≤ (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:111) (cid:21) ≤ K (cid:88) k =1 E (cid:20) (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:110) | p k ( X ) − G − ( β ) |≤ (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:111) (cid:21) = 2 (cid:107) ˆ p − p (cid:107) ∞ , P X K (cid:88) k =1 P X (cid:16)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ (cid:107) ˆ p − p (cid:107) ∞ , P X (cid:17) , ﬁnally, using the margin Assumption 2.1 we get almost surely data R ≤ c α K (cid:107) ˆ p − p (cid:107) α ∞ , P X . Integrating over the data from both sides and using Assumption 4.3 we get E ( D n , D N ) R ≤ c C α K (cid:18) n log n (cid:19) − (1+ α ) γ γ + d . hzhen, Denis and Hebiri/Conﬁdence Sets For R the following trivial upper-bound holds R = (cid:16) P(ˆΓ) − P(˜Γ) (cid:17) + G − ( β ) (cid:16) I(ˆΓ) − I(˜Γ) (cid:17) = K (cid:88) k =1 E (cid:0) p k ( X ) − G − ( β ) (cid:1) (cid:16) { k ∈ ˜Γ( X ) } − { k ∈ ˆΓ( X ) } (cid:17) ≤ K (cid:88) k =1 E (cid:12)(cid:12)(cid:12) { k ∈ ˜Γ( X ) } − { k ∈ ˆΓ( X ) } (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) E | ˜Γ( X ) (cid:52) ˆΓ( X ) | (B.2)= K (cid:88) k =1 E (cid:12)(cid:12)(cid:12) { ˆ p k ( X ) ≥ ˆ G − ( β ) } − { ˆ p k ( X ) ≥ ˜ G − ( β ) } (cid:12)(cid:12)(cid:12) , now, thanks to the ﬁrst property of Proposition A.4 we can write R ≤ K (cid:88) k =1 E (cid:12)(cid:12)(cid:12) { ˆ G (ˆ p k ( X )) ≤ β } − { ˜ G (ˆ p k ( X )) ≤ β } (cid:12)(cid:12)(cid:12) ≤ K (cid:88) k =1 P X (cid:16)(cid:12)(cid:12)(cid:12) ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) (cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12)(cid:17) To ﬁnish our proof we make use of the peeling technique of (Audibert andTsybakov, 2007, Lemma 3.1). That is, we deﬁne for δ > k ∈ [ K ] A k = (cid:110)(cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12) ≤ δ (cid:111) A kj = (cid:110) j − δ < (cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12) ≤ j δ (cid:111) , j ≥ . Since, for every k ∈ [ K ], the events ( A kj ) j ≥ are mutually exclusive, we deduce K (cid:88) k =1 P X (cid:16)(cid:12)(cid:12)(cid:12) ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) (cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12)(cid:17) = (B.3) K (cid:88) k =1 (cid:88) j ≥ P X (cid:16)(cid:12)(cid:12)(cid:12) ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) (cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12) ˜ G (ˆ p k ( X )) − β (cid:12)(cid:12)(cid:12) , A kj (cid:17) . Now, we consider ε uniformly distributed on [ K ] independent of the data and X . Conditional on the data and under Assumption 4.4, we apply Lemma A.5with Z k = ˆ p k ( X ), Z = (cid:80) Kk =1 Z k { ε = k } and then obtain that ˜ G ( Z ) is uniformlydistributed on [0 , K ]. Therefore, for all j ≥ δ >

0, we deduce1 K K (cid:88) k =1 P X (cid:16) | ˜ G (ˆ p k ( X )) − β | ≤ j δ (cid:17) = P X (cid:16) | ˜ G ( Z ) − β | ≤ j δ (cid:17) ≤ j +1 δK . hzhen, Denis and Hebiri/Conﬁdence Sets Hence, for all j ≥

0, we obtain K (cid:88) k =1 P X ( A kj ) ≤ K (cid:88) k =1 P X (cid:16) | ˜ G (ˆ p k ( X )) − β | ≤ j δ (cid:17) ≤ j +1 δ . (B.4)Next, we observe that for all j ≥ K (cid:88) k =1 P X (cid:16) | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) | ≥ | ˜ G (ˆ p k ( X )) − β | , A kj (cid:17) ≤ (B.5) K (cid:88) k =1 P X (cid:16) | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) | ≥ j − δ, A kj (cid:17) . Thus, we obtain that R ≤ K (cid:88) k =1 (cid:88) j ≥ P X (cid:16) | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) | ≥ j − δ, A kj (cid:17) , almost surely data. Integrating from both sides with respect to the data we get E ( D n , D N ) R ≤ K (cid:88) k =1 (cid:88) j ≥ E ( D n , D N ) P X (cid:16) | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) | ≥ j − δ, A kj (cid:17) = K (cid:88) k =1 (cid:88) j ≥ E ( D (cid:98) n/ (cid:99) , D (cid:100) n/ (cid:101) , D N ,X ∼ P X ) { | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) |≥ j − δ } { A kj } . recall that the function { A kj } for all j ≥ k ∈ [ K ] is independent from D (cid:100) n/ (cid:101) , D N , thus we can write E ( D (cid:98) n/ (cid:99) , D (cid:100) n/ (cid:101) , D N ,X ∼ P X ) { | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) |≥ j − δ } { A kj } = E ( D (cid:98) n/ (cid:99) ,X ∼ P X ) E ( D (cid:100) n/ (cid:101) , D N ) (cid:104) { | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) |≥ j − δ } (cid:12)(cid:12)(cid:12) D (cid:98) n/ (cid:99) , X (cid:105) { A kj } ]Now, since conditional on ( D (cid:98) n/ (cid:99) , X ), ˆ G (ˆ p k ( X )) is an empirical mean of i.i.d. random variables of common mean ˜ G (ˆ p k ( X )) ∈ [0 , K ], we deduce fromHoeﬀding’s inequality that E ( D (cid:100) n/ (cid:101) , D N ) (cid:104) { | ˆ G (ˆ p k ( X )) − ˜ G (ˆ p k ( X )) |≥ j − δ } (cid:12)(cid:12)(cid:12) D (cid:98) n/ (cid:99) , X (cid:105) ≤ e − ( N + (cid:100) n/ (cid:101) ) δ j − K . Therefore, treating A k separately, we get from inequalities of Eqs. (B.3), (B.4),and (B.5) E ( D n , D N ) R ≤ δ + δ (cid:88) j ≥ j +2 exp (cid:18) − ( N + (cid:100) n/ (cid:101) ) δ j − K (cid:19) . hzhen, Denis and Hebiri/Conﬁdence Sets Finally, choosing δ = K (cid:112) N + (cid:100) n/ (cid:101) in the above inequality, we ﬁnish the proof. Hamming risk: here we provide an upper bound on the Hamming risk.First, by the triangle inequality we can write for the proposed estimator ˆΓ andthe pseudo Oracle β set ˜Γ E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≤ E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˜Γ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) + E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) ˜Γ( X ) (cid:12)(cid:12)(cid:12) . Notice that for the term E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) ˜Γ( X ) (cid:12)(cid:12)(cid:12) we can re-use the prooftechnique used for the term R in Eq. (B.2). Thus, it remain to upper-boundthe term E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˜Γ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) . The proof on this part closely followsthe machinery used in Denis and Hebiri (2017), however, let us mention thatthey used this method to obtain a bound on the Discrepancy which leads to asub-optimal rate. Nevertheless, their approach gives a correct rate if instead ofthe Discrepancy we bound the Hamming distance. For the sake of completenesswe write the principal parts of the proof here.First of all, by the deﬁnition of sets Γ ∗ β and ˜Γ we can write for ( ∗ ) = E X ∼ P X (cid:12)(cid:12)(cid:12) ˜Γ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ( ∗ ) = K (cid:88) k =1 E X ∼ P X (cid:12)(cid:12)(cid:12) { ˆ p k ( X ) ≥ ˜ G − ( β ) } − { p k ( X ) ≥ G − ( β ) } (cid:12)(cid:12)(cid:12) , Now if ˆ p k ( X ) ≥ ˜ G − ( β ) and p k ( X ) < G − ( β ) we can have the following situa-tions • if ˜ G − ( β ) > G − ( β ), then (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | ; • if ˜ G − ( β ) ≤ G − ( β ), then either (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | orˆ p k ( X ) ∈ (cid:16) ˜ G − ( β ) , G − ( β ) (cid:17) ;Similar conditions are satisﬁed if ˆ p k ( X ) < ˜ G − ( β ) and p k ( X ) ≥ G − ( β ). Using hzhen, Denis and Hebiri/Conﬁdence Sets the above arguments we can upper-bound ( ∗ ) as( ∗ ) ≤ K (cid:88) k =1 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | (cid:1) + { ˜ G − ( β ) ≤ G − ( β ) } K (cid:88) k =1 P X (cid:16) ˜ G − ( β ) < ˆ p k ( X ) < G − ( β ) (cid:17) + { G − ( β ) < ˜ G − ( β ) } K (cid:88) k =1 P X (cid:16) G − ( β ) < ˆ p k ( X ) < ˜ G − ( β ) (cid:17) = K (cid:88) k =1 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | (cid:1) + (cid:12)(cid:12)(cid:12) ˜ G (cid:16) ˜ G − ( β ) (cid:17) − ˜ G (cid:0) G − ( β ) (cid:1)(cid:12)(cid:12)(cid:12) . Thanks to the continuity Assumption 4.4 on the estimator and the continu-ity Assumption 1.1 on the distribution we clearly have ˜ G (cid:16) ˜ G − ( β ) (cid:17) = β = G (cid:0) G − ( β ) (cid:1) . Moreover, we can write | ˜ G (cid:16) ˜ G − ( β ) (cid:17) − ˜ G (cid:0) G − ( β ) (cid:1) | = (cid:12)(cid:12)(cid:12) G (cid:0) G − ( β ) (cid:1) − ˜ G (cid:0) G − ( β ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ K (cid:88) k =1 E X ∼ P X (cid:12)(cid:12) { ˆ p k ( X ) ≥ G − ( β ) } − { p k ( X ) ≥ G − ( β ) } (cid:12)(cid:12) ≤ K (cid:88) k =1 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | (cid:1) . Thus, our bound reads as( ∗ ) ≤ K (cid:88) k =1 P X (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ | ˆ p k ( X ) − p k ( X ) | (cid:1) . Finally, in order to upper-bound the term above one can use the peeling argu-ment of Audibert and Tsybakov (2007) applied with the exponential concentra-tion inequality provided by Assumption 4.3. This part of the proof we omit hereand refer the reader to Denis and Hebiri (2017) or to Audibert and Tsybakov(2007) for a complete result.Let us emphasize that the argument above is only possible due to the conti-nuity Assumptions 1.1, 4.4 on the distribution and the estimator respectively.

Appendix C: Proof of the lower bounds

This section is devoted to the proof of the lower bounds provided by Theo-rems 3.4-3.5. Before proceeding to the proofs let us brieﬂy sketch the high-level hzhen, Denis and Hebiri/Conﬁdence Sets strategy used in this work. In order to prove the lower bounds of Theorems 3.4-3.5 we actually prove to separate lower bounds on the minimax risk. Clearly, ifsome non-negative quantity is lower-bounded by two diﬀerent values, thereforeit is lower-bounded by the maximum between the two. The two lower boundsthat we prove are naturally connected with the proposed two-steps estimator,that is, the ﬁrst lower bound is connected with the problem of non-parametricestimation of p k for all k ∈ [ K ] and the second describes the estimation of theunknown threshold G − ( β ).In particular, the ﬁrst lower bound is closely related to the one providedin (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009), though, cruciallythe continuity Assumption 1.1 makes the proof more involved. The second lowerbound is based on two hypotheses testing and is derived by constructing twodiﬀerent marginal distributions of X ∈ R d and a ﬁxed regression vector p ( · ). Inthis part we make use of Pinsker’s inequality recalled in Lemma A.2.In order to discriminate the supervised and the semi-supervised procedureswe make use of Deﬁnition 1.4. Notice that every supervised procedure thanksto Deﬁnition 1.4 is not ’sensitive’ to the expectation taken w.r.t. the unlabeleddataset D N , that is, randomness is only induced by the labeled dataset D n .This strategy allows to eliminate the dependence of the lower bound on thesize of the unlabeled dataset D N for supervised procedures. Indeed, let ˆΓ beany supervised estimator in the sense of Deﬁnition 1.4, then for any real valuedfunction of conﬁdence sets Z we have E ( D n , D N ) [ E P X Z (ˆΓ( X ; D n , D N ))] = E D n [ E P X Z (ˆΓ( X ; D n , D (cid:48) N ))] , with D (cid:48) N being an arbitrary set of N points in R d . C.1. Part I: ( N + n ) − / Here we prove that the rate ( N + n ) − / is optimal for semi-supervised methods,as already mentioned the rate for the supervised methods can be obtained byformally setting N = 0. The constant C (cid:48) , C, c are always assumed to be inde-pendent of N, n and can diﬀer from line to line. Let us ﬁx β ∈ { , . . . , (cid:98) K/ (cid:99)− } and K ≥

5. For a positive constant

C < / κ N,n = C ( N + n ) − < . . To prove the lower bound we construct two distribution P and P on R d shar-ing the same regression function p ( · ) = ( p ( · ) , . . . , p K ( · )) and with diﬀerentmarginals admitting densities µ , µ . First, for a ﬁxed parameter 0 < ρ < < r < r < r < r < r to be speciﬁed we deﬁne the hzhen, Denis and Hebiri/Conﬁdence Sets a ( a + b ) / b e − b − a )2 Fig 1: Bump function: x (cid:55)→ ψ a,b ( x ). Importantly, this function is supported on( a, b ) and is inﬁnitely smooth.following sets X = (cid:8) x ∈ R d : (cid:107) x (cid:107) ≤ r (cid:9) , X =  x ∈ R d : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x − ( r + ρ, , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ∈ R d ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ρ/  , X =  x ∈ R d : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x − ( r + ρ, , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ∈ R d ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ρ )  , X =  x ∈ R d : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x − ( r + ρ, , . . . , (cid:124) (cid:123)(cid:122) (cid:125) ∈ R d ) (cid:62) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ρ/  , X = (cid:8) x ∈ R d : r ≤ (cid:107) x (cid:107) ≤ r (cid:9) . Let us denote by o i = ( r i + ρ, , . . . , (cid:62) for i = 1 , , X , X and X . Using these sets we deﬁne the regression vector as hzhen, Denis and Hebiri/Conﬁdence Sets p ( x ) = . . . = p β ( x ) =  β − ϕ ( x )2 β , x ∈ X K +2 β Kβ − ϕ ( x )2 β , x ∈ X K +4 β Kβ − ϕ ( x )2 β , x ∈ X K +6 β Kβ − ϕ ( x )2 β , x ∈ X K − ϕ ( x )2 β , x ∈ X p β +1 ( x ) = . . . = p K ( x ) =  ϕ ( x ) K − β , x ∈ X K + ϕ ( x ) K − β , x ∈ X K − β K ( K − β ) + ϕ ( x ) K − β , x ∈ X K − β K ( K − β ) + ϕ ( x ) K − β , x ∈ X K + ϕ ( x ) K − β , x ∈ X , In order to deﬁne the functions ϕ i for i = 0 , . . . , a < bψ a,b ( x ) = (cid:40) exp (cid:16) − b − x )( x − a ) (cid:17) , x ∈ ( a, b )0 , otherwise . Figure 1 illustrates the behavior of ψ a,b function in one dimension. Note thatfor every a, b ∈ R the function above is inﬁnitely smooth. Using the deﬁnitionof ψ a,b we deﬁne the functions ϕ i for i = 0 , . . . , ϕ ( x ) = C (cid:48) (cid:18) K − β Kβ (cid:94) K (cid:19) ψ − ,r ( (cid:107) x (cid:107) ) ,ϕ i ( x ) = C (cid:48) ρ γ (cid:18) K − β Kβ (cid:94) K (cid:19) (cid:18) (cid:107) x − o i (cid:107) ρ (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x − o i (cid:107) ρ (cid:19) , i = 1 , ,ϕ ( x ) = C (cid:48) ρ γ (cid:18) K − β Kβ (cid:94) K (cid:19) ψ − , (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) ,ϕ ( x ) = C (cid:48) (cid:18) K − β Kβ (cid:94) K (cid:19) ψ r , r ( (cid:107) x (cid:107) ) , and the constant C (cid:48) ≤ ϕ i for i = 0 , . . . , γ, L )-H¨older. Let us point out that such value C (cid:48) exists and isindependent of n, N , indeed, the mapping x (cid:55)→ C (cid:48) (cid:107) x (cid:107) (cid:100) γ (cid:101) ψ − , ( (cid:107) x (cid:107) ) , is inﬁnitely smooth, thus it is ( γ, L )-H¨older for a properly chosen C (cid:48) . Figure 2demonstrates the behavior of the considered construction in one dimension.Note that ϕ i ( x ) for i = 1 , L . Same reasoning applies to ϕ i for i = 0 , , hzhen, Denis and Hebiri/Conﬁdence Sets a ( a + b ) / b Fig 2: Dumped bump function: x (cid:55)→ (cid:0) x − a + b (cid:1) (cid:100) γ (cid:101) ψ a,b ( x ). Importantly, thisfunction behaves as polynomial of even degree 2 (cid:100) γ (cid:101) in the aﬃnity of a + b , whilebeing inﬁnitely smooth and supported on ( a, b ). It means that if we select ameasure which is supported in the aﬃnity of a + b (light-blue hatched region)the function on the plot is essentially polynomial w.r.t. such a measure.Since β < K/ < K < K − β K ( K − β ) < K − β K ( K − β ) < K − β K ( K − β ) < K< K + 6 β Kβ < K + 4 β Kβ < K + 2 β Kβ < β , which will help us to ensure that the thresholds under P , P are K +2 β Kβ and K +6 β Kβ respectively. Now, we deﬁne two marginal distributions µ , µ by theirdensities as µ ( x ) =  / X ) , x ∈ X κ N,n

Leb( X ) , x ∈ X κ N,n

Leb( X ) , x ∈ X / − κ N,n

Leb( X ) , x ∈ X , µ ( x ) =  / − κ N,n

Leb( X ) , x ∈ X κ N,n

Leb( X ) , x ∈ X / X ) , x ∈ X , and both µ , µ are equal to zero in unspeciﬁed regions. Clearly, the strongdensity assumption is satisﬁed on X and X since the density is lower andupper-bounded by a constant independent of both N, n . The parameter ρ ischosen such that the strong density assumption on X i for i = 1 , , X i ) = cρ d , for some constant c > N, n , thus we set ρ = C ( N + n ) − / d .For these hypotheses one can easily check that the thresholds G − ( β ) , G − ( β ) hzhen, Denis and Hebiri/Conﬁdence Sets and the optimal β -sets Γ ∗ , Γ ∗ are given as G − ( β ) = 3 K + 2 β Kβ , G − ( β ) = K + 6 β Kβ , Γ ∗ ( x ) = (cid:40) { , . . . , β } , x ∈ X ∅ , otherwise , Γ ∗ ( x ) = (cid:40) { , . . . , β } , x ∈ X (cid:83) X (cid:83) X , ∅ , otherwise . The margin assumption: we are in position to check the margin Assump-tion 2.1. Let t = (cid:16) K − β Kβ (cid:86) K (cid:17) , thus for every k ∈ { β + 1 , . . . , K } and every t ≤ t we have P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) = 0 , P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) = 0 , moreover for every k ∈ { , . . . , β } and every t ≤ t we can write P (cid:32) (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:33) = P (cid:32) C (cid:48) ρ γ (cid:18) K − β Kβ (cid:94) K (cid:19) (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) ≤ βt, X ∈ X (cid:33) , P (cid:32) (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:33) = P (cid:32) C (cid:48) ρ γ (cid:18) K − β Kβ (cid:94) K (cid:19) (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) ≤ βt, X ∈ X (cid:33) . Hence, for the 0 hypothesis there exists c independent of N, n such that P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ P (cid:32)(cid:18) (cid:107) x − o (cid:107) ρ (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x − o (cid:107) ρ (cid:19) ≤ c ρ − γ t, X ∈ X (cid:33) Therefore we can write using the strong density assumption P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ (cid:90) (cid:107) x − o (cid:107)≤ ρ/ (cid:26)(cid:16) (cid:107) x − o (cid:107) ρ (cid:17) (cid:100) γ (cid:101) ψ − , (cid:16) (cid:107) x − o (cid:107) ρ (cid:17) ≤ c ρ − γ t (cid:27) dµ ( x ) ≤ C (cid:90) (cid:107) x − o (cid:107)≤ ρ/ (cid:26)(cid:16) (cid:107) x − o (cid:107) ρ (cid:17) (cid:100) γ (cid:101) ψ − , (cid:16) (cid:107) x − o (cid:107) ρ (cid:17) ≤ c ρ − γ t (cid:27) dx = C (cid:90) (cid:107) x (cid:107)≤ ρ/ (cid:26)(cid:16) (cid:107) x (cid:107) ρ (cid:17) (cid:100) γ (cid:101) ψ − , (cid:16) (cid:107) x (cid:107) ρ (cid:17) ≤ c ρ − γ t (cid:27) dx = Cρ d (cid:90) (cid:107) x (cid:107)≤ / (cid:26) (cid:107) x (cid:107) (cid:100) γ (cid:101) ψ − , ( (cid:107) x (cid:107) ) ≤ c ρ − γ t (cid:27) dx , Finally notice that for every x ∈ R d such that (cid:107) x (cid:107) ≤ / C > ψ − , ( (cid:107) x (cid:107) ) ≥ ψ − , (1 / ≥ C , hzhen, Denis and Hebiri/Conﬁdence Sets which implies that for some positive C, C (cid:48) independent of

N, n we can write P (cid:0)(cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) ≤ t (cid:1) ≤ Cρ d (cid:90) (cid:107) x (cid:107)≤ / (cid:26) (cid:107) x (cid:107) (cid:100) γ (cid:101) ≤ C (cid:48) ρ − γ t (cid:27) dx = Cρ d (cid:90) (cid:107) x (cid:107)≤ / (cid:26) (cid:107) x (cid:107)≤ C (cid:48) / (2 (cid:100) γ (cid:101) ) ρ − γ/ (2 (cid:100) γ (cid:101) ) t / (2 (cid:100) γ (cid:101) ) (cid:27) dx ≤ Cρ d (1 − γ/ (2 (cid:100) γ (cid:101) ) t d/ (2 (cid:100) γ (cid:101) ) . This implies that for as long as α ≤ d/ (2 (cid:100) γ (cid:101) ) (and since we have γ ≤ (cid:100) γ (cid:101) ) themargin assumption is satisﬁed. Moreover, these conditions imply that αγ ≤ d ,which we will also require while proving the supervised part of the rate. Samereasoning can be carried out for the case of the ﬁrst hypothesis P on the set X .Finally, the parameters r , r , r , r are chosen as constants independent of n, N such that there exists a smooth connection between the parts of the regres-sion functions p k ( · ) which are deﬁned on X , X , X , X , X . Notice that such achoice is possible since by the construction of functions ϕ i for i = 0 , , , , X , X , X , X , X . Thus in the region R d \ (cid:83) i =0 X i it is suﬃcient to construct a function which connects four diﬀerentconstants smoothly. We avoid this over complication on this part and hope thatthe guidelines provided above are suﬃcient for the understanding.Notice that the constructed distributions are satisfying Assumption 1.1 sincethe measures are only deﬁned on X , X , X , X , X and the regression functionson these sets are not concentrated around any constant.Before proceeding to the ﬁnal stage of the proof let us mention that in whatfollows we use the de Finetti (de Finetti, 1972, 1974) notation which is commonin probability. That is, given a probability measure P on some measurable space(Ω , A ) and a measurable function X : (Ω , A ) → ( R , Borel( R )) we write P [ X ] := E [ X ] . Bound on the KL-divergence: we start by computing the KL-divergencebetween µ and µ KL( µ , µ ) := (cid:90) R d µ ( x ) log (cid:18) µ µ (cid:19) dx = (cid:88) i =0 (cid:90) x ∈X i µ ( x ) log (cid:18) µ ( x ) µ ( x ) (cid:19) dx = 1Leb( X ) (cid:90) x ∈X

12 log (cid:18) / / − κ N,n (cid:19) dx + 1Leb( X ) (cid:90) x ∈X (cid:18) − κ N,n (cid:19) log (cid:18) / − κ N,n / (cid:19) dx = 12 log (cid:18) / / − κ N,n (cid:19) + (cid:18) − κ N,n (cid:19) log (cid:18) / − κ N,n / (cid:19) = − κ N,n log (1 − κ N,n ) ≤ κ N,n . hzhen, Denis and Hebiri/Conﬁdence Sets Lower bound for the Hamming risk: ﬁrst of all let us introduce thefollowing notation for i = 0 , , Γ ∗ i ) := µ i (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ i ( X ) (cid:12)(cid:12)(cid:12) . Recall that we are interested in the following quantityinf ˆΓ sup P ∈P E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) , since the hypotheses P , P ∈ P we can write2 sup P ∈P E ( D n , D N ) E X ∼ P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ ( ∗ ) , where ( ∗ ) is deﬁned as( ∗ ) = µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X H(ˆΓ , Γ ∗ ) + µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X H(ˆΓ , Γ ∗ ) , thus, for the Hamming risk we can write( ∗ ) ≥ µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X  dµ ⊗ ( n + N )1 ⊗ P ⊗ nY | X dµ ⊗ ( n + N )0 ⊗ P ⊗ nY | X (cid:94)  (cid:16) H(ˆΓ , Γ ∗ ) + H(ˆΓ , Γ ∗ ) (cid:17) . Now we focus our attention to the sum of two Hamming diﬀerences which ap-pearing on the right hand side of the above inequalityH(ˆΓ , Γ ∗ )+ H(ˆΓ , Γ ∗ ) = µ K (cid:88) k =1 k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) + µ K (cid:88) k =1 k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) ≥ µ (cid:18) dµ dµ (cid:94) (cid:19) K (cid:88) k =1 k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) + µ (cid:18) dµ dµ (cid:94) (cid:19) K (cid:88) k =1 k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) ≥ µ (cid:18) dµ dµ (cid:94) (cid:19) K (cid:88) k =1 k ∈ Γ ∗ ( X ) (cid:52) Γ ∗ ( X ) (Triangle inequality)= 2 βµ (cid:18) dµ dµ (cid:94) (cid:19) ( X + X )= 2 β (cid:90) R d (cid:18) µ ( x ) µ ( x ) (cid:94) (cid:19) ( X + X ) dµ ( x )= 2 β (cid:90) X (cid:18) µ ( x ) µ ( x ) (cid:94) (cid:19) dµ ( x ) + 2 β (cid:90) X (cid:18) µ ( x ) µ ( x ) (cid:94) (cid:19) dµ ( x )= 2 β P ( X ∪ X ) ≥ βκ n,N . hzhen, Denis and Hebiri/Conﬁdence Sets Substituting this lower bound into the initial inequality we arrive at( ∗ ) ≥ βκ n,N µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X  dµ ⊗ ( n + N )1 ⊗ P ⊗ nY | X dµ ⊗ ( n + N )0 ⊗ P ⊗ nY | X (cid:94)  = 2 βκ n,N (cid:16) − TV (cid:16) µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X , µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X (cid:17)(cid:17) = 2 βκ n,N (cid:16) − TV (cid:16) µ ⊗ ( n + N )0 , µ ⊗ ( n + N )1 (cid:17)(cid:17) ≥ βκ n,N (cid:32) − (cid:114)

12 KL (cid:16) µ ⊗ ( n + N )0 , µ ⊗ ( n + N )1 (cid:17)(cid:33) (Pinsker’s inequality) ≥ βκ n,N (cid:16) − κ n,N √ n + N (cid:17) , which implies the desired lower bound on the Hamming risk. Lower bound for the β excess risk: this part is analogues to the caseof the Hamming distance. Let us recall that for every ˆΓ we have the followingexpression for i = 0 , , Γ ∗ i ) := R β (ˆΓ) − R β (Γ ∗ i ) = µ i K (cid:88) k =1 (cid:12)(cid:12) p k ( X ) − G − ( β ) (cid:12)(cid:12) { k ∈ ˆΓ( X ) (cid:52) Γ ∗ i ( X ) } . Again, recall that we are interested ininf ˆΓ sup P ∈P E ( D n , D N ) [R β (ˆΓ)] − R(Γ ∗ β )similarly to the previous case, since the hypotheses P , P ∈ P we can write2 sup P ∈P E ( D n , D N ) [R β (ˆΓ)] − R(Γ ∗ β ) ≥ ( ∗∗ ) , where ( ∗∗ ) is deﬁned as( ∗∗ ) = µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X D(ˆΓ , Γ ∗ ) + µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X D(ˆΓ , Γ ∗ ) , we can write ( ∗∗ ) ≥ µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X (cid:32) dµ ⊗ ( n + N )1 ⊗ P ⊗ nY | X dµ ⊗ ( n + N )0 ⊗ P ⊗ nY | X (cid:94) (cid:33) (cid:16) D(ˆΓ , Γ ∗ ) + D(ˆΓ , Γ ∗ ) (cid:17) and we continue in a similar fashionD(ˆΓ , Γ ∗ ) + D(ˆΓ , Γ ∗ ) = µ K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) + µ K (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) ≥ µ β (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) X ∈X + µ β (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) X ∈X , hzhen, Denis and Hebiri/Conﬁdence Sets since µ ( x ) = µ ( x ) for all x ∈ X we obtainD(ˆΓ , Γ ∗ ) + D(ˆΓ , Γ ∗ ) ≥ µ (cid:32) β (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) X ∈X + β (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) k ∈ ˆΓ( X ) (cid:52) Γ ∗ ( X ) X ∈X (cid:33) ≥ µ (cid:32) β (cid:88) k =1 (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) (cid:94) (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) k ∈ Γ ∗ ( X ) (cid:52) Γ ∗ ( X ) X ∈X (cid:33) = µ (cid:32) β (cid:88) k =1 (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) (cid:94) (cid:12)(cid:12)(cid:12)(cid:12) p k ( X ) − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) X ∈X (cid:33) = µ (cid:18) β (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) K + 4 β Kβ − ϕ ( X )2 β − K + 2 β Kβ (cid:12)(cid:12)(cid:12)(cid:12) (cid:94) (cid:12)(cid:12)(cid:12)(cid:12) K + 4 β Kβ − ϕ ( X )2 β − K + 6 β Kβ (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) X ∈X (cid:19) = µ (cid:18) β (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) K − β Kβ + ϕ ( X )2 β (cid:12)(cid:12)(cid:12)(cid:12) (cid:94) (cid:12)(cid:12)(cid:12)(cid:12) K − β Kβ − ϕ ( X )2 β (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) X ∈X (cid:19) = µ (cid:18) β (cid:12)(cid:12)(cid:12)(cid:12) K − β Kβ − ϕ ( X )2 β (cid:12)(cid:12)(cid:12)(cid:12) X ∈X (cid:19) , then, since ϕ ( x ) β ≤ K − β Kβ for all x ∈ X , we haveD(ˆΓ , Γ ∗ ) + D(ˆΓ , Γ ∗ ) ≥ β ( K − β )16 Kβ µ ( X ) = K − β K κ n,N . Thus, ( ∗∗ ) ≥ K − β K κ n,N (cid:16) − TV (cid:16) µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X , µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X (cid:17)(cid:17) ≥ K − β K κ n,N (cid:32) − (cid:114)

12 KL (cid:16) µ ⊗ ( n + N )0 ⊗ P ⊗ nY | X , µ ⊗ ( n + N )1 ⊗ P ⊗ nY | X (cid:17)(cid:33) ≥ K − β K κ n,N (cid:16) − κ n,N √ n + N (cid:17) Which concludes the ﬁrst part of the lower bounds.

C.2. Part II: n − αγ/ (2 γ + d ) In this section we prove that in case of the Hamming risk Ψ H the rate n − αγ/ (2 γ + d ) is minimax optimal. Notice, that thanks to Proposition 3.1 a lower bound of or-der n − αγ/ (2 γ + d ) on the Hamming risk Ψ H immediately implies a lower boundof order n − ( α +1) γ/ (2 γ + d ) on both Ψ E and Ψ D .The proof is based on the reduction of the Hamming risk to a multiplehypotheses testing problem and an application of Fano’s inequality providedby Birg´e (2005) recalled in Lemma A.1. hzhen, Denis and Hebiri/Conﬁdence Sets a b Fig 3: Integrated bump: x (cid:55)→ (cid:82) ∞ x ψ a,b ( t ) dt (cid:82) ba ψ a,b ( t ) dt . Importantly, this function is inﬁnitelysmooth and is equal to one or zero only outside of the interval ( a, b ).Assume that K ≥ β ∈ { , . . . , ( K − ∧ (cid:98) K/ (cid:99)} , deﬁne theregular grid on [0 , d as G q := (cid:40)(cid:18) k + 12 q , . . . , k d + 12 q (cid:19) (cid:62) : k i ∈ { , . . . , q − } , i = 1 , . . . , d (cid:41) , and denote by n q ( x ) ∈ G q as the closest point to of the grid G q to the point x ∈ R d . Such a grid deﬁnes a partition of the unit cube [0 , d ⊂ R d denotedby X (cid:48) , . . . , X (cid:48) q d . Besides, denote by X (cid:48)− j := { x ∈ R d : − x ∈ X (cid:48) j } for all j =1 , . . . , q d . For a ﬁxed integer m ≤ q d and for any j ∈ { , . . . , m } deﬁne X i := X (cid:48) i , X − i := X (cid:48)− i . Additionally we introduce the following set X = B (0 , (4 q ) − ). Forevery w ∈ W := {− , } m we build the distribution P w ∈ P W , such that, themarginal distribution P w,X is independent of w ∈ {− , } m and the regressionvector ( p w ( x ) , . . . , p wK ( x )) is constructed as hzhen, Denis and Hebiri/Conﬁdence Sets p w ( x ) = . . . = p wβ − ( x ) = v + c (cid:48) β − g ( x ) β − ,p wβ ( x ) =  v + φ ( x ) , if x ∈ X v + w i ϕ ( x − n q ( x )) , if x ∈ X i v − w i ϕ ( x − n q ( x )) , if x ∈ X − i K , if x ∈ B (0 , √ d ) \ (cid:16)(cid:83) mi = − m,i (cid:54) =0 X i (cid:17) v + g ( x ) , if x ∈ R d \ B (0 , √ d + ρ ) v + ξ ( x ) , if x ∈ B (0 , √ d + ρ ) \ B (0 , √ d ) ,p wβ +1 ( x ) =  v − φ ( x ) , if x ∈ X v − w i ϕ ( x − n q ( x )) , if x ∈ X i v + w i ϕ ( x − n q ( x )) , if x ∈ X − i v, if x ∈ B (0 , √ d ) \ (cid:16)(cid:83) mi = − m,i (cid:54) =0 X i (cid:17) v − g ( x ) , if x ∈ R d \ B (0 , √ d + ρ ) v − ξ ( x ) , if x ∈ B (0 , √ d + ρ ) \ B (0 , √ d ) ,p wβ +2 ( x ) = . . . = p wK ( x ) = v − c (cid:48) K − β − − g ( x ) K − β − , where v ∈ [0 , ϕ : R d (cid:55)→ R + , and ξ : R d (cid:55)→ R + are to be speciﬁed. Theconstants v, c (cid:48) are set as v = 1 K , c (cid:48) = ( β − K − β − K The function ξ is constructed as ξ ( x ) = v u (cid:32) (cid:107) x (cid:107) − √ dρ (cid:33) , ¯ u ( x ) = 1 − (cid:82) ∞ x ψ , ( t ) dt (cid:82) ψ , ( t ) dt , the function ¯ u is inﬁnitely many times diﬀerentialble, is equal to zero on ( −∞ , , + ∞ ). Figure 3 shows the behavior of 1 − ¯ u . Taking the constant ρ > N, n we can ensure that the function ξ is( γ, L )-H¨older.The function φ is constructed similarly to the previous part of the rate, thatis, for φ we choose φ ( x ) = C φ (2 q ) − γ (cid:18) (cid:107) x (cid:107) (2 q ) − (cid:19) (cid:100) γ (cid:101) ψ − , (cid:18) (cid:107) x (cid:107) (2 q ) − (cid:19) , with C φ being suﬃciently small such that φ ( · ) is ( γ, L )-H¨older and upper-bounded by c (cid:48) / ∧ v/

4. For the function ϕ we consider the following construction ϕ ( x ) = C ϕ q − γ (cid:18) u (cid:18) (cid:107) x (cid:107) q − (cid:19) + ψ − , (cid:18) (cid:107) x (cid:107) q − (cid:19)(cid:19) , hzhen, Denis and Hebiri/Conﬁdence Sets − / − / / / Fig 4: The function x (cid:55)→ u ( | x | ) + ψ − , ( x ). Importantly, this function is in-ﬁnitely smooth nowhere concentrates at any constant on ( − / , / u ( · ) is deﬁned as u ( x ) = (cid:82) ∞ x ψ , ( t ) dt (cid:82) / / ψ , ( t ) dt . Figure 4 explains the behavior of this function and helps for better understand-ing of our results. The constant C ϕ is chosen in such a way that the constructedfunction ϕ ( · ) is ( γ, L )-H¨older and and upper-bounded by c (cid:48) / ∧ v/

4. Notice thatthe function ϕ ( x ) for all x ∈ B (0 , (4 q ) − ) satisﬁes C ϕ q − γ ≤ ϕ ( x ) ≤ C ϕ q − γ (cid:16) ψ − , (0) (cid:17) ≤ C ϕ q − γ . Finally, the function g is any ( γ, L )-H¨older function with suﬃciently boundedvariation which is not concentrated around any constant, for example g ( x ) = C g ¯ u (cid:16) (cid:107) x (cid:107) − √ d − ρ (cid:17) cos (cid:16) (cid:107) x (cid:107) − √ d − ρ (cid:17) , For C g chosen small enough to ensure that it is ( γ, L )-H¨older and has a boundedby c (cid:48) / ∧ v/ X ∈ R d . Weselect a Euclidean ball in R d denoted by A that has an empty intersection with B (0 , √ d + ρ ) and whose Lebesgue measure is Leb( A ) = 1 − mq − d . The density µ of the marginal distribution of X ∈ R d is constructed as • µ ( x ) = τ Leb( B (0 , (4 q ) − )) for every z ∈ G q ∪ { } and every x ∈ B ( z, (4 q ) − ))or x ∈ B ( − z, (4 q ) − )), • µ ( x ) = − mτ Leb( A ) for every x ∈ A , • µ ( x ) = 0 for every other x ∈ R d , hzhen, Denis and Hebiri/Conﬁdence Sets for some τ to be speciﬁed. Now, we check that the distributions constructedabove belong to the set P for every w ∈ W . Namely, we check the following listof assumption • The functions p w , . . . , p wK are deﬁning some regression function for every w ∈ W . That is, for each x ∈ R d we have (cid:80) Kk =1 p wk ( x ) = 1 and 0 ≤ p wk ( x ) ≤ • the functions p w , . . . , p wK are ( γ, L )-H¨older, • the function G w ( t ) := (cid:80) Kk =1 (cid:82) R d { p wk ( x ) ≥ t } µ ( x ) dx is continuous, • the threshold G − ( β ) is equal to v for every w ∈ W , • the marginal distribution satisﬁes the strong density assumption, • the regression function satisﬁes α -margin assumption. The regression function is well deﬁned: to see this, notice that for every w ∈ W and every x ∈ R d we have by construction p wβ +1 ( x ) + p wβ ( x ) = 2 v , β − (cid:88) k =1 p wk ( x ) + K (cid:88) k = β +2 p wk ( x ) = ( K − v , and the combination of both with v = 1 /K implies that (cid:80) Kk =1 p wk ( x ) = 1.Moreover, as long as sup x ∈X i ϕ ( x ) ≤ v/ i = − m, . . . , − , , . . . , m wehave for every x ∈ R d < v/ ≤ p wβ +1 ( x ) ≤ v/ ≤ , < v/ ≤ p wβ ( x ) ≤ v/ ≤ , and by construction of the function g we have for every k = 1 , . . . , β −

1, every x ∈ R d and every w ∈ W ≤ p wk ( x ) ≤ v + 3 c (cid:48) β − , due to the choice of c (cid:48) , v we have v + 3 c (cid:48) β −

1) = 1 K + 3( K − β − K ≤ K ≤ . Similarly, for every k = β + 2 , . . . , K , every x ∈ R d and every w ∈ Wv − c (cid:48) K − β − ∧ ( β − ≤ p wk ( x ) ≤ , and with the choice of v, c (cid:48) speciﬁed above and the constraint β ≤ (cid:98) K/ (cid:99) wehave v − c (cid:48) K − β −

1) = 1 K − β − K ≥ K − K/ − K = 14 K + 32 K ≥ . hzhen, Denis and Hebiri/Conﬁdence Sets Thus, the construction above deﬁnes some regression function for every w ∈ W . The regression function is ( γ, L ) -H¨older: this implication follows imme-diately from the construction of ϕ, ξ, g . Continuity of G ( t ) : ﬁrst let us show that (cid:82) R d p wk ( x ) ≥ t µ ( x ) dx is continuousfor every k ∈ [ K ]. For k = 1 , . . . , β − , β + 2 , . . . , K the continuity follows fromthe fact that g is not concentrated around any constant. For k = β, β + 1 weﬁrst write (cid:90) R d p wk ( x ) ≥ t µ ( x ) dx = m (cid:88) c ∈ G q ∪− G q τ Leb ( B ( c, (4 q ) − )) (cid:90) B ( c, (4 q ) − ) p wk ( x ) ≥ t dx + 1 − mτ Leb( A ) (cid:90) A p wk ( x ) ≥ t dx , thus for this choice of k the continuity follows from the fact that ϕ and g arenot concentrated around any constant. Threshold G − ( β ) = v : to see this notice that for every w ∈ W , K (cid:88) k =1 p wk ( x ) ≥ v = β, a.e. µ , and the condition on the threshold follows from the continuity of G ( · ). Besides,the corresponding β -Oracle sets Γ ∗ w are given for every w ∈ W asΓ ∗ w ( x ) =  { , . . . , β − , β } , x ∈ X i , w i = 1 , { , . . . , β − , β + 1 } , x ∈ X i , w i = − , { , . . . , β − , β } , x ∈ X − i , w i = − , { , . . . , β − , β + 1 } , x ∈ X − i , w i = 1 , { , . . . , β − , β } , x ∈ R d \ ( (cid:83) mi = − m X i ) The strong density assumption: the strong density assumption can bechecked following the proof of (Audibert and Tsybakov, 2007, Theorem 3.5)where an analogous construction of the marginal distribution was considered. α -margin assumption: for all t ≤ t := v/

4, all k ∈ [ K ] \ { β, β + 1 } and all w ∈ W we have µ ( | p wk ( X ) − v | ≤ t ) = 0 , thus for k ∈ [ K ] \ { β, β + 1 } the margin assumption is satisﬁed. It remains tocheck that the margin assumption is satisﬁed for k ∈ { β, β +1 } . Fix an arbitrary w ∈ W and k = β , then for all t ≤ t we can write µ (cid:32) | p wk ( X ) − v | ≤ t (cid:33) = m (cid:88) i = − m µ ( | p wk ( X ) − v | ≤ t, X ∈ X i )= m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) + µ ( φ ( X ) ≤ t, X ∈ X ) . hzhen, Denis and Hebiri/Conﬁdence Sets We separately upper-bound both terms which appear on the right hand side ofthe equality. µ ( φ ( X ) ≤ t, X ∈ X ) = τ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) { φ ( X ) ≤ t } dx = τ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) (cid:26) C φ (2 q ) − γ (cid:16) (cid:107) x (cid:107) (2 q ) − (cid:17) (cid:100) γ (cid:101) ψ − , (cid:16) (cid:107) x (cid:107) (2 q ) − (cid:17) ≤ t (cid:27) dx = Cτ q − d Leb( B (0 , (4 q ) − )) (cid:90) B (0 , / (cid:110) ( (cid:107) x (cid:107) ) (cid:100) γ (cid:101) ψ − , ( (cid:107) x (cid:107) ) ≤ C − φ (2 q ) γ t (cid:111) dx , clearly there exists a constant C such that for all x ∈ B (0 , /

2) we have ψ − , ( (cid:107) x (cid:107) ) ≥ C ,

Therefore for some constant

C > µ ( φ ( X ) ≤ t, X ∈ X ) ≤ Cτ q − d Leb( B (0 , (4 q ) − )) (cid:90) B (0 , / (cid:110) (cid:107) x (cid:107)≤ C ( q ) γ/ (cid:100) γ (cid:101) t / (cid:100) γ (cid:101) (cid:111) dx ≤ Cτ q − d ( − γ/ (cid:100) γ (cid:101) )Leb( B (0 , (4 q ) − )) t d/ (cid:100) γ (cid:101) , thanks to the strong density assumption we can write for some C > µ ( φ ( X ) ≤ t, X ∈ X ) ≤ Cq − d ( − γ/ (cid:100) γ (cid:101) ) t d/ (cid:100) γ (cid:101) . Thus since 1 − γ/ (cid:100) γ (cid:101) ≥ d/ (cid:100) γ (cid:101) ≥ α we can write for some C > µ ( φ ( X ) ≤ t, X ∈ X ) ≤ Ct α . To ﬁnish this part it remains to upper-bound the other term in the marginassumption m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) = 2 mτ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) { ϕ ( X ) ≤ t } dx , using the fact that the function ϕ ( x ) for all x ∈ B (0 , (4 q ) − ) satisﬁes C ϕ q − γ ≤ ϕ ( x ) ≤ C ϕ q − γ (cid:16) ψ − , (0) (cid:17) ≤ C ϕ q − γ , we can write for all t ≤ C ϕ q − γm (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) = 0 , moreover, for all t ≥ C ϕ q − γ we can write m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) ≤ mτ , hzhen, Denis and Hebiri/Conﬁdence Sets and ﬁnally for t ∈ ( C ϕ q − γ , C ϕ q − γ ) we can write m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) = 2 mτ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) { ϕ ( X ) ≤ t } dx ≤ mτ Leb( B (0 , (4 q ) − )) (cid:90) B (0 , (4 q ) − ) { C ϕ q − γ ≤ t } dx = 2 mτ . The above implies that for some constant

C > t ≤ t m (cid:88) i = − m,i (cid:54) =0 µ ( ϕ ( X − n q ( X )) ≤ t, X ∈ X i ) ≤ τ m { t ≤ C ϕ q − γ } ≤ Cτ mq γα t α . Thus the margin assumption is satisﬁed as long as • τ m = O ( q − γα ); • (cid:100) γ (cid:101) α ≤ d .Similarly one can check that the margin assumption is satisﬁed for k = β + 1 Bound on the KL-divergence: we are in position to upper-bound the KLdivergence between any two hypotheses. Fix some w, w (cid:48) ∈ W , then using theupper bound on ϕ ( · ) we can write for some C > P w , P w (cid:48) ) ≤ m (cid:88) i = − m,i (cid:54) =0 µ (cid:18) ϕ ( X − n q ( X )) log (cid:18) ϕ ( X − n q ( X ))1 − ϕ ( X − n q ( X )) (cid:19) , X ∈ X i (cid:19) ≤ Cmτ q − γ How many hypotheses to take: let us recall the following result which isa version of Varshamov-Gilbert bound (Gilbert, 1952; Varshamov, 1957).

Lemma C.1.

Let δ ( w, w (cid:48) ) denote the Hamming distance between w, w (cid:48) ∈ W given by δ ( w, w (cid:48) ) := m (cid:88) i =1 { w i (cid:54) = w (cid:48) i } . There exists

W ⊂ W such that for all w (cid:54) = w (cid:48) ∈ W we have δ ( w, w (cid:48) ) ≥ m , and log |W| ≥ m . Denote

W ⊂ W the set provided by Lemma C.1 and by P W the set ofdistributions P w with w ∈ W . Taking into account all the above we concludethat P W satisﬁes the assumptions of our result. hzhen, Denis and Hebiri/Conﬁdence Sets Lower bound on the Hamming risk (applying Birg´e’s Lemma A.1): ﬁnally, we are in position to lower bound the hamming risk. Recall that we areinterested in the following quantityinf ˆΓ sup P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) . The rest of the proof follows standard arguments, which again using the deFinetti notation read asinf ˆΓ sup P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ inf ˆΓ sup w ∈W µ ⊗ N ⊗ P ⊗ nw µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) . Denote by ˆ w the following minimizerˆ w ∈ arg min w ∈W µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) , thus if w (cid:54) = ˆ w we can write using the deﬁnition of ˆ w and the triangle inequality2 µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) ≥ µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) + µ (cid:16)(cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ ˆ w ( X ) (cid:12)(cid:12)(cid:12)(cid:17) ≥ µ ( | Γ ∗ ˆ w ( X ) (cid:52) Γ ∗ w ( X ) | ) ≥ δ ( w, ˆ w ) µ ( X )= 2 δ ( w, ˆ w ) τ ≥ mτ . These arguments and Birge’s lemma A.1 imply thatsup P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ mτ w ∈W µ ⊗ N ⊗ P ⊗ nw ( w (cid:54) = ˆ w ) ≥ mτ (cid:32) . (cid:95) − (cid:80) w ∈W\{ w (cid:48) } KL( µ ⊗ N ⊗ P ⊗ nw , µ ⊗ N ⊗ P ⊗ nw (cid:48) ) |W − | log |W| (cid:33) . Since the marginal distribution of the vector X ∈ R d is shared among thehypotheses, using the upper-bound on the KL-divergence and the conditions on W we get for some C > P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ mτ (cid:0) − Cnτ q − γ (cid:1) . Finally, let q = (cid:98) ¯ Cn / (2 γ + d ) (cid:99) , τ = (cid:98) C (cid:48) q − d (cid:99) and m = (cid:98) C (cid:48)(cid:48) q d − αγ (cid:99) for some¯ C, C (cid:48) , C (cid:48)(cid:48) > C > c < P ∈P E ( D n , D N ) E P X (cid:12)(cid:12)(cid:12) ˆΓ( X ) (cid:52) Γ ∗ β ( X ) (cid:12)(cid:12)(cid:12) ≥ Cn − αγ/ (2 γ + d ) (1 − c ) . One can easily verify that this choice of parameters τ, m, q is possible as longas 2 (cid:100) γ (cid:101) α ≤ d and clearly with our choice we have τ m = O ( q − αγ ). As alreadymentioned the lower bound for the excess risk and the discrepancy follows fromPropositions 3.1 and 3.2. hzhen, Denis and Hebiri/Conﬁdence Sets Appendix D: Inconsistency of top- β approach In this section we prove Proposition 3.3. The proof builds an explicit con-struction of a distribution P whose β -Oracle satisﬁes | Γ ∗ β ( x ) | > β for all x in some A ⊂ R d with P X ( A ) >

0. Clearly, if such a distribution exists thenthere is no estimator in ˆΥ β that would consistently estimate this β -Oracle.Let β ∈ [0 , . . . , (cid:98) K/ (cid:99) −

1] be a ﬁxed integer and K ≥

3. For the proof ofthe theorem we shall construct one distribution P for which none of the esti-mators with a ﬁxed information can perform well. We start by specifying themarginal distribution of X ∈ R d . We start the construction by specifying thedensity µ of the marginal distribution P X . Deﬁne a disk in R d for some positive r ≤ r (cid:48) as D ( r, r (cid:48) ) = (cid:8) x ∈ R d : r ≤ (cid:107) x (cid:107) ≤ r (cid:48) (cid:9) . First of all ﬁx some parameters r < r < r < r < r which are independent from n, N . The density µ issupported on B (0 , r ) ∪ D ( r , r ) ∪ D ( r , r ).Moreover, • µ ( x ) = ββ +1 − Leb( B (0 ,r ))Leb( D ( r , r )) for all x ∈ D ( r , r ), • µ ( x ) = β +1) Leb( D ( r , r )) for all x ∈ D ( r , r ), • µ ( x ) = 1, for all x ∈ B (0 , r ), • µ ( x ) = 0 otherwise,where r > ββ +1 − Leb( B (0 , r )) > p ( · ) = ( p ( · ) , . . . , p K ( · )) (cid:62) are deﬁned as p ( x ) = . . . = p β +1 ( x ) =  β +1) + C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ B (0 , r ) β +1) + g ( x ) β +1 , x ∈ D ( r , r ) β +1 − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ D ( r , r ) β +1 − ξ ( x ) β +1 , x ∈ D (2 r , r ) β +1) − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ R d \ B (0 , r ) ,p β +2 ( x ) = . . . = p K ( x ) =  K − β − − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ B (0 , r ) K − β − − g ( x ) K − β − , x ∈ D ( r , r ) C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) K − β − , x ∈ D ( r , r ) ξ ( x ) K − β − , x ∈ D (2 r , r ) K − β − + C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β +1 , x ∈ R d \ B (0 , r ) , where the constant C L is chosen small enough to ensure that these functionsare ( γ, L )-H¨older and have suﬃciently small variation. Consider an arbitraryinﬁnitely many times diﬀerentiable function v : R (cid:55)→ [0 ,

1] which satisﬁes v ( x ) =0 for all x ≤ v ( x ) = 1 for all x ≥

1. Then, the functions g ( · ) and ξ ( · ) aredeﬁned as g ( x ) = v (cid:16) (cid:107) x (cid:107)− r r − r (cid:17) , ξ ( x ) = v (cid:16) (cid:107) x (cid:107)− r r − r (cid:17) . The above construction hzhen, Denis and Hebiri/Conﬁdence Sets deﬁnes a distribution P for which we have G − ( β ) = 12( β + 1)Γ ∗ β ( x ) = (cid:40) { , . . . , β + 1 } , x ∈ B (0 , r ) (cid:83) D ( r , r ) ∅ , otherwise . Indeed, let us evaluate the following quantity under the assumption that β ≤(cid:98) K/ (cid:99) − K (cid:88) k =1 (cid:90) p k ( x ) ≥ G − ( β ) µ ( x ) dx = ( β + 1) (cid:32)(cid:90) B (0 ,r ) µ ( x ) dx + (cid:90) D ( r , r ) µ ( x ) dx (cid:33) = ( β + 1) (cid:16) Leb ( B (0 , r )) + (cid:16) ββ +1 − Leb( B (0 , r )) (cid:17)(cid:17) = β . Thus, using this distribution we can write for any classiﬁer ˆΓ ∈ ˆΥ β with ﬁxedcardinalP(ˆΓ) − P(Γ ∗ β ) = (cid:90) R d K (cid:88) k =1 (cid:12)(cid:12) p k ( x ) − G − ( β ) (cid:12)(cid:12) k ∈ ˆΓ( x ) (cid:52) Γ ∗ ( x ) µ ( x ) dx ≥ (cid:90) D ( r , r ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β + 1) − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β + 1 − β + 1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ ( x ) dx = (cid:90) D ( r , r ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) β + 1) − C L − cos (cid:16) πr (cid:107) x (cid:107) (cid:17) β + 1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ββ +1 − Leb( B (0 , r ))Leb ( D ( r , r )) dx , where the ﬁrst inequality follows from the observation that for x ∈ D ( r , r )there is always at least one label k such that k ∈ ˆΓ( x ) (cid:52) Γ ∗ ( x ). Thus, since theconstant C L is chosen to satisfy 2 C L / ( β + 1) ≤ / β + 1) we have for anyˆΓ ∈ ˆΥ β P(ˆΓ) − P(Γ ∗ β ) ≥ ββ +1 − Leb( B (0 , r ))4( β + 1) , If r is such that Leb( B (0 , r )) ≤ β β +1) we getP(ˆΓ) − P(Γ ∗ β ) ≥ β β + 1) , almost surely . By construction, the regression vector is ( γ, L )-H¨older and the density is lower-and upper-bounded by some positive constants. Hence, it remains to checkthat the constructed distribution satisﬁes the α -margin assumption. This canbe achieved by an appropriate choice of r . Indeed, on the sets D ( r , r ) ∪ hzhen, Denis and Hebiri/Conﬁdence Sets D ( r , r ) there is a “corridor” of constant size between the regression functionsand the threshold G − ( β ). The threshold G − ( β ) is only approached by the re-gression function on the set B (0 , r ). As all the parameters in our constructionare independent from n, N ∈ N we can ﬁnd a value r being small enough sothat the α -margin assumption is veriﬁed for a ﬁxed α >α >