Minimax rates for cost-sensitive learning on manifolds with approximate nearest neighbours
PProceedings of Machine Learning Research 1:1–45, 2017 Algorithmic Learning Theory 2017
Minimax rates for cost-sensitive learning on manifoldswith approximate nearest neighbours ∗ Henry W J Reeve † HENRY . REEVE @ MANCHESTER . AC . UK Gavin Brown
GAVIN . BROWN @ MANCHESTER . AC . UK School of Computer ScienceUniversity of ManchesterManchester M13 9PL
Editors:
Steve Hanneke and Lev Reyzin
Abstract
We study the approximate nearest neighbour method for cost-sensitive classification on low-dimensionalmanifolds embedded within a high-dimensional feature space. We determine the minimax learningrates for distributions on a smooth manifold, in a cost-sensitive setting. This generalises a classicresult of Audibert and Tsybakov. Building upon recent work of Chaudhuri and Dasgupta we provethat these minimax rates are attained by the approximate nearest neighbour algorithm, where neigh-bours are computed in a randomly projected low-dimensional space. In addition, we give a boundon the number of dimensions required for the projection which depends solely upon the reach anddimension of the manifold, combined with the regularity of the marginal.
1. Introduction
The nearest neighbour method is a simple and intuitive approach to classification with numerousstrong theoretical properties. A classical result of Stone (1977) gives convergence to the Bayes error.More recently, Chaudhuri and Dasgupta (2014) demonstrated that the nearest neighbour methodadapts to the unknown level of noise, expressed as a margin condition. Indeed, under Tsybakov’smargin condition, the risk converges at the minimax optimal rates for binary classification identifiedby Audibert et al. (2007), which require that the marginal is absolutely continuous with respect tothe Lebesgue measure. In this work we move the analysis of minimax rates for classification closerto practical settings encountered in machine learning applications.High dimensional feature spaces occur in many machine learning applications from computervision through to genome analysis and natural language processing. Whilst the dimensionality ofthe feature space may be high, the data itself is often constrained to a low-dimensional manifold.This renders the assumption of an absolutely continuous marginal distribution inappropriate. As weshall see, optimal classification rates are dependent upon the intrinsic complexity of the manifold.High-dimensional feature spaces also give rise to computational challenges. Indeed, an exactnearest neighbour search is often prohibitively expensive when the feature space is high dimensional ∗ The authors would like to thank Ata Kaban for numerous useful conversations which greatly improved the paper.We would also like to thank Joe Mellor and the anonymous reviewers for useful feedback. In addition we gratefullyacknowledge the support of the EPSRC for the LAMBDA project (EP/N035127/1) and the Manchester Centre forDoctoral Training (EP/1038099/1). † Corresponding author c (cid:13) a r X i v : . [ c s . L G ] M a r INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS (Indyk and Motwani (1998)). An efficient approach to dealing with these computational challengesis random projections (Dirksen (2016)). Kab´an (2015) demonstrated that the randomly projectednearest neighbour method is capable of exploiting low intrinsic complexity within the data set in thedistribution-free setting. However, these distribution-free bounds are non-optimal when Tsybakov’smargin condition holds. We provide optimal distribution-dependent bounds for the approximatenearest neighbour method.The seminal works of Audibert et al. (2007) and Chaudhuri and Dasgupta (2014) target the over-all classification risk. In doing so they implicitly make the assumption of symmetric costs. However,many real-world applications, from medical diagnosis through to fraud detection, have asymmet-ric costs: Different mis-classification errors incur different costs (see Elkan (2001); Dmochowskiet al. (2010)). The nearest neighbour method can be straight-forwardly adapted to the cost-sensitive setting in which asymmetric costs are taken into account (see Section 5). In this work we analyzethe nearest neighbour method in high-dimensional cost-sensitive settings, in keeping with our goalof bringing the analysis of nearest neighbour methods closer to practical settings encountered inmachine learning applications.Building upon previous work of Audibert et al. (2007); Chaudhuri and Dasgupta (2014); Kab´an(2015), we give optimal, distribution-dependent, and cost-sensitive bounds for the approximatenearest neighbour method, when the data is concentrated on a low-dimensional manifold. Specifi-cally, we provide the following. • We determine the minimax learning rates for a natural family of distributions supported onembedded manifolds, in a multi-class cost-sensitive setting (Section 4); • We demonstrate that these rates are attained by an approximate nearest neighbour algorithm,where neighbours are computed in a low-dimensional randomly projected space (Section 5); • We give a bound upon on the number of dimensions required for the projection to attainoptimal learning rates which depends solely upon the reach and dimension of the manifold,combined with the regularity of the marginal distribution (Section 5).We begin by introducing some notation in Sections 2 and 3 before stating our main results inSections 4,5 and 6. Detailed proofs are provided in the Appendix: Sections A, B, C and D.
2. Background I: Classification with nearest neighbours
In this section we introduce our notation and give the relevant background on the nearest neighbourmethod and distribution dependent bounds. This will lead us into a discussion on high dimensionaldata and manifolds. We will bring these two strands together in Sections 4 and 5.
Suppose we have a distribution P over a Z = X × Y , where ( X , ρ ) is a metric space and Y = { , · · · , L } is a discrete space of labels. We let ∆( Y ) ⊂ R L denote the ( L − -simplex consistingof probability vectors over Y . The distribution P over Z is determined by a marginal distribution µ on X , and a conditional probability specified by η : X → ∆( Y ) where for each x ∈ X , y ∈ Y , η ( x ) y = P [ Y = y | X = x ] . We let E denote expectation over random pairs ( X, Y ) ∼ P . We take afixed cost matrix Φ with entries φ ij ≥ denoting the cost incurred by predicting class i when the INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS true label is class j . Following Elkan (2001) we shall say that a cost-matrix Φ is reasonable when φ ii < φ ji for all i, j with j = i . Often it is assumed that all mistakes are equally expensive and thecost matrix Φ , with all diagonal entries equal to and all non-diagonal entries equal to is used.However, there are many application domains where the assumption of a symmetric cost matrix ishighly inaccurate (Dmochowski et al. (2010)). The class of reasonable cost matrices provides amore general assumption applicable to a wide range of cost-sensitive scenarios. In particular, theclass of reasonable cost matrices generalises the class considered by Zhang (2004).Given a cost matrix Φ , the risk R ( f ) of a classifier f : X → Y is defined by R ( f ) = E (cid:2) φ f ( X ) ,Y (cid:3) . The Bayes risk R ∗ is defined by R ∗ = inf { R ( f ) : f : X → Y is Borel } . Our goalis to obtain a classifier f : X → Y with R ( f ) as close as possible to R ∗ . Whilst we do not havedirect access to the distribution P , we do have access to a random data set D n = { Z , · · · , Z n } with Z i selected independently according to P . Equivalently, D n ∼ P n where P n denotes theproduct measure Q ni =1 P on Z n . We let E n denote expectation over data sets D n ∼ P n andlet F n denote the set of feature vectors in the training data ie. F n = { X , · · · , X n } where D n = { ( X , Y ) , · · · , ( X n , Y n ) } . One of the central goals of statistical learning theory is to establish optimal bounds on the risk ofa classifier, under natural assumptions on the distribution. The seminal work of Audibert et al.(2007) established optimal bounds for a class of distributions supported on regular sets of positiveLebesgue measure, satisfying a margin condition. To recall these results we require some notation.
Definition 2.1 (Regular sets and measures)
Suppose we have a measure υ on the metric space ( X , ρ ) . A subset A ⊂ X is said to be a ( c , r ) -regular set with respect to the measure υ if forall x ∈ A and all r ∈ (0 , r ) we have υ ( A ∩ B r ( x )) ≥ c · υ ( B r ( x )) , where B r ( x ) denotesthe open metric ball of radius r , centred at x . A measure µ with support supp ( µ ) ⊂ X is saidto be ( c , r , ν min , ν max ) -regular measure with respect to υ if supp ( µ ) is a ( c , r ) -regular set withrespect to υ and µ is absolutely continuous with respect to υ with Radon-Nikodym derivative ν ( x ) = dµ ( x ) /dυ ( x ) , such that for all x ∈ supp ( µ ) we have ν min ≤ ν ( x ) ≤ ν max . The assumption of a regular marginal ensures that with high probability there are a large numberof training points in the vicinity of the average test point.
Definition 2.2 (H¨older conditional)
The conditional η is said to be H¨older continuous with con-stants ( α, C α ) if for µ almost every x , x ∈ X we have k η ( x ) − η ( x ) k ∞ ≤ C α · ρ ( x , x ) α . The assumption of a H¨older conditional ensures that the proximity between a test point and itsclosest training points is reflected in a similar value for the conditional probability. Hence, higherH¨older exponents correspond to faster learning rates. Audibert and Tsybakov also considered higherorder smoothness conditions and showed that in such settings even faster learning rates are attainable(see (Audibert et al., 2007, Section 2 & Section 3, Theorem 3.3)). However, in this work we restrictour attention to the H¨older condition given in Definition 2.2.
Definition 2.3 (Tsybakov’s Margin condition)
Suppose that Y = { , } and Φ = Φ , . Weshall say that P satisfies the margin condition with constants ( C β , β ) if for all ζ > we have µ (cid:0)(cid:8) x ∈ X : 0 < (cid:12)(cid:12) η ( x ) − (cid:12)(cid:12) ≤ ζ (cid:9)(cid:1) ≤ C β · ζ β . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
The margin condition bounds the probability that the labels of training points in the vicinity ofthe test point will disagree with the mode of the conditional label distribution. We generalise thecondition to an arbitrary L × L cost matrix as follows. Given n ∈ ∆( Y ) we let Y ∗ Φ ( n ) ⊂ Y denotethe set of labels with minimal associated cost. That is, Y ∗ Φ ( n ) = argmin y ∈Y (cid:8) e ( y ) T Φ n (cid:9) , where e ( y ) ∈ { , } L × denotes the ‘one-hot-encoding’, satisfying e ( y ) y = 1 and e ( y ) l = 0 for l = y .Given n ∈ ∆( Y ) we define M Φ ( n ) = min n ( e ( y ) − e ( y )) T Φ n : y ∈ Y ∗ Φ ( n ) , y ∈ Y\Y ∗ Φ ( n ) o . Definition 2.4 (Margin condition)
We shall say that P satisfies the margin condition with con-stants ( C β , β, ζ max ) if for all ζ ∈ (0 , ζ max ) we have µ ( { x ∈ X : M Φ ( η ( x )) ≤ ζ } ) ≤ C β · ζ β . The Definitions 2.1, 2.2, 2.4 characterise a natural family of distributions as follows.
Definition 2.5 (Measure classes)
Fix positive constants α, β, r , c , ν min , ν max , ζ max , C α , C β andlet Γ = h ( c , r , ν min , ν max ) , ( β, C β , ζ max ) , ( α, C α ) i . We let P Φ ( υ, Γ) denote the class of all prob-ability measures P on Z = X × Y such that • µ is a ( c , r , ν min , ν max ) -regular measure with respect to υ , • η is H¨older continuous with constants ( α, C α ) , • P satisfies the margin condition with constants ( β, C β , ζ max ) . Audibert et al. (2007) gave the following minimax result for classes of the form P Φ , ( L d , Γ) ,where L d denotes the d -dimensional Lebesgue measure on R d . The Euclidean metric ρ on R d isgiven by ρ ( x , x ) = k x − x k , where k · k denotes the Euclidean norm. To state the result wemust distinguish between the estimation procedure ˆ f est n ∈ (cid:0) Y X (cid:1) Z n and the classifier ˆ f n ∈ Y X . Theestimation procedure ˆ f est n : Z n → Y X is a mapping which takes a data set D n ∈ Z n and outputs aclassifier ˆ f n : X → Y , which implicitly depends upon the particular data set D n . Theorem 1 (Audibert et al. (2007))
Take d ∈ N , X = R d , let ρ denote the Euclidean metricand set Y = { , } . There exists positive constants C , R , V − , V + such that for all c ∈ (0 , C ) , r ∈ (0 , R ) , ν min ∈ (0 , V − ) , ν max ∈ ( V + , ∞ ) , α ∈ (0 , , β ∈ (0 , d/α ) , C α , C β > , ζ max > , ifwe take Γ = h ( c , r , ν min , ν max ) , ( β, C β , ζ max ) , ( α, C α ) i , we have inf n sup n E P n h R (cid:16) ˆ f n (cid:17)i − R ∗ : P ∈ P Φ , ( L d , Γ) o : ˆ f est n ∈ (cid:0) Y X (cid:1) Z n o = Θ (cid:18) n − α (1+ β )2 α + d (cid:19) , with upper and lower constants determined solely by d and Γ . Here we use standard complexity notation (Cormen, 2009, Chapter 3). In the proof of Theorem 1Audibert et al. also identified a classifier based on kernel density estimation which attains theminimax optimal convergence rates Audibert et al. (2007). This raises the interesting question ofwhich other classifiers attain these rates. INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
The nearest neighbour classifier is constructed as follows. Given k ≤ n we let S ◦ k ( x, F n ) ⊆{ , · · · , n } denote a set of for k -nearest neighbour indices. That is, I = S ◦ k ( x, F n ) minimises max { ρ ( x, X i ) : i ∈ I} over all sets I ⊆ { , · · · , n } with I = k . The k -nearest classifier isdefined by ˆ f k,n ( x ) = Mode ( { Y i : i ∈ S ◦ k ( x, F n ) } ) .Despite its simplicity the nearest neighbour classifier has strong theoretical properties. Shalev-Shwartz and Ben-David gave an elegant proof that the generalisation error of the -nearest neigh-bour classifier converges to at most · R ∗ + O ( n − / (1+ d ) ) , for Lipchitz η , without any assump-tions on the marginal µ (Shalev-Shwartz and Ben-David, 2014, Chapter 19). This approach canbe extended to all metric spaces of doubling dimension d (see Kontorovich and Weiss (2014)).Chaudhuri and Dasgupta gave distribution dependent bounds for the nearest neighbour method ongeneral metric spaces (Chaudhuri and Dasgupta (2014)). Rather than relying directly upon theH¨older continuity of the conditional, Chaudhuri and Dasgupta introduced the following smoothnesscondition, which is especially suited to non-parametric classification. Given x ∈ X and r > welet B r ( x ) = { ˜ x ∈ X : ρ ( x, ˜ x ) < r } , and B r ( x ) = { ˜ x ∈ X : ρ ( x, ˜ x ) ≤ r } . Definition 2.6 (Measure-smooth conditional)
The conditional η is said to be measure-smoothwith constants ( λ, C λ ) if for µ almost every x , x ∈ X we have k η ( x ) − η ( x ) k ∞ ≤ C λ · µ (cid:0) B ρ ( x ,x ) ( x ) (cid:1) λ . Chaudhuri and Dasgupta (2014) proved that whenever the conditional is measure-smooth andthe Tsybakov margin condition holds, then for
Φ = Φ , and Y = { , } , the risk of the nearestneighbour method converges to the Bayes error at a rate O (cid:0) n − λ (1+ β ) / (2 λ +1) (cid:1) . It follows that thenearest neighbour classifier attains the optimal convergence rates for P ∈ P Φ , (cid:0) L d , Γ (cid:1) given inTheorem 1 (Chaudhuri and Dasgupta, 2014, Lemma 2).
3. Background II: High dimensional data
A wide variety of machine learning application domains from computer vision through to genomeanalysis involve extremely high-dimensional feature spaces. Nonetheless, statistical regularities inthe data often mean that its intrinsic complexity is often much lower than the number of features.A natural approach to modelling this low intrinsic complexity is to assume that the data lies on amanifold (see Roweis and Saul (2000); Tenenbaum et al. (2000); Park et al. (2015)).
We shall consider a compact C ∞ -smooth sub-manifold of M ⊂ R d of dimension γ (see Lee(1997)). The manifold is endowed with two natural metrics. Given a pair of points x , x ∈ M ,distances may be computed either with respect to the Euclidean metric ρ ( x , x ) = k x − x k (since M ⊂ R d ), or with respect to the geodesic distance induced by the manifold, ρ g ( x , x ) := inf (cid:26)Z k c ( t ) k : c : [0 , → M is piecewise C with c (0) = x & c (1) = x (cid:27) . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
We shall make use of the concept of reach τ introduced by Federer (1959) and investigated byNiyogi et al. (2008). The reach τ of a manifold M is defined by τ := sup (cid:26) r > ∀ z ∈ R d inf q ∈M {k z − q k } < r = ⇒ ∃ ! p ∈ M , k z − p k = inf q ∈M {k z − q k } (cid:27) . Note that Niyogi et al. (2008) refers the condition number /τ , which is the reciprocal of the reach τ . We let dV M denote the Riemannian volume element and V M the Riemannian volume. The study of minimax rates for data lying on a low-dimensional manifold has received substantial at-tention in the regression setting. In particular Kpotufe (2011) determined the minimax optimal ratesfor regression on metric spaces of a given intrinsic dimension and showed that k -nearest neighbourregression attains these rates. Regression with nearest neighbours has also been studied in the semi-supervised domain (Goldberg et al. (2009); Moscovich et al. (2017)). By combining Theorem 4from Chaudhuri and Dasgupta (2014) with (Eftekhari and Wakin, 2015, Lemma 12) one obtainsan upper bound on the risk for the binary k-nearest neighbour classifier on a manifold. However,Chaudhuri and Dasgupta (2014) do not present minimax lower bounds for classification on non-Euclidean spaces. Proving lower bounds on non-Euclidean spaces for classification is complicatedby the necessity of constructing distributions which simultaneously satisfy the margin condition andhave a marginal with a regular support. High dimensional data is computationally problematic for the nearest neighbour method. The naivenearest neighbour search depends linearly on the number of dimensions. More sophisticated solu-tions which are logarithmic in the number of examples lead to either a time or space complexitywhich is exponential in the number of features (Andoni and Indyk (2006)). Thus, nearest neighbourclassification based on an exact nearest neighbour search is computationally prohibitive for highdimensional data sets with many examples. A popular and computationally tractable alternative isto use approximate nearest neighbours (Indyk and Motwani (1998)).Given θ ≥ , a family of mappings S = { S k } k ∈ N is said to generate θ approximate nearestneighbours if for each x ∈ X , n ∈ N and k ≤ n , S k ( x, F n ) is a subset of { , · · · , n } with cardinal-ity k such that max { ρ ( x, X i ) : i ∈ S k ( x, F n ) } ≤ θ · max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } . The associ-ated approximate nearest neighbour classifier is given by ˆ f Sk,n ( x ) = Mode ( { Y i : i ∈ S k ( x, F n ) } ) .A highly efficient approach to generating approximate nearest neighbours is to apply a subgaus-sian random projection (see Dirksen (2016)). Given a random variable u , the subgaussian normis given by k u k ψ := inf (cid:8) ψ > E u (cid:2) exp (cid:0) k u k /ψ (cid:1)(cid:3) ≤ (cid:9) . Whenever k u k ψ < ∞ the randomvector u is said to be subgaussian. More generally, a d -dimensional random vector v is said to besubgaussian if k v k ψ := sup (cid:8) k v T w k ψ : w ∈ R d k w k ≤ (cid:9) < ∞ . A random vector v is said tobe isotropic if for all w ∈ R d we have E v h(cid:0) v T w (cid:1) i = k w k . By a subgaussian random projection ϕ : R d → R h we shall mean a random mapping constructed by taking h independent and iden-tically distributed subgaussian and isotropic random vectors v , · · · , v h , taking a random matrix V := [ v , · · · , v h ] T and letting ϕ ( x ) = p (1 /h ) · V x . The subgaussian norm is extended to sub-gaussian random projections ϕ by defining k ϕ k ψ := k v k ψ . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Example:
An interesting family of a subgaussian random projections is the set of ‘data-basefriendly’ projections introduced by Achlioptas (2003). The entries of the matrix are chosen i.i.dfrom the set {−√ , , + √ } , with respective probabilities / , / , / . The high (expected) levelof sparsity gives rise to projections which are efficient both to store and to apply.Given a subgaussian random projection ϕ , we define an associated family of mappings S ( ϕ ) = { S ϕk } k ∈ N by letting S ϕk ( x, F n ) denote the set of k nearest neighbours of ϕ ( x ) in the randomly pro-jected feature space R h , i.e. S ϕk ( x, F n ) = S ◦ k ( ϕ ( x ) , ϕ ( F n )) where ϕ ( F n ) = { ϕ ( X ) , · · · , ϕ ( X n ) } .When S ( ϕ ) generates approximate nearest neighbours we let ˆ f ϕk,n denote ˆ f S ( ϕ ) k,n . The celebratedJohnson-Lindenstrauss theorem states that, given a finite data set A ⊂ R d , for h = Ω(log( A )) ,with high probability, a subgaussian random projection ϕ : R d → R h will be bi-Lipchitz on A (Johnson and Lindenstrauss (1984); Matouˇsek (2008)). Moreover, the bi-Lipchitz property impliesthat the associated family of mappings S ( ϕ ) generates approximate nearest neighbours (see LemmaD.4), a fact that was utilised by Indyk and Motwani (1998). Klartag and Mendelson (2005) showedthat it is sufficient to take h = Ω (cid:0) γ tal ( A ) (cid:1) where γ tal ( A ) quantifies the metric-complexity of A (seeSection G for details). Recently, Dirksen (2016) has built upon the work Klartag and Mendelsen togive a unified theory of dimensionality reduction, which will be critical for our main results. Theresults of Klartag and Mendelson (2005) and Dirksen (2016) are highly significant from a statisticallearning theory perspective, since for many natural examples, such as when X ⊂ R d lies within asmooth manifold, the metric complexity γ tal ( X ) may be bounded independently of the cardinalityof X . In such cases we may bound the required number of projection dimensions h independentlyof the number of examples n . Indeed, Kab´an (2015) applied the results of Klartag and Mendelson(2005) to obtain improved bounds on the generalisation error of the approximate nearest neighbourclassifier ˆ f ϕk,n when X ⊂ R d and the metric complexity γ tal ( X ) (cid:28) d . Kab´an (2015) showed thatgiven a sub-Gaussian random projection ϕ : R d → R h with h = Ω (cid:0) γ tal ( X ) (cid:1) , with high prob-ability the generalisation error of the approximate -nearest neighbour classifier ˆ f ϕ ,n is boundedabove by · R ∗ + O ( n − / (1+ h ) ) . Kab´an (2015) also gives the same bound for the exact nearestneighbour classifier, dependent upon γ tal ( X ) . Hence, convergence rates for both the approximatenearest neighbour classifier and the exact nearest neighbour classifier may be improved under theassumption of low-metric complexity in the distribution free setting. This raises the following questions. If we combine an assumption of low-intrinsic complexity withregularity assumptions analogous to those in Theorem 1, then what are the best possible rates? Howdo these rates depend upon geometric properties of the manifold? Are these rates the same for all reasonable cost matrices? Which algorithms attain these bounds? INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
4. Minimax rates for cost-sensitive learning on manifolds
In this section we shall give minimax learning rates for cost-sensitive learning on manifolds. Recallthat a cost-matrix Φ is said to be reasonable when φ ii < φ ji for all i, j with j = i (see Section2.1). As in Section 2.2 we distinguish between the estimation procedure ˆ f est n : Z n → Y X and theclassifier ˆ f n : X → Y , which implicitly depends on the data set D n . A key feature of our bound isthat the constants are uniform over all manifolds M of a given reach τ and intrinsic dimension γ . Theorem 2
Take d ∈ N and let ρ be the Euclidean metric on R d . Suppose that Φ is a rea-sonable cost matrix and M ⊆ R d is a compact smooth submanifold with dimension γ , reach τ and Riemannian volume V M . There exists a positive constant Z Φ , determined by Φ , and posi-tive constants C , R , V − , V + determined by γ, τ , such that for all c ∈ (0 , C ) , r ∈ (0 , R ) , ν min ∈ (0 , V − ) , ν max ∈ ( V + , ∞ ) , ζ max ∈ (0 , Z Φ ) , α ∈ (0 , , β ∈ (0 , γ/α ) , C α , C β > , if wetake Γ = h ( c , r , ν min , ν max ) , ( β, C β , ζ max ) , ( α, C α ) i we have inf n sup n E P n h R (cid:16) ˆ f n (cid:17)i − R ∗ : P ∈ P Φ ( V M , Γ) o : ˆ f est n ∈ (cid:0) Y X (cid:1) Z n o = Θ (cid:18) n − α (1+ β )2 α + γ (cid:19) , with upper and lower constants determined solely by Φ , γ , τ and Γ . The proof of Theorem 2 consists of a a lower bound and an upper bound. The proof strategyfor the lower bound is based on the proof of (Audibert et al., 2007, Theorem 3.5) where the resultis proved in the special setting of binary classification with the zero-one loss Φ , and a Lebesgueabsolutely continuous marginal distribution µ . In the proof of (Audibert et al., 2007, Theorem3.5) the authors exploit the following fact: Suppose that you are given one of a pair of Bernoullimeasures with probability p = 1 / or p = 1 / − ∆ for some ∆ ∈ (0 , / . If you knowwhich of the two measures you have ( p = 1 / or p = 1 / − ∆ ) then you may make binarypredictions in such a way as to get an error rate of / − ∆ . However, without knowledge of whichBernoulli measure you have, your average expected error rate must be / , by symmetry. Hence,the average level of regret due to not knowing which of the two measures you have is ∆ / . With thisin mind, the structure of Euclidean space is then utilised to show that the set [0 , d may be brokenup into qd small well-separated cubes of size − qd − . Thus, one can construct a set of measures P with a marginal µ supported on those cubes, and conditional equal to p ± = 1 / ± ∆ , with thechoice of ± made independently for each small cube. The number of cubes means that a largenumber of training examples is required for an estimator to know the true value of the conditionalon many of the small cubes. Hence, on average, an estimator based on D n must have regret atleast ∆ / , on a large proportion of the cubes. This implies a lower bound on the generalisationerror minus the Bayes error. In the proof of the lower bound for our Theorem 2 we have twoimportant differences requiring modifications to the proof. Firstly, we are working in a multi-classcost sensitive scenario. Secondly, we are working on an embedded Riemannian manifold, ratherthan Euclidean space. To deal with the fact that our problem is multi-class and cost-sensitive wemake use of Elkan’s reasonableness assumption (Elkan (2001)) to show that there exists familiesof measures on the simplex ∆( Y ) such that whatever class one predicts, the average expected costone incurs, averaged over all the measures, is well separated from the expected cost one wouldhave incurred if one knew which measure in ∆( Y ) one had (see Lemmas A.1 and A.2). The keydifficulty in the non-Euclidean setting is the construction of distributions which are both ( c , r ) and satisfy the margin condition. Our construction consists a collection of well-separated closed INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS geodesic balls (see Lemmas A.3 and A.5). To show that the support of the marginal is ( c , r ) -regular we exploit two key properties which follow from the assumption of bounded reach. Firstly,by results of Eftekhari and Wakin (2015) and Chazal (2013) the volume of small geodesic ballsin manifolds of bounded reach is approximately r γ (see Lemma A.6). Secondly, we have a lowerbound on the volume of the intersection of two sufficiently close geodesic balls (see Lemma A.7).We combine these properties to construct families of measures for which the average differencebetween expected risk and Bayes risk is bounded from below for all estimators. The full proof ofthe lower bound is given in Section A. The upper bound follows from Theorem 3 in Section 5 wherewe exhibit an efficient algorithm which attains the optimal rate.
5. The approximate nearest neighbour method on manifolds
In this section we shall see that the minimax rates identified in Section 4 are attained by the ap-proximate nearest neighbour method. Given a data set D n ∼ P n , an approximate nearest neighbourgenerating process S = { S k } k ∈ N and a query point x ∈ X , the algorithm proceeds as follows:1. Compute a set of approximate k -nearest neighbours S k ( x, F n ) ,2. Estimate η ( x ) with ˆ η Sn,k ( x ) = k P i ∈ S k ( x, F n ) e ( Y i ) ,3. Predict f Sn,k ( x ) ∈ argmin y ∈Y n e ( y ) T Φ ˆ η Sn,k ( x ) o .The following result implies that the approximate nearest neighbour method is minimax optimal.To state the result we introduce the quantities of Asym (Φ) and Λ(Φ) which depend upon the cost-matrix Φ . Given a cost matrix Φ we let Asym (Φ) denote the asymmetry of Φ , given by Asym (Φ) :=max {| φ i j − φ i j | : i = j, i = j } . Note that with Φ − equal to the cost matrix corresponding tothe zero-one loss we have Asym (Φ − ) = 0 . In addition, we define Λ(Φ) := ( L − · Asym (Φ) + 2 k Φ k ∞ . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
The constant
Λ(Φ) controls the degree of dependence of the cost differentials between classes, uponthe marginal: The greater
Λ(Φ) is, the greater the potential for errors in estimating η to translateinto incorrect assignments of relative cost. Theorem 3
Take d ∈ N and let ρ denote the Euclidean metric on R d . Let Φ be a cost matrix and M ⊆ R d a compact smooth submanifold with dimension γ and reach τ . Take positive constants k , r , c , ν min , ν max , ζ max , α, β, C α , C β and let Γ = h ( c , r , ν min , ν max ) , ( β, C β , ζ max ) , ( α, C α ) i .Suppose that S generates θ -approximate nearest neighbours for some θ ≥ . There exists a con-stant C > , depending upon k , γ , τ , Γ such that for all P ∈ P Φ ( V M , Γ) and n ∈ N the followingholds:(1) Given ξ ∈ (0 , and k n = k · n α α + γ · (1 + log(1 /ξ )) γ/ (2 α + γ ) with probability at least − ξ over D n ∼ P n we have P (cid:2) f Sn,k ( X ) / ∈ Y ∗ Φ ( η ( X )) (cid:3) ≤ ξ + C · (cid:16) θ α · Λ(Φ) · p log( L ) (cid:17) β · (cid:18) /ξ ) n (cid:19) βα/ (2 α + γ ) . (2) Given k n = k · n α α + γ we have E n (cid:2) R (cid:0) f Sn,k n (cid:1)(cid:3) − R ∗ ≤ C · ( θ α · Λ(Φ)) β · L · n − α (1+ β )2 α + γ . Moreover, there exists an absolute constant
K > such that whenever θ > , given any subgaus-sian random projection ϕ : R d → R h with h ≥ K · k ϕ k ψ · (cid:18) θ + 1 θ − (cid:19) · max (cid:8) γ log + ( γ/ ( r · τ )) − log + ( c · ν min ) + γ, log δ − (cid:9) , with probability at least − δ , S ( ϕ ) generates θ -approximate nearest neighbours, so both (1) and(2) hold with f ϕn,k in place of f Sn,k . We emphasize that the rates are uniform over all manifolds M of a given reach τ and intrinsicdimension γ (for fixed k , Γ ). Note that the first part of Theorem 3 includes the special case in which θ = 1 and S generates exact nearest neighbours. Approximate nearest neighbours and randomprojections are required purely for reducing computational complexity, and not for generalizationperformance. Note also that Theorem 3 holds for all non-negative cost matrices Φ (not necessarilyreasonable).A full proof of Theorem 3 is given in Section D. The first part of Theorem 3 follows straightfor-wardly from a more general result for metric spaces, which we present in Section 6. The second partof Theorem 3 follows from the first part combined with the following result on random projections.A full proof is given in Section D. Theorem 4
There exists an absolute constant K such that the following holds. Given a compactsmooth submanifold M ⊆ R d with dimension γ and reach τ , suppose that A ⊂ M is ( c , r ) regular with respect to the Riemannian volume V M . Suppose that ϕ : R d → R h is a subgaussianrandom projection. Take (cid:15), δ ∈ (0 , and suppose that h ≥ K · k ϕ k ψ · (cid:15) − · max (cid:8) γ log + ( γ/ ( r · τ )) + log + ( V M ( A ) /c ) + γ, log δ − (cid:9) . Then with probability at least − δ , for all pairs x , x ∈ A we have (1 − (cid:15) ) · k x − x k ≤ k ϕ ( x ) − ϕ ( x ) k ≤ (1 + (cid:15) ) · k x − x k . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Theorem 4 is a generalisation of (Dirksen, 2016, Theorem 7.9), where the result is given in thespecial case where A = M . The proof is very similar, but is given in Section G for completeness.Theorem 4 is necessary for dealing with situations where we are able to bound the volume of thesupport of the marginal via the regularity condition (Defintion 2.1), but we have no bound on thevolume of the ambient manifold M .
6. The approximate nearest neighbour method on metric spaces
In this section we give a counterpart to Theorem 3 for arbitrary metric spaces. We first introducethe concept of measure -approximate nearest neighbours. Given ω ≥ , a family of mappings S = { S k } k ∈ N is said to generate ω measure-approximate nearest neighbours if, for each x ∈ X , n ∈ N and k ≤ n , S k ( x, F n ) is a subset of { , · · · , n } with cardinality k such that taking r =max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } and r = max { ρ ( x, X i ) : i ∈ S k ( x, F n ) } implies µ ( B r ( x )) ≤ ω · µ ( B r ( x )) . Theorem 5
Suppose that P satisfies the margin condition with constants ( β, C β , ζ max ) and thatthe conditional η is measure-smooth, with constants ( λ, C λ ) . Suppose that S generates ω measure-approximate nearest neighbours with respect to the measure µ and take some k > . There existsa constant C > , depending upon k , β , λ , C β , C λ such that for all n ∈ N the following holds:(1) Given ξ ∈ (0 , and k n = k · n λ λ +1 · (1 + log(1 /ξ )) / (2 λ +1) with probability at least − ξ over D n ∼ P n we have P (cid:2) f Sn,k ( X ) / ∈ Y ∗ Φ ( η ( X )) (cid:3) ≤ ξ + C · (cid:16) ω λ · Λ(Φ) · p log( L ) (cid:17) β · (cid:18) /ξ ) n (cid:19) βλ/ (2 λ +1) . (2) Given k n = k · n λ λ +1 we have E n (cid:2) R (cid:0) f Sn,k n (cid:1)(cid:3) − R ∗ ≤ C · (cid:16) ω λ · Λ(Φ) (cid:17) β · L · n − λ (1+ β )2 λ +1 . Theorem 5 is an analogue of Theorem 4 in Chaudhuri and Dasgupta (2014), extended to themulti-class, cost-sensitive setting with measure approximate nearest neighbours. A sketch of theproof is as follows:(1) By the concentration of measure phenomenon, given a data set D n of size n we expect the k nearest neighbours of x to lie in metric ball with probability roughly k/n , when k is large. It fol-lows that a set of k ω -measure-approximate nearest neighbours will with high probability lie in ballof probability roughly ω · k/n . By the measure smooth property (Definition 2.6) the conditionalprobability η at those k ω -measure approximate nearest neighbours will be of the order ( ω · k/n ) λ from η ( x ) or less, with high probability. By the margin condition, with high probability, the marginat x , M Φ ( x ) , is large. Moreover, if x has large margin then the conditional of the k ω -measureapproximate nearest neighbours would have to be far from η ( x ) to lead to sub-optimal classifica-tions, which is unlikely for small ( ω · k/n ) λ . A more precise statement of this argument gives theconclusion that the predictions are optimal with high probability.(2) The argument for (2) is more involved. We begin with the straightforward observation that thedifference between expected risk of the approximate nearest neighbour classifier and the Bayes risk INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS is equal to the average differential between the cost incurred by the approximate nearest neighbourclassifier and the cost incurred by the Bayes optimal classifier. Let’s denote this differential by d ( x, D n ) . The idea is to slice the difference between expected risk and Bayes risk up into regionsbased upon the value of d ( x, D n ) . For each j we consider the event d ( x, D n ) ∈ (cid:0) j − · (cid:15), j · (cid:15) (cid:3) .We note the following: a) By definition d ( x, D n ) < j · (cid:15) on the j -th slice, b) We have
7. Discussion
In this work we determined the minimax learning rates for a natural family of distributions supportedon embedded manifolds, in a cost-sensitive setting. We proved that these rates are attained by anapproximate nearest neighbour algorithm, where neighbours are computed in a low-dimensionalrandomly projected space. We also gave a bound upon on the number of dimensions required for theprojection to attain optimal learning rates. Our work raises many questions for future investigation,both theoretical and empirical. Firstly, whilst we have demonstrated that Theorem 3 is optimal in thenumber of examples, up to a constant term, the bound depends linearly on the number of classes L ,even in the cost-symmetric case. Whilst this is superior to the quadratic dependence of Crammer andSinger (2002), the distribution-free bounds of Kontorovich and Weiss (2014) give O ( p log( L ) /n ) dependence on the number of classes. It would be interesting to see if the approach of Kontorovichand Weiss (2014) may be adapted to give a better dependency upon n , in the presence of the margincondition. Secondly, as discussed in Section 2 Audibert and Tsybakov also considered higher ordersmoothness conditions and showed that in the presence of such conditions even faster learning ratesare attainable in the Lebesgue absolutely continuous setting (see (Audibert et al., 2007, Section 2& Section 3, Theorem 3.3)). In future work we intend to prove analogous results for manifolds,extending Theorem 2 to the setting of higher order smoothness conditions. From a more geometricperspective it would be interesting to see what bounds are possible if we relax the assumption thatthe data is concentrated on a manifold, and instead assume that the data lies near the manifold.The minimax optimality of the randomly projected nearest neighbour method in the cost-sensitivesettings strongly suggests the method as a simple baseline for cost-sensitive problems. Hence, itwould interesting to conduct an empirical investigation to determine how well the method compareson real-world data sets with other approaches to cost-sensitive classification Dmochowski et al.(2010); Nikolaou et al. (2016). INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
References
Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins.
Journalof computer and System Sciences , 66(4):671–687, 2003.Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor inhigh dimensions. In
Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposiumon , pages 459–468. IEEE, 2006.Jean-Yves Audibert, Alexandre B Tsybakov, et al. Fast learning rates for plug-in classifiers.
The Annals ofstatistics , 35(2):608–633, 2007.St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.
Concentration inequalities: A nonasymptotic theoryof independence . Oxford university press, 2013.Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for nearest neighbor classification. In
Advances in Neural Information Processing Systems , pages 3437–3445, 2014.Fr´ed´eric Chazal. An upper bound for the volume of geodesic balls in submanifolds of euclidean spaces. http://geometrica.saclay.inria.fr/team/Fred.Chazal/BallVolumeJan2013.pdf , 2013.Thomas H Cormen.
Introduction to algorithms . MIT press, 2009.Koby Crammer and Yoram Singer. On the learnability and design of output codes for multiclass problems.
Machine learning , 47(2-3):201–233, 2002.Sjoerd Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory.
Foundations ofComputational Mathematics , 16(5):1367–1396, 2016.Jacek P Dmochowski, Paul Sajda, and Lucas C Parra. Maximum likelihood in cost-sensitive learning: modelspecification, approximations, and upper bounds.
Journal of Machine Learning Research , 11(Dec):3313–3332, 2010.Armin Eftekhari and Michael B Wakin. New analysis of manifold embeddings and signal recovery fromcompressive measurements.
Applied and Computational Harmonic Analysis , 39(1):67–109, 2015.Charles Elkan. The foundations of cost-sensitive learning. In
In Proceedings of the Seventeenth InternationalJoint Conference on Artificial Intelligence , 2001.Herbert Federer. Curvature measures.
Transactions of the American Mathematical Society , 93(3):418–491,1959.Andrew Goldberg, Xiaojin Zhu, Aarti Singh, Zhiting Xu, and Robert Nowak. Multi-manifold semi-supervised learning. In
Artificial Intelligence and Statistics , pages 169–176, 2009.Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimension-ality. In
Proceedings of the thirtieth annual ACM symposium on Theory of computing , pages 604–613.ACM, 1998.William B Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space.
Contem-porary mathematics , 26(189-206):1, 1984.Ata Kab´an. A new look at nearest neighbours: Identifying benign input geometries via random projections.In
Proceedings of The 7th Asian Conference on Machine Learning , pages 65–80, 2015.B Klartag and Shahar Mendelson. Empirical processes and random projections.
Journal of FunctionalAnalysis , 225(1):229–245, 2005.Aryeh Kontorovich and Roi Weiss. Maximum margin multiclass nearest neighbors. In
ICML , pages 892–900,2014.Samory Kpotufe. k-nn regression adapts to local intrinsic dimension. In
Advances in Neural InformationProcessing Systems , pages 729–737, 2011.John M Lee.
Riemannian manifolds: an introduction to curvature , volume 176. Springer Science & BusinessMedia, 1997.Jiˇr´ı Matouˇsek. On variants of the johnson–lindenstrauss lemma.
Random Structures & Algorithms , 33(2):142–156, 2008.Michael Mitzenmacher and Eli Upfal.
Probability and computing: Randomized algorithms and probabilisticanalysis . Cambridge University Press, 2005. 13
INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Amit Moscovich, Ariel Jaffe, and Nadler Boaz. Minimax-optimal semi-supervised regression on unknownmanifolds. In
Artificial Intelligence and Statistics , pages 933–942, 2017.Nikolaos Nikolaou, Narayanan Edakunni, Meelis Kull, Peter Flach, and Gavin Brown. Cost-sensitive boost-ing algorithms: Do we really need them?
Machine Learning , 104(2-3):359–384, 2016.Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with highconfidence from random samples.
Discrete & Computational Geometry , 39(1-3):419–441, 2008.Mijung Park, Wittawat Jitkrittum, Ahmad Qamar, Zolt´an Szab´o, Lars Buesing, and Maneesh Sahani.Bayesian manifold learning: The locally linear latent variable model (ll-lvm). In
Advances in NeuralInformation Processing Systems , pages 154–162, 2015.Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding.
Science , 290(5500):2323–2326, 2000.Shai Shalev-Shwartz and Shai Ben-David.
Understanding machine learning: From theory to algorithms .Cambridge University Press, 2014.Charles J Stone. Consistent nonparametric regression.
The annals of statistics , pages 595–620, 1977.Michel Talagrand.
The generic chaining: upper and lower bounds of stochastic processes . Springer Science& Business Media, 2006.Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlineardimensionality reduction.
Science , 290(5500):2319–2323, 2000.Alexandre B Tsybakov. Introduction to nonparametric estimation. revised and extended from the 2004 frenchoriginal. translated by vladimir zaiats, 2009.Tong Zhang. Statistical analysis of some multi-category large margin classification methods.
Journal ofMachine Learning Research , 5(Oct):1225–1251, 2004.14
INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Appendix A. Lower bound for cost-sensitive learning on manifolds
In order to prove Theorem 2 we first prove the following lower bound.
Proposition A.1
Suppose that Φ is a cost matrix satisfying Elkan’s reasonableness assumption and M ⊆ R d is a compact smooth submanifold with dimension γ and reach τ . Let ˜ τ = min { τ, } andlet v γ denote the volume of the γ -dimensional Euclidean unit ball. There exists a positive constant Z Φ , determined by Φ , such that for all c ∈ (cid:0) , − γ (cid:1) , r ∈ (0 , ˜ τ / , ν min ∈ (cid:0) , (2 / ˜ τ ) γ · v − γ (cid:1) , ν max ∈ (cid:0) (2 / ˜ τ ) γ · v − γ , ∞ (cid:1) , ζ max ∈ (0 , Z Φ ) , α ∈ (0 , , β ∈ (0 , γ/α ) , C α , C β > , if we take Γ = h ( c , r , ν min , ν max ) , ( β, C β , ζ max ) , ( α, C α ) i then there exists a constant C determined solelyby Φ , γ , τ and Γ such that for all estimators ˆ f : Z n → Y X and n ∈ N we have sup n E P n h R (cid:16) ˆ f n (cid:17)i − R ∗ : P ∈ P Φ ( V M , Γ) o ≥ C · n − α (1+ β )2 α + γ . To clarify the proof of Proposition A.1 requires several lemmas so for clarity we begin with anoutline. First, some notation: Given a class label y ∈ Y and a probability vector n ∈ ∆( Y ) weshall define D Φ ( y, n ) = e ( y ) T Φ n − min l ∈Y (cid:8) e ( l ) T Φ n (cid:9) . Hence, D Φ ( y, n ) is a form of regret which quantifies how far a prediction y is from the optimal,according to a distribution n over Y . The proof proceeds as follows: Lemma A.1:
We make use of Elkan’s reasonableness assumption to construct a pair of measuresin the simplex ∆( Y ) such that 1) Both measures have large margin, and 2) For any predicted classlabel, the regret D Φ ( y, n ) for at least one of the measures must be large. Lemma A.2:
This lemma is based on Assoud’s lemma (Tsybakov, 2009, Chapter 2) and is closelyrelated to the original construction within Audibert et al. (2007). We construct a family of measures P σ on Z = X × Y such that we can lower bound the average differential between the expectedrisk and the Bayes risk. This lower bound is carried out by showing that with large probability µ on x ∈ X , the corresponding conditionals η σ ( x ) for measures P σ correspond to those constructedin Lemma A.1. Hence, by Lemma A.1 the average regret D Φ ( y, η σ ( x )) for any prediction y ∈ Y is large. In addition it is shown that the different distributions P σ within the family are sufficientlysimilar that they cannot be effectively distinguished by any estimator ˆ f n based on the data set D n . Lemma A.3:
We apply Lemma A.1 to give suitable conditions under which the family of measuresconstructed in Lemma A.3 satisfies the margin condition.
Lemma A.4:
We give suitable conditions under which the family of measures constructed inLemma A.3 satisfies the H¨older condition.
Lemma A.7:
We use the geodesic structure of the manifold to show that when the centre of twometric balls are sufficiently close then the volume of their intersection is large. INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Lemma A.5:
We apply Lemma A.7 to construct ( c , r ) -regular sets S ( r ) consisting of many smallmetric balls of radius r . Lemma A.9:
We give upper and lower bounds on the volume of the sets S ( r ) . This implies upperand lower bounds on the density of the normalised probability measure on S ( r ) . Lemma A.8:
We give a lower bound on the number of small balls in the supporting set S ( r ) .Finally, Lemmas A.1-A.8 are combined. Using Lemma A.2, a family of measures is constructedwith a lower bound on the difference between the risk of any estimator and the Bayes risk, averagedover all the measures. We then use Lemma A.3 to show that each of the measures in the family sat-isfies the margin condition and use Lemma A.4 to show that each measure has a sufficiently Holderconditional. Similarly, we use Lemmas A.5 and A.9 to show that the measure is ( c , r , ν min , ν max ) regular. Finally, Lemma A.8 is combined with other properties of the construction to show that thelower bound in Lemma A.2 implies the lower bound in Proposition A.1.Given p ∈ [0 , we let n ( p ) denote the probability vector n ( p ) = (1 − p ) · e (1) + p · e (2) . Lemma A.1
Let Φ be a cost matrix with L ≥ , satisfying Elkan’s reasonableness assumption.There exists a constants κ Φ ∈ (0 , , t Φ ∈ (0 , min { κ Φ , − κ Φ } ) and c Φ > , depending solelyupon Φ , such that for all σ ∈ {− , +1 } , δ ∈ (0 , t Φ ) and all y ∈ Y we have X σ ∈{− , +1 } D Φ ( y, n ( κ Φ + σ · δ )) ≥ min σ ∈{− , +1 } { M Φ ( n ( κ Φ + σ · δ )) } ≥ c Φ · δ. Proof
We begin by defining β Φ = min { φ y − φ : y ∈ Y\{ }} , κ Φ = min (cid:26) ( φ y − φ )( φ y − φ ) + ( φ − φ y ) : y ∈ Y\{ } , φ y < φ (cid:27) , and c Φ = β Φ / (2 κ Φ ) . By Elkan’s reasonableness assumption we have φ y − φ > for all y ∈ Y\{ } , so β Φ > and φ < φ so κ Φ is well-defined and κ Φ ∈ (0 , . We note that κ Φ = min (cid:8) sup (cid:8) p ∈ (0 ,
1) : e ( y ) T Φ n ( p ) > e (1) T Φ n ( p ) (cid:9) : y ∈ Y\{ } (cid:9) . Thus, for all p < κ Φ we have Y ∗ Φ ( n ( p )) = { } and for all p > κ Φ we have / ∈ Y ∗ Φ ( n ( p )) . Hence, for all δ ∈ (0 , κ ) wehave Y ∗ Φ ( n ( κ Φ + δ )) ∩ Y ∗ Φ ( n ( κ Φ − δ )) = ∅ . Thus, X σ ∈{− , +1 } D Φ ( y, n ( κ Φ + σ · δ )) ≥ min σ ∈{− , +1 } { M Φ ( n ( κ Φ + σ · δ )) } . Hence, it suffices to find some t Φ ∈ (0 , min { κ Φ , − κ Φ } ) and c Φ > such that for all δ ∈ (0 , t Φ ) we have min σ ∈{− , +1 } { M Φ ( n ( κ Φ + σ · δ )) } ≥ c Φ · δ, INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
We begin by showing that whenever δ < κ Φ and y ∈ Y\{ } we have D Φ ( y, n ( κ Φ − δ )) ≥ ( β Φ /κ Φ ) · δ .Given y ∈ Y\{ } we have e ( y ) T Φ n (0) − e (1) T Φ n (0) ≥ β Φ and e ( y ) T Φ n ( κ Φ ) − e (1) T Φ n ( κ Φ ) ≥ . Hence, by mean value theorem together with the linearity of p (cid:0) e ( y ) T Φ n ( p ) − e (1) T Φ n ( p ) (cid:1) − β Φ κ Φ · ( κ Φ − p ) , we have D Φ ( y, n ( p )) ≥ (cid:0) e ( y ) T Φ n ( p ) − e (1) T Φ n ( p ) (cid:1) ≥ β Φ κ Φ · ( κ Φ − p ) . Hence, D Φ ( y, n ( κ Φ − δ )) ≥ ( β Φ /κ Φ ) · δ for all δ ∈ (0 , κ Φ ) . Since this holds for all y ∈ Y\{ } we have M Φ ( n ( κ Φ − δ )) ≥ ( β Φ /κ Φ ) · δ for all δ ∈ (0 , κ Φ ) .Now define sets J ∗ , K ∗ , L ∗ ⊂ Y by J ∗ = { j ∈ Y : (1 − κ Φ ) · ( φ j − φ ) + κ Φ · ( φ j − φ ) = 0 } ,K ∗ = { j ∈ K ∗ : φ j − φ j is minimal } ,L ∗ = { j ∈ J ∗ \ K ∗ : φ j − φ j is minimal } . By the construction of κ Φ , we have K ∗ = ∅ and ∈ J ∗ \ K ∗ , so L ∗ = ∅ . Since Y ∗ Φ ( n ( p )) = { } forall p < κ Φ , and for each y ∈ Y , e ( y ) T Φ n ( p ) is linear in p we have J ∗ = Y ∗ Φ ( n ( κ Φ )) . Moreover,for each y ∈ Y we have ∂ (cid:0) e ( y ) T Φ n ( p ) (cid:1) ∂p = φ j − φ j . Thus, there exists t Φ ∈ (0 , min { κ Φ , − κ Φ } ) such that for all δ ∈ (0 , t Φ ) we have Y ∗ Φ ( n ( κ Φ + δ )) = K ∗ and if we take l ∈ L ∗ then for all y ∈ Y\Y ∗ Φ ( n ( κ Φ + δ )) we have e ( l ) T Φ n ( κ Φ + δ ) ≤ e ( y ) T Φ n ( κ Φ + δ ) . Choose k ∗ ∈ K and l ∗ ∈ L . Then, given δ ∈ (0 , t Φ ) , for all y ∈ Y ∗ Φ ( n ( κ Φ + δ )) = K ∗ and y ∈ Y\Y ∗ Φ ( n ( κ Φ + δ )) , we have (cid:0) e ( y ) − e ( y ) T (cid:1) Φ n ( κ Φ + δ ) ≥ e ( l ∗ ) T Φ n ( κ Φ + δ ) − e ( k ∗ ) T Φ n ( κ Φ + δ )= (( φ l ∗ − φ l ∗ ) − ( φ k ∗ − φ k ∗ )) · δ. Thus, if we take c Φ = min n β Φ κ Φ , (( φ l ∗ − φ l ∗ ) − ( φ k ∗ − φ k ∗ )) o > then for all δ ∈ (0 , t Φ ) , min σ ∈{− , +1 } { M Φ ( n ( κ Φ + σ · δ )) } ≥ c Φ · δ. Lemma A.2
Let κ Φ ∈ (0 , , t Φ ∈ (0 , min { κ Φ , − κ Φ } ) and c Φ > be as in the statement ofLemma A.1. Fix a distribution µ on X . Take m ∈ N , together with positive constants u ≥ v > .Suppose that we have two collections { A j } mj =1 , { B j } mj =1 each consisting of m disjoint subsets of INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS X such that for each j ∈ { , · · · , m } we have B j ⊂ A j and v ≤ µ ( B j ) ≤ µ ( A j ) ≤ u . Supposefurther that for each j ∈ { , · · · , m } there exists a function g j : X → [0 , with g j ( x ) = 0 for all x / ∈ A j and g j ( x ) = 1 for all x ∈ B j . Take δ ∈ (0 , t Φ ) and for each σ ∈ {− , , +1 } m we define η σ : X → R L × by η σ ( x ) = n κ Φ + δ · m X j =1 σ j · g j ( x ) , We let P σ denote the distribution on Z = X × Y with marginal µ and conditional η σ . Let P be aset of distributions on Z with { P σ : σ ∈ {− , +1 } m } ⊆ P . Given any estimator ˆ f : Z N → Y X we have sup P ∈P n E P n h R P ( ˆ f n ) i − R P ( f ∗ P ) o ≥ (2 c Φ mv ) · δ · (cid:18) − δ · r nut Φ (cid:19) . Proof
We have sup P ∈P n E P n h R P ( ˆ f n ) i − R P ( f ∗ P ) o = sup P ∈P n E P n h E P h φ ˆ f n ( X ) ,Y − φ f ∗ P ( X ) ,Y iio ≥ m X σ ∈{− , +1 } m E P n σ h E P σ h φ ˆ f n ( X ) ,Y − φ f ∗ P σ ( X ) ,Y ii = 12 m X σ ∈{− , +1 } m E P n σ (cid:20)Z (cid:16) e (cid:16) ˆ f n ( x ) (cid:17) − e (cid:0) f ∗ P σ ( x ) (cid:1)(cid:17) T Φ η σ ( x ) dµ ( x ) (cid:21) = 12 m X σ ∈{− , +1 } m E P n σ (cid:20)Z D Φ (cid:16) ˆ f n ( x ) , η σ ( x ) (cid:17) dµ ( x ) (cid:21) Note that D Φ ( y, n ) ≥ for all y ∈ Y and probability vectors n . Moreover, η σ ( x ) = n ( κ Φ + σ j · δ ) for all x ∈ B j . Hence, we have sup P ∈P n E P n h R P ( ˆ f n ) i − R P ( f ∗ P ) o ≥ m X σ ∈{− , +1 } m E P n σ m X j =1 Z B j D Φ (cid:16) ˆ f n ( x ) , η σ ( x ) (cid:17) dµ ( x ) ≥ m X j =1 m X σ ∈{− , +1 } m E P n σ "Z B j D Φ (cid:16) ˆ f n ( x ) , κ Φ + σ j · δ (cid:17) dµ ( x ) . Given j ∈ { , · · · , m } , σ = ( σ i ) m − i =1 ∈ {− , +1 } m − and r ∈ {− , , +1 } define ( σ k j r ) by ( σ k j r ) i = σ i if i ∈ { , · · · , j − } r if i = jσ i − if i ∈ { j + 1 , · · · , m } . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
From the above we have sup P ∈P n E P n h R P ( ˆ f n ) i − R P ( f ∗ P ) o ≥ m X j =1 m X σ ∈{− , +1 } m − X r ∈{− , +1 } E P n ( σ k jr ) "Z B j D Φ (cid:16) ˆ f n ( x ) , κ Φ + r · δ (cid:17) dµ ( x ) ≥ m X j =1 Z B j (cid:18) (cid:19) m − X σ ∈{− , +1 } m − X r ∈{− , +1 } E P n ( σ k jr ) h D Φ (cid:16) ˆ f n ( x ) , κ Φ + r · δ (cid:17)i dµ ( x ) . Hence, it suffices to fix j ∈ { , · · · , m } , x ∈ B j , σ ∈ {− , +1 } m − and show that X r ∈{− , +1 } E P n ( σ k jr ) h D Φ (cid:16) ˆ f n ( x ) , κ Φ + r · δ (cid:17)i ≥ (2 c Φ ) · δ · (cid:18) − δ · r nut Φ (cid:19) . For each r ∈ {− , , +1 } we let π σ ,j,r denote the Radon-Nikodym derivative of P ( σ k j r ) withrespect to P ( σ k j . Similarly, we let π n σ ,j,r denote the Radon-Nikodym derivative of P n ( σ k j r ) withrespect to P n ( σ k j . From the definition of P ( σ k j r ) we have π σ ,j,r (( x, y )) = if x / ∈ A j or y / ∈ { , } , − r · δ · g j ( x ) / (1 − κ Φ ) if x ∈ A j and y = 1 , r · δ · g j ( x ) /κ Φ if x ∈ A j and y = 2 . (1)We have X r ∈{− , +1 } E P n ( σ k jr ) h D Φ (cid:16) ˆ f n ( x ) , κ Φ + r · δ (cid:17)i = 12 X r ∈{− , +1 } E P n ( σ k j ) h D Φ (cid:16) ˆ f n ( x ) , κ Φ + r · δ (cid:17) π n σ ,j,r i ≥ E P n ( σ k j ) X r ∈{− , +1 } D Φ (cid:16) ˆ f n ( x ) , κ Φ + r · δ (cid:17) · min r ∈{− , +1 } (cid:8) π n σ ,j,r (cid:9) ≥ c Φ · δ · E P n ( σ k j ) (cid:20) min r ∈{− , +1 } (cid:8) π n σ ,j,r (cid:9)(cid:21) . The final inequality follows from Lemma A.1. Hence, to complete the proof of the lemma it remainsto show that E P n ( σ k j ) (cid:20) min r ∈{− , +1 } (cid:8) π n σ ,j,r (cid:9)(cid:21) ≥ − δ · r nut Φ . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Equivalently, we must show that E P n ( σ k j ) (cid:2)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12)(cid:3) ≤ δ · r nut Φ , (2)where we have used the fact that E P n ( σ k j ) (cid:2) π n σ ,j, +1 (cid:3) = E P n ( σ k j ) (cid:2) π n σ ,j, − (cid:3) = 1 . For each ω = ( ω i ) ni =1 ∈ { , } n we let S ω := {D n = ( Z i ) ni =1 : X i ∈ A j if and only if ω i = 1 } . Thus, we have E P n ( σ k j ) h(cid:12)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12)(cid:12)i = X ω ∈{ , } n P n ( σ k j [ D n ∈ S ω ] · E P n ( σ k j ) (cid:2)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12) | S ω (cid:3) We now bound E P n ( σ k j ) h(cid:12)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12)(cid:12) | S ω i for each ω ∈ { , } n . We first deal with thesimple case where P i =1 ω i = 1 . Recall that t Φ < (min { κ Φ , − κ Φ } ) − , so applying (1) gives E P n ( σ k j ) (cid:2)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12) | S ω (cid:3) = E P ( σ k j ) [ | π σ ,j, +1 − π σ ,j, − | | X ∈ A j ] ≤ t Φ · δ ≤ δ · s P ni =1 ω i t Φ . We now bound E P n ( σ k j ) h(cid:12)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12)(cid:12) | S ω i for ω ∈ { , } n with P i =1 ω i ≥ .Note that for each r ∈ {− , , +1 } we have P n ( σ k j r ) [ D n ∈ S ω ] = n Y i =1 µ ( A j ) ω i · (1 − µ ( A j )) − ω i . Hence, E P n ( σ k j ) (cid:2) π n σ ,j, +1 | S ω (cid:3) = E P n ( σ k j ) (cid:2) π n σ ,j, − | S ω (cid:3) = 1 . Thus, by the Cauchy Schwartz inequality we have E P n ( σ k j ) h(cid:12)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12)(cid:12) | S ω i = E P n ( σ k j ) h(cid:16)q π n σ ,j, +1 − q π n σ ,j, − (cid:17) (cid:16)q π n σ ,j, +1 + q π n σ ,j, − (cid:17) | S ω i ≤ s E P n ( σ k j ) (cid:20)(cid:16)q π n σ ,j, +1 − q π n σ ,j, − (cid:17) | S ω (cid:21) · s E P n ( σ k j ) (cid:20)(cid:16)q π n σ ,j, +1 + q π n σ ,j, − (cid:17) | S ω (cid:21) = s (cid:18) − E P n ( σ k j ) hq π n σ ,j, +1 · π n σ ,j, − | S ω i(cid:19) · s (cid:18) E P n ( σ k j ) hq π n σ ,j, +1 · q π n σ ,j, − | S ω i(cid:19) = 2 √ · r − E P n ( σ k j ) hq π n σ ,j, +1 · π n σ ,j, − | S ω i INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
To bound this term we first note that for each r ∈ {− , , +1 } , P n ( σ k j r ) is a product measure. Hence, E P n ( σ k j ) hq π n σ ,j, +1 · π n σ ,j, − | S ω i = (cid:18) E P ( σ k j ) (cid:2) √ π σ ,j, +1 · π σ ,j, − | X ∈ A j (cid:3)(cid:19) P ni =1 ω i · (cid:18) E P ( σ k j ) (cid:2) √ π σ ,j, +1 · π σ ,j, − | X / ∈ A j (cid:3)(cid:19) n − P ni =1 ω i By construction, for all x / ∈ A j we have η ( σ k j +1) ( x ) = η ( σ k j +1) ( x ) = η ( σ k j +1) ( x ) , so π σ ,j, +1 ( x ) = π σ ,j, − ( x ) = π σ ,j, ( x ) . Hence, E P ( σ k j ) (cid:2) √ π σ ,j, +1 · π σ ,j, − | X / ∈ A j (cid:3) = E P ( σ k j ) [ π σ ,j, | X / ∈ A j ] = 1 . Consequently, we have E P n ( σ k j ) hq π n σ ,j, +1 · π n σ ,j, − | S ω i = (cid:18) E P ( σ k j ) (cid:2) √ π σ ,j, +1 · π σ ,j, − | X ∈ A j (cid:3)(cid:19) P ni =1 ω i Moreover, by (1) we see that for all ( x, y ) ∈ X × Y , π σ ,j, +1 · π σ ,j, − ≥ − r · δ · g j ( x ) t Φ ≥ − δ t Φ . Combining these inequalities we have, E P n ( σ k j ) (cid:2)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12) | S ω (cid:3) ≤ √ · s − (cid:18) − δ t Φ (cid:19) · P ni =1 ω i . Note that for any l ≥ and x ≥ we have (cid:16) − (cid:0) − x (cid:1) l (cid:17) ≤ lx . Hence, we have E P n ( σ k j ) (cid:2)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12) | S ω (cid:3) ≤ δ · s P ni =1 ω i t Φ . It follows that E P n ( σ k j ) h(cid:12)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12)(cid:12)i = X ω ∈{ , } n P n ( σ k j [ D n ∈ S ω ] · E P n ( σ k j ) (cid:2)(cid:12)(cid:12) π n σ ,j, +1 − π n σ ,j, − (cid:12)(cid:12) | S ω (cid:3) ≤ δ √ t Φ · X ω ∈{ , } n n Y i =1 µ ( A j ) ω i · (1 − µ ( A j )) − ω i · vuut n X i =1 ω i ≤ δ √ t Φ · vuut X ω ∈{ , } n n Y i =1 µ ( A j ) ω i · (1 − µ ( A j )) − ω i · n X i =1 ω i = 2 δ √ t Φ · q n · µ ( A j ) ≤ δ · r nut Φ . This completes the proof of the lemma. INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Lemma A.3
Suppose that Φ satisfies Elkan’s reasonableness assumption and let c Φ , κ Φ , t Φ beas in the statement of Lemma A.1. Take m ∈ N and u > and suppose that there are disjointsets { A j } mj =1 such that for each j ∈ { , · · · , m } there exists B j ⊂ A j with µ ( B j ) ≤ u and µ ( A j \ B j ) = 0 along with a function g j : X → [0 , with g j ( x ) = 0 for all x / ∈ A j and g j ( x ) = 1 for all x ∈ B j . We also take σ ∈ {− , , +1 } m along with η σ and P σ as in the statement ofLemma A.2. Suppose we have constants C β > , β > , ζ max ∈ (0 , M Φ ( n ( κ Φ ))) , and δ ∈ (0 , t Φ ) satisfying m · u ≤ C β · ( c Φ · δ ) β . Then the measure P σ satisfies the margin condition with constants ( β, C β , ζ max ) . Proof
First note that by the construction of η σ (see Lemma A.2), combined with the fact that if x ∈ B j x we have g j x ( x ) = 1 and g j ( x ) = 0 for j = j x , so η σ ( x ) = n κ Φ + δ · m X j =1 σ j · g j ( x ) = n ( κ Φ + δ · σ j x ) . Hence, by Lemma A.1 we have M Φ ( η σ ( x )) ≥ c Φ · δ . On the other hand, if x ∈ X \ S mj =1 A j then g j ( x ) = 0 for all j , so η σ ( x ) = n ( κ Φ ) , so M ( η σ ( x )) > ζ max .We now fix ζ < ζ max < M Φ ( n ( κ Φ )) . We must show that, µ ( { x ∈ X : M Φ ( η ( x )) ≤ ζ } ) ≤ C β · ζ β . Now if ζ < c Φ · δ then µ ( { x ∈ X : M Φ ( η ( x )) ≤ ζ } ) ≤ µ m [ j =1 A j \ B j = 0 . On the other hand, if ζ ∈ ( c Φ · δ, ζ max ) , then µ ( { x ∈ X : M Φ ( η ( x )) ≤ ζ } ) ≤ µ m [ j =1 A j ≤ m · u ≤ C β · ( c Φ δ ) β ≤ C β · ζ β . Lemma A.4
Suppose that W = { w j } mj =1 is a r -separated set for some r > . There exist functions { g j } mj =1 such that for each j ∈ { , · · · , m } , g j ( x ) = 0 for all x / ∈ B r/ ( w j ) , g j ( x ) = 1 for all x ∈ B r/ ( w j ) and for any C α > , α ∈ (0 , , δ ∈ (0 , ( C α / · r α ] and σ ∈ {− , , +1 } m , thefunction η σ : X → R L × by η σ ( x ) = n κ Φ + δ · m X j =1 σ j · g j ( x ) , is H¨older continuous with constants ( α, C α ) . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Proof
Firstly, we define a function u : [0 , → [0 , by u ( t ) = for t ∈ [0 , / − t for t ∈ [1 / , / for t ≥ / . For each j ∈ { , · · · , m } we let g j ( x ) = u ((2 /r ) · ρ ( x, w j )) . Clearly if x / ∈ B r/ ( w j ) then (2 /r ) · ρ ( x, w j ) ≥ / so g j ( x ) = 0 . On the other hand, if x ∈ B r/ ( w j ) then (2 /r ) · ρ ( x, w j ) < / , so g j ( x ) = 1 . We now fix δ ∈ (0 , ( C α / · r α ] and σ ∈ {− , , +1 } m and show that η σ isH¨older continuous with constants ( α, C α ) . By the definition of n it suffices to show that ϕ σ ( x ) = κ Φ + δ · m X j =1 σ j · g j ( x ) is H¨older continuous with constants ( α, C α ) . Since g j ( x ) = 0 for all x / ∈ B r/ ( w j ) and { w j } mj =1 is an r -separated set, we have ϕ σ ( x ) = δ · σ j x · u ((2 /r ) · ρ ( x, w j x )) whenever x ∈ B r/ ( w j x ) for some j x ∈ { , · · · , m } , and ϕ ( x ) = 0 if x ∈ X \ S mj =1 B r/ ( w j ) . Now take x , x ∈ X .If ϕ σ ( x ) = ϕ σ ( x ) = 0 then k ϕ σ ( x ) − ϕ σ ( x ) k ≤ C α · ρ ( x , x ) α holds trivially. Nowsuppose that ϕ σ ( x ) = 0 or ϕ σ ( x ) = 0 . Without loss of generality we assume that ϕ σ ( x ) = 0 .Hence, for some j ∈ { , · · · , m } we have x ∈ B r/ ( w j ) . Now either x ∈ B r/ ( w j ) or x / ∈ B r/ ( w j ) . If x ∈ B r/ ( w j ) then we have ϕ σ ( x ) = δ · σ j · u ((2 /r ) · ρ ( x , w j )) and ϕ σ ( x ) = δ · σ j · u ((2 /r ) · ρ ( x , w j )) . Moreover, | ρ ( x , w j ) − ρ ( x , w j ) | ≤ ρ ( x , x ) , so | u ((2 /r ) · ρ ( x , w j )) − u ((2 /r ) · ρ ( x , w j )) | ≤ (6 /r ) · ρ ( x , x ) . Hence, k ϕ σ ( x ) − ϕ σ ( x ) k ≤ δ · (6 /r ) · ρ ( x , x ) ≤ C α · r α − · ρ ( x , x ) ≤ C α · ρ ( x , x ) α , since α ≤ and ρ ( x , x ) ≤ r .On the other hand, if x / ∈ B r/ ( w j ) then ρ ( x , x ) ≥ ρ ( x , w j ) − ρ ( x , w j ) ≥ r/ whilst k ϕ σ ( x ) − ϕ σ ( x ) k ≤ · δ ≤ · (( C α / · r α ) ≤ C α · ρ ( x , x ) α . Given a smooth manifold x , x ∈ M we let ρ g ( x , x ) denote the geodesic distance andfor r > we let B gr ( x ) denote the geodesic metric ball of radius r . Given a set A ⊂ M and r > , an r -separated subset of A is a set W r = { w j } Qj =1 ⊆ A such that for j = j , ρ ( w j , w j ) k w j , w j k > r . A maximal r -separated subset is an r -separated subset of maximalcardinality. Note that r -separation is with respect to the Euclidean metric, rather than the geodesicmetric. Lemma A.5
Suppose
M ⊆ R d is a compact smooth submanifold with dimension γ and reach τ .Fix x ∗ ∈ M , r ∗ > . For each r ∈ (0 , r ∗ ) we construct S ( r ) by taking a maximal r -separatedsubset of B gr ∗ ( x ∗ ) , W r = { w j } Q ( r ) j =1 , and defining S ( r ) := Q ( r ) [ j =1 B gr/ ( w j ) , For all r ∈ (0 , r ∗ ) , S ( r ) is a (cid:0) − γ , min { r ∗ , τ / } (cid:1) regular set. INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
To prove lemma A.5 we shall utilise the following geometric lemmas, the proof of which isgiven in appendix F. Let v γ denote the volume of the γ -dimensional Euclidean unit ball. Lemma A.6
Let
M ⊆ R d be a compact smooth submanifold with dimension γ , reach τ andRiemannian volume form V M . Then for all x ∈ M and r < τ / we have − γ · v γ · r γ ≤ V M ( B gr ( x )) ≤ V M ( B r ( x )) ≤ γ · v γ · r γ . Lemma A.7
With the assumptions of lemma A.6, for all x, ˜ x ∈ M and ˜ r ≤ r < τ / with ρ g ( x, ˜ x ) ≤ r + ˜ r/ we have V M (cid:0) B gr ( x ) ∩ B g ˜ r (˜ x ) (cid:1) ≥ − γ · v γ · ˜ r γ . Proof [Proof of Lemma A.5] Fix r ∈ (0 , r ∗ ) and take ˜ x ∈ S ( r ) and ˜ r ∈ (0 , min { r ∗ , τ / } ) . Weconsider two cases.Case 1: ˜ r ≥ r/ . Let J (˜ x, ˜ r ) = n j ∈ { , · · · , Q ( r ) } : B r ( w j ) ∩ B ˜ r/ (˜ x ) = ∅ o . Given j ∈ J (˜ x, ˜ r ) , we have k ˜ x − w j k < r + ˜ r/ . Thus, if z ∈ B gr/ ( w j ) then k z − ˜ x k ≤ ρ g ( z, w j ) + k ˜ x − w j k < r/ r/ ≤ ˜ r . Thus, S ( r ) ∩ B ˜ r (˜ x ) ⊇ [ j ∈J (˜ x, ˜ r ) B gr/ ( w j ) . Since W r is r -separated, for j = j we have ρ g ( w j , w j ) ≥ k w j − w j k ≥ r , so B gr/ ( w j ) ∩ B gr/ ( w j ) = ∅ . Hence, we may apply Lemma A.6 to obtain V M ( S ( r ) ∩ B ˜ r (˜ x )) ≥ X j ∈J (˜ x, ˜ r ) V M (cid:16) B gr/ ( w j ) (cid:17) ≥ J (˜ x, ˜ r ) · − γ · v γ · ( r/ γ . Now we shall give a lower bound on J (˜ x, ˜ r ) . First note that since ˜ x ∈ S ( r ) and W r ⊂ B gr ∗ ( x ∗ ) we have ρ g ( x, ˜ x ) ≤ r ∗ + r/ < r ∗ + ˜ r/ . Moreover, ˜ r/ ∈ (0 , min { r ∗ , τ / } ) , so by Lemma A.7we have V M (cid:0) B gr ∗ ( x ∗ ) ∩ B ˜ r/ (˜ x ) (cid:1) ≥ − γ · v γ · ˜ r γ . By the maximality of W r we have B gr ∗ ( x ∗ ) ⊆ S Q ( r ) j =1 B r ( w j ) , so B gr ∗ ( x ∗ ) ∩ B ˜ r/ (˜ x ) ⊆ S j ∈J (˜ x, ˜ r ) B r ( w j ) .Hence, − γ · v γ · ˜ r γ ≤ V M (cid:0) B gr ∗ ( x ∗ ) ∩ B ˜ r/ (˜ x ) (cid:1) ≤ X j ∈J (˜ x, ˜ r ) V M (cid:16) B r ( w j ) (cid:17) ≤ J (˜ x, ˜ r ) · γ · v γ · r γ . Thus, J (˜ x, ˜ r ) ≥ − γ · ( r/ ˜ r ) γ and V M ( S ( r ) ∩ B ˜ r (˜ x )) ≥ − γ · v γ · ˜ r γ ≥ − γ · V M ( B ˜ r (˜ x )) . Case 2: ˜ r < r/ . Since ˜ x ∈ S ( r ) we may take j ˜ x ∈ { , · · · , Q ( r ) } so that ˜ x ∈ B gr/ ( w j ˜ x ) . ByLemma A.7 we have V M ( S ( r ) ∩ B ˜ r (˜ x )) ≥ V M (cid:16) B gr/ ( w j ˜ x ) ∩ B ˜ r/ (˜ x ) (cid:17) ≥ − γ · v γ · ˜ r γ ≥ − γ · V M ( B ˜ r (˜ x )) . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Lemma A.8
Fix x ∗ ∈ M , let ˜ τ = min { τ, } and take r ∗ = ˜ τ / . Choose r < r ∗ and let S ( r ) and Q ( r ) by as in the statement of Lemma A.5. Then Q ( r ) ≥ (cid:0) − · ˜ τ (cid:1) γ · r − γ . Proof
Note that W r = { w j } Q ( r ) j =1 is a maximal ( ρ g , r ) -separated subset of B gr ∗ ( x ∗ ) . Hence, B gr ∗ ( x ∗ ) ⊆ S j =1 B gr ( w j ) . Thus, − γ · v γ · ˜ τ γ ≤ V M (cid:16) B g ˜ τ/ ( x ∗ ) (cid:17) ≤ Q ( r ) X j =1 V M (cid:16) B r ( w j ) (cid:17) ≤ Q ( r ) · γ · v γ · r γ . Lemma A.9
Let S ( r ) be as in the statement of Lemmas A.5 and A.8 with r ∗ = ˜ τ / . Then (3 − · − · ˜ τ ) γ · v γ ≤ V M ( S ( r )) ≤ v γ · (˜ τ / γ . Proof
Since r ∈ (0 , r ∗ ) we must have S ( r ) ⊂ B r ∗ ( x ∗ ) , so it follows from Lemma A.6 that V M ( S ( r )) ≤ γ · v γ · (2 r ∗ ) γ ≤ v γ · (˜ τ / γ . On the other hand, since W r is r -separated the balls n B gr/ ( w j ) o Q ( r ) j =1 are disjoint. Hence, combining Lemmas A.6 and A.8 we have V M ( S ( r )) ≥ Q ( r ) X j =1 V M (cid:16) B gr/ ( w j ) (cid:17) ≥ Q ( r ) · · − γ · v γ · r γ ≥ (3 − · − · ˜ τ ) γ · v γ . We are now well placed to complete the proof of Proposition A.1.
Proof [Proof of Proposition A.1]We take κ Φ as in the statement of Lemma A.1 and define Z Φ = M Φ ( n ( κ Φ )) > . Take c ∈ (cid:0) , − γ (cid:1) , r ∈ (0 , ˜ τ / , ν min ∈ (cid:0) , (2 / ˜ τ ) γ · v − γ (cid:1) , ν max ∈ (cid:0) (2 / ˜ τ ) γ · v − γ , ∞ (cid:1) , ζ max ∈ (0 , Z Φ ) , α ∈ (0 , , β ∈ (0 , γ/α ) , C α , C β > and let Γ = h ( r , c , ν min , ν max ) , ( β, C β , ζ max ) , ( α, C α ) i . For each r ∈ (0 , r ) we shall construct an associated set of probability measures P ( r ) ⊂ P ( M , Γ) as follows. Fix some r ∈ (0 , r ) . We begin by constructing µ , which will be common to all P ∈ P ( r ) . To do so we take S ( r ) as in the statement of Lemma A.5. That is, we fix some x ∗ ∈ M and construct S ( r ) by taking W r = { w j } Q ( r ) j =1 to be a maximal r -separated subset of B gr ∗ ( x ∗ ) anddefine, S ( r ) = Q ( r ) [ j =1 B gr/ ( w j ) . Let ν ∗ ( r ) = ( V M ( S ( r ))) − . We let µ be a probability measure which is absolutely continuouswith respect to V M and has density ν ( x ) = ( ν ∗ ( r ) if x ∈ S ( r )0 if x / ∈ S ( r ) . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Clearly, supp ( µ ) = S ( r ) and by Lemma A.5 the set S ( r ) is ( c , r ) regular. Moreover, by LemmaA.9 we have ν min < (2 / ˜ τ ) γ · v − γ ≤ ν ∗ ( r ) ≤ (2 / ˜ τ ) γ · v − γ < ν max . Hence, the measure µ is ( c , r , ν min , ν max ) regular. For each j = { , · · · , Q ( r ) } we let B j = B gr/ ( w j ) and A j = B r/ ( w j ) . Since W r is r -separated, the balls A j are disjoint. Hence, µ ( A j \ B j ) = 0 , since µ is absolutely continuous and supported on S ( r ) . In addition, by LemmaA.6 we have v ( r ) ≤ µ ( B j ) ≤ µ ( A j ) ≤ u ( r ) with v ( r ) = ν ∗ ( r ) · − γ · v γ · ( r/ γ and u ( r ) = ν ∗ ( r ) · γ · ν γ · ( r/ γ . Take t Φ as in the statement of Lemma A.1 and let δ ( r ) = min { t Φ / , ( C α / · r α } .In addition, we take m ( r ) = min { Q ( r ) , b ( C β · c Φ ) β · u ( r ) − · δ ( r ) β c} . By Lemma A.8 we have Q ( r ) ≥ (cid:0) − · ˜ τ (cid:1) γ · r − γ . Hence, there exists constants R (Γ) , m (Γ) > such that for all r < R (Γ) , δ ( r ) = ( C α / · r α and m ( r ) ≥ m (Γ) · r α · β − γ . By Lemma A.4 there exists functions { g j } mj =1 so that g j ( x ) = 1 for all x ∈ B j , g j ( x ) = 0 for all x / ∈ A j and σ ∈ {− , , +1 } m , the function η σ : X → R L × by η σ ( x ) = n κ Φ + δ · m X j =1 σ j · g j ( x ) , is H¨older continuous with constants ( α, C α ) . For each σ ∈ {− , , +1 } m we let P σ denote themeasure on Z = X × Y formed by taking µ to be the marginal distribution over X and η σ to be theconditional distribution of Y given x ∈ X . Since δ ( r ) ∈ (0 , t Φ ) and m ( r ) · u ( r ) ≤ C β · ( c Φ · δ ( r )) β and ζ max ∈ (0 , Z Φ ) , it follows from Lemma A.3 that each P σ satisfies the margin condition withconstants ( β, C β , ζ max ) . We let P ( r ) := { P σ : σ ∈ {− , +1 } m } . We have shown that µ is ( c , r , ν min , ν max ) regular each η σ is H¨older continuous with constants ( α, C α ) and each P σ satisfies the margin condition with constants ( β, C β , ζ max ) . Thus, P ( r ) ⊂ P ∈ P Φ ( V M , Γ) . Hence, by Lemma A.2, for all classifiers ˆ f and all n ∈ N we have sup P ∈P Φ ( V M , Γ) n E P n h R P ( ˆ f n ) i − R P ( f ∗ P ) o ≥ (2 c Φ · m ( r ) · v ( r )) · δ ( r ) · − δ ( r ) · s n · u ( r ) t Φ . It follows from the construction of δ ( r ) , u ( r ) , v ( r ) , m ( r ) and Lemma A.9 we see that there exists R , C , C > , depending purely upon Φ , γ , τ and Γ , such that for all r < R , δ ( r ) · u ( r ) · t − < C · r γ +2 α c Φ · m ( r ) · v ( r ) · δ ( r ) > C · r ( αβ − γ )+ γ + α = C · r α ( β +1) . Thus, if we take r = min { R , (16 C ) − } · n − α + γ we have δ ( r ) · p n · u ( r ) · t Φ − < / . Hence, sup P ∈P Φ ( V M , Γ) n E P n h R P ( ˆ f n ) i − R P ( f ∗ P ) o ≥ C · (cid:0) min { R , (16 C ) − } (cid:1) α ( β +1) · n − α ( β +1)2 α + γ . This completes the proof of Proposition A.1. INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Appendix B. Bounding the probability of mis-classification
In this section we prove Proposition B.1 which forms the first part of Theorem 5.
Proposition B.1
Suppose that P satisfies the margin condition with constants ( β, C β , ζ max ) andthat the conditional η is measure-smooth, with constants ( λ, C λ ) . Suppose that generates ω measure-approximate nearest neighbours with respect to the measure µ . Take k > and for each n ∈ N take k n = k · n λ λ +1 · (1 + log(1 /δ )) / (2 λ +1) . There exists a constant C > depending purely upon C λ , λ, C β , β, k such that for all n ∈ N , with probability at least − δ over D n ∼ P n we have P (cid:2) f Sn,k ( X ) / ∈ Y ∗ Φ ( η ( X )) (cid:3) ≤ δ + C · (cid:16) ω λ · Λ(Φ) · p log( L ) (cid:17) β · (cid:18) /δ ) n (cid:19) βλ/ (2 λ +1) . To prove Proposition B.1 we first introduce some notation before giving some preliminary lem-mas. We define r p ( x ) = inf { r > µ ( B r ( x )) ≥ p } . We make use of the following standardresults. Lemma B.1
Take x ∈ X . Suppose that p ∈ [0 , , ξ ≤ , k ≤ (1 − ξ ) np . Then P n [max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } > r p ( x )] ≤ exp( − kξ / . Lemma B.2
Take x ∈ X . For each δ > we have P n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − k X i ∈ S k ( x, F n ) η ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≥ δ |F n ≤ L exp( − kδ ) . The proof of lemmas B.1 and B.2 is given in Appendix E. Recall that we defined
Λ(Φ) by Λ(Φ) := ( L − · Asym (Φ) + 2 k Φ k ∞ . Lemma B.3
Given y , y ∈ Y and n , n ∈ ∆( Y ) we have k ( e ( y ) − e ( y )) T Φ ( n − n ) k ≤ Λ(Φ) · k n − n k ∞ . Proof
This follows immediately from the definitions.Given p ∈ (0 , , and ∆ > we define X p, ∆ = n x ∈ X : ∀ ˜ x ∈ B r p ( x ) ( x ) , y ∈ Y ∗ Φ ( η ( x )) , y ∈ Y\Y ∗ Φ ( η ( x )) , ( e ( y ) − e ( y )) T Φ η (˜ x ) ≥ Λ(Φ) · ∆ o , and let ∂ θp, ∆ = X \X θp, ∆ . Lemma B.4
Suppose that k < n and S generates ω measure approximate nearest neighbours.Take p ∈ (0 , and ∆ > and suppose that x ∈ X p, ∆ satisfies both INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } ≤ r p/ω ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − k P i ∈ S k ( x, F n ) η ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ < ∆ .Then f Sn,k n ( x ) ∈ Y ∗ Φ ( η ( x )) . Proof
Since max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } ≤ r p/ω ( x ) and S generates ω measure-approximatenearest neighbours we have max { ρ ( x, X i ) : i ∈ S k ( x, F n ) } ≤ r p ( x ) . Since x ∈ X θp, ∆ for any y ∈ Y ∗ Φ ( η ( x )) , y ∈ Y\Y ∗ Φ ( η ( x )) we have ( e ( y ) − e ( y )) T Φ η ( X i ) ≥ Λ(Φ) · ∆ for all i ∈ S k ( x, F n ) , so ( e ( y ) − e ( y )) T Φ k X i ∈ S k ( x, F n ) η ( X i ) ≥ Λ(Φ) · ∆ . Moreover, since (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − k P i ∈ S k ( x, F n ) η ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ < ∆ , by Lemma B.3, this implies that for all y ∈ Y ∗ Φ ( η ( x )) and y ∈ Y\Y ∗ Φ ( η ( x )) we have ( e ( y )) T Φ ˆ η Sn,k ( x ) < ( e ( y )) T Φ ˆ η Sn,k ( x ) . Thus, Y ∗ Φ ( η ( x )) = Y ∗ Φ (cid:16) ˆ η Sn,k ( x ) (cid:17) . Hence, f Sn,k n ( x ) ∈ Y ∗ Φ ( η ( x )) . Lemma B.5
Take δ ∈ (0 , , k · ω < n and suppose that S generates ω measure approximate k -nearest neighbours. With probability at least − δ over D n ∼ P n we have P (cid:2) f Sn,k ( X ) / ∈ Y ∗ Φ ( η ( X )) (cid:3) ≤ δ + µ ( ∂ p, ∆ ) , where p = kn · ω − p (2 /k ) log(2 /δ ) and ∆ = r k log 4 Lδ . Proof
Given D n ∼ P n we define A ( D n ) = n x ∈ X p, ∆ : f Sn,k ( X ) / ∈ Y ∗ Φ ( η ( X )) o . To prove thelemma it suffices to show that with probability at least − δ over D n ∼ P n we have P [ A ( D n )] ≤ δ .This follows from the definition of ∂ p, ∆ as X \X p, ∆ . Now by Lemma B.4 we have A ( D n ) ⊆ (cid:8) x ∈ X : max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } > r p/ω ( x ) (cid:9) ∪ x ∈ X : (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − k X i ∈ S k ( x, F n ) η ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≥ ∆ . Now take x ∈ X . By Lemma B.1 with ξ = 1 − ( kω ) / ( np ) we have P n (cid:2) max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } > r p/ω ( x ) (cid:3) ≤ exp( − kξ /
2) = δ . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
By Lemma B.2 we have P n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − k X i ∈ S k ( x, F n ) η ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≥ ∆ ≤ L exp( − k ∆ ) = δ . Hence, P n [ x ∈ A ( D n )] ≤ δ . Since this holds for all x ∈ X , by Fubini’s theorem we have E n [ P [ A ( D n )]] = E n (cid:2) E (cid:2) X ∈ A ( D n ) (cid:3)(cid:3) = E (cid:2) E n (cid:2) X ∈ A ( D n ) (cid:3)(cid:3) = E [ P n [ X ∈ A ( D n )]] ≤ δ . Thus, by Markov’s inequality P n [ P [ A ( D n )] > δ ] ≤ δ . Thus, with probability at least − δ over D n ∼ P n we have P [ A ( D n )] ≤ δ and the lemma holds. Lemma B.6
Suppose that P satisfies the margin condition with constants ( β, C β , ζ max ) and thatthe conditional η is measure-smooth, with constants ( λ, C λ ) . Given any ∆ > and p ∈ (0 , satisfying Λ(Φ) · (cid:0) ∆ + C λ · p λ (cid:1) < ζ max we have µ ( ∂ p, ∆ ) ≤ C β · (cid:16) Λ(Φ) · (cid:16) ∆ + C λ · p λ (cid:17)(cid:17) β . Proof
Suppose that x ∈ ∂ p, ∆ . Then there exists some ˜ x ∈ X with ρ (˜ x, x ) ≤ r p ( x ) , some y ∈Y ∗ Φ ( η ( x )) and some y ∈ Y\Y ∗ Φ ( η ( x )) such that ( e ( y ) − e ( y )) T Φ η (˜ x ) < Λ(Φ) · ∆ . Since η ismeasure smooth with constants ( λ, C λ ) we have, k η (˜ x ) − η ( x ) k ∞ ≤ C λ · µ (cid:0) B ρ (˜ x,x ) ( x ) (cid:1) λ ≤ C λ · µ (cid:0) B r p ( x ) ( x ) (cid:1) λ ≤ C λ · p λ . By Lemma B.3 this implies that M Φ ( η ( x )) = ( e ( y ) − e ( y )) T Φ η ( x ) < Λ(Φ) · (cid:16) ∆ + C λ · p λ (cid:17) . Hence, µ ( ∂ p, ∆ ) ≤ µ (cid:16)n x ∈ X : M Φ ( η ( x )) < Λ(Φ) · (cid:16) ∆ + C λ · p λ (cid:17)o(cid:17) ≤ C β · (cid:16) Λ(Φ) · (cid:16) ∆ + C λ · p λ (cid:17)(cid:17) β . Proof [Proof of Proposition B.1] First note that without loss of generality we may assume that ζ max = ∞ . Indeed if P satisfies the margin condition with constants ( β, C β , ζ max ) then P alsosatisfies the margin condition with constants ( ˜ C β , β, ∞ ) , where ˜ C β = max { C β , ζ − β max } .To complete the proof, for each n ∈ N , we take k n = k · n λ λ +1 (1 + log(1 /δ )) / (2 λ +1) p n = ( k n ω ) / (cid:16) n (cid:16) − p (2 /k n ) log(2 /δ ) (cid:17)(cid:17) ∆ n = p (1 / k n ) log (4 L/δ ) , INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS and apply lemmas B.5 and B.6. Indeed, suppose that n ≥ (16 /k ) (2 λ +1) / (2 λ ) (1 + log(1 /δ )) . Itfollows that p (2 /k n ) log(2 /δ ) < / , so p n < (2 ω ) · k n n = (2 k ω ) · (cid:18) /δ ) n (cid:19) / (2 λ +1) . In addition, for some constant c k , depending upon k , we have ∆ n ≤ c k · p log( L ) · (cid:18) /δ ) n (cid:19) λ/ (2 λ +1) . Moreover, by lemmas B.5 and B.6, we have P (cid:2) f Sn,k ( X ) / ∈ Y ∗ Φ ( η ( X )) (cid:3) ≤ δ + µ ( ∂ p, ∆ ) ≤ δ + C β · (cid:16) Λ(Φ) · (cid:16) ∆ n + C λ · p λn (cid:17)(cid:17) β ≤ C · (cid:16) ω λ · Λ(Φ) · p log( L ) (cid:17) β · (cid:18) /δ ) n (cid:19) βλ/ (2 λ +1) , where C is a constant depending purely upon k , β, λ, ζ max , C β , C λ . By increasing the constant,depending upon k , β, λ , the bound also holds for n < (16 /k ) (2 λ +1) / (2 λ ) (1 + log(1 /δ )) . Appendix C. Bounding the expected risk
In this section we prove Proposition C.1 which forms the second part of Theorem 5.
Proposition C.1
Suppose that P satisfies the margin condition with constants ( C β , β, ζ max ) andthat the conditional η is measure-smooth, with constants ( λ, C λ ) . Suppose that generates ω measure-approximate nearest neighbours with respect to the measure µ . Take k > and for each n ∈ N take k n = k · n λ λ +1 . There exists a constant C > depending purely upon C λ , λ, C β , β, k suchthat for all n ∈ N we have E n (cid:2) R (cid:0) f Sn,k n (cid:1)(cid:3) − R ∗ ≤ C · L · (cid:16) ω λ · Λ(Φ) (cid:17) β · n − λ (1+ β )2 λ +1 . Proposition C.1 follows from Propositions C.2 and C.3.
Proposition C.2
Take a probability distribution P on Z = X × Y determined by a marginaldistribution µ and conditional probability η . Suppose that for each n ∈ N , ˆ η n : X → R L is anestimator of η determined by D n ∼ P n . Suppose further that there exists constants C , C > , N ∈ N and some decreasing positive sequence ( a n ) n ≥ N such that for each n ≥ N , µ almostevery x ∈ X there exists a set A n ( x ) ⊆ Z n such that P n [ A n ( x )] ≤ a βn and for all ξ ≥ a n wehave P n [ k ˆ η n ( x ) − η ( x ) k ∞ > ξ |D n / ∈ A n ( x )] ≤ C · exp − C · (cid:18) ξa n (cid:19) ! . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Suppose for each n ∈ N we construct a classifier ˆ f n : X → Y , based on ˆ η n : X → R L and definedby ˆ f n ( x ) = min ( Y ∗ Φ (ˆ η n ( x ))) . If P satisfies the margin condition with constants ( C β , β, ζ max ) thenthere exists C = C ( C , C β , β, ζ max ) > , which is monotonically increasing with β , such that forall n ≥ N we have E n h R (cid:16) ˆ f n (cid:17)i − R ∗ ≤ C · (1 + C ) · (Λ(Φ) · a n ) β . Proposition C.3
Suppose that the conditional η is measure-smooth, with constants ( λ, C λ ) , andthat S generates ω measure-approximate nearest neighbours. Then for any n ∈ N , k ≤ n/ if for µ almost every x ∈ X we let A n ( x ) = (cid:8) D n : max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } > r k/n ( x ) (cid:9) . Then P n [ A n ( x )] ≤ exp( − k/ and for all ξ ≥ C λ · (cid:0) ωkn (cid:1) λ we have P n (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − η ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≥ ξ |D n / ∈ A n ( x ) (cid:21) ≤ L exp( − kξ . Proof [Proof of Proposition C.1] We combine Proposition C.2 with Proposition C.3, with a n =2 C λ · (2 ωk ) λ · n − λ λ +1 . Proof [Proof of Proposition C.2] First note that as in the proof of Propostion B.1 we may assumethat ζ max = ∞ , without loss of generality. We define a Bayes optimal classifier f ∗ : X → Y by f ∗ ( x ) = min ( Y ∗ Φ ( η ( x ))) , so R ( f ∗ ) = R ∗ . Hence, E n h R (cid:16) ˆ f n (cid:17)i − R ∗ = E n h R (cid:16) ˆ f n (cid:17) − R ( f ∗ ) i = E n h E h φ ˆ f n ( X ) ,Y − φ f ∗ ( X ) ,Y ii = E n (cid:20)Z (cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) dµ ( x ) (cid:21) = Z E n (cid:20)(cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) (cid:21) dµ ( x ) ≤ Z (cid:18) E n (cid:20)(cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) |D n / ∈ A n ( x ) (cid:21) + 2 · k Φ k ∞ · P n [ A n ( x )] (cid:19) dµ ( x ) ≤ Z E n (cid:20)(cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) |D n / ∈ A n ( x ) (cid:21) dµ ( x ) + Λ(Φ) · a βn . Thus, it suffices to show that there exists ˜ C = ˜ C ( C , C β , β ) > , which is monotonically increasingwith β , such that for all n ≥ N , Z E n (cid:20)(cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) |D n / ∈ A n ( x ) (cid:21) dµ ( x ) ≤ ˜ C · (1 + C ) · (Λ(Φ) · a n ) β . (3) INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
We define sets Ω j ( D n ) for each j ∈ N and D n ∼ P n by Ω ( D n ) = (cid:26) x ∈ X : 0 < (cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) < Λ(Φ) · a n (cid:27) Ω j ( D n ) = (cid:26) x ∈ X : 2 j − · Λ(Φ) · a n < (cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) < j · Λ(Φ) · a n (cid:27) . Note that for all j ≥ , if x ∈ Ω j ( D n ) then j − · Λ(Φ) · a n < (cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) . On the other hand, by the construction of ˆ f n we have (cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ ˆ η n ( x ) ≤ . Thus, by Lemma B.3 for all j ≥ , if x ∈ Ω j ( D n ) then k ˆ η n ( x ) − η ( x ) k ∞ > j − a n ≥ a n .Moreover, for all j ∈ N , if x ∈ Ω j ( D n ) then M Φ ( η ( x )) < j · Λ(Φ) · a n . So if j ≥ we have Ω j ( D n ) ≤ {k ˆ η n ( x ) − η ( x ) k ∞ > j − a n } · { M Φ ( η ( x )) < j · Λ(Φ) · a n } . Thus, R E n h Ω j ( D n ) |D n / ∈ A n ( x ) i dµ ( x ) ≤ Z E n (cid:2) {k ˆ η n ( x ) − η ( x ) k ∞ > j − a n } · { M Φ ( η ( x )) < j · Λ(Φ) · a n } |D n / ∈ A n ( x ) (cid:3) dµ ( x )= Z E n (cid:2) {k ˆ η n ( x ) − η ( x ) k ∞ > j − a n } |D n / ∈ A n ( x ) (cid:3) · { M Φ ( η ( x )) < j · Λ(Φ) · a n } dµ ( x )= Z P n (cid:2) k ˆ η n ( x ) − η ( x ) k ∞ > j − a n |D n / ∈ A n ( x ) (cid:3) · { M Φ ( η ( x )) < j · Λ(Φ) · a n } dµ ( x ) ≤ Z C · exp (cid:0) − C · j − (cid:1) · { M Φ ( η ( x )) < j · Λ(Φ) · a n } dµ ( x ) ≤ C · exp (cid:0) − C · j − (cid:1) · µ (cid:0) { M Φ ( η ( x )) < j · Λ(Φ) · a n } (cid:1) ≤ C · exp (cid:0) − C · j − (cid:1) · C β · (cid:0) j · Λ(Φ) · a n (cid:1) β . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
In addition R E n (cid:2) Ω ( D n ) |D n / ∈ A n ( x ) (cid:3) dµ ( x ) ≤ C β · (Λ(Φ) · a n ) β . Thus, Z E n (cid:20)(cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) |D n / ∈ A n ( x ) (cid:21) dµ ( x )= Z E n ∞ X j =0 Ω j ( D n ) (cid:16) e ( ˆ f n ( x )) − e ( f ∗ ( x )) (cid:17) T Φ η ( x ) |D n / ∈ A n ( x ) dµ ( x ) ≤ (Λ(Φ) · a n ) · ∞ X j =0 j · Z E n h Ω j ( D n ) |D n / ∈ A n ( x ) i dµ ( x ) ≤ (Λ(Φ) · a n ) β · C β · C · ∞ X j =1 jβ exp (cid:0) − C · j − (cid:1) ≤ C β · ∞ X j =1 jβ exp (cid:0) − C · j − (cid:1) · (1 + C ) · (Λ(Φ) · a n ) β . Hence, (3) holds with ˜ C = C β · ∞ X j =1 jβ exp (cid:0) − C · j − (cid:1) < ∞ . This completes the proof of the proposition.
Proof [Proof of Proposition C.3] The fact that P n [ A n ( x )] ≤ exp( − k/ follows immediately fromLemma B.1 applied with p = 2 k/n and ξ = 1 / . Moreover, if we take r = max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } then whenever D n / ∈ A n ( x ) we have r ≤ r k/n ( x ) , so µ ( B r ( x )) ≤ k/n . Hence, letting r = max { ρ ( x, X i ) : i ∈ S k ( x, F n ) } , since S generates ω measure-approximate nearest neigh-bours we have µ ( B r ( x )) ≤ ω · µ ( B r ( x )) ≤ ω · kn . Hence, provided D n / ∈ A n ( x ) the fact that η is measure-smooth, with constants ( λ, C λ ) implies thatfor each i ∈ S k ( x, F n ) with probability one we have k η ( X i ) − η ( x ) k ∞ ≤ C λ · µ (cid:0) B ρ ( x,X i ) ( x ) (cid:1) λ ≤ C λ · (cid:18) ωkn (cid:19) λ . Thus, for D n / ∈ A n ( x ) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k X i ∈ S k ( x, F n ) η ( X i ) − η ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≤ C λ · (cid:18) ωkn (cid:19) λ . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Hence, for all ξ ≥ C λ · (cid:0) ωkn (cid:1) λ we have P n (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − η ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≥ ξ |D n / ∈ A n ( x ) (cid:21) ≤ P n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − k X i ∈ S k ( x, F n ) η ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≥ ξ |D n / ∈ A n ( x ) ≤ L exp( − kξ , where the final inequality holds by Lemma B.2, combined with the fact that P n [ D n / ∈ A n ( x ) |F n ] ∈{ , } . Appendix D. Cost sensitive learning on manifolds
In this section we complete the proof of Theorem 3 by combining Theorems 4 and 5.
Theorem 3
Take d ∈ N and let ρ denote the Euclidean metric on R d . Let Φ be a cost matrix and M ⊆ R d a compact smooth submanifold with dimension γ and reach τ . Take positive constants k , r , c , ν min , ν max , ζ max , α, β, C α , C β and let Γ = h ( c , r , ν min , ν max ) , ( β, C β , ζ max ) , ( α, C α ) i .Suppose that S generates θ -approximate nearest neighbours for some θ ≥ . There exists a con-stant C > , depending upon k , γ , τ , Γ such that for all P ∈ P Φ ( V M , Γ) and n ∈ N the followingholds:(1) Given ξ ∈ (0 , and k n = k · n α α + γ · (1 + log(1 /ξ )) γ/ (2 α + γ ) with probability at least − ξ over D n ∼ P n we have P (cid:2) f Sn,k ( X ) / ∈ Y ∗ Φ ( η ( X )) (cid:3) ≤ ξ + C · (cid:16) θ α · Λ(Φ) · p log( L ) (cid:17) β · (cid:18) /ξ ) n (cid:19) βα/ (2 α + γ ) . (2) Given k n = k · n α α + γ we have E n (cid:2) R (cid:0) f Sn,k n (cid:1)(cid:3) − R ∗ ≤ C · ( θ α · Λ(Φ)) β · L · n − α (1+ β )2 α + γ . Moreover, there exists an absolute constant
K > such that whenever θ > , given any subgaus-sian random projection ϕ : R d → R h with h ≥ K · k ϕ k ψ · (cid:18) θ + 1 θ − (cid:19) · max (cid:8) γ log + ( γ/ ( r · τ )) − log + ( c · ν min ) + γ, log δ − (cid:9) , with probability at least − δ , S ( ϕ ) generates θ -approximate nearest neighbours, so both (1) and(2) hold with f ϕn,k in place of f Sn,k . We shall require the following lemmas.
Lemma D.1
Suppose that X = M ⊂ R d is a smooth complete manifold with dimension γ and reach τ . Suppose further that P consists of a marginal µ which is regular with constants ( c , r , ν min , ν max ) , along with a conditional label distribution η which is H¨older continuous withconstants ( α, C α ) . It follows that the conditional η is measure-smooth, with constants ( λ, C λ ) ,where λ = α/γ and C λ = max (cid:8) C α , ( τ / − α , ( r ) − α (cid:9) · (cid:0) c · ν min · − γ · v γ (cid:1) − λ . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Proof
We recall from Lemma A.6 that for x ∈ M and all r ≤ τ / we have V M ( B r ( x )) ≥ − γ · v γ · r γ . Moreover, since the marginal µ which is regular with constants ( c , r , ν min , ν max ) , for all r ≤ r we have, µ ( B r ( x )) = Z B r ( x ) ν ( x ) dx ≥ ν min · V M ( supp ( µ ) ∩ B r ( x )) ≥ ν min · c · V M ( B r ( x )) . Thus, for all r ≤ min { τ / , r } we have µ ( B r ( x )) ≥ (cid:0) c · ν min · − γ · v γ (cid:1) · r γ . Given x , x ∈ supp ( µ ) we must show that, k η ( x ) − η ( x ) k ∞ ≤ C λ · µ (cid:0) B ρ ( x ,x ) ( x ) (cid:1) λ . First suppose that ρ ( x , x ) ≥ min { r , τ / } . Then we have k η ( x ) − η ( x ) k ∞ ≤ ≤ C λ · (cid:0)(cid:0) c · ν min · − γ · v γ (cid:1) · (min { τ / , r } ) γ (cid:1) λ ≤ C λ · µ (cid:0) B min { r ,τ/ } ( x ) (cid:1) λ ≤ C λ · µ (cid:0) B ρ ( x ,x ) ( x ) (cid:1) λ . One the other hand, if ρ ( x , x ) < min { r , τ / } , then k η ( x ) − η ( x ) k ∞ ≤ C α · ρ ( x , x ) α ≤ C λ · (cid:0) c · ν min · − γ · v γ (cid:1) λ · ρ ( x , x ) α = C λ · (cid:0)(cid:0) c · ν min · − γ · v γ (cid:1) · ρ ( x , x ) γ (cid:1) λ ≤ C λ · µ (cid:0) B ρ ( x ,x ) ( x ) (cid:1) λ . Lemma D.2
Suppose that X = M ⊂ R d is a smooth complete manifold with dimension γ andreach τ . Suppose further that µ is a regular probability measure with constants ( c , r , ν min , ν max ) .We let ˜ C denote the constant ˜ C = (cid:0) c · ν min · − γ · v γ (cid:1) − · max (cid:8) ν max · γ · v γ , (min { τ / , r } ) − γ (cid:9) . Then, for all x ∈ X , r > and θ ≥ , we have µ ( B θ · r ( x )) ≤ ˜ C · θ γ · µ ( B r ( x )) . Proof
As noted in the proof of Lemma D.1 given any x ∈ supp ( µ ) and r ≤ min { τ / , r } we have µ ( B r ( x )) ≥ (cid:0) c · ν min · − γ · v γ (cid:1) · r γ . In addition, by Lemma A.6, for all x ∈ X and r ≤ τ / we have µ ( B r ( x )) ≤ ν max · V M ( B r ( x )) ≤ ( ν max · γ · v γ ) · r γ . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Now take x ∈ X , r > and θ ≥ . Firstly, the lemma holds trivially for x / ∈ supp ( µ ) , so we mayassume x ∈ supp ( µ ) . We consider two cases.Case 1: Assume that θ · r ≤ min { τ / , r } , so we have µ ( B θ · r ( x )) ≤ ( ν max · γ · v γ ) · ( θ · r ) γ ≤ (cid:0) ( ν max · γ · v γ ) / (cid:0) c · ν min · − γ · v γ (cid:1)(cid:1) · θ γ · µ ( B r ( x )) . Case 2: Assume that θ · r ≥ min { τ / , r } , so µ ( B r ( x )) ≥ µ (cid:0) B min { τ/ ,r } /θ ( x ) (cid:1) ≥ (cid:0) c · ν min · − γ · v γ (cid:1) · min { τ / , r } γ · θ − γ ≥ (cid:0) c · ν min · − γ · v γ (cid:1) · min { τ / , r } γ · θ − γ · µ ( B θ · r ( x )) . Lemma D.3
Suppose that X = M ⊂ R d is a smooth complete manifold with dimension γ andreach τ . Suppose further that µ is a regular probability measure with constants ( c , r , ν min , ν max ) .There exists a constant ˜ C which depends purely upon c , r , ν min , ν max , γ and τ such that givenany θ ≥ , whenever S generates θ approximate nearest neighbours for some θ . Then S generates ω measure-approximate nearest neighbours with respect to the measure µ with ω ≤ ˜ C · θ γ . Proof
Suppose that S generates θ approximate nearest neighbours. Take some n ∈ N , k ≤ n and let r = max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } and r = max { ρ ( x, X i ) : i ∈ S k ( x, F n ) } . Since S generates θ approximate nearest neighbours we have r ≤ θ · r . Hence, by Lemma D.2 we have µ ( B r ( x )) ≤ µ ( B θ · r ( x )) ≤ ˜ C · θ γ · µ ( B r ( x )) . Lemma D.4
Suppose we have a metric space (˜ x, ˜ ρ ) along with a map ϕ : X → ˜ X together withconstants c − ( ϕ ) , c + ( ϕ ) > such that for µ almost every x , x ∈ X we have c − ( ϕ ) · ρ ( x , x ) ≤ ˜ ρ ( ϕ ( x ) , ϕ ( x )) ≤ c + ( ϕ ) · ρ ( x , x ) , where ˜ ρ denotes the metric for ˜ X . For each k, n ∈ N we let S k ( x, F n ) denote the indices ofthe k nearest neighbours to ϕ ( x ) in the set { ϕ ( X i ) } ni =1 with respect to ˜ ρ . Then S k generates θ -approximate nearest neighbours with θ = c + ( ϕ ) /c − ( ϕ ) . Proof
By the construction of S k we have max { ρ ( x, X i ) : i ∈ S k ( x, F n ) } ≤ c − ( ϕ ) − · max { ˜ ρ ( ϕ ( x ) , ϕ ( X i )) : i ∈ S k ( x, F n ) }≤ c − ( ϕ ) − · max { ˜ ρ ( ϕ ( x ) , ϕ ( X i )) : i ∈ S ◦ k ( x, F n ) }≤ ( c + ( ϕ ) /c − ( ϕ )) · max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Proof [Proof of Theorem 3] As in the statement of Theorem 3 we take a compact smooth submani-fold
M ⊆ R d with dimension γ and reach τ . Take positive constants k , r , c , ν min , ν max , ζ max , α, β, C α , C β ,and suppose that S generates θ -approximate nearest neighbours for some θ ≥ . By Lemma D.1there exists a constant C λ , depending upon γ , τ , Γ , such that η is measure-smooth with constants ( λ, C λ ) , where λ = α/γ . In addition, by Lemma D.3 we see that S generates ω -measure approx-imate nearest neighbours with ω ≤ ˜ C · θ γ , where ˜ C depends purely upon γ , τ and Γ . Hence, thefirst part of Theorem 3 follows from Theorem 5.To prove the second part of Theorem 3, we note that µ ( supp ( µ )) = Z supp ( µ ) ν ( x ) dV M ( x ) ≥ ν min · V M ( supp ( µ )) . Hence, V M ( supp ( µ )) ≤ ν − . Moreover, by assumption supp ( µ ) is a ( c , r ) regular set. Thus, byTheorem 4, provided h ≥ K · k ϕ k ψ · (cid:18) θ + 1 θ − (cid:19) · max (cid:8) γ log + ( γ/ ( r · τ )) − log + ( c · ν min ) + γ, log δ − (cid:9) , then with probability at least − δ , for all pairs x , x ∈ supp ( µ ) we have θ + 1 · k x − x k ≤ k ϕ ( x ) − ϕ ( x ) k ≤ θ θ + 1 · k x − x k . Hence, with probability at least − δ , ϕ : R d → R h is bi-Lipchitz with constants c − ( ϕ ) = p / ( θ + 1) and c + ( ϕ ) = θ · p / ( θ + 1) , so by Lemma D.4, S ( ϕ ) generates θ approximatenearest neighbours. Appendix E. Standard lemmas
Recall that r p ( x ) = inf { r > µ ( B r ( x )) ≥ p } . In this section we prove lemmas B.1 and B.2. Lemma B.1
Take x ∈ X . Suppose that p ∈ [0 , , ξ ≤ , k ≤ (1 − ξ ) np . Then P n [max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } > r p ( x )] ≤ exp( − kξ / . Proof
Take r > r p ( x ) , so µ ( B r ( x )) ≥ p . Note that max { ρ ( x, X i ) : i ∈ S ◦ k ( x, F n ) } ≥ r if andonly if D n ∩ B r ( x ) < k , which is equivalent to P ni =1 { X i ∈ B r ( x ) } < k . Moreover, taking ˜ p = 1 n n X i =1 E (cid:2) { X i ∈ B r ( x ) } (cid:3) = µ ( B r ( x )) ≥ p implies k ≤ (1 − ξ ) n ˜ p ≤ n ˜ p . Thus, by the multiplicative Chernoff bound (Mitzenmacher and Upfal(2005)) we have, P n [ ρ ∞ ( x, { X i : i ∈ S ◦ k ( x, F n )) } ≥ r ] ≤ P " n X i =1 { X i ∈ B r ( x ) } < k ≤ P " n X i =1 { X i ∈ B r ( x ) } < (1 − ξ ) n ˜ p ≤ exp( − n ˜ pξ / ≤ exp( − kξ / . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Since this holds for any countable decreasing sequence of r with each r > r p ( x ) , the lemma followsby continuity. Lemma B.2
Take x ∈ X . For each δ > we have P n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ η Sn,k ( x ) − k X i ∈ S k ( x, F n ) η ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≥ δ |F n ≤ L exp( − kδ ) . Proof
By the construction of ˆ η Sn,k ( x ) it suffices to show that P n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i ∈ S k ( x, F n ) ( e ( Y i ) − η ( X i )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∞ ≥ k · δ |F n ≤ L exp( − kδ ) . By the union bound it suffices to show that for each l ∈ Y we have P n (cid:12)(cid:12)(cid:12)(cid:12) X i ∈ S k ( x, F n ) ( e ( Y i ) l − η ( X i ) l ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ k · δ |F n ≤ − kδ ) . This in turn follows if we show that for all x , · · · , x n ∈ X we have P n (cid:12)(cid:12)(cid:12)(cid:12) X i ∈ S k ( x, F n ) ( e ( Y i ) l − η ( x i ) l ) (cid:12)(cid:12)(cid:12)(cid:12) ≥ k · δ (cid:12)(cid:12)(cid:12)(cid:12) X = x , · · · , X n = x n ≤ − kδ ) . Moreover, e ( Y i ) l = { Y i = l } and η ( x i ) l = P [ Y i = l | X i = x i ] = E (cid:2) { Y i = l } | X i = x i (cid:3) . Thus, wemust show that P n (cid:12)(cid:12)(cid:12)(cid:12) X i ∈ S k ( x, F n ) (cid:0) { Y i = l } − E (cid:2) { Y i = l } | X i = x i (cid:3)(cid:1) (cid:12)(cid:12)(cid:12)(cid:12) ≥ k · δ (cid:12)(cid:12)(cid:12)(cid:12) X = x , · · · , X n = x n does not exceed − kδ ) . This is immediate from Hoeffding’s inequality (Boucheron et al.(2013)). Appendix F. Geometric lemmas
Lemma A.6
Let
M ⊆ R d be a compact smooth submanifold with dimension γ , reach τ and Rie-mannian volume form V M . Then for all x ∈ M and r < τ / we have − γ · v γ · r γ ≤ V M ( B gr ( x )) ≤ V M ( B r ( x )) ≤ γ · v γ · r γ . Proof
Fix x ∈ M and r < τ / . By (Chazal, 2013, Corollary 1.3) we have V M ( B r ( x )) ≤ (cid:18) ττ − r (cid:19) γ · v γ · r γ ≤ γ · v γ · r γ . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
In addition by (Eftekhari and Wakin, 2015, Lemma 12) we have V M ( B r ( x )) ≥ (cid:18) − r τ (cid:19) γ · v γ · r γ ≥ v γ · − γ · r γ . (4)Now suppose z ∈ B r/ ( x ) , so k z − x k < r/ , then by (Niyogi et al., 2008, Proposition 6.3) wehave ρ g ( z, x ) ≤ τ − τ r − k z − x k τ ≤ · k z − x k < r. Hence, B r/ ( x ) ⊆ B gr ( x ) . Hence, by (4) we have V M ( B gr ( x )) ≥ V M (cid:0) B r/ ( x ) (cid:1) ≥ v γ · − γ · r γ . Lemma A.7
With the assumptions of lemma A.6, for all x, ˜ x ∈ M and ˜ r ≤ r < τ / with ρ g ( x, ˜ x ) ≤ r + ˜ r/ we have V M (cid:0) B gr ( x ) ∩ B g ˜ r (˜ x ) (cid:1) ≥ − γ · v γ · ˜ r γ . Proof
Fix x ∈ M and r > and take ˜ x ∈ B gr ( x ) and ˜ r ∈ (0 , min { r, τ / } ) . We claim thatthere exists a geodesic ball B g ˜ r/ ( y ) of radius ˜ r/ such that B g ˜ r/ ( y ) ⊆ B gr ( x ) ∩ B g ˜ r (˜ x ) . To see thisconsider two cases.Case 1: Suppose that ρ g ( x, ˜ x ) ≤ r/ ≤ r/ . Then given z ∈ B g ˜ r/ (˜ x ) we have ρ g ( x, z ) ≤ ρ g ( x, ˜ x ) + ρ g (˜ x, z ) < r/ r/ ≤ r . Hence, B g ˜ r/ (˜ x ) ⊆ B gr ( x ) ∩ B g ˜ r (˜ x ) , so the claim holds with y = ˜ x .Case 2: Suppose that r/ < ρ g ( x, ˜ x ) ≤ r . Let c : [0 , ρ g ( x, ˜ x )] → M be a unit speedgeodesic with c (0) = x and c (1) = ˜ x . Let y = c ( ρ g ( x, ˜ x ) − r/ , so ρ g ( y, ˜ x ) = 3˜ r/ and ρ g ( y, x ) = ρ g ( x, ˜ x ) − r/ ≤ r − ˜ r/ . Hence, B g ˜ r/ ( y ) ⊆ B gr ( x ) ∩ B g ˜ r (˜ x ) .Hence, the claim holds. Thus, by two applications of Lemma A.6 we have, V M (cid:0) B gr ( x ) ∩ B g ˜ r (˜ x ) (cid:1) ≥ V M (cid:16) B g ˜ r/ ( y ) (cid:17) ≥ − γ · v γ · ˜ r γ . Appendix G. Random projections theorem
Our goal in this section is to prove Theorem 4.
Theorem 4
There exists an absolute constant K such that the following holds. Given a compactsmooth submanifold M ⊆ R d with dimension γ and reach τ , suppose that A ⊂ M is ( c , r ) regular with respect to the Riemannian volume V M . Suppose that ϕ : R d → R h is a subgaussianrandom projection. Take (cid:15), δ ∈ (0 , and suppose that h ≥ K · k ϕ k ψ · (cid:15) − · max (cid:8) γ log + ( γ/ ( r · τ )) + log + ( V M ( A ) /c ) + γ, log δ − (cid:9) . Then with probability at least − δ , for all pairs x , x ∈ A we have (1 − (cid:15) ) · k x − x k ≤ k ϕ ( x ) − ϕ ( x ) k ≤ (1 + (cid:15) ) · k x − x k . INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
The result generalises Theorem 7.9 from Dirksen (2016) and the proof is very similar. We beginby recalling another important result from Dirksen (2016). We require some notation. Given ametric space ( X , ρ ) and a subset A ⊂ X we define γ tal ( A ) := inf T sup x ∈ A X q ≥ q/ · ρ ( x, T q ) , where the infimum is taken over sequences T = { T q } q ∈ N with each T q ⊂ A , T ) = 1 and for q ≥ , T q ) ≤ q . Given x , x ∈ R d we let Ch ( x , x ) denote the normalised chord,Ch ( x , x ) = x − x k x − x k . Given a set A ⊂ R d we let A nc ⊂ S d − ⊂ R d denote the set of normalised chords, A nc = { Ch ( x , x ) : x , x ∈ A } .Given x ∈ M we let P x denote the projection onto the tangent space of M at x . Given a matrix M we let k M k op denote the operator norm of M . Given a semi-metric space ( X , ρ ) , a subset A ⊂ X and r > , a subset { x , · · · , x q } ⊂ A such that for each a ∈ A we have ρ ( a, x i ) < r for some i ∈ { , · · · , q } is referred to as an r -net of A with respect to ρ . We let N ( A, ρ, r ) denote thecardinality of the smallest r -net of A with respect to ρ . Theorem 6 (Dirksen (2016))
There exists an absolute constant K such that the following holds.Suppose that A ⊂ R d and let ϕ : R d → R h . Take (cid:15), δ ∈ (0 , and suppose that h ≥ K · k ϕ k ψ · (cid:15) − · max (cid:8) γ tal ( A nc ) , log (cid:0) δ − (cid:1)(cid:9) . Then with probability at least − δ , for all pairs x , x ∈ A we have (1 − (cid:15) ) · k x − x k ≤ k ϕ ( x ) − ϕ ( x ) k ≤ (1 + (cid:15) ) · k x − x k . Given a ( c , r ) regular set A ⊂ M we shall seek to bound γ tal ( A nc ) . We shall use the followingupper bound. Lemma G.1 (Talagrand (2006))
Given a metric space ( X , ρ ) and a subset A ⊂ X we have γ tal ( A ) ≤ (log 2) − / · Z diam ( A )0 p log N ( A, ρ, r ) dr. Proof
See (Talagrand, 2006, pg. 13).We require the following lemmas from Dirksen (2016).
Lemma G.2 (Dirksen (2016))
Suppose that
M ⊂ R d is a manifold with reach τ . Then, given any x , x , y , y ∈ M we have,(a) k Ch ( x , x ) − Ch ( y , y ) k ≤ · ( k x − y k + k x − y k ) / ( k x − x k ) ,(b) k Ch ( x , x ) − P x ( Ch ( x , x )) k ≤ τ − · k x − x k , INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS (c) k P x − P x k op ≤ √ · τ − / · k x − x k / . Lemma G.3
Take any subset A ⊂ M where M ⊂ R d is a γ -dimensional manifold of reach τ .For all r > , if we let n ( r ) = N (cid:0) A, k · k , min { τ, / } · r / (cid:1) then we have, N ( A nc , k · k , r ) ≤ n ( r ) · (cid:18) n ( r ) + (cid:18) r (cid:19) γ (cid:19) . Proof
Take a, b > and let { x , · · · , x q } ⊂ A be a minimal a -net of A with respect to k · k ,and for each i = 1 , · · · , q we let { y ij } m i j =1 be a b -net for A with respect to the semi-metric ( z , z )
7→ k P x i ( z − z ) k . Note that q = N ( A, k · k , a ) and for each i , m i = N ( A, k P x i ( · ) k , b ) ≤ N ( { w ∈ R γ : k w k ≤ } , k · k , b ) ≤ (cid:18) b (cid:19) γ . Now take t > and decompose A nc into A ≥ t nc and A
Lemma G.4
Take any subset A ⊂ M where M ⊂ R d is a γ -dimensional manifold of reach τ .Suppose further that A is a ( c , r ) -regular set. Then, for all r < min { r , τ / } we have N ( A, k · k , r ) ≤ c − · V M ( A ) · ( γ + 4) γ/ · r − γ . Proof
Let { x , · · · , x q } ⊂ A be a maximal r -separated set. By (Eftekhari and Wakin, 2015, Lemma12), for each i we have V M (cid:0) B r/ ( x i ) (cid:1) ≥ (63 / γ/ · v γ · ( r/ γ = (cid:16) ((63 · π ) / γ/ / Γ (cid:16) γ (cid:17)(cid:17) · r γ ≥ (cid:18) ((63 · π ) / γ/ · (cid:16) γ (cid:17) − ( γ/ (cid:19) · r γ ≥ ( γ + 4) − ( γ/ · r γ . Since { x , · · · , x q } ⊂ A is r -separated, the ballse B r/ ( x i ) are disjoint, so using the ( c , r ) regu-larity property, V M ( A ) ≥ q X i =1 V M (cid:0) A ∩ B r/ ( x i ) (cid:1) ≥ c · q X i =1 V M (cid:0) B r/ ( x i ) (cid:1) ≥ q · c · ( γ + 4) − ( γ/ · r γ . Moreover, since { x , · · · , x q } ⊂ A is a maximal r -separated set, { x , · · · , x q } must also be an r -net for A with respect to k · k , so N ( A, k · k , r ) ≤ q . Thus, N ( A, k · k , r ) ≤ c − · V M ( A ) · ( γ + 4) γ/ · r − γ . Lemma G.5
There exists a universal constant ˜ K > such that the following holds. Take anysubset A ⊂ M where M ⊂ R d is a γ -dimensional manifold of reach τ . Suppose further that A isa ( c , r ) -regular set. Then, for all r < { r , } we have log ( N ( A nc , k · k , r )) ≤ ˜ K · (cid:0) γ log + ( γ/τ ) + log + ( V M ( A ) /c ) − γ log + ( r ) + γ (cid:1) . Proof
Combine lemmas G.3 and G.4.
Proof [Proof of Theorem 4] To complete the proof of Theorem 4, we apply Lemmas G.1 and G.5, ( γ tal ( A nc )) ≤ (log 2) − · (cid:18)Z p log N ( A nc , ρ, r ) dr (cid:19) ≤ · (log 2) − · Z log N ( A nc , ρ, r ) dr ≤ K · (cid:0) γ log + ( γ/ ( τ r )) + log + ( V M ( A ) /c ) + γ (cid:1) . Hence, Theorem 4 follows from Theorem 6. INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS
Appendix H. Notation ζ max Upper limit for ζ in the margin condition r Regularity radius for supp ( µ ) c Regularity coefficient supp ( µ ) Y ∗ Φ The set of labels which minimise the cost sensitive loss for a given conditional distribution f Sn,k
A classifier based on n training examples and k approximate nearest neighbours generatedvia S k ˆ η Sn,k
An estimator of η based on n training examples and k approximate nearest neighboursgenerated via S k k The number of nearest neighbours X The feature space Z i The i th training example from D n X i The i th feature vector in the training data Y i The i th test label in the training data Y The set of class labels Z The Cartesian product
X × YD n A data set D n = { Z , · · · , Z n } of size n , with each Z i = ( X i , Y i ) ∼ P chosenindependently n Number of training examples S ◦ k A function from a point to the indices of its k nearest neighbours S k A function from a point indices of a set of approximate k nearest neighbours θ Scale factor for approximate nearest neighbours ω Scale factor for measure-approximate nearest neighbours λ Smoothness exponent for η with respect to µα H¨older exponent for ηC α H¨older scaling constant for ηγ Dimension of the manifold M INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS Γ Set of parameters defining a set of distributions on M V M The Riemannian volume form on manifold M ν The density of µ with respect V M β Margin exponent for P C β Margin scaling constant for P C λ Smoothness scaling constant for ηx A feature vector y A class label X A random feature vector Y A random class label e ( y ) A L × one-hot-encoding of the class label y F n The ordered set { X , · · · , X n } where D n = { ( X , Y ) , · · · , ( X n , Y n ) } P Distribution over Z = X × Y µ The marginal distribution over X ie. µ ( A ) = P [ X ∈ A ] for A ⊆ X η The conditional distribution of Y given X = x , as a probability vector L Number of classes Φ A L × L cost matrix with entries φ i,j φ i,j The cost incurred by predicting class i when the true label is class jM Φ The cost-sensitive margin P n Probability over data sets D n of size n with each ( x i , y i ) sampled i.i.d from PE n Expectation over data sets D n according to P n ρ A metric on X . If X ⊂ R d then ρ denotes the Euclidean metric. ρ g The Riemannian metric on
Mk · k The Euclidean norm on R d R ( h ) The risk of a classifier hR ∗ The Bayes risk INIMAX RATES FOR COST - SENSITIVE LEARNING ON MANIFOLDS B r ( x ) The open metric ball of radius r centered at xB r ( x ) The closed metric ball of radius r centered at xB gr ( x ) The open metric ball of radius r centered at x with respect to the geodesic metric ρ g B gr ( x ) The closed metric ball of radius r centered at x with respect to the geodesic metric ρ g A The indicator function of a set Aτ Reach for a Riemannian manifold MM A compact C ∞ -smooth submanifold of R d d Dimension of the ambient Euclidean space R d R d d dimensional Euclidean spaceAsym The asymmetry of a cost matrix P Φ ( υ, Γ) A class of measures specified in Definition 2.5
Λ(Φ)
The constant
Λ(Φ) := ( L − · Asym (Φ) + 2 · k Φ k ∞∞