Local Rademacher Complexity Bounds based on Covering Numbers
aa r X i v : . [ c s . A I] O c t Local Rademacher Complexity Bounds based on CoveringNumbers
Yunwen Lei ∗ , Lixin Ding † , and Yingzhou Bi ‡ Department of Mathematics, City University of Hong Kong State Key Lab of Software Engineering, School of Computer, Wuhan University Science Computing and Intelligent Information Processing of GuangXi HigherEducation Key Laboratory, Guangxi Teachers Education University
Abstract
This paper provides a general result on controlling local Rademacher complexities, whichcaptures in an elegant form to relate the complexities with constraint on the expected norm to thecorresponding ones with constraint on the empirical norm. This result is convenient to apply inreal applications and could yield refined local Rademacher complexity bounds for function classessatisfying general entropy conditions. We demonstrate the power of our complexity bounds byapplying them to derive effective generalization error bounds.
Keywords . Local Rademacher complexity; Covering numbers; Learning theory
Machine learning refers to a process of inferring the underlying relationship among input-outputvariables from a previously chosen hypothesis class H , on the basis of some scattered, noisy exam-ples [11, 29]. Generalization analysis on learning algorithms stands a central place in machine learningsince it is important to understand the factors influencing models’ behavior, as well as to suggestways to improve them [2, 3, 5–7, 20]. One seminar example can be found in the multiple kernel learn-ing (MKL) context, where Cortes et al. [7] established a framework showing how the generalizationanalysis in [12, 13, 25] could motivate two novel MKL algorithms.Vapnik and Chervonenkis [30] pioneered the research on learning theory by relating generalizationerrors to the supremum of an empirical process: sup f ∈F [ P f − P n f ], where F is the associated lossclass induced from the hypothesis space, P and P n are the true probability measure and the empiricalprobability measure, respectively. It was then indicated that this supremum is closely connected withthe “size” of the space F [29, 30]. For a finite class of functions, its size can be simply measured by itscardinality. Vapnik [29] provided a novel concept called VC dimension to characterize the complexityof { , } -valued function classes, by noticing that the quantity of significance is the number of pointsacquired when projecting the function class onto the sample. Other quantities like covering numbers,which measure the number of balls required to cover the original class, have been introduced to capture,on a finer scale, the “size” of real-valued function classes [8, 14, 33, 34]. With the recent developmentin concentration inequalities and empirical process theory, it is possible to obtain a slightly tighterestimate on the “size” of H through the remarkable concept called Rademacher complexity [1, 2, 15, 32].However, all the above mentioned approaches provide only global estimates on the complexityof function classes, and they do not reflect how a learning algorithm explores the function class andinteracts with the examples [4, 5]. Moreover, they are bound to control the deviation of empirical errors ∗ [email protected]. Part of the work was done at Wuhan University † [email protected] ‡ [email protected] f ) ≤ B ( P f ) α ,these functions will also admit small variances. That is to say, the obtained prediction rule is likely tofall into a subclass with small variances [2]. Due to the seminar work of Koltchinskii and Panchenko [16]and Massart [22], it turns out that the notion of Rademacher complexity can be naturally modified totake this into account, yielding the so-called local Rademacher complexity [16]. Since local Rademachercomplexity is always smaller than the global counterpart, the discussion based on local Rademachercomplexities always yields significantly better learning rates under the variance-expectation conditions.Mendelson [23, 24] initiated the discussion of estimating local Rademacher complexities with cov-ering numbers and these complexity bounds are very effective in establishing fast learning rates. How-ever, the discussions in [23, 24] are somewhat dispersed in the sense that the author did not provided ageneral result applicable to all function classes. Indeed, Mendelson [23, 24] derived local Rademachercomplexity bounds for several function classes satisfying different entropy conditions case-by-case, andthe involved deduction also relies on the specific entropy conditions. Mendelson [25] also derived, for ageneral Reproducing Kernel Hilbert Space (RKHS), an interesting local Rademacher complexity boundbased on the eigenvalues of the associated integral operator, which was later generalized to ℓ p -normMKL context [12, 13, 21]. These results are exclusively developed for RKHSs and it still remains un-known whether they could be extended to general function classes. In this paper, we try to refine thesediscussions by providing some general and sharp results on controlling local Rademacher complexitiesby covering numbers. A distinguished property of our result is that it captures in an elegant formto relate local Rademacher complexities to the associated empirical local Rademacher complexities,which allows us to improve the existing local Rademacher complexity bounds for function classes withdifferent entropy conditions in a systematic manner. We also demonstrate the effectiveness of thesecomplexity bounds by applying them to refine the existing learning rates.The paper is organized as follows. Section 2 formulates the problem. Section 3 provides a generallocal Rademacher complexity bound as well as its applications to different function classes. Section 4applies our complexity bounds to generalization analysis. All proofs are presented in Section 5. Someconclusions are presented in Section 6. We first introduce some notations which will be used throughout this paper. For a measure µ and a positive number 1 ≤ q < ∞ , the notation L q ( µ ) means the collection of functions for whichthe norm k f k L q ( µ ) := ( R | f | q d µ ) /q is finite. For a class F of functions, we use the abbreviation a F := { af : f ∈ F} , and denote by e F := { f − g : f, g ∈ F} (2.1)the class consisting of those elements which can be represented as the minus of two elements in F .For a real number a , ⌈ a ⌉ indicates the least integer not less than a , and log a represents the naturallogarithm of a . By c ( · ) we denote any quantity of a constant multiple of the involved arguments andits exact value may change from line to line, or even within the same line. Definition 1 (Empirical measure) . Let S be a set and let s , s , . . . , s n be n points in S , then theempirical measure P n supported on s , s , . . . , s n is defined as P n ( A ) := 1 n n X i =1 χ A ( s i ) , for any A ⊂ S, (2.2)where χ I is the characteristic function defined by χ A ( s ) = 0 if s A and χ A ( s ) = 1 if s ∈ A .2f Q is a measure and f is a measurable function, it is convenient [5] to use the notation Qf = R f d Q = E f . Now, for the empirical measure P n supported on Z , . . . , Z n , the empirical average of f can be abbreviated as P n f = n P ni =1 f ( Z i ). Definition 2 (Covering number [14]) . Let ( G , d ) be a metric space and set F ⊆ G . For any ǫ > F △ is called an ǫ -cover of F if for every f ∈ F we can find an element g ∈ F △ satisfying d ( f, g ) ≤ ǫ . An ǫ -cover F △ is called a proper ǫ -cover if F △ ⊆ F . The covering number N ( ǫ, F , d ) isthe cardinality of a minimal proper ǫ -cover of F , that is N ( ǫ, F , d ) := min {|F △ | : F △ ⊆ F is an ǫ -cover of F} . We also define the logarithm of covering number as the entropy number.For brevity, when G is a normed space with norm k · k , we also denote by N ( ǫ, F , k · k ) the coveringnumber of F with respect to the metric d ( f, g ) := k f − g k . Introduce the notation: N ( ǫ, F , k · k p ) := sup n sup P n N ( ǫ, F , k · k L p ( P n ) ) . (2.3) Definition 3 (Rademacher complexity [1]) . Let P be a probability measure on X from which theexamples X , . . . , X n are independently drawn. Let σ , . . . , σ n be independent Rademacher randomvariables that have equal probability of being 1 or −
1. For a class F of functions f : X → R , introducethe notations: R n f = 1 n n X i =1 σ i f ( X i ) , R n F = sup f ∈F R n f. The Rademacher complexity E R n F and empirical Rademacher complexity E σ R n F are defined by E R n F := E " sup f ∈F n n X i =1 σ i f ( X i ) , E σ R n F := E " sup f ∈F n n X i =1 σ i f ( X i ) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n . In this paper we concentrate our attention on local Rademacher complexities. The word localmeans that the class over which the Rademacher process is defined is a subset of the original class.We consider here local Rademacher complexities of the following form: E R n { f ∈ F : P f ≤ r } or E σ R n { f ∈ F : P n f ≤ r } . We refer to the former as the local Rademacher complexity and the latter as the empirical localRademacher complexity. The parameter r is used to filter out those functions with large variances [25],which are of little significance in the learning process since learning algorithms are unlikely to pickthem. This section is devoted to establishing a general local Rademacher complexity bound. For thispurpose, we first show how to control empirical local Rademacher complexities. The empirical radii arethen connected with the true radii via the contraction property of Rademacher averages (Lemma A.4).Some examples illustrating the power of our result are also presented.
Mendelson [23, 24] studied E R n { f ∈ F : P f ≤ r } by relating it with E R n { f ∈ F : P n f ≤ ˆ r } , ˆ r := sup f ∈F : P f ≤ r P n f , (3.1)the latter of which involves an empirical radius defined w.r.t. the empirical measure P n and can befurther tackled by standard entropy integral [10], yielding a bound of the following form: E R n { f ∈ F : P f ≤ r } ≤ c · E Z ˆ r log N ( ǫ, F , k · k L ( P n ) ) dǫ. (3.2)3lthough the expectation E √ ˆ r can be controlled by r plus the local Rademacher complexity itself [17] E √ ˆ r ≤ r + 4 sup f ∈F k f k ∞ E R n { f ∈ F : P f ≤ r } , (3.3)it is generally not trivial to control the integral in Eq. (3.2) since the random variable ˆ r appears in theupper limit of the integral (the bound Eq. (3.3) can not be trivially used to control the r.h.s. of Eq.(3.2)). Mendelson’s [23, 24] idea is, under different entropy conditions, to construct different upperbounds on the involved integral for which the random variable ˆ r appears in a relatively simple term.For example, for the function class F satisfying log N ( ǫ, F , k · k ) ≤ log p γǫ , Mendelson [24] establishedthe following bound on the integral: E Z ˆ r log N ( ǫ, F , k · k L ( P n ) ) dǫ ≤ E Z √ ˆ r log p γǫ dǫ ≤ E h √ ˆ r log p c ( p, γ ) √ ˆ r i . (3.4)The term √ ˆ r log p c ( p,γ ) √ ˆ r turns out to be concave w.r.t. √ ˆ r , which, together with Jensen’s inequality, canbe controlled by applying the standard upper bound (3.3). Although these deductions are elegant, theydo not allow for general bounds for local Rademacher complexities, and sometimes yield unsatisfactoryresults due to the looseness introduced by constructing an additional artificial upper bound for theintegral in Eq. (3.2) (e.g., Eq. (3.4)).We overcome these drawbacks by providing a general result on controlling local Rademacher com-plexity bounds. The step stone is the following lemma controlling local Rademacher complexity on asub-class involving a random radius ˆ r by a local Rademacher complexity on a sub-class involving adeterministic and adjustable parameter ǫ plus a linear function of √ ˆ r , which allows for a direct use ofthe standard upper bound on E √ ˆ r and excludes the necessity of constructing non-trivial bounds forthe integral in Eq. (3.2). Our basic strategy, analogous to [18, 19, 28], is to approximate the originalfunction class F with an ǫ -cover, thus relating the local Rademacher complexity of F to that of tworelated function classes. One class is of finite cardinality and can be approached by the Massart lemma(Lemma A.1), while the other is of small magnitude and is defined by empirical radii. Lemma 1.
Let F be a function class and let P n be the empirical measure supported on the points X , . . . , X n , then we have the following complexity bound ( r can be stochastic w.r.t. X i , a typicalchoice of r is the term ˆ r defined in Eq. (3.1) ): E σ R n { f ∈ F : P n f ≤ r } ≤ inf ǫ> " E σ R n { f ∈ e F : P n f ≤ ǫ } + r r log N ( ǫ/ , F , k · k L ( P n ) ) n . Theorem 2 (Main theorem) . Let F be a function class satisfying k f k ∞ ≤ b, ∀ f ∈ F . There holds thefollowing inequality: E R n { f ∈ F : P f ≤ r } ≤ inf ǫ> " E R n { f ∈ e F : P n f ≤ ǫ } +8 b log N ( ǫ/ , F , k · k ) n + r r log N ( ǫ/ , F , k · k ) n . (3.5) Remark . An advantage of Theorem 2 over the existing local Rademacher complexity bounds consistsin the fact that it provides a general framework for controlling local Rademacher complexities, fromwhich, as we will show in Section 3.2, one can trivially derive explicit local Rademacher complexitybounds when the entropy information is available. Furthermore, since Theorem 2 does not involvean artificial upper bound for the integral in Eq. (3.2) (e.g., Eq. (3.4)) , it could yield sharper localRademacher complexity bounds (see Remark 2, 3, 4) when compared to the results in [23, 24].
We now demonstrate the effectiveness of Theorem 2 by applying it to some interesting classessatisfying general entropy conditions. Our discussion is based on the refined entropy integral (A.2),which can be used to tackle the situation where the standard entropy integral [10] diverges.4 orollary 1.
Let F be a function class with sup f ∈F k f k ∞ ≤ b . Assume that there exist three positivenumbers γ, d, p such that log N ( ǫ, F , k · k ) ≤ d log p ( γ/ǫ ) for any < ǫ ≤ γ , then for any < r ≤ γ and n ≥ γ − there holds that E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ ) min "(cid:16)r dr log p (2 γr − / ) n + d log p (2 γr − / ) n (cid:17) , (cid:16) d log p (2 γn / ) n + r rd log p (2 γn / ) n (cid:17) . Remark . For function classes F meeting the condition of Corollary 1, Mendelson [23, Lemma 2.3]derived the following complexity bound E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ ) max " dn log p √ r , r drn log p/ √ r . (3.6)It is interesting to compare the bound (3.6) with ours and the difference can be seen in the followingthree aspects:(1) Firstly, it is obvious that the r.h.s. of Eq. (3.6) is of the same order of magnitude to p drn − log p ( r − / )+ dn − log p ( r − / ). Consequently, our bound can be no worse than Eq. (3.6).(2) Furthermore, as we will see in Section 4, the upper bound in Eq. (3.6) is not a sub-root func-tion, which adds some additional difficulty in applying it to the generalization analysis. As acomparison, the upper bound dn − log p ( n / ) + p rdn − log p ( n / ) satisfies the sub-root condi-tion (see definition of sub-root functions in Section 4) and thus can be convenient to use in thegeneralization analysis.(3) Thirdly, Eq. (3.6) is not consistent with the natural opinion on what the complexity bound shouldbe. For example, when r approaches to 0 it is expected that the term E R n { f ∈ F : P f ≤ r } should monotonically decrease to a limiting point. However, the upper bound in Eq. (3.6) divergesto ∞ as r →
0. As a comparison, our result does not violate such consistence since the term dn − log p ( n / ) + p rdn − log p ( n / ) is always an increasing function of r . Corollary 2.
Let F be a function class with sup f ∈F k f k ∞ ≤ b . Assume that there exist two constants γ > , p > such that log N ( ǫ, F , k · k ) ≤ γǫ − p log ǫ , (3.7) then we have the following complexity bound: E R n { f ∈ F : P f ≤ r } ≤ c inf ǫ> (cid:20) n − / ǫ − p/ log ǫ + ǫ − p n − log ǫ + q rǫ − p n − log ǫ (cid:21) if < p < ,c (cid:2) n − / log n + √ rn − (cid:3) if p = 2 ,c [ n − /p log n + √ rn − ] if p > , (3.8) where c := ( b, p, γ ) is a constant dependent on b, p and γ .Remark . We now compare Corollary 2 with the following inequality established in [24, Eq. (3.5)]under the entropy condition (3.7) with 0 < p < E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ )( n − / ( p +2) log p r + n − / r (2 − p ) / log 2 r ) , < p < . (3.9)The upper bound in Eq. (3.9) is not a sub-root function. Furthermore, our bound grows monotonicallyincreasing w.r.t. r , while the bound (3.9) diverges to ∞ as r →
0, which violates the natural propertythe local Rademacher complexity should admit. 5 orollary 3.
Let F be a function class with sup f ∈F k f k ∞ ≤ b . Assume that there exist two constants γ > , p > such that log N ( ǫ, F , k · k ) ≤ γǫ − p , then we have the following complexity bound: E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ ) inf ǫ> h n − / ǫ − p/ + ǫ − p n − + √ rǫ − p n − i if < p < ,c ( b, p, γ ) (cid:2) n − / log n + √ rn − / (cid:3) if p = 2 ,c ( b, p, γ ) (cid:2) n − /p + √ rn − / (cid:3) if p > . (3.10) Remark . As compared with the following inequality established in [24, Eq. (3.4)] E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ )( n − / ( p +2) + n − / r (2 − p ) / ) , < p < , (3.11)Corollary 3 generalizes Eq. (3.11) to the case p ≥ p <
2. For example, when r ≤ n − / ( p +2) one can take ǫ = n − / ( p +2) in Eq. (3.10) to show that E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ ) h n − / ( p +2) + √ rn − / ( p +2) i , which is no larger than Eq. (3.11) since √ rn − / ( p +2) ≤ n − / r (2 − p ) / for such r . Furthermore, for thecase r > n − / ( p +2) one can also choose ǫ = r / in Eq. (3.10) to obtain that E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ ) h n − / r (2 − p ) / + r − p/ n − i , which is again no larger than Eq. (3.11) since r − p/ n − ≤ n − / ( p +2) in this case. Therefore, our resultis competitive to Eq. (3.11) for any r > We now show how to apply the previous local Rademacher complexity bounds to study the general-ization performance for learning algorithms. In the learning context, we are given an input space X andan output space Y , along with a probability measure P on Z := X × Y . Given a sequence of examples Z = ( X , Y ) , . . . , Z n = ( X n , Y n ) independently drawn from P , our goal is to find a prediction rule(model) h : X → Y to perform prediction as accurately as possible. The error incurred from using h todo the prediction on an example Z = ( X, Y ) can be quantified by a non-negative real-valued loss func-tion ℓ ( h ( X, Y )). The generalization performance of a model h can be measured by its generalizationerror [9, 31] E ( h ) := R ℓ ( h ( X ) , Y )d P . Since the measure P is often unknown to us, the Empirical RiskMinimization principle firstly establishes the so-called empirical error E z ( h ) := n P ni =1 ℓ ( h ( X i ) , Y i )to approximate E ( h ), and then searches the prediction rule ˆ h n by minimizing E z ( h ) over a specifiedclass H called hypothesis space. That is, ˆ h n := argmin h ∈H E z ( h ). Denoting by h ∗ := argmin h ∈H E ( h )the best prediction rule attained in H , generalization analysis aims to relate the excess generalizationerror E (ˆ h n ) − E ( h ∗ ) to the empirical behavior of ˆ h n over the sample.Our generalization analysis is based on Theorem 3 in Bartlett et al. [2], which justifies the use of theRademacher complexity associated with a small subset of the original class as a complexity term in anerror bound. We call a function ψ : [0 , ∞ ) −→ [0 , ∞ ) sub-root if it is nonnegative, nondecreasing andif r −→ ψ ( r ) / √ r is nonincreasing for r >
0. If ψ is a sub-root function, then it can be checked [2, 3]that the equation ψ ( r ) = r has a unique positive solution r ∗ , which is referred to as the fixed point of ψ . Lemma 3 ([2]) . Let F be a class of functions taking values in [ a, b ] and assume that there exist somefunctional T : F −→ R + and some constant B such that Var ( f ) ≤ T ( f ) ≤ BP f for every f ∈ F . Let ψ be a sub-root function with the fixed point r ∗ . If for any r ≥ r ∗ , ψ satisfies ψ ( r ) ≥ B E R n { f ∈ F : T ( f ) ≤ r } , then for any K > and any t > , the following inequality holds with probability at least − e − t : P f ≤ KK − P n f + 704 KB r ∗ + t (11( b − a ) + 26 BK ) n , ∀ f ∈ F . (4.1)6 heorem 4. Let H be the hypothesis space and F := { Z = ( X, Y ) → ℓ ( h ( X ) , Y ) − ℓ ( h ∗ ( X ) , Y ) : h ∈ H} be the shifted loss class. Suppose that ℓ is L -Lipschitz, sup h ∈H k h k ∞ ≤ b, Pr (cid:8) | Y | ≤ b (cid:9) = 1 andthere exist three positive constants γ, d and p satisfying log N ( ǫ, H , k · k ) ≤ d log p ( γ/ǫ ) . Suppose thevariance-expectation condition holds for functions in F , i.e., there exists a constant B > such that P f ≤ BP f, ∀ f ∈ F . Then, for any < δ < , ˆ h n satisfies the following inequality with probability atleast − δ : E (ˆ h n ) − E ( h ∗ ) ≤ c (cid:20) d log p nn + log(1 /δ ) n (cid:21) , where c is a constant depending on B, p, γ, b and L .Remark . It is possible to derive generalization error bounds using the local Rademacher complexitybounds given in [24] (Eq. (3.6)) under the same entropy condition. An obstacle in the way of applyingLemma 3 is that the r.h.s. of Eq. (3.6) is not a sub-root function. The trick towards this problem isto consider the local Rademacher complexity of a slightly larger function class (the star-shaped space,or star-hull, star( F ) := { αf : f ∈ F , α ∈ [0 , } of F ), which always satisfies the sub-root propertyand can be related to the original class by the following inequality due to Mendelson [24, Lemma 3.9]:log N (2 ǫ, star( F ) , k · k ) ≤ log 2 ǫ + log N ( ǫ, F , k · k ) . With this trick and plugging Eq. (3.6) into Lemma 3, one can derive the following generalizationbound with probability at least 1 − δ : E (ˆ h n ) − E ( h ∗ ) ≤ c " d log max(1 ,p ) nn + log(1 /δ ) n , which is slightly worse than the bound in Theorem 4 for p <
1. Furthermore, notice that our upperbound on local Rademacher complexities is always a sub-root function, which is more convenient touse in Lemma 3 and does not require the trick of introducing an additional star-hull.
Theorem 5.
Under the same condition of Theorem 4 except the entropy condition Eq. (3.7) , thefollowing inequality holds with probability at least − δ : E (ˆ h n ) − E ( h ∗ ) ≤ c ( n − pp +2 (log n ) − pp +2 log n (log n ) p +2 + n − log(1 /δ )) , where c is a constant depending on B, p, γ, b and L .Remark . Since the local Rademacher complexity bound given in Eq. (3.9) is not sub-root, theapplication of it to study generalization performance also requires the trick of star-hull argument.Indeed, with this trick one can show that the bound (3.9) could yield the following generalizationguarantee with probability at least 1 − δ : E (ˆ h n ) − E ( h ∗ ) ≤ c ( n − pp +2 (log n ) p +2 + n − log(1 /δ )) , which is slightly worse than the bound given in Theorem 5. Proof of Lemma 1.
For a temporarily fixed ǫ >
0, let F △ be a minimal proper ǫ -cover of the class { f ∈ F : P n f ≤ r } with respect to the metric k · k L ( P n ) . According to the definition of coveringnumbers, we know that F △ ⊆ { f ∈ F : P n f ≤ r } . Furthermore, Lemma A.3 shows that |F △ | ≤ ( ǫ/ , F , k · k L ( P n ) ). For any f ∈ F , let f △ be an element of F △ satisfying k f − f △ k L ( P n ) ≤ ǫ .Then, we have R n { f ∈ F : P n f ≤ r } = sup { f ∈F : P n f ≤ r } " n n X i =1 σ i f ( X i ) − n n X i =1 σ i f △ ( X i ) + 1 n n X i =1 σ i f △ ( X i ) ≤ sup { f ∈F : P n f ≤ r } n n X i =1 σ i [ f ( X i ) − f △ ( X i )] + sup { f ∈F : P n f ≤ r } n n X i =1 σ i f △ ( X i ) ≤ sup { f ∈F : P n f ≤ r } n n X i =1 σ i [ f ( X i ) − f △ ( X i )] + sup { f ∈F △ : P n f ≤ r } n n X i =1 σ i f ( X i ) , (5.1)where the last inequality is due to the inclusion relationship F △ ⊂ { f ∈ F : P n f ≤ r } .Taking g = f − f △ , then the definition of e F and the fact f △ ∈ F guarantees that g ∈ e F . Moreover,the construction of f △ implies that P n g = 1 n n X i =1 ( f − f △ ) ( X i ) ≤ ǫ . Consequently, we havesup { f ∈F : P n f ≤ r } n n X i =1 σ i [ f ( X i ) − f △ ( X i )] ≤ sup { g ∈ e F : P n g ≤ ǫ } n n X i =1 σ i g ( X i ) = R n { f ∈ e F : P n f ≤ ǫ } . Plugging the above inequality into Eq. (5.1) gives R n { f ∈ F : P n f ≤ r } ≤ R n { f ∈ e F : P n f ≤ ǫ } + R n { f ∈ F △ : P n f ≤ r } . (5.2)Taking conditional expectations on both sides of Eq. (5.2) and using Lemma A.1 to bound E σ R n { f ∈F △ : P n f ≤ r } , we derive that E σ R n { f ∈ F : P n f ≤ r } ≤ E σ R n { f ∈ e F : P n f ≤ ǫ } + r r log N ( ǫ/ , F , k · k L ( P n ) ) n . Since the above inequality holds for any ǫ >
0, the desired inequality follows immediately.
Proof of Theorem 2.
For any ǫ > X , . . . , X n . For any f ∈ F with P f ≤ r ,there holds that P n f ≤ sup { f ∈F : P f ≤ r } ( P n f − P f ) + P f ≤ sup { f ∈F : P f ≤ r } ( P n f − P f ) + r. Consequently, the following result holds almost surely { f ∈ F : P f ≤ r } ⊆ n f ∈ F : P n f ≤ sup { f ∈F : P f ≤ r } ( P n f − P f ) + r o . (5.3)Using the inclusion relationship (5.3), one can control local Rademacher complexities as follows: E R n { f ∈ F : P f ≤ r } = EE σ R n { f ∈ F : P f ≤ r }≤ EE σ R n n f ∈ F : P n f ≤ r + sup { f ∈F : P f ≤ r } ( P n f − P f ) o ≤ E R n { f ∈ e F : P n f ≤ ǫ } + r n E s(cid:18) ( r + sup { f ∈F : P f ≤ r } ( P n f − P f ) (cid:19) log N ( ǫ/ , F , k · k L ( P n ) ) ≤ E R n { f ∈ e F : P n f ≤ ǫ } + r N ( ǫ/ , F , k · k ) n E r r + sup { f ∈F : P f ≤ r } ( P n f − P f ) , (5.4)8here the second inequality is a direct corollary of Lemma 1 and the last inequality follows from Eq.(2.3).The concavity of φ ( x ) = √ x , coupled with the Jensen inequality, implies that E r r + sup { f ∈F : P f ≤ r } ( P n f − P f ) ≤ r r + E sup { f ∈F : P f ≤ r } ( P n f − P f ) ≤ p r + 2 E R n { f : f ∈ F , P f ≤ r }≤ p r + 4 b E R n { f ∈ F : P f ≤ r } , (5.5)where the second inequality follows from the standard symmetrical inequality on Rademacher average[2, e.g., Lemma A.5] and the third inequality comes from a direct application of Lemma A.4 with φ ( x ) = x (with Lipschitz constant 2 b on [ − b, b ]).Combining Eqs. (5.4), (5.5) together, it follows directly that E R n { f ∈ F : P f ≤ r } ≤ E R n { f ∈ e F : P n f ≤ ǫ } + r N ( ǫ/ , F , k · k ) n p r + 4 b E R n { f ∈ F : P f ≤ r } . Solving the above inequality (a quadratic inequality of E R n { f ∈ F : P f ≤ r } ) gives that E R n { f ∈ F : P f ≤ r } ≤ E R n { f ∈ e F : P n f ≤ ǫ } + 8 b log N ( ǫ/ , F , k · k ) n + r r log N ( ǫ/ , F , k · k ) n . The proof is complete if we take an infimum over all ǫ > Proof of Corollary 1.
It follows directly from Theorem 2 that E R n { f ∈ F : P f ≤ r } ≤ inf <ǫ ≤ γ " E R n { f ∈ e F : P n f ≤ ǫ } + 8 bd log p (2 γ/ǫ ) n + r rd log p (2 γ/ǫ ) n , (5.6)where e F is defined by Eq. (2.1). Lemma A.2 and the condition on covering numbers imply thatlog N ( ǫ, e F , k · k ) ≤ N ( ǫ/ , F , k · k ) ≤ d log p (2 γ/ǫ ) , for any 0 < ǫ ≤ γ. (5.7)Now one can resort to Lemma A.5 to address the term E R n { f ∈ e F : P n f ≤ ǫ } , < ǫ < γ .Indeed, applying Lemma A.5 with the assignment ǫ k = 2 − k ǫ and using the inequality N ( ǫ k , { f ∈ e F : P n f ≤ ǫ } , k · k L ( P n ) ) ≤ N ( ǫ k / , e F , k · k L ( P n ) ) , the following inequality holds for any N ∈ N + : E R n { f ∈ e F : P n f ≤ ǫ } = EE σ R n { f ∈ e F : P n f ≤ ǫ } ≤ E N X k =1 ǫ k − s log N ( ǫ k / , e F , k · k L ( P n ) ) n + ǫ N ≤ / r dn ǫ N X k =1 − k log p/ (cid:0) k +2 γǫ − (cid:1) + ǫ N (according to Eq. (5.7)) ≤ (7+ p ) / r dn ǫ N X k =1 − k h(cid:0) ( k + 1) log 2 (cid:1) p/ + log p/ (2 γǫ − ) i + ǫ N ≤ (7+ p ) / r dn ǫ h c ( p ) + log p/ (2 γ/ǫ ) i + ǫ N , (5.8)9here the third inequality follows from the standard result ( a + b ) p/ ≤ [2 max( a, b )] p/ ≤ p/ ( a p/ + b p/ ) , a, b ≥ P Nk =1 − k (cid:0) ( k + 1) log 2 (cid:1) p/ < ∞ .Letting N → ∞ in Eq. (5.8) and noticing Eq. (5.6), one derives that E R n { f ∈ F : P f ≤ r } ≤ inf <ǫ ≤ γ " (9+ p ) / r dn ǫ (cid:16) c ( p )+log p/ (2 γ/ǫ ) (cid:17) + 8 bd log p (2 γ/ǫ ) n + r rd log p (2 γ/ǫ ) n ≤ c ( b, p, γ ) inf <ǫ ≤ γ "r dn ǫ log p/ (2 γ/ǫ ) + d log p (2 γ/ǫ ) n + r rd log p (2 γ/ǫ ) n . (5.9)Taking the choice ǫ = √ r in Eq. (5.9), there holds that E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ ) "r dr log p (2 γr − / ) n + d log p (2 γr − / ) n . Taking the assignment ǫ = n − / , we derive that E R n { f ∈ F : P f ≤ r } ≤ c ( b, p, γ ) " d log p (2 γn / ) n + r rd log p (2 γn / ) n . Since E R n { f ∈ F : P f ≤ r } can be upper bounded for any 0 < ǫ ≤ γ , the desired inequality isimmediate. Proof of Corollary 2.
Theorem 2 can be applied here to show that E R n { f ∈ F : P f ≤ r } ≤ inf ǫ> E R n { f ∈ e F : P n f ≤ ǫ } + 8 bγǫ − p p log ǫ n + s rγǫ − p p log ǫ n . (5.10)Lemma A.2 gives the following entropy condition for e F :log N ( ǫ, e F , k · k ) ≤ N ( ǫ/ , F , k · k ) ≤ p +1 γǫ − p log ǫ . (5.11)Now applying Lemma A.5 with the assignment ǫ k = 2 − k ǫ and analyzing analogously to the proof ofCorollary 1 except using the entropy condition (5.11), one derives that E R n { f ∈ e F : P n f ≤ ǫ } ≤ E N X k =1 ǫ k − s log N ( ǫ k / , e F , k · k L ( P n ) ) n + ǫ N ≤ N X k =1 − k ǫ s γǫ − p ( k +2) p +1 log k +3 ǫ n + 2 − N ǫ = r γn / p ǫ − p/ N X k =1 k ( p − / (cid:2) log 1 ǫ + ( k + 3) log 2 (cid:3) + 2 − N ǫ. (5.12)We now continue our discussion by distinguishing three cases according to the magnitude of p :(a) case < p < . In this case, the series P ∞ k =1 k ( p − / [log ǫ + ( k + 3) log 2] converges and thus onecan tend N → ∞ in Eq. (5.12) to derive the bound E R n { f ∈ e F : P n f ≤ ǫ } ≤ cn − / ǫ − p/ log ǫ .Plugging this inequality back into Eq. (5.10) one obtains that E R n { f ∈ F : P f ≤ r } ≤ c inf ǫ> " n − / ǫ − p/ log 1 ǫ + ǫ − p n − log ǫ + r rǫ − p n − log ǫ . case p = 2 . For this particular p , Eq. (5.10) and Eq. (5.12) imply that E R n { f ∈ F : P f ≤ r } ≤ c inf ǫ> inf N ∈ N + " n − / ( N log 1 ǫ + N ) + 2 − N ǫ + ǫ − n − log ǫ + r rǫ − n − log ǫ ≤ c (cid:2) n − / log n + √ rn − (cid:3) , where in the last step we simply take the choice ǫ = 1 and N = (cid:6) − log n (cid:7) .(c) case p > . In this case, taking the choice ǫ = 1 in Eqs. (5.10), (5.12) we have E R n { f ∈ F : P f ≤ r } ≤ c inf N ∈ N + (cid:2) n − / N X k =1 ( k + 3)2 k ( p − / + 2 − N + n − + √ rn − (cid:3) ≤ c inf N ∈ N + (cid:2) n − / N N ( p − / + 2 − N + √ rn − (cid:3) ≤ c [ n − /p log n + √ rn − ] , where we choose N = ⌈ p − log n ⌉ in the last step.Using a similar deduction strategy, one can also prove Corollary 3 on local Rademacher complexitybounds when the entropy number grows as a polynomial of 1 /ǫ . For simplicity we omit the proof here. Proof of Theorem 4.
We consider the functional T ( f ) := P f here. The structural result on coveringnumbers implies that [24]log N ( ǫ, F , k · k ) ≤ log N ( ǫ/L, H , k · k ) ≤ d log p ( γL/ǫ ) . Corollary 1 implies that ψ ( r ) := c " d log p (2 γn / ) n + r rd log p (2 γn / ) n is an appropriate choice meeting the condition of Lemma 3. Let r ∗ be its fixed point then we knowthat r ∗ = c " d log p (2 γn / ) n + r r ∗ d log p (2 γn / ) n . Solving this equality gives r ∗ ≤ cdn − log p ( n ). It can be directly checked that any f ∈ F also satisfies k f k ∞ ≤ b . Consequently, one can apply Lemma 3 here to show that for the particular functionˆ f n = ℓ (ˆ h n ( x ) , y ) − ℓ ( h ∗ ( x ) , y ), the following inequality holds with probability at least 1 − δP ˆ f n ≤ KK − P n ˆ f n + 704 Kcd log p nBn + log(1 /δ )(88 b + 416 b K ) n , ∀ K > . Using the above inequality and the fact P n ˆ f n = E z (ˆ h n ) − E z ( h ∗ ) ≤
0, we immediately derive thedesired result.
Proof of Theorem 5.
Let ǫ be a positive number to be fixed later. The entropy assumption imply thatlog N ( ǫ, F , k · k ) ≤ cǫ − p log ǫ , from which Corollary 2 implies that ψ ǫ ( r ) := c " n − / ǫ − p/ log 1 ǫ + ǫ − p n − log ǫ + r rǫ − p n − log ǫ
11s a function meeting the condition of Lemma 3. The associated fixed point r ∗ ǫ = ψ ( r ∗ ǫ ) satisfies theconstraint r ∗ ǫ ≤ c (cid:2) n − ǫ − p log 1 ǫ + ǫ − p n − log ǫ (cid:3) . For the specific choice ǫ = (log n ) p +2 n − p +2 we get r ∗ ǫ = cn − p +2 (log n ) − pp +2 log n (log n ) p +2 . Pluggingthis bound on r ∗ ǫ into Lemma 3 completes the proof. This paper provides a systematic approach to estimating local Rademacher complexities withcovering numbers. Local Rademacher complexity is an effective concept in learning theory and hasrecently received increasing attention since it captures the property that the prediction rule picked by alearning algorithm always lies in a subset of the original class. We provide a general local Rademachercomplexity bound, which captures in an elegant form to relate the complexities with constraint on the L ( P ) norm to the corresponding ones with constraint on the L ( P n ) norm. This bound is convenientto calculate and is easily applicable to practical learning problems. We show that our general result(Theorem 2) could yield local Rademacher complexity bounds superior to that in Mendelson [23, 24],when applied to function classes satisfying general entropy conditions. We also apply the derived localRademacher complexity bounds to the generalization analysis. Acknowledgement
The work is partially supported by Science Computing and Intelligent Information Processing ofGuangXi higher education key laboratory (Grant No. GXSCIIP201409).
A Lemmas
Lemma A.1 presents effective empirical complexity bounds for function classes of finite cardinality.
Lemma A.1 (Massart lemma [4]) . Suppose that F is a finite class with cardinality N , then theempirical local Rademacher complexity can be bounded as follows: E σ R n { f ∈ F : P n f ≤ r } ≤ r r log Nn .
Lemma A.2 ([27]) . Let k · k be a norm defined on the class F . If e F is defined by Eq. (2.1) , then wehave N ( ǫ, e F , k · k ) ≤ N ( ǫ/ , F , k · k ) . Since our definition of covering numbers requires the ǫ -cover to belong to the original class, coveringnumbers of a sub-class is not necessarily smaller than that of the whole class. However, we have thefollowing structural result for tackling covering numbers of a sub-class. Lemma A.3 ([27]) . Let F be a class of functions from X to R and let F ⊆ F be a subset. Then forany ǫ > , we have the following relationship on covering numbers: N ( ǫ, F , d ) ≤ N ( ǫ/ , F , d ) . The following structural result on Rademacher complexities provides us a powerful tool to tacklethe complexity of a composite class via that of the basis class.
Lemma A.4 (Contraction property [2]) . Let φ be a Lipschitz function with constant L , that is, | φ ( x ) − φ ( y ) | ≤ L | x − y | . Then for every function class F there holds E σ R n φ ◦ F ≤ L E σ R n F , (A.1) where φ ◦ F := { φ ◦ f : f ∈ F} and ◦ is the composition operator. emma A.5 (Refined entropy integral [23]) . Let X , . . . , X n be a sequence of examples and let P n be the associated empirical measure. For any function class F and any monotone sequence ( ǫ k ) ∞ k =0 decreasing to such that ǫ ≥ sup f ∈F p P n f , the following inequality holds for every non-negativeinteger N : E σ R n F ≤ N X k =1 ǫ k − r log N ( ǫ k , F , k · k L ( P n ) ) n + ǫ N . (A.2) References [1] P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structuralresults.
J. Mach. Learn. Res. , 3:463–482, 2002.[2] P. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities.
Ann. Stat. , 33(4):1497–1537, 2005.[3] G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector machines.
Ann. Stat. , 36(2):489–531, 2008.[4] O. Bousquet.
Concentration Inequalities and Empirical Processes Theory Applied to the Analysisof Learning Algorithms . PhD thesis, Ecole Polytechnique, Paris, 2002.[5] O. Bousquet. New approaches to statistical learning theory.
Ann. Inst. Stat. Math. , 55(2):371–389,2003.[6] H. Chen, J. Peng, Y. Zhou, L. Li, and Z. Pan. Extreme learning machine for ranking: General-ization analysis and applications.
Neural Networks , 53:119–126, 2014.[7] C. Cortes, M. Kloft, and M. Mohri. Learning kernels using local rademacher complexity. In
Advances in Neural Information Processing Systems , pages 2760–2768, 2013.[8] F. Cucker and S. Smale. On the mathematical foundations of learning.
Bull. Am. Math. Soc. , 39(1):1–50, 2002.[9] F. Cucker and D.-X. Zhou.
Learning theory: an approximation theory viewpoint . CambridgeUniv. Press, Cambridge, 2007.[10] R. Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes.
J.Funct. Anal , 1(3):290–330, 1967.[11] T. Hastie, R. Tibshirani, and J. Friedman.
The elements of statistical learning: data mining,inference, and prediction . Springer-Verlag, New York, 2001.[12] M. Kloft and G. Blanchard. The local rademacher complexity of lp-norm multiple kernel learning.In
Advances in Neural Information Processing Systems , pages 2438–2446, 2011.[13] M. Kloft and G. Blanchard. On the convergence rate of lp-norm multiple kernel learning.
J.Mach. Learn. Res. , 13(1):2465–2502, 2012.[14] A. N. Kolmogorov and V. M. Tikhomirov. ε -entropy and ε -capacity of sets in function spaces. Uspekhi Matematicheskikh Nauk , 14(2):3–86, 1959.[15] V. Koltchinskii. Rademacher penalties and structural risk minimization.
IEEE Trans. Inf. Theory ,47(5):1902–1914, 2001.[16] V. Koltchinskii and D. Panchenko. Rademacher processes and bounding the risk of functionlearning. In E. Gin´e, D. Mason, and J. Wellner, editors,
Hign Dimensional Probability II , pages443–458, Boston, 2000. Birkh¨auser.[17] M. Ledoux and M. Talagrand.
Probability in Banach Spaces: isoperimetry and processes . Springer-Verlag, Berlin, 1991. 1318] Y. Lei and L. Ding. Refined Rademacher chaos complexity bounds with applications to themultikernel learning problem.
Neural. Comput. , 26(4):739–760, 2014.[19] Y. Lei, L. Ding, and W. Zhang. Generalization performance of radial basis function networks.
IEEE Transactions on Neural Networks and Learning Systems , 26(3):551–564, 2015.[20] Y. Lei, ¨U. Dogan, A. Binder, and M. Kloft. Multi-class svms: From tighter data-dependentgeneralization bounds to novel algorithms.
Advances in Neural Information Processing Systems,To appear , 2015.[21] S. Lv and F. Zhou. Optimal learning rates of lp-type multiple kernel learning under generalconditions.
Information Sciences , 294:255–268, 2015.[22] P. Massart. Some applications of concentration inequalities to statistics.
Annales de la facult´edes sciences de Toulouse , 9(2):245–303, 2000.[23] S. Mendelson. Improving the sample complexity using global data.
IEEE Trans. Inf. Theory , 48(7):1977–1991, 2002.[24] S. Mendelson. A few notes on statistical learning theory. In S. Mendelson and A. Smola, editors,
Advanced Lectures on Machine Learning. Lect. Notes Comput. Sci. 2600 , pages 1–40. Springer-Verlag, Berlin, 2003.[25] S. Mendelson. On the performance of kernel classes.
J. Mach. Learn. Res. , 4:759–771, 2003.[26] L. Oneto, A. Ghio, S. Ridella, and D. Anguita. Local rademacher complexity: Sharper riskbounds with and without unlabeled samples.
Neural Networks , 65:115–125, 2015.[27] D. Pollard.
Convergence of stochastic processes . Springer-Verlag, New York, 1984.[28] N. Srebro, K. Sridharan, and A. Tewari. Optimistic rates for learning with a smooth loss. arXivpreprint arXiv:1009.3896 , 2010.[29] V. Vapnik.
The nature of statistical learning theory . Springer-Verlag, New York, 2000.[30] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of events totheir probabilities.
Theory Probab. Appl. , 16(2):264–280, 1971.[31] Q. Wu, Y. Ying, and D.-X. Zhou. Learning rates of least-square regularized regression.
Founda-tions of Computational Mathematics , 6(2):171–192, 2006.[32] Y. Ying and C. Campbell. Rademacher chaos complexities for learning the kernel problem.
Neural.Comput. , 22(11):2858–2886, 2010.[33] D.-X. Zhou. The covering number in learning theory.
J. Complex. , 18(3):739–767, 2002.[34] D.-X. Zhou. Capacity of reproducing kernel spaces in learning theory.