Learning rates for partially linear support vector machine in high dimensions
aa r X i v : . [ m a t h . S T ] J un Learning rates for partially linear support vectormachine in high dimensions
Yifan Xia † , Yongchao Hou † , Shaogao Lv ‡ , ∗ † College of Statistics, Southwestern University of Finance and Economics, Chengdu, China; ‡ Department of Statistics and Mathematics, Nanjing Audit University, Nanjing, China; ∗ Corresponding author: [email protected]
Abstract
This paper analyzes a new regularized learning scheme for high dimensional par-tially linear support vector machine. The proposed approach consists of an empiricalrisk and the Lasso-type penalty for linear part, as well as the standard functionalnorm for nonlinear part. Here the linear kernel is used for model interpretationand feature selection, while the nonlinear kernel is adopted to enhance algorith-mic flexibility. In this paper, we develop a new technical analysis on the weightedempirical process, and establish the sharp learning rates for the semi-parametricestimator under the regularized conditions. Specially, our derived learning ratesfor semi-parametric SVM depend on not only the sample size and the functionalcomplexity, but also the sparsity and the margin parameters.
Key Words and Phrases: partially linear models, high dimension, support vectormachine, weighted empirical process.
Support vector machine (SVM), originally introduced by Vapnik (1995), is well known tobe a popular and powerful technique mainly due to its successful practical performancesand nice theoretical foundations in machine learning. For supervised classification prob-lems, SVM is based on the margin-maximization principle endowed with a specified kernel,which is formulated by a nonlinear map from the input space to the feature space.Denote X and Y = {− , +1 } as the input space and corresponding output space,respectively. Let ( X, Y ) ∈ X × Y be a random vector drawn from an unknown joint1istribution ρ on X × Y . Suppose that all the observations { ( Y i , X i ) } ni =1 are availablefrom ρ . In empirical risk minimization, the standard L -norm SVM has the widely-usedhinge loss plus L -norm penalty formulation. Recall that the empirical hinge loss functionis defined by R n ( f ) = 1 n n X i =1 φ h ( Y i f ( X i )) , where the hinge loss is φ h ( u ) = (1 − u ) + , with u + denoting the positive part of u ∈ R .The standard SVM can be expressed as the following regularization problemmin f ∈H K (cid:8) R n ( f ) + λ k f k K (cid:9) , where λ is the regularized parameter for controlling the functional complexity of H K .Note that H K is referred to a reproducing kernel Hilbert space (RKHS), often specified inadvance. See Section 2 for more details on RKHS. The book by Steinwart and Christmann(2008) contains a good overview of SVMs and the particularly related learning theory.Among various kernel-based learning schemes including SVM, it is full of challengeshow to select a suitable kernel and there are not any perfect answers for such problem un-til now. See related work on kernel learning (Lanckriet et al., 2004; Micchelli and Pontil,2005; Wu et al., 2007; Kloft et al., 2011; Micchelli et al., 2016) for instances. In this pa-per, we consider a semi-parametric SVM problem of the linear kernel plus a generalnonlinear kernel. Indeed, partial linear models in statistics have received a great atten-tion in the last several decades, see (Muller and van de Geer, 2015; Hardle and Liang,2007; Speckman, 1988). Particularly, the linear part in the partial linear modelsaims atthe model interpretation, and the nonlinear part is used to enhance the model flexibility.As a concrete example in stock market, the future return of a stock Y may depend onseveral company management indexes (e.g. shareholders structure) which are homoge-neous for all the companies, and we allow linear relation with Y . However, the otherfeatures (e.g., from financial statements) should be nonlinear to the response, in that acompany has a complex curve in terms of operation or profit pattern. In practice, loadforecasting using semi-parametric SVM gets a better prediction than the conventional way(Jacobus and Abhisek, 2009). The semi-parametric SVM are also successfully applied toanalyze pharmacokinetic and pharmacodynamic data ( Seok et al., 2011). However, tothe best of our knowledge, the theoretical research on the semi-parametric support vectormachines is still lacking, and this paper focuses on this topic in high dimensional setting.High dimensional case refers to the setting where the ambient dimension p of thecovariates is very large (e.g. p ≫ n ), but only a small subset of the covariates are2ignificantly relevant to the response. The high dimensional estimation and inference forvarious models have been investigated in the last years, and the interested readers canrefer to two related book written by Buhlmann (2019) and Giraud (2014). Specially,the high dimensional inference for the linear (or additive) SVM has been wildly studiedin recent years, see (Tarigan and Geer, 2006; Zhao and Liu, 2012; Zhang et al., 2016;Peng et al., 2016). Precisely, Tarigan and Geer (2006) consider a ℓ -penalized parametricestimation in high dimensions for SVM and prove the convergence rates of the excess riskterm under regularity conditions. Similarly, Zhao and Liu (2012) propose a group-Lassotype regularized approachs for the nonparametric additive SVM, and provide the oracleproperties of the estimator and develop an efficient numerical algorithm to compute it.For high dimensional linear SVM, Zhang et al. (2016) and Peng et al. (2016) explicitlyinvestigate the statistical performance of the ℓ -norm and non-convex-penalized SVM suchas variable selection consistency. However, all the aforementioned works only consider asingle kernel in high dimensions. By contrast, a partially linear SVM has to consider themutual correlation between these two kernels with different structures, and also considersthe mutual effects between the sparsity and the nonlinear functional complexity. Sothe non-asymptotic analysis of such semi-parametric models in high-dimensional SVMappears to be considerably more complicated than those based on a single kernel.Under our partial linear setting, the whole input feature consists of two parts: X =( Z, T ) ′ , where Z ∈ R p has a linear relation to the response, while the sub-feature T hasa nonlinear effect to the response. Given all the observations { ( Y i , Z i , T i ) } ni =1 with thesample size n , we consider a two-fold regularized learning scheme for the high dimen-sional PLQR, and the semi-parametric estimation pair ( ˆ β , ˆ g ) is the unique solution byminimizing the following unconstrained optimizationmin ( f = β ′ Z + g ) ∈F n R n ( f ) + λ n k β k + µ n k g k K o , (1.1)where ( λ n , µ n ) are two regularized hyper-parameters for controlling the coefficients of thesparsity and functional complexity, respectively. In the partial linear setting, the adoptedhypothesis space F for SVM is a summation of the linear kernel and the general nonlinearkernel. More precisely, F := { f ( X ) = β ′ Z + g ( T ) , β ∈ R p , g ∈ H K } . To investigate the statistical performance of the proposed semi-parameter estimator(1.1), we introduce a population target function for the partial linear SVM within F . Inthis paper, the target function we will focus on is a global solution f ∗ of the following3opulation minimization on F ,min f ∈F R ( f ) , where R ( f ) := E ρ [ φ h ( Y f ( X ))] . (1.2)Under the partial linear framework, f ∗ can be written as: f ∗ ( X ) = ( β ∗ ) ′ Z + g ∗ ( T ), where g ∗ is the nonparametric component, belonging to a specific RKHS that will be defined inSection 2. For the parametric part, one often assumes that the structure of β ∗ is sparseunder high dimensional setting, in sense that the cardinality of S = { j, β ∗ j = 0 , j =1 , , ..., p } is far less than the ambient dimension p . Note that, the target function f ∗ isquite different from the Bayes rule, and the latter is an optimal decision function takenover all the measurable functions. We can treat f ∗ as a sparse approximation to the Bayesrule within F , particularly when the true function is not sparse. In the current literatures,we are not concerned with any approximation error induced by sparse approximation orkernel misspecification.In this paper, we are primarily concerned with learning rates of the excess risk R ( ˆ f ) −R ( f ∗ ) and the estimation errors of the parametric estimator and the nonparametric es-timator for the high dimensional SVM. Interestingly, the theoretical results reveal thatour derived rate of the parametric estimator depends on not only the sample size and thesparsity parameter, but also the functional complexity generated by the non-parametriccomponent and vice versa. As a byproduct, we develop a new weighted empirical processto refine our analysis. This is one of the key theoretical tools in the high dimensionalliteratures of the semi-parametric estimation.The rest of this article is organized as follows. In Section 2, we introduce some basicnotations on RKHS that is used to characterize the functional complexity. Then weimpose some regular assumptions required to establish the convergence rates. In the endof Section 2, we explicitly propose our main theoretical results in terms of the excess riskand estimation errors. Section 3 is devoted to a detailed proof for the main theorems, andalso proves some useful lemmas associated with the weighted empirical process. Section4 concludes this paper with discussions and future possible researches. Notations . We use [ p ] to denote the set { , , ..., p } . For a vector a = ( a , a , ..., a p ) ∈ R p , the ℓ q -norm is defined as k a k q = (cid:0) P i ∈ [ p ] | a i | q (cid:1) /q . For two sequences of numbers a n and b n , we use an a n = O ( b n ) to denote that a n ≤ Cb n for some finite positive constant C for all n . If both a n = O ( b n ) and b n = O ( a n ), we use the notation a n ≃ b n . Wealso use an a n = Ω( b n ) for a n ≥ Cb n . For a function, we denote the L -norm of f by k f k = (cid:0) R X f ( x ) dρ X ( x ) (cid:1) / with some distribution ρ X .4 Conditions and Main Theorems
We begin with the background and notation required for the main statements of ourproblem. First of all, we introduce the notation of RKHS. RKHS can be defined by anysymmetric and positive semidefinite kernel function K : T × T → R . For each t ∈ T ,the function t ′ → K ( t ′ , t ) is contained with the Hilbert space H K ; moreover, the Hilbertspace is endowed with an inner product h· , ·i K such that K ( · , t ) acts as the representer ofthe evaluation. Especially, the reproducing property of RKHS plays an important role inthe theoretical analysis and numerical optimization for any kernel-based method, f ( t ) = h f, K ( · , t ) i K , ∀ t ∈ T . (2.1)This property also implies that k f k ∞ ≤ κ k f k K with κ := max t ∈T | K ( t, t ) | < ∞ . More-over, by Mercer’s theorem, a kernel K defined on a compact subset of T admits thefollowing eigen-decomposition, K ( t, t ′ ) = ∞ X ℓ =1 µ ℓ φ ℓ ( t ) φ ℓ ( t ′ ) , t, t ′ ∈ T , (2.2)where µ ≥ µ ≥ · · · > { φ ℓ } ∞ ℓ =1 is an orthonormal basis in L ( ρ T ). The decay rate of µ ℓ completely characterizes the complexity of RKHS inducedby a kernel K , and generally it has equivalent relationships with various entropy numbers,see Steinwart and Christmann (2008) for details. With these preparations, we define thequantity, Q n ( r ) = 1 √ n h ∞ X ℓ =1 min { r , µ ℓ } i / , ∀ r > . (2.3)Let ν n be the smallest positive solution to the inequality, 40 ν n ≥ Q n ( ν n ) , where 40 is onlya technical constant.Then, due to the mutual effects between the high dimensional parametric componentand the nonparametric one, we introduce the following quantity related to the convergencerates of the semi-parametric estimate, as illustrated in (2.4), γ n := max n ν n , r log pn o . (2.4)We now describe our main assumptions. Our first assumption deals with the tailbehavior of the covariate of the linear part. 5 ssumption A . (i) For simplicity, we assume that k Z k ∞ ≤ C < ∞ with somepositive constant C ; (ii) The largest eigenvalue of E [ ZZ ′ ] is finite, denoted by Λ max > Z -values is a restrictive assumption, ruling out thestandard sub-gaussian covariates. However, we can usually approximate a non-boundeddistribution with its truncated version. Imposing such assumption is only for technicalsimplicity and may be relaxed to general thin-tail random variables. Assumption A(ii) isfairly standard in the literature to identify the coefficients associated to Z . Assumption B . There exist the constants C > ζ ≥ f ∈ F ,the equation (2.5) holds, R ( β , g ) − R ( β ∗ , g ∗ ) ≥ C k f − f ∗ k ζ . (2.5)The parameter ζ is called the Bernstein parameter introduced by Bartlett and Mendelson(2006); Pierre et al. (2019). Fast rates will usually be derived when ζ = 2. This conditionis essentially a qualification of the identifiability condition of the objective function at itsminimum f ∗ . Note that, the Bernstein parameter is slightly different from the classicalmargin parameter adopted by Tarigan and Geer (2006); Chen et al. (2004).To estimate the parametric and nonparametric parts respectively, some conditionsconcerning correlations between Z and T are required. For each j ∈ [ p ], let Π ( j ) T be theprojection of Z ( j ) onto H K . To be precise, Π ( j ) T = g ∗ j ( T ) with (2.6), g ∗ j = arg min g ∈H K E Z ( j ) ,T [( Z ( j ) − g ( T )) ] . (2.6)Let Π Z | T := (Π (1) T , ..., Π ( p ) T ) ′ and Z T = Z − Π Z | T . Each function g ∗ j can be viewed as thebest approximation of E [ Z ( j ) | T ] within H K . In the extreme case ( Z is uncorrelated with T ), Π Z | T = 0. The following condition is quite common in the semi-parametric estimation(Muller and van de Geer, 2015), ensuring that there is enough information in the data toidentify the parametric coefficients. Assumption C . The smallest eigenvalue of E [ Z T Z ′ T ] is bounded below by a constantΛ min > k · k -norm, k β ′ Z + g ( T ) k = k β ′ Z T k + | β ′ Π Z | T + g ( T ) k , ∀ f ∈ F . (2.7)This equality ensures that the parametric estimation can be separated from the totalestimation, which is very useful in our proof.6e are in a position to derive the learning rate of the estimator ( ˆ β , ˆ g ) defined byminimization (1.1). We allow that the number of dimension p and the number of activecovariates s := | S | which are increasing with respect to the sample size n , while s ≪ p and the dimension of T is fixed. Theorem 1.
Let ( ˆ β , ˆ g ) be the proposed semi-parametric estimator for SVM defined in (1.1) , with the regularization parameters λ n = p log p/n and µ n ≃ γ n . If Assumptions A,B, and C hold, the equation (2.8) holds with the probability at least − p − A/ − p − A with some A > , R ( ˆ β , ˆ g ) − R ( β ∗ , g ∗ ) = O (cid:0) ( γ n + p s log p/n ) ζζ − (cid:1) , (2.8) and at the meantime the estimation error has the form (2.9) , k ˆ β − β ∗ k = O (cid:0) ( γ n + p s log p/n ) ζ − (cid:1) , k ˆ g − g ∗ k = O (cid:0) ( γ n + p s log p/n ) ζ − (cid:1) . (2.9)Remark that, this rate may be interpreted as the sum of a subset selection term( p s log p/n ) for the linear part and a fixed dimensional non-parametric estimation term( ν n ). Depending on the scaling of the triple ( n, p, s ) and the smoothness of the RKHS H K , either the subset selection term or the non-parametric estimation term may dominatethe estimation. In general, if s log p/n = o ( ν n ), the s -dimensional parametric term candominate the estimation, so can the vice versa otherwise. At the boundary, the scalingsof the two terms are equivalent. In the best situation ( ζ = 2), our derived rate of theexcess risk is the same as the optimal rate achieved by those least square approaches, see(Koltchinskii and Yuan, 2010; Muller and van de Geer, 2015) for details.Note also that, it is easy to check that Theorems 1 still holds if p in the confidenceprobability is replaced by an arbitrary ˜ p ≥ p such that log ˜ p ≥ n . In this case,the divergence of p is not needed and the probability bounds in the theorem becomes1 − p − A/ − p − A .A number of corollaries of Theorem 1 can be obtained with particular choices ofdifferent kernels. First of all, we present finite-dimensional m -rank operators, i.e., thekernel function K can be expressed in terms of m eigenfunctions. These eigenfunctionsinclude the linear functions, polynomial functions, as well as the function class based onfinite dictionary expansions. Corollary 1.
Under the same conditions as Theorem 1, consider a nonlinear kernelwith finite rank m . Then the semi-parametric estimator for SVM defined in (1.1) with n = p log p/n and µ n ≃ γ n satisfies the condition (2.10) , R ( ˆ β , ˆ g ) − R ( β ∗ , g ∗ ) = O p (cid:16)(cid:16) s log pn + mn (cid:17) ζ ζ − (cid:17) , (2.10) where k ˆ β − β ∗ k = O p (cid:16)(cid:16) s log pn + mn (cid:17) ζ − (cid:17) , k ˆ g − g ∗ k = O p (cid:16)(cid:16) s log pn + mn (cid:17) ζ − (cid:17) . For a finite rank kernel and for any r >
0, we have Q n ( r ) ≤ r p mn , which follows bythe result of Theorem 1. Corollary 1 corresponds to the linear case for SVM when s ≃ m .The existing theory in the literatures on the linear SVM has paid constant attention to theanalysis of the generalization error and variable selection consistency. Zhang et al. (2016)considers the non-convex penalized SVM in terms of the variable selection consistencyand oracle property in high dimension, however, their results are based on a restrictivecondition in case of p ≪ n / . So the ultra-high dimensional cases ( p = O ( e n r ) with r < ℓ norm, with an order p s log p/n , which is the same as our rate in Corollary 1 when ζ = 2.Secondly, we state a result for the RKHS with countable eigenvalues, decaying at arate µ ℓ ≃ (1 /ℓ ) α for some smooth parameter α > /
2. In fact, this type of scaling coversthe Sobolev spaces, consisting of derivative functions with α . Corollary 2.
Under the same conditions as Theorem 1, consider a kernel with the eigen-value decay µ ℓ ≃ (1 /ℓ ) α for some α > / . Then the semi-parametric estimator definedin (1.1) with λ n = p log p/n and µ n ≃ γ n satisfies the equation (2.11) , R ( ˆ β , ˆ g ) − R ( β ∗ , g ∗ ) = O p (cid:16)(cid:16) s log pn + n − α α +1 (cid:17) ζ ζ − (cid:17) , (2.11) where k ˆ β − β ∗ k = O p (cid:16)(cid:16) s log pn + n − α α +1 (cid:17) ζ − (cid:17) , k ˆ g − g ∗ k = O p (cid:16)(cid:16) s log pn + n − α α +1 (cid:17) ζ − (cid:17) . In the previous corollary, we need to compute the critical univariate rate ν n . Given theassumption of polynomial eigenvalue decay, a truncation argument shows that Q n ( r ) = O ( r α α − √ n ), i.e., ν n ≃ n − α α +1 . As opposed to Corollary 1, we now discuss the specialcase where the functional complexity dominates the esimation, that is, the rate of theexcess risk is an order O (cid:0) n − α α +1 (cid:1) ζζ − . This is a better rate campared with those inChen et al. (2004) and Wu et al. (2007). The learning rate in Chen et al. (2004) is derived8s O p ( n θθ + η + θη ), where θ is a separation parameter corresponding to ζ = θθ , and η is apower appearing in the covering number, satisfying η ≃ α . Our rate can be proved to bebetter than that of Chen in the best case with ζ = 2. The similar arguments also holdwhen considering the result in Wu et al. (2007). In this section, we provide the proofs of our main theorem (Theorem 1). At a high-level, Theorem 1 is based on an appropriate adaptation to the semi-parametric settingsof various techniques, developed for sparse linear regression or additive non-parametricestimation in high dimensions (Buhlmann, 2019). In contrast to the parametric setting oradditive setting, it involves structural deals from the empirical process theory to controlthe error terms in the semi-parametric case . In particular, we make use of severalconcentration theorems for the empirical processes (Geer, 2000), as well as the results onthe Rademacher complexity of kernel classes (Bartlett et al., 2005).
We write the total empirical objective as the equation (3.1), L ( β , g ) = R n ( β , g ) + λ n k β k + µ n k g k K . (3.1)The population risk for partial linear SVM is defined by (3.2), R ( β , g ) = E (cid:2) φ h ( Y [ β ′ Z + g ( T )]) (cid:3) . (3.2)According to the definition of ( ˆ β , ˆ g ), it holds that L ( ˆ β , ˆ g ) ≤ L ( β ∗ , g ∗ ). That means, R n ( ˆ β , ˆ g ) + λ n k ˆ β k + µ n k ˆ g k K ≤ R n ( β ∗ , g ∗ ) + λ n k β ∗ k + µ n k g ∗ k K . (3.3)The inequality (3.3) can be rewritten into the form (3.4), R ( ˆ β , ˆ g ) − R ( β ∗ , g ∗ ) + λ n k ˆ β − β ∗ k + µ n / k ˆ g − g ∗ k K ≤ R ( ˆ β , ˆ g ) − R n ( ˆ β , ˆ g ) + R n ( β ∗ , g ∗ ) − R ( β ∗ , g ∗ ) + 2 λ n k ( ˆ β − β ∗ ) S k + 2 µ n k g ∗ k K . (3.4)9or simplicity, we denote, ν n ( f ) := ν n ( β , g ) = ( R n ( β , g ) − R ( β , g )) , ∀ f ( X ) = β ′ Z + g ( T ) . (3.5)In order to derive the upper bound of | ν n ( ˆ f ) − ν n ( f ∗ ) | in (3.5), a new weighted empiricalprocess is proposed in our semi-parametric high dimensional setting. The process isrelevant to the uniform law of large number in a mixed function space. The Lemma 1 can be derived via the peeling device which is often used in probabilistic theory.
Lemma 1.
Let E be the event E := n | ν n ( f ) − ν n ( f ∗ ) | ≤ D (cid:16)r log pn k β − β ∗ k + γ n k g − g ∗ k + γ n k g − g ∗ k K + e − p (cid:17)o , (3.6) where D is a constant in the proof of the Lemma 1 . Because p = Ω(log n ) and p = o ( e n ) ,the inequality (3.7) holds for some universal constant A > , P ( E ) ≥ − p − A/ − p − A . (3.7)We continue our proof (3.4) along with the results established in Lemma 1. Apply theweighted empirical process and we can obtain (3.8) from (3.4), R ( ˆ β , ˆ g ) − R ( β ∗ , g ∗ ) + λ n k ˆ β − β ∗ k + µ n k ˆ g − g ∗ k K ≤ D (cid:0)p log p/n k β − β ∗ k + γ n k g − g ∗ k + γ n k g − g ∗ k K + e − p (cid:1) + 2 λ n k ( ˆ β − β ∗ ) S k + 2 µ n k g ∗ k K . (3.8)Therefore, when the conditions λ n ≥ D p log p/n and µ n ≥ D γ n are both satisfied,(3.9) holds after the inequality (3.8), R ( ˆ β , ˆ g ) − R ( β ∗ , g ∗ ) + λ n k ˆ β − β ∗ k + µ n k ˆ g − g ∗ k K ≤ D (cid:0) γ n k g − g ∗ k + γ n / (cid:1) + 2 λ n k ( ˆ β − β ∗ ) S k + 2 µ n k g ∗ k K , (3.9)where we use the basic inequality 2 uv ≤ u + v . Since p = Ω(log n ) implies that e − p = O ( n − ) = O ( γ n ), (3.10) can be derived with Assumption B, C and the equality (2.7), R ( ˆ β , ˆ g ) − R ( β ∗ , g ∗ ) ≥ C k ˆ f − f ∗ k ζ ≥ C k ( ˆ β − β ∗ ) ′ Z T k ζ ≥ C Λ ζ/ k ˆ β − β ∗ k ζ . (3.10)10oreover, (3.11) follows by Assumption A after some simple computations, ( k g − g ∗ k ≤ k f − f ∗ k + Λ max k ˆ β − β ∗ k ,λ n k ( ˆ β − β ∗ ) S k ≤ λ n √ s k ( ˆ β − β ∗ ) S k ≤ λ n √ s k ˆ β − β ∗ k , (3.11)where the last inequality follows from the Cauchy-Schwartz inequality. Substitute (3.11)into (3.8) and we can obtain the inequality (3.12), C k ˆ f − f ∗ k ζ + λ n k ˆ β − β ∗ k + µ n / k ˆ g − g ∗ k K ≤ D γ n k f − f ∗ k + D γ n Λ max k ˆ β − β ∗ k + D γ n / √ sλ n k ˆ β − β ∗ k + 2 µ n k g ∗ k K . (3.12)For any θ >
0, we can then derive the inequality (3.13) with the Young inequality ( uv ≤ u q /q + v p /p with 1 /q + 1 /p = 1), γ n k f − f ∗ k ≤ (cid:0) ζ − ζ (cid:1) θ ζ − ζ γ ζζ − n + θ ζ ζ k ˆ f − f ∗ k ζ , (3.13) c n k ˆ β − β ∗ k ≤ (cid:16) ζ − ζ (cid:17) θ ζ − ζ c ζζ − n + θ ζ ζ k ˆ β − β ∗ k ζ , (3.14)where c n := D γ n Λ max + 2 √ sλ n . (3.15) holds if θ is small enough to satisfy θ ζ ≤ C ζ , C k ˆ f − f ∗ k ζ + λ n k ˆ β − β ∗ k + µ n k ˆ g − g ∗ k K ≤ D (cid:0) ζ − ζ (cid:1) θ ζ − ζ γ ζζ − n + D γ n c n k ˆ β − β ∗ k + 2 µ n k g ∗ k K . (3.15)Furthermore, combine (3.10), (3.13) with (3.15) and we can conclude the inequality (3.16), C Λ ζ min k ˆ β − β ∗ k ζ + λ n k ˆ β − β ∗ k + µ n k ˆ g − g ∗ k K ≤ D (cid:0) ζ − ζ (cid:1) θ ζ − ζ γ ζζ − n + D γ n µ n k g ∗ k K + (cid:0) ζ − ζ (cid:1) θ ζ − ζ c ζζ − n , (3.16)where the condition θ ζ ≤ C ζ Λ ζ/ is additionnally required so that θ ζ ζ is ignorable. In thiscase, we can derive (3.17), k ˆ β − β ∗ k ζ = O p (cid:0) ( γ n + √ sλ n ) ζζ − + µ n (cid:1) , (3.17) λ n k ˆ β − β ∗ k = O p (cid:0) ( γ n + √ sλ n ) ζζ − + µ n (cid:1) . (3.18)11oreover, we will obtain (3.19) based on (3.17) and (3.15), k ˆ f − f ∗ k ζ = O p (cid:0) ( γ n + √ sλ n ) ζζ − + µ n + γ n µ ζ n (cid:1) = O p (cid:0) γ ζζ − n + γ ζ +2 ζ n (cid:1) , (3.19)where we choose µ n ≃ γ n and γ n = Ω( λ n ). Therefore, it is concluded that (3.20) holds bythe triangle inequality and Assumption C, k ˆ g − g ∗ k = O p (cid:0) ( γ n + √ sλ n ) ζ − (cid:1) . (3.20)Finally, plugging the derived upper bounds into (3.9), we obtain the desired upper boundof the excess risk R ( ˆ β , ˆ g ) − R ( β ∗ , g ∗ ). This completes the proof. (cid:3) In order to prove Lemma 1, some auxiliary results is required which is on uniform law oflarge number or concerntation inequalities, stated as Lemmas 2 (Massart, 2000).
Lemma 2.
Let U , ..., U n be independent and identically distributed copies of a randomvariable U ∈ U . Let Γ be a class of real-valued functions on U satisfying sup u | γ ( u ) | ≤ D for all γ ∈ Γ . Define Z := sup γ ∈ Γ (cid:12)(cid:12)(cid:12) n n X i =1 { γ ( U i ) − E [ γ ( U i )] } (cid:12)(cid:12)(cid:12) , and B := sup γ ∈ Γ var ( γ ( U )) . Then there exists a universal constant N such that P (cid:16) Z ≥ N (cid:2) E [ Z ] + B p r/n + Dr/n (cid:3)(cid:17) ≤ exp( − r ) , ∀ r > . Proof for Lemma 1.
For any f ( x ) = β ′ z + g ( t ), we define (3.21) to apply Lemma 2, γ ( u ) = φ h ( y [ β ′ z + g ( t )]) − φ h ( y [( β ∗ ) ′ z + g ∗ ( t )]) , β ∈ R p and g ∈ H K . (3.21)Based on (3.21), a bounded set of functions is introduced,Γ ∆ := (cid:8) γ, p log p/n k β − β ∗ k ≤ ∆ β , γ n k g − g ∗ k ≤ ∆ − , γ n k g − g ∗ k K ≤ ∆ + (cid:9) , β , ∆ − , ∆ + ) . Since φ h ( · ) in (3.21) is Lipschitz withconstant 1, the inequality (3.22) holds for any u , | γ ( u ) | ≤ | ( β − β ∗ ) ′ z | + | g ( t ) − g ∗ ( t ) | ≤ C k β − β ∗ k + κ k g − g ∗ k K ≤ C q n log p ∆ β + κ n log p ∆ + . (3.22)(3.22) implies that if we take (3.23) in Lemma 2, D := C r n log p ∆ β + κ n log p ∆ + , (3.23)and (3.24) is also derived by the Lipschitz property, B ≤ E [(( β − β ∗ ) ′ Z ) ] + 2 E [( g ( T ) − g ∗ ( T )) ] ≤ n log p (cid:0) C ∆ β + ∆ − (cid:1) , (3.24)we can plug (3.23) and (3.25) into Lemma 2 to yield (3.26), B := r n log p (cid:0) C ∆ β + ∆ − (cid:1) , (3.25) P (cid:16) Z ≥ N h E [ Z ] + r r log p (cid:0) C ∆ β + ∆ − (cid:1) + κr log p ∆ + i(cid:17) ≤ exp( − r ) , ∀ r ∈ (0 , n ) . (3.26)It remains to provide the upper bound of the term E [ Z ]. Let σ , ..., σ n be a Rademachersequence independent of ( Y , X ) , ..., ( Y n , X n ). The inequality (3.27) can be obtained bysymmetrization and the contraction inequality, E [ Z ] ≤ E (cid:16) sup γ ∈ Γ ∆ (cid:12)(cid:12)(cid:12) n P ni =1 σ i ( f ( X i ) − f ∗ ( X i )) (cid:12)(cid:12)(cid:12)(cid:17) ≤ E (cid:16) sup γ ∈ Γ ∆ (cid:12)(cid:12)(cid:12) n P ni =1 σ i (( β − β ∗ ) ′ Z i ) (cid:12)(cid:12)(cid:12)(cid:17) +4 E (cid:16) sup γ ∈ Γ ∆ (cid:12)(cid:12)(cid:12) n P ni =1 σ i ( g ( T i ) − g ∗ ( T i )) (cid:12)(cid:12)(cid:12)(cid:17) . (3.27)By Bernstein inequality and the union bound, we can get the inequality (3.28), E (cid:16) sup γ ∈ Γ ∆ (cid:12)(cid:12)(cid:12) n n X i =1 σ i (( β − β ∗ ) ′ Z i ) (cid:12)(cid:12)(cid:12)(cid:17) ≤ E (cid:16) sup j ∈ [ p ] (cid:12)(cid:12)(cid:12) n n X i =1 σ i Z ij (cid:12)(cid:12)(cid:12)(cid:17) sup γ ∈ Γ ∆ k β − β ∗ k ≤ C ∆ β . (3.28)Moreover, applying Talagrand’s concentration inequality once again, we get (3.29) with13he probability at least 1 − e − r E (cid:16) sup γ ∈ Γ ∆ (cid:12)(cid:12)(cid:12) n P ni =1 σ i ( g ( T i ) − g ∗ ( T i )) (cid:12)(cid:12)(cid:12)(cid:17) ≤ N (cid:16) sup γ ∈ Γ ∆ (cid:12)(cid:12)(cid:12) n P ni =1 σ i ( g ( T i ) − g ∗ ( T i )) (cid:12)(cid:12)(cid:12) + q r log p ∆ − + κr log p ∆ + (cid:17) . (3.29)Besides the result in Koltchinskii and Yuan (2010) that,sup γ ∈ Γ ∆ (cid:12)(cid:12)(cid:12) n n X i =1 σ i ( g ( T i ) − g ∗ ( T i )) (cid:12)(cid:12)(cid:12) ≤ ν n k g − g ∗ k + ν n k g − g ∗ k K , ∀ g ∈ H K , the inequality (3.30) holds with the probability at least 1 − e − r − p − A/ , E (cid:16) sup γ ∈ Γ ∆ (cid:12)(cid:12)(cid:12) n n X i =1 σ i ( g ( T i ) − g ∗ ( T i )) (cid:12)(cid:12)(cid:12)(cid:17) ≤ N (cid:16) ∆ − + ∆ + + r r log p ∆ − + κr log p ∆ + (cid:17) , (3.30)where A > E of the probability atleast 1 − e − r − p − A/ , Z ≤ L h ∆ β + ∆ − + ∆ + + r r log p (cid:0) ∆ β + ∆ − (cid:1) + r log p ∆ + i , ∀ r ∈ (0 , n ) , (3.31)where L is some constant depending on N , C and κ .We will now choose r = A log p so as to obtain a weighted empirical process that holdsuniformly over the constrains (3.32), e − p ≤ ∆ β ≤ e p , e − p ≤ ∆ − ≤ e p , e − p ≤ ∆ + ≤ e p . (3.32)To achieve this purpose, if we choose∆ kβ = ∆ k − = ∆ k + := 2 − k , k = − p, − p + 1 ..., p − , p, and Γ k,l,h ∆ := n γ,
12 ∆ kβ ≤ r log pn k β − β ∗ k ≤ ∆ kβ ,
12 ∆ l − ≤ γ n k g − g ∗ k ≤ ∆ l − ,
12 ∆ h + ≤ γ n k g − g ∗ k K ≤ ∆ h + o , (3.33)based on (3.31), (3.34) holds over an event F (∆ kβ , ∆ l − , ∆ h + ) of the probability P (cid:0) F (∆ kβ , ∆ l − , ∆ h + ) (cid:1) ≥ − p A/ for any triplet (∆ kβ , ∆ l − , ∆ h + ) satisfying (3.32) and (3.33), Z ≤ L h ∆ kβ + ∆ l − + ∆ h + + √ A (cid:0) ∆ kβ + ∆ l − (cid:1) + A ∆ h + i , ∀ r ∈ (0 , n ) . (3.34)On the event E ′ := E ∩ (cid:0) T k,l,h F (∆ kβ , ∆ l − , ∆ h + ) (cid:1) , the intersection T k,l,h F (∆ kβ , ∆ l − , ∆ h + ) isbounded by (2 p + 1) . Therefore, the lower bound of the probability E ′ can be formulatedas (3.35), P ( E ′ ) ≥ − P ( E c ) − (2 p + 1) exp( − A log p ) ≥ − p − A/ − p − A . (3.35)Thus, for any k, l, h , (3.36) holds on the event E ′ with the construction of the functionsets Γ k,l,h ∆ and (3.34), Z ≤ L (1 + √ A ) (cid:16)r log pn k β − β ∗ k + γ n k g − g ∗ k (cid:17) + 2 L (1 + A ) γ n k g − g ∗ k K . (3.36)If it is true for either of the conditions ∆ β ≤ e − p , or ∆ − ≤ e − p or ∆ + ≤ e − p , it followsthat (3.36) with almost the same probability by monotonnicity of the left-hand side, Z ≤ L (1 + √ A ) (cid:16)r log pn k β − β ∗ k + γ n k g − g ∗ k (cid:17) + 2 L (1 + A ) γ n k g − g ∗ k K + 3 e − p . This completes the proof of Lemma 1. (cid:3)
In this paper, we have studied the estimation in the partially linear sparse models forsupport vector machine, where the covariates split the linear component and the non-linear component within a reproducing kernel Hilbert space. An important feature ofour analysis is that we develop a new weighted empirical process in the high dimensionalsemi-parametric setting, so that our derived rates are sharp in comparison with the ex-isting related results, even are comparable to the least square estimations in the highdimensional setting.There are some further research topics of this work. It is known that the parametricestimation for the partially linear mean regression does not depend on the functional com-plexity under some additional conditions. This paper has not achieved such a better ratefor the partially linear SVM. Therefore, how to improve the parametric estimation erroris an interesting research problem. Besides, the lower bound for the partially linear SVMin the high dimensional setting have not been established under the margin information,15hich is an important complementary to our upper bounds.
Acknowledgement
Yifan Xia’s research is partially supported by the National Social Science Fund of China(NSSFC-16BTJ013)(NSSFC-16ZDA010) and the Sichuan Project of Science & Technol-ogy (2016JY0273). Shaogao Lv’s research is partially supported by NSFC-11871277 andNSFC-11829101.
References
P., Buhlmann and S. V. D, Geer. (2019). Statistics for High-Dimensional Data Methods,Theory and Applications. Springer Heidelberg.P. L., Bartlett and S., Mendelson. (2006). Empirical minimization. Probab. TheoryRelated Fields, 135, 311–334.P. L., Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. (2005).Ann. Statist., 33, 1497–1537.D. R., Chen, Q., Wu, Y. M., Ying and D. X., Zhou. (2004). Support vector machine softmargin classifiers: error analysis. J. Mach. Learn. Res., 5, 1143–1175.C., Giraud. (2014). Introduction to High-Dimensional Statistics. Chapman andHall/CRC.S. Van D., Geer. (2000). Empirical Processes in M-Estimation. Cambridge UniversityPress, 2000.W., Hardle and H., Liang. (2007). Partially Linear Models. Springer, New York.J., Jacobus and U., Abhisek. (2009). Short term load forecasting using semi-parametricmethod and support vector machines. IEEE Africon Conference.M., Kloft, U., Brefeld, S., Sonnenburg and A., Zien. (2011). l p -norm multiple kernellearning. J. Mach. Learn. Res., 12, 953–997.V., Koltchinskiiand M., Yuan. (2010). Sparsity in multiple kernel learning. Ann. Statist.,38, 3660–3695. 16., Lanckriet, N. Cristianini, L. E., Ghaoui, P., Bartlett, and M. I., Jordan. (2004).Learning the kernel matrix with semi-definite programming. J. Mach. Learn. Res., 5,27–72.C. A., Micchelli, M., Pontil, Q., Wu and D. X., Zhou. (2016). Generalization bounds forlearning the kernel. Anal. Appl., 14, 849–868.C. A., Micchelli and M., Pontil. (2005). Learning the kernel function via regularization.J. Mach. Learn. Res., 6, 1099–1125.P., Massart. (2000). About the constants in Talagrand’s concentration inequalities forempirical processes. Ann. Probab., 28, 863–884.P. Muller and S. van de Geer. (2015). The partial linear model in high dimensions. Scand.J. Statist., 42, 580–608.A., Pierre , V., Cottet , and G., Lecue. (2019). Estimation bounds and sharp oracleinequalities of regularized procedures with Lipschitz loss functions. Ann. Statist., 38,3660–3695.B., Peng, L., Wang and Y., Wu. (2016). An error bound for L support vector machinecoeffcients in ultra-high dimension. J. Mach. Learn. Res., 17, 1-26.K. H., Seok, J., Shim, D., Cho, G. J., Noh and C. H., Hwang. (2011). Semiparametricmixed-effect least squares support vector machine for analyzing pharmacokinetic andpharmacodynamic data. Neurocomputing, 74, 3412–3419.I., Steinwart and A., Christmann. (2008). Support Vector Machine. Springer.P., Speckman. (1988). Kernel smoothing in partial linear models. J Roy. Statist. Soc. Ser.B, 50, 413–436.B., Tarigan and S. V. D., Geer. (2006). Classifiers of support vector machine type with ℓ complexity regularization. Bernoulli, 12, 1045-1076.V., Vapnik. (1995). The Nature of Statistical Learning Theory. Springer Science andBusiness Media.Q., Wu, Y. M., Ying and D. X., Zhou. (2007). Multi-kernel regularized classifiers. J.Complexity, 23, 108–138.J., Zhu, S., Rosset, T., Hastie and R., Tibshirani. (2003). L1