Estimation of scale functions to model heteroscedasticity by support vector machines
EEstimation of scale functions to modelheteroscedasticity by support vector machines
Robert Hable
University of BayreuthDepartment of MathematicsD-95440 Bayreuth [email protected]
Andreas Christmann
University of BayreuthDepartment of MathematicsD-95440 Bayreuth [email protected]
November 2, 2018
Abstract
A main goal of regression is to derive statistical conclusions on the conditional distribution ofthe output variable Y given the input values x . Two of the most important characteristics of asingle distribution are location and scale. Support vector machines (SVMs) are well established toestimate location functions like the conditional median or the conditional mean. We investigate theestimation of scale functions by SVMs when the conditional median is unknown, too. Estimationof scale functions is important e.g. to estimate the volatility in finance. We consider the medianabsolute deviation (MAD) and the interquantile range (IQR) as measures of scale. Our mainresult shows the consistency of MAD-type SVMs. Let P be the distribution of a pair of random variables (
X, Y ) with values in a set
X × Y where X is an input variable and Y is a real-valued output variable. The goal in regression problems is toderive statistical conclusions on the conditional distribution of Y given X = x . Generally, locationand scale are considered as the two most important characteristics of a distribution and estimatingthese quantities is one of the main topics in statistics.Regularized empirical risk minimization [26, 27, 18, 8] using the kernel trick proposed by [19] andthe special case of support vector machines (SVMs) [26, 6, 18, 23] are well established methods inorder to estimate the location of the conditional distribution of Y given X = x . For an i.i.d. sample D = (cid:0) ( X , Y ) , . . . , ( X n , Y n ) (cid:1) drawn from P, the SVM-estimator is defined by f L, D ,λ = arg inf f ∈ H n n (cid:88) i =1 L (cid:0) Y i , f ( X i ) (cid:1) + λ (cid:107) f (cid:107) H , (1)where L is a loss function, H is a certain space – a so-called reproducing kernel Hilbert space (RKHS)– of functions f : X → R , and λ ∈ (0 , ∞ ) is a regularization parameter in order to prevent overfitting.We refer to [27, 18, 2, 8, 23] for the concept of an RKHS. There are a number of different quantitieswhich describe the location of a single distribution and which can be estimated by SVMs by choosing1 a r X i v : . [ s t a t . M L ] N ov suitable loss function. The conditional mean function g ( x ) := E P [ Y | X = x ], x ∈ X can be estimatedby using the least-squares loss L LS ( y, t ) = ( y − t ) and the τ -quantile function g ( x ) := f ∗ τ, P ( x ), x ∈ X ,(see (2) below) by using the τ -pinball loss function L τ − pin ( y, t ) = (cid:26) (1 − τ ) · ( t − y ) if y − t < ,τ · ( y − t ) if y − t ≥ , ( y, t ) ∈ Y × R , see [15, 14, 20, 25]. The choice τ = 0 . median function f ∗ . , P ( x ) := median P ( Y | X = x ) , x ∈ X . The goal of this paper is to investigate two methods to estimate the variability of the conditionaldistributions of Y given X = x for x ∈ X via scale functions . Estimation of heteroscedasticityis interesting in many areas of applied statistics, e.g., for the estimation of volatility in finance.To fix ideas, let us illustrate what we mean by scale function estimation by considering a smalldata concerning the so-called LIDAR technique. LIDAR is the abbreviation of LIght Detection AndRanging. This technique uses the reflection of laser-emitted light to detect chemical compounds inthe atmosphere. We consider the logarithm of the ratio of light received from two laser sources asthe output variable Y = logratio , whereas the single input variable X = range is the distancetraveled before the light is reflected back to its source. We refer to [17] for more details on this dataset. A scatterplot of the data set consisting of n = 221 observations is shown in the left subplotof Figure 1 together with the fitted quantile curves based on SVMs using the pinball loss functionfor τ ∈ { . , . , . , . , . } and the Gaussian RBF kernel k ( x, x (cid:48) ) := exp( − γ (cid:107) x − x (cid:48) (cid:107) ) for x, x (cid:48) ∈ X . By looking at the estimated median function (i.e., the black curve in the middle of theleft subplot), we clearly see a nonlinear relationship between both variables which is almost constantfor values of range below 550 and decreasing for higher values of range . However, there is also aconsiderable change of the variability of logratio given range : the variability is relatively small forsmall values of range , but much larger for large values of range . This becomes obvious by lookingat the other estimated quantile curves in the left subplot or by looking at the right subplot of Figure1 which shows the estimated width of intervals covering at least 50% of the mass of P( Y | x ). In thissimple example we can just look onto the 2-dimensional plot to realize this kind of heteroscedasticityof the conditional distribution of Y given X = x . However, this is obviously no longer possible if theinput space X is a high-dimensional Euclidean space or an abstract metric space. Hence an automaticand non-parametric method to model and to estimate such kind of variability becomes important.Therefore, this article investigates how two classical scale quantities of the conditional distributionof Y given X = x can be estimated by use of SVMs. Such scale functions g : X → [0 , ∞ ) are quitecommon in a heteroscedastic model like P( Y | x ) = f ( x ) + g ( x ) ε , where f denotes the location functionand ε denotes the stochastic error term. Note, that we will not assume such a specific model. Asin case of location, there are several well established quantities which describe the scale, e.g., [10,Chap. 5]( i ) the variance function : g ( x ) = Var P ( Y | X = x ), x ∈ X ,( ii ) the median absolute deviation from the median (MAD) function : g ( x ) := MAD P ( Y | X = x ) := median (cid:0) | Y − f ∗ . , P ( x ) | (cid:12)(cid:12) X = x (cid:1) , x ∈ X ,( iii ) the interquantile range (IQR) function for quantiles τ < τ : g ( x ) := IQR τ ,τ ( Y | X = x ) := f ∗ τ , P ( x ) − f ∗ τ , P ( x ), x ∈ X .Note that these three quantities are not directly comparable. However, IQR . , . and 2 timesMAD are both quantities for the width of an interval covering at least 50% of the probability massof P( Y | x ). There is a vast literature on the estimation of scale functions, often based on specialparametric dispersion models, see, e.g., [11, 21, 12], and for a wavelet thresholding approach forunivariate regression models we refer to [3].In this article, we consider the MAD function and the IQR function and show how both canbe consistently estimated in a purely nonparametric manner with SVMs. In case of the MAD, weestimate the unknown median function f ∗ . , P by an SVM f L . − pin , D ,λ and calculate the estimatedabsolute residuals R i := | Y i − f L . − pin , D ,λ ( X i ) | in a first step. In a second step, we estimate the2igure 1: Illustration of the estimation for scale functions by SVMS for the LIDAR data set. Leftsubplot: data set, estimated quantile functions with SVMs for τ = 0 . τ = 0 .
25 and τ = 0 . τ = 0 .
05 and τ = 0 .
95 (both in red). Right subplot: Estimated width of the intervalscovering 50% of the mass of P( Y | x ). IQR-type SVM (blue) using ( τ , τ ) = (0 . , .
75) and 2 timesthe MAD-type SVM (green). lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll
400 500 600 700 − . − . . range l og r a t i o
400 500 600 700 . . . range w i d t h conditional median of the absolute residuals by the SVM based on a smoothed version of the -pinball loss defined in (4) below for the pairs of random variables ( X i , R i ). The resulting estimator iscalled MAD-type SVM and it is shown in Subsection 2.1 that it is risk-consistent (up to any predefined ε >
0) even though (i) the estimation in the second step cannot be based on the true residuals buthas to be based on the estimated residuals because the true median function is unknown and (ii) therandom variables ( X i , R i ) are not i.i.d. In case of the IQR f ∗ τ , P − f ∗ τ , P , we respectively estimatethe τ j -quantile function f ∗ τ j , P by use of the τ j -pinball loss so that we get f L τ , D ,λ − f L τ , D ,λ asan estimate, which we call IQR-type SVM. As this is the difference of two standard SVMs, we cancarry over many well-known facts on SVMs in this case in Subsection 2.2. In both cases, availablesoftware, e.g., the R-package kernlab [13] or the C++ implementation mySVM [16], can be used sincewe essentially have to calculate SVMs for pinball losses.The rest of the paper has the following structure. Section 2 contains with Theorem 2.2 our mainresult. Section 3 contains not only the proof of this theorem, but also gives two new consistency resultsin the L -sense for SVMs based on the pinball loss, see Lemma 3.1 and Theorem 3.2. Although weneed these results in our proof of Theorem 2.2, we think that they are interesting in its own, becausethey improve earlier consistency results of SVMs which showed the weaker kind of convergence inprobability, see [23, Cor. 3.62, Thm. 9.7(i)]. The following assumptions and notations are used throughout the whole article.
Assumption 2.1
Let X be a complete separable metric space, e.g. X = R d , and Y ⊂ R be closed. For j ∈ { , } , let k j : X × X → R be a bounded continuous kernel with (cid:107) k j (cid:107) ∞ := sup x ∈X ( k j ( x, x )) / < ∞ . Its corresponding reproducing kernel Hilbert space (RKHS) is denoted by H j , its correspondingcanonical feature map is denoted by Φ j , and it is assumed that each H j is dense in L ( µ ) for every µ ∈ M ( X × Y ) . M ( X ×Y ) denotes the set of all Borel probability measures on
X ×Y . The unknown joint distributionof (
X, Y ) is denoted by P ∈ M ( X × Y ) and D = (cid:0) ( X , Y ) , . . . , ( X n , Y n ) (cid:1) is an i.i.d. sample drawnfrom P. Let P X denote the distribution of X , let L ( X ) denote the set of all Borel measurablefunctions f : X → R and let L (P X ) denote the set of all P X -integrable functions f : X → R . We3efine the τ -quantile function as the (perhaps set-valued) function X → R , x (cid:55)→ F ∗ τ, P ( x ) := (cid:8) t ∈ R : P (cid:0) ( −∞ , t ] (cid:12)(cid:12) x (cid:1) ≥ τ and P (cid:0) [ t, ∞ ) (cid:12)(cid:12) x (cid:1) ≥ − τ (cid:9) . (2)We make the standard assumption that F ∗ τ, P ( x ) are singletons and hence we write f ∗ τ, P ( x ) : X → R instead, see [22, 24]. We would like to estimate the MAD function given by g ( x ) = MAD P ( Y | X = x ) = median P (cid:0) | Y − f ∗ . , P ( x ) | (cid:12)(cid:12) X = x (cid:1) where f ∗ . , P is the median function. First, we estimate the median function f ∗ . , P .For a random sample D = (cid:0) ( X , Y ) , . . . , ( X n , Y n ) (cid:1) drawn from P, the SVM-estimator for f ∗ . , P is f L . − pin , D ,λ ,n = arg inf f ∈ H (cid:18) n n (cid:88) i =1 L . − pin (cid:0) Y i , f ( X i ) (cid:1) + λ ,n (cid:107) f (cid:107) H (cid:19) ,λ ,n ∈ (0 , ∞ ), and H is an RKHS. Then, we can estimate the conditional median of the absoluteresiduals | Y − f ∗ . , P ( x ) | by use of the estimated absolute residuals. Let us define˜ g D ,n = arg inf g ∈ H (cid:18) n n (cid:88) i =1 L ε (cid:0)(cid:12)(cid:12) Y i − f L . − pin , D ,λ ,n ( X i ) (cid:12)(cid:12) , g ( X i ) (cid:1) + λ ,n (cid:107) g (cid:107) H (cid:19) , (3)where, for some small predefined number ε >
0, the loss function L ε defined by L ε ( y, t ) = ( y − t ) − ε log (cid:0) (cid:0) y − tε (cid:1)(cid:1) = L . − pin ( y, t ) − ε log (cid:0) (cid:0) | y − t | ε (cid:1)(cid:1) , (4)is an ε -smoothed version of the pinball loss function for τ = 0 .
5, Λ( r ) = 1 / (1 + e − r ) for every r ∈ R , λ ,n ∈ (0 , ∞ ), and H is an RKHS. Since ˜ g D ,n occasionally can have negative values, we propose theMAD-type estimator g D ,n = max { ˜ g D ,n , } (5)instead of ˜ g D ,n . We use the smoothed version L ε of the pinball loss function L . − pin because we willneed in the proof of Theorem 2.2 that the loss function has a Lipschitz continuous derivative, see (18)and (19). This is the price we pay for the unavoidable fact that our estimation cannot be based onthe true residuals but on the estimated ones because the distribution P of ( X i , Y i ) is assumed to beunknown in statistical machine learning. Some easy calculations show that the smoothed pinball lossfunction L ε is convex, Lipschitz continuous with | L ε | = 0 .
5, has a Lipschitz continuous derivative,and fulfills 0 ≤ L . − pin ( y, t ) − L ε ( y, t ) ≤ log(2) ε < ε for every ( y, t ) ∈ Y × R and the risks fulfill, forall P ∈ M ( Y × R ),0 ≤ E P L . − pin ( Y, f ( X )) − E P L ε ( Y, f ( X )) ≤ E P | L . − pin ( Y, f ( X )) − L ε ( Y, f ( X )) | < ε . The ε -smoothed version of the pinball loss is actually a re-parametrized logistic loss function L ε ( y, t ) = εL logistic ( y/ε, t/ε ) / , see [23, p. 44] Hence the SVM based on L ε can be calculated by any softwarewhich supports the logistic loss. For the illustration purposes in the introduction, we used ε = 0 . L and every measurable f, g : X → R , define the risk R L, P ( f, g ) := E P L (cid:0) | Y − f ( X ) | , g ( X ) (cid:1) . (6)If the median function f ∗ . , P and the MAD function g ∗ P ( x ) = MAD( Y | X = x ) uniquely exist, thenthe MAD function g ∗ P minimizes g (cid:55)→ R L . − pin , P (cid:0) f ∗ . , P , g (cid:1) over all measurable functions g : X → R ,i.e. R L . − pin , P (cid:0) f ∗ . , P , g ∗ P (cid:1) = inf g ∈L ( X ) R L . − pin , P (cid:0) f ∗ . , P , g (cid:1) . The following theorem says that g D ,n is risk ε -consistent for the MAD function.4 heorem 2.2 In addition to Assumption 2.1, assume that E P | Y | < ∞ and that the median function f ∗ . , P : X → R is almost surely unique. Let L denote the . -pinball loss function and let ε > bethe predefined real number in the loss function L ε . Then, for n → ∞ , inf g ∈L ( X ) R L , P (cid:0) f ∗ . , P , g (cid:1) + ε ≥ R L , P (cid:0) f ∗ . , P , g D ,n (cid:1) + o P (1) = R L , P (cid:0) f L , D ,λ ,n , g D ,n (cid:1) + o P (1) if lim n →∞ λ ,n = 0 , lim n →∞ λ ,n = 0 , and lim n →∞ λ ,n λ ,n n = ∞ . Remarks: (i) We assume that the true median function uniquely exists but we do not assume thatthe true MAD function g ∗ P uniquely exists. (ii) The value R L , P (cid:0) f ∗ . , P , g D ,n (cid:1) quantifies the expecteddistance of the estimate g D ,n to the absolute values of the true residuals | Y − f ∗ . , P ( X ) | ; the value R L , P (cid:0) f L , D ,λ ,n , g D ,n (cid:1) quantifies the expected distance of the estimate g D ,n to the absolute values ofthe estimated residuals | Y − f L , D ,λ ,n ( X ) | . According to Theorem 2.2, both values asymptoticallyachieve the infimal risk up to the predefined ε >
0. (iii) The assumption lim n →∞ λ ,n λ ,n n = ∞ isstronger than the standard assumption lim n →∞ λ j,n = ∞ ; see [23, Thm. 9.6]. This is plausible becauseestimating the MAD is burdened with the estimation of a nuisance function (i.e. the unknown medianfunction). Let us now consider a linear combination of m SVMs under the Assumption 2.1. As the results followby straightforward calculations using standard results on SVMs, the proofs are left out.Let m be a positive integer, J = { , . . . , m } , c = ( c , . . . , c m ) ∈ R m \ { } , and ( ξ j,n ) n ∈ N be asequence of measurable functions into some complete separable metric space E j enclipped with itsBorel σ -algebra, j ∈ J . Obviously, (cid:80) j ∈ J c j ξ j,n exists and is unique if all ξ j,n exist and are unique.Furthermore, (cid:80) j ∈ J c j ξ j,n converges to (cid:80) j ∈ J c j ξ j, in probability (or almost surely or in the L p sense)if all ξ j,n converge in probability (or almost surely or in the L p sense) to ξ j, for n → ∞ .Now, let 0 < τ < . . . < τ m <
1. If we either specialize that ξ j,n denotes the support vectormachine f L τj − pin , D ,λ j,n and choose as E j the RKHS H j or that ξ j,n denotes the corresponding risk E P L τ j − pin ( Y, f L τj − pin , D ,λ j,n ( X )) and choose E j = Y , then existence, uniqueness and consistency re-sults for the linear combination of the SVMs or of their risks follow immediately from results validfor each individual SVM, see, e.g., [23, 22, 24] and our Theorem 3.2. Denote the subdifferential (seee.g. [7, Section 5.3]) of the pinball loss function L τ j − pin (with respect to the second argument) by ∂L τ j − pin . We then obtain immediately a representer theorem for the linear combination of SVMsbecause it is well-known that each individual SVM has a representer theorem, i.e. it holds (cid:88) j ∈ J c j f L τj − pin , P ,λ j,n = (cid:88) j ∈ J − λ j c j E P h j, P ( X, Y )Φ j ( X ) , (7)where the functions h j, P fulfill h j, P ( x, y ) ∈ ∂L τ j − pin ( y, f L τj − pin , P ,λ j,n ( x )) ∀ ( x, y ) ∈ X × Y , (8)see e.g. [23, Thm. 5.8, Cor. 5.11]. In the same manner we obtain by straightforward calculationsbounds for the maximal bias of SVMs and Bouligand influence functions for linear combinations ofSVMs, see [4]. Note that SVMs exist even for heavy-tailed distributions which violate the classicalassumption E P | Y | < ∞ , which can be shown by using a trick already used by [9] where instead of theoriginal loss function a shifted loss function in the sense L ∗ τ − pin ( y, t ) := L τ − pin ( y, t ) − L τ − pin ( y, y, t ∈ R , is used, see [5]. This only changes the objective function to be minimized, but not the SVMitself. Example 2.3 [Estimation of scale functions.] Let m = 2 , c = ( − , +1) , τ ∈ (0 , ) and τ ∈ ( , ,e.g. ( τ , τ ) = ( , ) or ( τ , τ ) = (0 . , . . Then we obtain immediately existence, uniquenessand consistency results for the difference of the two SVMs based on pinball loss functions L τ − pin and τ − pin , respectively. In other words, if we denote a τ j -quantile of the conditional distribution of Y given X = x by f ∗ τ j , P , then the following difference of two SVMs f L τ − pin , D ,λ ,n − f L τ − pin , D ,λ ,n yields an estimator for the difference of f ∗ τ , P − f ∗ τ , P . (cid:74) Example 2.4 [Estimation of asymmetry functions.] Let m = 3 , c = (+1 , − , +1) , τ ∈ (0 , ) , τ = , and τ ∈ ( , , e.g. ( τ , τ , τ ) = ( , , ) or ( τ , τ , τ ) = (0 . , . , . . Then we obtainimmediately existence, uniqueness and consistency results for f L τ − pin , D ,λ ,n − f L τ − pin , D ,λ ,n + f L τ − pin , D ,λ ,n , which gives us an estimator for the difference of ( f ∗ τ , P − f ∗ τ , P ) − ( f ∗ τ , P − f ∗ τ , P ) . (9) Let us now choose τ ∈ (0 , ) and τ = 1 − τ = τ , e.g. τ = 0 . . Then the function in (9) is zero, if,for all x ∈ X , the upper conditional quantile f ∗ − τ, P ( x ) differs from the conditional median f ∗ . , P ( x ) by the same amount than the conditional median f ∗ . , P ( x ) differs from the lower conditional quantile f ∗ τ, P ( x ) . Hence the function in (9) or its supremum norm can be used as a quantity to measure theamount of asymmetry of the conditional distribution of Y given X = x . (cid:74) It is well-known that the so-called crossing problem can occur in quantile regression and that thisproblem is not specific to SVMs, see [14, p. 55-59]. The crossing problem occurs if, for two quantilelevels τ < τ , the estimated quantile functions ˆ q τ , ˆ q τ are in the wrong order for at least one x ∈ X ,i.e. ˆ q τ ( x ) > ˆ q τ ( x ). The danger that the crossing problem occurs for a fixed data set is typicallysmall if τ is close to 0 and if τ is close to 1. A numerical method to prevent the crossing problem inkernel based quantile regression was proposed by [25]. From our point of view, it will often depend on the application whether an MAD- or an IQR-typeSVM is more appropriate.We see three advantages of MAD-type estimation. (i) One can estimate the heteroscedasticity ofP( ·| x ) by estimating the conditional median of the absolute residuals | Y − ˆ f ( x ) | without estimating two conditional quantile functions. Because in most applications the conditional median (or theconditional mean) are estimated anyway, one only needs to compute one additional SVM instead oftwo additional SVMs by the IQR-type approach. (ii) It can happen that the upper and the lowerquantile functions are hard to approximate, e.g., they are not in the RKHSs H and H which caneasily happen even with the classical Gaussian RBF kernel whose RKHS contains only continuousfunctions, see [23, Lem. 4.28, Cor. 4.36] whereas the true quantile functions may have jumps. (iii) Itcan happen that the difference of two quantile functions is easy to estimate, e.g. it is constant, linear,or a polynomial of low order, although the quantile functions f ∗ τ , P and f ∗ τ , P are complicated.On the other hand, we see three advantages of IQR-type estimation: (i) Greater flexibility by thechoice of ( τ , τ ) whereas the MAD-type approach is based on estimating one conditional quantile(which is here τ = ) of the distribution of the absolute residuals . (ii) Greater flexibility by choosingdifferent types of kernels or kernels with different kernel parameters for estimating the upper and thelower quantile functions. (iii) The IQR-type approach allows the direct estimation of asymmetry orother quantities of interest for the distribution of Y given X = x . L consistency of quantile function estimation by SVMs The following lemma strengthens [23, Cor. 3.62] in case of the pinball loss function as convergence inprobability is replaced by the stronger L -convergence.6 emma 3.1 Let L be the pinball loss with τ ∈ (0 , and let P ∈ M ( X × Y ) be the distribution of ( X, Y ) . Assume that E P | Y | < ∞ and that the conditional quantile function f ∗ τ, P : X → R is P X -a.s.unique. Then, for every f n ∈ L (P X ) , n ∈ N , we have lim n →∞ R L, P ( f n ) = inf f ∈L ( X ) R L, P ( f ) ⇒ lim n →∞ (cid:13)(cid:13) f n − f ∗ τ, P (cid:13)(cid:13) L (P X ) = 0 . Proof:
Define h n : X × Y → R by h n ( x, y ) = L (cid:0) y, f n ( x ) (cid:1) and h : X × Y → R by h ( x, y ) = L (cid:0) y, f ∗ τ, P ( x ) (cid:1) . Define c := min { − τ, τ } and note that L ( y, t ) ≥ c | y − t | for every ( y, t ) ∈ Y × R .According to [23, Cor. 3.62], it is already known that f n → f ∗ τ, P in probability (w.r.t. P X ). Therefore,it follows from the continuity of L that h n → h in probability (w.r.t. P). Sincelim n →∞ (cid:82) | h n | d P = lim n →∞ R L, P ( f n ) = inf f ∈L ( X ) R L, P ( f ) = R L, P ( f ∗ τ, P ) = (cid:82) | h | d P , the sequence ( h n ) n ∈ N , is uniformly integrable; see e.g. [1, Thm. 21.7]. Since (cid:12)(cid:12) f n ( x ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) y − f n ( x ) (cid:12)(cid:12) + | y | ≤ c − L (cid:0) y, f n ( x ) (cid:1) + | y | = c − h n ( x, y ) + | y | ∀ ( x, y, n ) ∈ X × Y × N , it follows that the sequence f n , n ∈ N , is uniformly integrable, too. Hence convergence in probabilityof f n , n ∈ N , implies L -convergence; see e.g. [1, Thm. 21.7].The following theorem strengthens [23, Thm. 9.7(i)] as convergence in probability is replaced by thestronger L -convergence. The proof coincides with that of [23, Thm. 9.7(i)] apart from applyingLemma 3.1 instead of [23, Cor. 3.62] and therefore is omitted. Theorem 3.2
Let X be a complete measurable space, Y ⊂ R be closed, L be the pinball loss with τ ∈ (0 , , H be a separable RKHS of a bounded kernel k on X such that H is dense in L ( µ ) forall µ ∈ M ( X ) , and λ n ∈ (0 , ∞ ) , n ∈ N , such that lim n →∞ λ n = 0 and lim n →∞ λ n n = ∞ . Let P ∈ M ( X × Y ) be the distribution of ( X, Y ) and assume that E P | Y | < ∞ and that the conditionalquantile function f ∗ τ, P : X → R is P X -a.s. unique. Then, (cid:13)(cid:13) f L, D ,λ n − f ∗ τ, P (cid:13)(cid:13) L (P X ) → in probability , n → ∞ . In order to increase the readability of the proof, a comprehensive notation is needed. Therefore, wedefine L := L . − pin and, for probability measures P , P ∈ M ( X × Y ), we define f P ; n := f L , P ,λ ,n = arg inf f ∈ H (cid:18) (cid:90) L (cid:0) y, f ( x ) (cid:1) P (cid:0) d ( x, y ) (cid:1) + λ ,n (cid:107) f (cid:107) H (cid:19) ,g P , P ; n := arg inf g ∈ H (cid:18) (cid:90) L ε (cid:0) | y − f P ; n ( x ) | , g ( x ) (cid:1) P (cid:0) d ( x, y ) (cid:1) + λ ,n (cid:107) g (cid:107) H (cid:19) . In this definition, P and P can also be equal to the empirical measure D = n (cid:80) ni =1 δ ( X i ,Y i ) , whichcorresponds to the random sample D = (cid:0) ( X , Y ) , . . . , ( X n , Y n ) (cid:1) . That is, the estimate g D ,n definedin (5) and (3), is given by g D ,n = max { g D , D; n , } in this notation. We obtain L (cid:48) ε ( y, t ) := ∂∂t L ε ( y, t ) = − Λ (cid:0) y − tε (cid:1) ∀ y, t ∈ R . Since (cid:12)(cid:12) ∂∂y L ε ( y, t ) (cid:12)(cid:12) ≤ . (cid:12)(cid:12) ∂∂t L ε ( y, t ) (cid:12)(cid:12) ≤ . y, t ∈ R , the following Lipschitz property isfulfilled (cid:12)(cid:12) L ( y , t ) − L ( y , t ) (cid:12)(cid:12) ≤ . | y − y | + 0 . | t − t | ∀ y , y , t , t ∈ R . (10)An easy calculation shows that (10) implies (cid:12)(cid:12) R L (cid:15) , P ( f , g ) − R L (cid:15) , P ( f , g ) (cid:12)(cid:12) ≤ . (cid:13)(cid:13) f − f (cid:107) L (P X ) + 0 . (cid:13)(cid:13) g − g (cid:107) L (P X ) (11)7or all f , f , g , g ∈ L (P X ). Note that, by construction, 0 ≤ L − L ε ≤ ε , which implies R L (cid:15) , P ( f, g ) ≤ R L , P ( f, g ) ≤ R L (cid:15) , P ( f, g ) + ε ∀ f, g ∈ L (P X ) . (12)It is obvious from the definition (6) of the risk R L , P ( f, g ) that replacing negative values of the function g by 0 reduces the risk. Hence, the definitions imply R L , P ( f ∗ . , P , g D ,n ) ≤ R L , P ( f ∗ . , P , g D , D; n ) andit follows from (12) that R L , P ( f ∗ . , P , g D ,n ) − inf g ∈L ( X ) R L , P ( f ∗ . , P , g ) ≤≤ R L (cid:15) , P ( f ∗ . , P , g D , D; n ) − inf g ∈L ( X ) R L (cid:15) , P ( f ∗ . , P , g ) + ε . (13)Applying the triangular inequality yields R L (cid:15) , P ( f ∗ . , P , g D , D; n ) ≤ (cid:12)(cid:12)(cid:12) R L (cid:15) , P ( f ∗ . , P , g D , D; n ) −R L (cid:15) , P ( f P; n , g P , P; n ) (cid:12)(cid:12)(cid:12) + R L (cid:15) , P ( f P; n , g P , P; n ) . (14)Next, define∆ ( n )1 := (cid:13)(cid:13) g D , D; n − g P , D; n (cid:107) L (P X ) , ∆ ( n )2 := (cid:13)(cid:13) g P , D; n − g P , P; n (cid:107) L (P X ) , ∆ ( n )3 := (cid:13)(cid:13) f ∗ . , P − f P; n (cid:107) L (P X ) , ∆ ( n )4 := (cid:16) R L (cid:15) , P ( f P; n , g P , P; n ) − inf g ∈L ( X ) R L (cid:15) , P ( f ∗ . , P , g ) (cid:17) . Then, it follows from (13), (14), (11), and another application of the triangular inequality that0 ≤ R L , P ( f ∗ . , P , g D ,n ) − inf g ∈L ( X ) R L , P ( f ∗ . , P , g ) ≤ . (cid:16) ∆ ( n )1 + ∆ ( n )2 + ∆ ( n )3 (cid:17) + ∆ ( n )4 + ε . (15)Each of the four summands ∆ ( n )1 , . . . , ∆ ( n )4 will be considered separately in the following four parts.In order to prove the theorem, it is enough to show that ∆ ( n )1 and ∆ ( n )2 converge to 0 in probability(Part 1 and Part 2), that ∆ ( n )3 converges to 0 (Part 3), and that the limit superior of ∆ ( n )4 is notlarger than 0 (Part 4); the terms ∆ ( n )3 and ∆ ( n )4 are non-stochastic. Note that (11) and Theorem 3.2imply, for n → ∞ , the convergence in probability of (cid:12)(cid:12) R L , P (cid:0) f ∗ . , P , g D ,n (cid:1) − R L , P (cid:0) f L , D ,λ ,n , g D ,n (cid:1)(cid:12)(cid:12) ≤ . (cid:13)(cid:13) f ∗ . , P − f L , D ,λ ,n (cid:13)(cid:13) L (P X ) → . Part 1:
For D = (cid:0) ( X , Y ) , . . . , ( X n , Y n ) (cid:1) , defineQ D = 1 n n (cid:88) i =1 δ ( X i , | Y i − f P; n ( X i ) | ) and ˜Q D = 1 n n (cid:88) i =1 δ ( X i , | Y i − f D; n ( X i ) | ) . For every ( x, y ) ∈ X × Y , define h D ,n ( x, y ) = L (cid:48) ε ( y, g P , D; n ( x )). Then, it follows from the representertheorem [23, Cor. 5.10] that (cid:13)(cid:13) g D , D; n − g P , D; n (cid:13)(cid:13) H ≤ λ − ,n (cid:13)(cid:13) E ˜Q D h D ,n Φ − E Q D h D ,n Φ (cid:13)(cid:13) H ≤≤ λ ,n n n (cid:88) i =1 (cid:12)(cid:12)(cid:12) h D ,n ( X i , | Y i − f D; n ( X i ) | ) − h D ,n ( X i , | Y i − f P; n ( X i ) | ) (cid:12)(cid:12)(cid:12) · (cid:13)(cid:13) Φ ( X i ) (cid:13)(cid:13) H . (16)According to the boundedness of k and k , we will use the well-known inequalities (cid:107) Φ j ( x ) (cid:107) H j ≤ (cid:107) k j (cid:107) ∞ ∀ x ∈ X and (cid:107) f (cid:107) ∞ ≤ (cid:107) k j (cid:107) ∞ (cid:107) f (cid:107) H j ∀ f ∈ H j (17)8or every j ∈ { , } ; see [23, p. 124]. Then, the definition of h D ,n and the easy to prove Lipschitzproperty (cid:12)(cid:12) L (cid:48) ε ( y , t ) − L (cid:48) ε ( y , t ) (cid:12)(cid:12) ≤ ε − | y − y | for all y , y , t ∈ R imply (cid:13)(cid:13) g D , D; n − g P , D; n (cid:13)(cid:13) H (16 , ≤ (cid:107) k (cid:107) ∞ λ ,n n n (cid:88) i =1 (cid:12)(cid:12)(cid:12) h D ,n ( X i , | Y i − f D; n ( X i ) | ) − h D ,n ( X i , | Y i − f P; n ( X i ) | ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) k (cid:107) ∞ λ − ,n sup t sup x, y (cid:12)(cid:12)(cid:12) L (cid:48) ε (cid:0) | y − f D; n ( x ) | , t (cid:1) − L (cid:48) ε (cid:0) | y − f P; n ( x ) | , t (cid:1)(cid:12)(cid:12)(cid:12) (18) ≤ (cid:107) k (cid:107) ∞ ε − λ − ,n sup x, y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y − f D; n ( x ) (cid:12)(cid:12) − (cid:12)(cid:12) y − f P; n ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) k (cid:107) ∞ ε − λ − ,n (cid:13)(cid:13) f D; n − f P; n (cid:13)(cid:13) ∞ (19) (17) ≤ (cid:107) k (cid:107) ∞ (cid:107) k (cid:107) ∞ ε − λ − ,n (cid:13)(cid:13) f D; n − f P; n (cid:13)(cid:13) H . Next, it follows from the representer theorem [23, Cor. 5.10] that there is an h P ,n ∈ L ∞ ( X ) such that (cid:107) h P ,n (cid:107) ∞ ≤ . (cid:13)(cid:13) f D; n − f P; n (cid:13)(cid:13) H ≤ λ − ,n (cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 (cid:0) h P ,n ( X i , Y i )Φ ( X i ) − E P h P ,n Φ (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) H . (20)Define B := sup x,y (cid:107) h P ,n ( x, y )Φ ( x ) (cid:107) H ≤ . (cid:107) k (cid:107) ∞ and fix any η >
0. Then it follows from (20) andHoeffding’s inequality [28, Chapter 3] that, for n → ∞ ,P n (cid:16) λ − ,n (cid:13)(cid:13) f D; n − f P; n (cid:13)(cid:13) H ≥ η (cid:17) ≤ exp (cid:18) − · η λ ,n λ ,n nηλ ,n λ ,n B + 3 B (cid:19) −→ , because lim n →∞ λ ,n λ ,n n = 0. That is, we have shown that ∆ ( n )1 = (cid:13)(cid:13) g D , D; n − g P , D; n (cid:13)(cid:13) H convergesto 0 in probability w.r.t. P n . Part 2:
Define L ε ; n ( x, y, t ) = L ε ( | y − f P; n ( x ) | , t ). This yields g P , P ; n := arg inf g ∈ H (cid:18)(cid:90) L ε ; n (cid:0) x, y, g ( x ) (cid:1) P (cid:0) d ( x, y ) (cid:1) + λ ,n (cid:107) g (cid:107) H (cid:19) ∀ P ∈ M ( X × Y ) . Hence, for h P ,n ( x, y ) = L (cid:48) ε ( | y − f P; n ( x ) | , t ), the representer theorem [23, Cor. 5.10] implies that (cid:13)(cid:13) g P , D; n − g P , P; n (cid:13)(cid:13) H ≤ λ − ,n (cid:13)(cid:13)(cid:13)(cid:13) n n (cid:88) i =1 h P ,n Φ − E P h P ,n Φ (cid:13)(cid:13)(cid:13)(cid:13) H . For B := sup x,y (cid:107) h P ,n ( x, y )Φ ( x ) (cid:107) H ≤ . (cid:107) k (cid:107) ∞ , it follows from Hoeffding’s inequality [28, Chap. 3]and λ ,n n → ∞ that ∆ ( n )2 = (cid:13)(cid:13) g P , D; n − g P , P; n (cid:13)(cid:13) H converges to 0 in probability. Part 3:
Since lim n →∞ R L , P ( f P; n ) = inf f ∈L ( X ) R L , P ( f ) as shown in [23, p. 338],it follows fromLemma 3.1 that lim n →∞ ∆ ( n )3 = lim n →∞ (cid:107) f ∗ P − f P; n (cid:107) L (P X ) = 0 . (21) Part 4:
For every g ∈ H , define the approximation error function (where we use the notation (6)) A g : L (P X ) × R → R , ( f, λ ) (cid:55)→ R L (cid:15) , P ( f, g ) + λ (cid:107) g (cid:107) H − inf g ∈L ( X ) R L (cid:15) , P (cid:0) f ∗ . , P , g (cid:1) . Note that the assumption E P | Y | < ∞ implies that | A g ( f, λ ) | < ∞ such that A g is well defined. Itfollows from the Lipschitz property (10) of L ε that A g is continuous for every g ∈ H and, therefore,the map ( f, λ ) (cid:55)→ inf g ∈ H A g ( f, λ ) is upper semicontinuous. Hence, (21) implieslim sup n →∞ ∆ ( n )4 ≤ lim sup n →∞ inf g ∈ H A g ( f P; n , λ ,n ) ≤ inf g ∈ H A g ( f ∗ . , P ,
0) = 0 , where the last equality follows, because the assumption that H is dense in L (P X ) guaranteesinf g ∈L ( X ) R L (cid:15) , P (cid:0) f ∗ . , P , g (cid:1) = inf g ∈ H R L (cid:15) , P (cid:0) f ∗ . , P , g (cid:1) according to [23, Theorem 5.31]. (cid:4) eferences [1] H. Bauer. Measure and integration theory . Walter de Gruyter & Co., Berlin, 2001.[2] A. Berlinet and C. Thomas-Agnan.
Reproducing kernel Hilbert spaces in probability and statistics . Kluwer,Boston, 2004.[3] T. T. Cai and L. Wang. Adaptive variance function estimation in heteroscedastic nonparametric regres-sion.
The Annals of Statistics , 36:2025–2054, 2008.[4] A. Christmann and A. Van Messem. Bouligand derivatives and robustness of support vector machinesfor regression.
J. Mach. Learn. Res. , 9:915–936, 2008.[5] A. Christmann, A. Van Messem, and I. Steinwart. On consistency and robustness properties of supportvector machines for heavy-tailed distributions.
Statistics and Its Interface , 2:311–327, 2009.[6] N. Cristianini and J. Shawe-Taylor.
An Introduction to Support Vector Machines . Cambridge UniversityPress, Cambridge, 2000.[7] Z. Denkowski, S. Mig´orski, and N. Papageorgiou.
An introduction to nonlinear analysis: Theory . KluwerAcademic Publishers, Boston, 2003.[8] T. Hofmann, B. Sch¨olkopf, and A. J. Smola. Kernel methods in machine learning.
The Annals of Statistics ,36:1171–1220, 2008.[9] P. J. Huber. The behavior of maximum likelihood estimates under nonstandard conditions.
Proc. 5thBerkeley Symp. , 1:221–233, 1967.[10] P. J. Huber.
Robust Statistics . John Wiley & Sons, New York, 1981.[11] B. Jørgensen. Exponential dispersion models.
J. R. Statist. Soc. B , 49:127–162, 1987.[12] B. Jørgensen.
Theory of Dispersion Models . Chapman and Hall, London, 1997.[13] A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis. kernlab – an S4 package for kernel methods in R.
Journal of Statistical Software , 11:1–20, 2004.[14] R. W. Koenker.
Quantile Regression . Cambridge University Press, Cambridge, 2005.[15] R. W. Koenker and G. W. Bassett. Regression quantiles.
Econometrica , 46:33–50, 1978.[16] S. R¨uping. mySVM-Manual . Department of Computer Science, University of Dortmund, 2000. .[17] D. Ruppert, M. P. Wand, and R. J. Carroll.
Semiparametric Regression . Cambridge University Press,Cambridge, 2003.[18] B. Sch¨olkopf and A. J. Smola.
Learning with Kernels. Support Vector Machines, Regularization, Opti-mization, and Beyond . MIT Press, Cambridge, MA, 2002.[19] B. Sch¨olkopf, A. J. Smola, and K.-R. M¨uller. Nonlinear component analysis as a kernel eigenvalueproblem.
Neural Comput. , 10:1299–1319, 1998.[20] B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms.
NeuralComput. , 12:1207–1245, 2000.[21] G. Smyth. Generalized linear models with varying dispersion.
J. R. Statist. Soc. B , 51:47–60, 1989.[22] I. Steinwart and A. Christmann. How SVMs can estimate quantiles and the median. In J. Platt, D. Koller,Y. Singer, and S. Roweis, editors,
Advances in Neural Information Processing Systems 20 , pages 305–312.MIT Press, Cambridge, MA, 2008.[23] I. Steinwart and A. Christmann.
Support Vector Machines . Springer, New York, 2008.[24] I. Steinwart and A. Christmann. Estimating conditional quantiles with the help of the pinball loss.
Bernoulli , 17(1):211–225, 2011.[25] I. Takeuchi, Q. V. Le, T. D. Sears, and A. J. Smola. Nonparametric quantile estimation.
J. Mach. Learn.Res. , 7:1231–1264, 2006.[26] V. N. Vapnik.
Statistical Learning Theory . John Wiley & Sons, New York, 1998.[27] G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV.In B. Sch¨olkopf, C. J. C. Burges, and A. Smola, editors,
Advances in Kernel Methods–Support VectorLearning , pages 69–88. MIT Press, Cambridge, MA, 1999.[28] V. Yurinsky.
Sums and Gaussian Vectors , volume 1617 of
Lecture Notes in Mathematics, 1617 . Springer,Berlin, 1995.. Springer,Berlin, 1995.