Monotonicity preservation properties of kernel regression estimators
MMonotonicity preservation properties of kernelregression estimators
Iosif Pinelis
Department of Mathematical SciencesMichigan Technological UniversityHoughton, Michigan 49931, USAE-mail: [email protected]
Abstract
Three common classes of kernel regression estimators are considered: the Nadaraya–Watson (NW) estimator, the Priestley–Chao (PC) estimator, and the Gasser–M¨uller (GM) estimator. It is shown that (i) the GM estimator has a certainmonotonicity preservation property for any kernel K , (ii) the NW estimator hasthis property if and only the kernel K is log concave, and (iii) the PC estimatordoes not have this property for any kernel K . Other related properties of theseregression estimators are discussed. Keywords: nonparametric estimators, kernel regression estimators, curvefitting, monotonicity preservation property
1. Introduction, summary, and discussion
We are given points ( x , y ) , . . . , ( x n , y n ) in R . These points may be thoughtof as particular realizations of random pairs ( X , Y ) , . . . , ( X n , Y n ). In particu-lar, this includes nonlinear regression models of the form Y i = f ( X i ) + ε i for i ∈ [ n ] := { , . . . , n } , where f is a somewhat smooth unknown function from R to R and the ε i ’s are random variables such that E ( ε i | X , . . . , X n ) = 0 for all i . One then wants to obtain an estimator ˆ f of the unknown function f . Away to do that is to smooth the data ( x , y ) , . . . , ( x n , y n ) using a kernel K ,which is understood as a probability density function (pdf) on R – that is, anonnegative measurable function from R to R such that (cid:82) R K ( u ) du = 1. Theresulting kernel smoothers ˆ f of the data are called kernel regression estimators.The kernel K is usually taken according to the formula K ( u ) = K κ,h ( u ) := 1 h κ (cid:16) uh (cid:17) (1.1) Monday 6 th July, 2020 a r X i v : . [ m a t h . S T ] J u l or real u , where κ can be thought of as a fixed kernel, and then h is a positivereal number referred to as the bandwidth, whose choice may depend on themodel, the estimator used, the sample size n , and possibly on the data as well;the choice of the “mother” kernel κ may depend on the model and the estimator.See e.g. [6].Let x := ( x , . . . , x n ) and y := ( y , . . . , y n ) . The three most common kernel regression estimators are as follows.The
Nadaraya–Watson (NW) estimator [15, 19] is defined by the formulaˆ f NW K ( x ) := ˆ f NW K ; x , y ( x ) := (cid:80) ni =1 y i K ( x − x i ) (cid:80) ni =1 K ( x − x i ) (1.2)for all real x such that the denominator (cid:80) ni =1 K ( x − x i ) of the ratio in (1.2) isnonzero; let us denote the set of all such x by D NW K ; x , y : D NW K ; x , y := (cid:110) x ∈ R : n (cid:88) i =1 K ( x − x i ) > (cid:111) . For x / ∈ D NW K ; x , y , the value of ˆ f NW K ; x , y ( x ) is left undefined. So, D NW K ; x , y is thedomain (of definition) of the NW estimator ˆ f NW K ; x , y .The Priestley–Chao (PC) estimator [16] is defined by the formulaˆ f PC K ( x ) := ˆ f PC K ; x , y ( x ) := n (cid:88) i =1 y i ( x i − x i − ) K ( x − x i ) (1.3)for all real x . Here, it is assumed that the x i ’s are in the order of their indices,so that x (cid:54) · · · (cid:54) x n (1.4)and that x is a real number such that x (cid:54) x .The Gasser–M¨uller (GM) estimator [9] is defined by the formulaˆ f GM K ( x ) := ˆ f GM K ; x , y ( x ) := n (cid:88) i =1 y i (cid:90) s i s i − K ( x − t ) dt (1.5)for all real x , where s i := ( x i + x i +1 ) / . Here, (1.4) is assumed again, with the additional assumptions x := −∞ and x n +1 := ∞ , so that s = −∞ and s n = ∞ .Note that x here is not the same as x for the PC estimator.The PC and GM estimators are defined on the entire real line R , which isthus the domain of these two estimators.The question considered in the present note is this:2 Under what conditions on the kernel K do the NW, PC, and GM kernelestimators preserve the monotonicity?More specifically, assume that condition (1.4) holds, as well as the condition y (cid:54) · · · (cid:54) y n , (1.6)so that, if x i < x j for some i and j in [ n ], then y i (cid:54) y j . One can also say that x = ( x , . . . , x n ) and y = ( y , . . . , y n ) are co-monotone.The co-monotonicty condition arises naturally e.g. in the following setting:We have two groups – labeled, say, as an x -group and a y -group, each consistingof n individuals. The individuals in each group are ordered according to thevalues of a certain numerical characteristic, matching the individual in the x -group with the i th smallest value x i of the characteristic to the individual in the y -group with the i th smallest value y i of the characteristic, so that conditions(1.4) and (1.6) hold.Let us say that the NW kernel estimator preserves the monotonicity for agiven kernel K if the function ˆ f NW K = ˆ f NW K ; x , y is nondecreasing (on its domain D NW K ; x , y ) for any natural n and any co-monotone x and y in R n . Similarlydefined are the monotonicity preservation properties for the PC and GM kernelestimators, with the domain D NW K ; x , y of course replaced by R for the latter twoestimators.The monotonicity preservation property appears to be natural and desirablefor a curve estimator. This point is an instance of the general principle that it isdesirable for the values of a statistical estimator to be in the set of all values ofthe estimated function of the unknown distribution. E.g., it is natural to wantthe values of an estimator of a nonnegative parameter to be nonnegative; thevalues of an estimator of a pdf to be pdf’s; etc.As pointed out e.g. in [10], “Monotone estimates are of course required inmany practical applications, where physical considerations suggest that a re-sponse should be monotone in the dosage or the explanatory variable.” Variousmethods for monotonizing kernel estimators have been proposed, including theisotonic regression methods using constrained optimization [8]; a method basedon the minimization of misclassification costs [4]; constrained spline-based meth-ods [13] and other projection techniques [12]; ones based on maximizing fidelityto the conventional empirical approach subject to monotonicity [10]; monotonesmoothing by inversion [7]; a method based on the Hardy–Littlewood–P´olyamonotone rearrangement [1]. The paper [5] uses a sieve (rather than kernel)estimator which imposes the monotonicity constraint.The main results of this note, which will be proved in Section 2, are Theo-rems 1, 2, and 3, which characterize the kernels K for which the NW, PC, andGM kernel estimators preserve the monotonicity. Theorem 1.
The NW kernel estimator preserves the monotonicity for a givenkernel K if and only if K is log concave. Recall here that a nonnegative function g is log concave if ln g is concave,with ln 0 := −∞ . An important example of a log-concave kernel is any normal3df. Also, if K = K κ,h is as in (1.1) with κ ( u ) = c p e −| u | p for some real p (cid:62) u (with c p := 1 / (cid:82) ∞−∞ e −| u | p du ) or with κ ( u ) = b e − / (1 − u ) I {| u | < } forall real u (with b := 1 / (cid:82) − e − / (1 − u ) du and I {·} denoting the indicator), then K is log concave. Also, the arbitrarily shifted and rescaled pdf’s of the gammadistribution with shape parameter (cid:62) (cid:62) K necessarily decrease at least exponentially fast. Also, clearly the kernel K κ,h defined by (1.1) is log concave for each real h > κ is log concave.In a somewhat more specific setting, the “if” part of Theorem 1 was es-sentially presented in [14, Remark 2.1], based on a monotone likelihood ratioproperty of a posterior distribution, with a reference to [11, Lemma 2, page 74].However, the latter lemma does not explicitly mention a posterior distribution.Therefore, we shall give a short, direct, and self-contained proof of the “if” partof Theorem 1, which will also be used to prove the “only if” part of Theorem 1. Theorem 2.
The PC kernel estimator does not preserve the monotonicity forany given kernel K . More specifically, for any kernel K , any natural n , and anyco-monotone x and y in R n , the function ˆ f PC K ; x , y is not nondecreasing – unless x and y are trivial in the sense that y i ( x i − x i − ) = 0 for all i ∈ [ n ] (1.7) (in which case ˆ f PC K ; x , y is identically ). Theorem 3.
The GM kernel estimator preserves the monotonicity for any givenkernel K .Remark . It immediately follows from the definitions (1.2), (1.3), and (1.5)that the NW, PC, and GM kernel estimators are linear in y . In particular, if y is replaced by − y , then the values of these estimators change to their opposites.Therefore, the monotonicity preservation property of any one of these threeestimators implies the corresponding constancy preservation property, by whichwe mean the following: if y = · · · = y n , then the corresponding values of theestimators do not depend on x .In fact, it is obvious that, for any kernel K , the NW and GM kernel estima-tors have the constancy preservation property; moreover, they have the constantpreservation property: if y = · · · = y n = c , then the NW and GM kernel es-timators have the constant value c . On the other hand, in view of Theorem 2,for any kernel K , the function ˆ f PC K ; x , y for y = · · · = y n is constant if and only ifat least one of the following two trivial cases takes place: (i) y = · · · = y n = 0or (ii) x = · · · = x n .The NW and GM kernel estimators also have the shift preservation property(which actually follows from the constant preservation property and the linear-ity): If y , . . . , y n are replaced by y + c, . . . , y n + c for some real c , then ˆ f NW K ; x , y and ˆ f GM K ; x , y are replaced by ˆ f NW K ; x , y + c and ˆ f GM K ; x , y + c , respectively. On the otherhand, for any kernel K , if y , . . . , y n are replaced by y + c, . . . , y n + c for some4eal c , then ˆ f PC K ; x , y is replaced by ˆ f PC K ; x , y + c if and only if at least one of thefollowing two trivial cases takes place: (i) c = 0 or (ii) x = · · · = x n .Summarizing this remark, we may say that the NW and GM kernel estima-tors always have the constancy and shift preservation properties, whereas thePC estimator practically never has these nice properties.The presence – or, in the case of the PC estimator, absence – of the mono-tonicity and shift preservation properties is illustrated in Figure 1.The upper row in Figure 1 shows graphs of ˆ f NW K ; x , y , ˆ f PC K ; x , y , and ˆ f GM K ; x , y forthe (randomly generated) 20-tuples x = ( − . , − , − . , − . , − . , − . , − . , − . , − . , − , − . , − , − , . , . , . , . , . , . ,
10) (1.8)and y = ( − . , − . , − . , − . , − . , − . , − . , − . , . , , . , . , . , . , . , , . , . , . , . , (1.9)with K of the form K h := K κ,h as in (1.1) with κ being the standard normaldensity.For each of the three graphs in the upper row of Figure 1, the bandwidth h of the kernel K h is determined by cross-validation (see e.g. [18]) – that is, as anapproximate minimizer (obtained numerically) of CW ( h ) := (cid:88) j =1 (cid:0) y j − ˆ f K h ; x ( j ) , y ( j ) ( x j ) (cid:1) in real h >
0, where ˆ f ∈ { ˆ f NW , ˆ f PC , ˆ f GM } , x ( j ) := ( x , . . . , x j − , x j +1 , . . . , x ),and y ( j ) := ( y , . . . , y j − , y j +1 , . . . , y ).The lower row in Figure 1 shows graphs of ˆ f NW K ; x , y + , ˆ f PC K ; x , y + , and ˆ f GM K ; x , y + for x and y as in (1.8) and (1.9), with y replaced by its shifted version y + :=( y + 10 , . . . , y n + 10). Here the kernel K is of the same form as the one usedfor the upper row of Figure 1, with the bandwidth h still determined by cross-validation.For the graphs of ˆ f PC K ; x , y and ˆ f PC K ; x , y + in Figure 1, it is assumed that x := x − h .The corresponding data points are also shown in Figure 1.Figure 1 illustrates the monotonicity and shift preservation properties of theNW and GM estimators and the lack of these properties for the PC estimator.One may also note that the GM graphs in Figure 1 look very similar to thecorresponding NW graphs. However, other choices of x and y suggest that theGM graphs are usually a bit smoother than the NW ones.A possible reason for this is that the cross-validation quality CW ( h ) forthe NW estimator is usually rather flat (that is, almost constant) in a largeneighborhood of a minimizer h of CW ( h ), and hence the choice of a numerical5inimizer h of CW ( h ) for the NW estimator may be rather unstable, which canthen affect the smoothness of the resulting NW estimator. Moreover, CW isusually rather flat for the GM estimator as well. This observation is illustratedin Figure 2.On the other hand, as clearly seen from the definition (1.5), the GM estima-tor is always continuous, for any, however discontinuous, kernel K . Of course,this cannot be said concerning the NW and PC estimators.Figures 1 and 2 also suggest that, at least for the co-monotone x and y ,the NW and GM curves fit the data ( x , y ) significantly better than the PCestimator does. In particular, for x and y as in (1.8) and (1.9) the smallestvalues of the cross-validation quality CW are about 27 .
1, 71 .
7, and 20 . CW correspond to the higher quality of the fit.Figure 3 is quite similar to Figure 1 except that in Figure 3 for the motherkernel κ we use the “rectangular” uniform density on the interval ( − / , / CW for the “rectangu-lar” kernel K , they were about 105 . . CW valueswere about 99 . .
1. Thus, for such a discontinuous “rectangular” ker-nel K , the GM estimator appears to perform much better than the NW and PCones. The performance of the PC estimator was especially poor in the secondinstance.Figure 4 is quite similar to Figures 1 and 3 except that in Figure 4 themother kernel κ is the infinitely smooth pdf defined by the formula κ ( u ) = b e − / (1 − u ) I {| u | < } for all real u with b := 1 / (cid:82) − e − / (1 − u ) du .We see that the graphs of the GM estimator in Figure 4 look again smootherthan the corresponding graphs of the NW estimator; however, in this case allthe curves are infinitely smooth. 6 - - - - - - - - - - - - - - Figure 1: Upper row: graphs of ˆ f NW K ; x , y (left), ˆ f PC K ; x , y (middle), and ˆ f GM K ; x , y (right) for x and y as in (1.8) and (1.9). Lower row: graphs of ˆ f NW K ; x , y + (left), ˆ f PC K ; x , y + (middle), andˆ f GM K ; x , y + (right) for x and y as in (1.8) and (1.9). Here K is of the form (1.1) with κ beingthe standard normal density. Figure 2: Graphs of CW for the NW estimator (left), the PC estimator (middle), and theGM estimator (right) for x and y as in (1.8) and (1.9). Here K is of the form (1.1) with κ being the standard normal density. - - - - - - - - - - - - - - - - Figure 3: Upper row: graphs of ˆ f NW K ; x , y (left), ˆ f PC K ; x , y (middle), and ˆ f GM K ; x , y (right) for x and y as in (1.8) and (1.9). Lower row: graphs of ˆ f NW K ; x , y + (left), ˆ f PC K ; x , y + (middle), andˆ f GM K ; x , y + (right) for x and y as in (1.8) and (1.9). Here K is of the form (1.1) with κ beingthe uniform density on ( − / , / - - - - - - - - - - - - - -
10 10 205101520 - - - Figure 4: Upper row: graphs of ˆ f NW K ; x , y (left), ˆ f PC K ; x , y (middle), and ˆ f GM K ; x , y (right) for x and y as in (1.8) and (1.9). Lower row: graphs of ˆ f NW K ; x , y + (left), ˆ f PC K ; x , y + (middle),and ˆ f GM K ; x , y + (right) for x and y as in (1.8) and (1.9). Here K is of the form (1.1) with κ ( u ) = b e − / (1 − u ) I {| u | < } for all real u , where b = 1 / (cid:82) − e − / (1 − u ) du . . Proofs The proofs of Theorems 1, 2, and 3 given below are each based on quitedifferent ideas.
Proof of Theorem 1.
Consider first the “if” part. Here we suppose that K is log concave. Take any x and z in D NW K ; x , y such that x < z . We have to show that ˆ f NW K ( z ) (cid:62) ˆ f NW K ( x ).Letting for brevity k i := K ( x − x i ) and l i := K ( z − x i ) , (cid:80) i := (cid:80) i ∈ [ n ] , and (cid:80) i,j := (cid:80) i ∈ [ n ] ,j ∈ [ n ] , we see that2 (cid:0) ˆ f NW K ( z ) − ˆ f NW K ( x ) (cid:1) (cid:88) i k i (cid:88) j l j = 2 (cid:16) (cid:80) j y j l j (cid:80) j l j − (cid:80) i y i k i (cid:80) i k i (cid:17) (cid:88) i k i (cid:88) j l j = 2 (cid:88) j y j l j (cid:88) i k i − (cid:88) i y i k i (cid:88) j l j = (cid:88) j y j l j (cid:88) i k i + (cid:88) i y i l i (cid:88) j k j − (cid:88) i y i k i (cid:88) j l j − (cid:88) j y j k j (cid:88) i l i = (cid:88) i,j ( y j l j k i + y i l i k j − y i k i l j − y j k j l i )= (cid:88) i,j ( y j − y i )( l j k i − k j l i ) . (2.1)For any i and j in [ n ] such that i (cid:54) j we have x i (cid:54) x j and hence, by the log-concavity of K , k i (cid:62) l − ti k tj and l j (cid:62) l ti k − tj , where t := t i,j := ( z − x ) / ( x j − x i + z − x ) ∈ [0 ,
1) and 0 := 0, so that l j k i (cid:62) k j l i and hence ( y j − y i )( l j k i − k j l i ) (cid:62) i and j in [ n ] such that i (cid:62) j . So,by (2.1), we have ˆ f NW K ( z ) (cid:62) ˆ f NW K ( x ), which completes the proof of the “if” partof Theorem 1.Consider now the “only if” part of Theorem 1. Here we are assuming thatthe NW kernel estimator preserves the monotonicity for K , and we have to showthat K is then log concave. It is enough to show that K is log concave on theset s ( K ) := { x ∈ R : K ( x ) > } .Take n = 2, y = 0, y = 1, x = 0, x = ( v − u ) / x = ( v + u ) /
2, and z = v for any u and v in s ( K ) such that u < v . Then, by (2.1), l k (cid:62) k l , that is, K (( v + u ) / (cid:62) K ( u ) K ( v ), which means that ln K is midpoint concave. Also,ln K is Lebesgue measurable, since K is a pdf. By Sierpi´nski’s theorem [17], anyLebesgue measurable midpoint concave function is concave. So, ln K is concaveand thus K is log concave. Now the “only if” part of Theorem 1 is proved aswell. 9 roof of Theorem 2. Take any kernel K , any natural n , and any co-monotone x and y in R n such that the function ˆ f PC K ; x , y is nondecreasing. We have to showthat then x and y are trivial in the sense that (1.7) holds.Since K is a pdf, (1.3) implies (cid:90) R ˆ f PC K ( x ) dx := n (cid:88) i =1 y i ( x i − x i − ) ∈ R , so that (cid:82) R ˆ f PC K ∈ L ( R ).However, the only nondecreasing function f ∈ L ( R ) is the zero function.Indeed, if f ( a ) > a ∈ R , then f (cid:62) f ( a ) > a, ∞ )and hence (cid:82) [ a, ∞ ) f ( x ) dx = ∞ , which contradicts the assumption f ∈ L ( R ).Similarly, if f ( a ) < a ∈ R , then f (cid:54) f ( a ) < −∞ , a ]and hence (cid:82) ( −∞ ,a ] f ( x ) dx = −∞ , which again contradicts the assumption f ∈ L ( R ).Therefore and because the function ˆ f PC K ; x , y was assumed to be nondecreasing,we conclude that ˆ f PC K ; x , y must be the zero function. Recalling (1.3) again andapplying the Fourier transform, we see that n (cid:88) j =1 y j ( x j − x j − ) e itx j ˆ K ( t ) = 0 (2.2)for all real t , where ˆ K is the Fourier transform/characteristic function of thepdf K given by the formula ˆ K ( t ) := (cid:90) R e itx K ( x ) dx and i is the imaginary unit. Since ˆ K (0) = 1 and the function ˆ K is continuous,there exists some real t > t ∈ ( − t , t ) we have ˆ K ( t ) (cid:54) = 0and hence, by (2.2), (cid:80) nj =1 y j ( x j − x j − ) e itx j = 0 or, equivalently, (cid:88) j ∈ J y j ( x j − x j − ) e itx j = 0 , where J := (cid:8) j ∈ { , . . . , n } : x j − x j − (cid:54) = 0 (cid:9) . Note that the x j ’s for j ∈ J are pairwise distinct – because for any j and k in J such that j < k we have x j (cid:54) x k − < x k . Using now the textbookfact that exponential functions are linearly independent on any nonempty openinterval (cf. e.g. Lemma 3.2 on page 92 in [3] or a more general version forgroup characters [2, Theorem 12, page 38]), we conclude that y j ( x j − x j − ) = 0for all j ∈ J and hence for all j ∈ { , . . . , n } , which completes the proof ofTheorem 2. 10 roof of Theorem 3. Let F be the cdf corresponding to the pdf K , so that F ( x ) = (cid:90) x −∞ K ( u ) du for x ∈ [ −∞ , ∞ ]. Also introduce(∆ i F )( x ) := F ( x − s i − ) − F ( x − s i )for all x ∈ [ −∞ , ∞ ] and all i = 1 , . . . , n and∆ y i := y i − y i − for all i = 2 , . . . , n , so that y i = y + i (cid:88) j =2 ∆ y j for all i = 1 , . . . , n ; as usual, (cid:80) j =2 · · · := 0. Then, recalling the definition (1.5)of the GM kernel estimator, for all real x we haveˆ f GM K ( x ) = n (cid:88) i =1 y i (∆ i F )( x )= n (cid:88) i =1 (cid:16) y + i (cid:88) j =2 ∆ y j (cid:17) (∆ i F )( x )= y n (cid:88) i =1 (∆ i F )( x ) + n (cid:88) i =1 i (cid:88) j =2 ∆ y j (∆ i F )( x )= y + n (cid:88) j =2 ∆ y j n (cid:88) i = j (∆ i F )( x )= y + n (cid:88) j =2 ∆ y j F ( x − s j − ) , taking into account that s = −∞ and s n = ∞ , whereas F ( −∞ ) = 0 and F ( ∞ ) = 1. Also, by (1.6), ∆ y j (cid:62) j = 2 , . . . , n . Now it is obvious thatthe function ˆ f GM K is nondecreasing. This completes the proof of Theorem 3.[1] Dragi Anevski and Anne-Laure Foug`eres, Limit properties of the monotonerearrangement for density and regression function estimation , Bernoulli (2019), no. 1, 549–583. MR 3892329[2] Emil Artin, Galois theory , Edited and supplemented with a section onapplications by Arthur N. Milgram. Second edition, with additions and re-visions. Fifth reprinting. Notre Dame Mathematical Lectures, No. 2, Uni-versity of Notre Dame Press, South Bend, Ind., 1959. MR 0265324113] Viorel Barbu,
Differential equations , Springer, Cham, 2016. MR 3585801[4] Daniel A. Bloch and Bernard W. Silverman,
Monotone discriminant func-tions and their applications in rheumatology , Journal of the American Sta-tistical Association (1997), no. 437, 144–153.[5] Denis Chetverikov and Daniel Wilhelm, Nonparametric instrumental vari-able estimation under monotonicity , Econometrica (2017), no. 4, 1303–1320. MR 3681772[6] C.-K. Chu and J. S. Marron, Choosing a kernel regression estimator ,Statist. Sci. (1991), no. 4, 404–436, With comments and a rejoinder bythe authors. MR 1146907[7] Holger Dette, Natalie Neumeyer, and Kay F. Pilz, A simple nonparametricestimator of a strictly monotone regression function , Bernoulli (2006),no. 3, 469–490. MR 2232727[8] Jerome Friedman and Robert Tibshirani, The monotone smoothing of scat-terplots , Technometrics (1984), no. 3, 243–250.[9] Theo Gasser and Hans-Georg M¨uller, Kernel estimation of regression func-tions , Smoothing techniques for curve estimation (Proc. Workshop, Hei-delberg, 1979), Lecture Notes in Math., vol. 757, Springer, Berlin, 1979,pp. 23–68. MR 564251[10] Peter Hall and Li-Shan Huang,
Nonparametric kernel regression subjectto monotonicity constraints , Ann. Statist. (2001), no. 3, 624–647. MR1865334[11] E. L. Lehmann, Testing statistical hypotheses , John Wiley & Sons, Inc.,New York; Chapman & Hall, Ltd., London, 1959. MR 0107933[12] E. Mammen, J. S. Marron, B. A. Turlach, and M. P. Wand,
A general pro-jection framework for constrained smoothing , Statistical Science (2001),no. 3, 232–248.[13] E. Mammen and C. Thomas-Agnan, Smoothing splines and shape restric-tions , Scandinavian Journal of Statistics (1999), no. 2, 239–252.[14] Hari Mukerjee, Monotone nonparameteric regression , Ann. Statist. (1988), no. 2, 741–750. MR 947574[15] E. A. Nadaraya, On nonparametric estimates of density functions and re-gression curves , Theory Probab. Appl. (1965), 186–190.[16] M. B. Priestley and M. T. Chao, Non-parametric function fitting , J. Roy.Statist. Soc. Ser. B (1972), 385–392. MR 331616[17] Waclaw Sierpinski, Sur les fonctions convexes mesurables , FundamentaMathematicae (1920), no. 1, 125–128.1218] M. Stone, Cross-validatory choice and assessment of statistical predictions ,Journal of the Royal Statistical Society. Series B (Methodological) (1974), no. 2, 111–147.[19] Geoffrey S. Watson, Smooth regression analysis , Sankhy¯a Ser. A26