Lévy Adaptive B-spline Regression via Overcomplete Systems
LL´evy Adaptive B-spline Regression via Overcomplete Systems
Sewon Park , Hee-Seok Oh , and Jaeyong Lee Department of Statistics, Seoul National UniversityFebruary 2, 2021
Abstract
The estimation of functions with varying degrees of smoothness is a challenging prob-lem in the nonparametric function estimation. In this paper, we propose the LABS (L´evyAdaptive B-Spline regression) model, an extension of the LARK models, for the estimationof functions with varying degrees of smoothness. LABS model is a LARK with B-splinebases as generating kernels. The B-spline basis consists of piecewise k degree polynomialswith k − a r X i v : . [ s t a t . M E ] J a n Introduction
Suppose we observe n pairs of observations, ( x , y ) , ( x , y ) , . . . , ( x n , y n ) where y i = η ( x i ) + (cid:15) i , (cid:15) i iid ∼ N (0 , σ ) , i = 1 , , . . . , n, (1)and η is an unknown real-valued function which maps R to R . We wish to estimate the meanfunction η that may have varying degrees of smoothness including discontinuities. In nonpara-metric function estimation, we often face smooth curves except for a finite number of jump dis-continuities and sharp peaks, which are common in many climate and economic datasets. Heavyrainfalls cause a sudden rise in the water level of a river. The COVID-19 epidemic brought abouta sharp drop in unemployment rates. Policymakers’ decisions can give rise to abrupt changes.For instance, the United States Congress passed the National Minimum Drinking Age Act in1984, which has been debated over several decades in the United States, establishing 21 asthe minimum legal alcohol purchase age. This act caused a sudden rise in mortality for youngAmericans around 21. The abrupt changes can provide us with meaningful information on theseissues, and it is important to grasp the changes.There has been much research into the estimation of local smoothness of the functions. Thefirst approach is to minimize the penalized sum of squares based on a locally varying smoothingparameter or penalty function across the whole domain. Pintore et al. (2006), Liu and Guo(2010), and Wang et al. (2013) modeled the smoothing parameter of smoothing spline to varyover the domain. Ruppert and Carroll (2000), Crainiceanu et al. (2007), Krivobokova et al.(2008), and Yang and Hong (2017) suggested the penalized splines based on the local penalty thatadapts to spatial heterogeneity in the regression function. The second approach is the adaptivefree-knot splines that choose the number and location of the knots from the data. Friedman(1991) and Luo and Wahba (1997) determined a set of knots using stepwise forward/backwardknot selection procedures. Zhou and Shen (2001) avoided the problems of stepwise schemesand proposed optimal knot selection schemes introducing the knot relocation step. Smith andKohn (1996), Denison et al. (1998b), Denison et al. (1998a), and DiMatteo et al. (2001) studiedBayesian estimation of free knot splines using MCMC techniques. The third approach is to usewavelet shrinkage estimators including VisuShrink based on the universal threshold (Donoho andJohnstone; 1994), SureShrink based on Stein’s unbiased risk estimator (SURE) function (Donohoand Johnstone; 1995), Bayesian thresholding rules by utilizing a mixture prior (Abramovich2t al.; 1998), and empirical Bayes methods for level-dependent threshold selection (Johnstone andSilverman; 2005). The fourth approach is to detect jump discontinuities in the regression curve.Koo (1997), Lee (2002), and Yang and Song (2014) dealt with the estimation of discontinuousfunction using B-spline basis functions. Qiu and Yandell (1998), Qiu (2003), Gijbels et al.(2007), and Xia and Qiu (2015) identified jumps based on local polynomial kernel estimation.In this paper, we consider a function estimation method using overcomplete systems. Asubset of the vectors { φ } j ∈ J of Banach space F is called a complete system if (cid:107) η − (cid:88) j ∈ J β j φ j (cid:107) < (cid:15), ∀ η ∈ F , ∀ (cid:15) > , where β j ∈ R and J ∈ N ∪ ∞ . Such a complete system is overcomplete if removal of a vector φ j from the system does not alter the completeness. In other words, an overcomplete systemis constructed by adding basis functions to a complete basis (Lewicki and Sejnowski; 2000).Coefficients β j in the expansion of η with an overcomplete system are not unique owing to theredundancy intrinsic in the overcomplete system. The non-uniqueness property can providemore parsimonious representations than those with a complete system (Simoncelli et al.; 1992).The L´evy Adaptive Regression Kernels (LARK) model, first proposed by Tu (2006), is aBayesian regression model utilizing overcomplete systems with L´evy process priors. Tu (2006)showed the LARK model had sparse representations for η from an overcomplete system andimprovements in nonparametric function estimation. Pillai et al. (2007) found out the rela-tionship between the LARK model and a reproducing kernel Hilbert space (RKHS), and Pillai(2008) proved the posterior consistency of the LARK model. Ouyang et al. (2008) extendedthe LARK method to the classification problems. Chu et al. (2009) used continuous waveletsas the elements of an overcomplete system. Wolpert et al. (2011) obtained sufficient conditionsfor LARK models to lie in the some Besov space or Sobolev space. Lee et al. (2020) devised anextended LARK model with multiple kernels instead of only one type kernel.In this paper, we develop a fully Bayesian approach with B-spline basis functions as theelements of an overcomplete system and call it the L´evy Adaptive B-Spline regression (LABS).Our main contributions of this work can be summarized as follows.1. The LABS model can systematically represent the smoothness of functions varying locallyby changing the orders of the B-spline basis. The form of a B-spline basis depends on3he locations of knots and can be symmetrical or asymmetrical. The varying degree ofB-spline basis enables the LABS model to adapt to the smoothness of functions.2. We investigate two theoretical properties of the LABS model. First, the mean function ofthe LABS model exists in certain Besov spaces based on the types of degrees of B-splinebasis. Second, the prior of the LABS model has full support on some Besov spaces. Thus,the proposed LABS model extends the range of smoothness classes of the mean function.3. We provide empirical results demonstrating that our model performs well in the spatiallyinhomogeneous functions such as the functions with both jump discontinuities, sharppeaks, and smooth parts. The LABS model achieved the best results in almost everyexperiments compared to the popular nonparametric function estimation methods. Inparticular, the LABS model showed remarkable performance in estimating functions withjump discontinuities and outperformed other competing models.The rest of the paper is organized as follows. In section 2, we introduce the L´evy AdaptiveRegression Kernels and discuss its properties. In section 3, we propose the LABS model andpresent an equivalent model with latent variables that make the posterior computation tractable.We present three theorems that the function spaces for the proposed model depend upon thedegree of B-spline basis and that the prior has large support in some function spaces. We describethe detailed algorithm of posterior sampling using reversible jump Markov chain Monte Carloin section 4. In section 5, the LABS model is compared with other methods in two simulationstudies and in section 6 three real-world data sets are analysed using the LABS model. In thelast section, we discuss some improvements and possible extensions of the proposed model. In this section, we give a brief introduction to the LARK model. Let Ω be a complete separablemetric space, and ν be a positive measure on R × Ω with ν ( { } , Ω) = 0 satisfying L integrabilitycondition, (cid:90) (cid:90) R × A (1 ∧ | β | ) ν ( dβ, dω ) < ∞ , (2)for each compact set A ⊂ Ω. The L´evy random measure L with L´evy measure ν is defined as L ( dω ) = (cid:90) R βN ( dβ, dω ) , N is a Poisson random measure with intensity measure ν . We denote L ∼ L´evy( ν ). Forany t ∈ R , the characteristic function of L ( A ) is E (cid:104) e itL ( A ) (cid:105) = exp (cid:26)(cid:90) (cid:90) R × A ( e itβ − ν ( dβ, dω ) (cid:27) , for all A ⊂ Ω . (3)Let g ( x, ω ) be a real-valued function defined on X ×
Ω where X is another set. By integrating g with respect to a L´evy random measure L , we define a real-valued function on X : η ( x ) ≡ L [ g ( x )] = (cid:90) Ω g ( x, ω ) L ( dω ) = (cid:90) Ω (cid:90) R g ( x, ω ) βN ( dβ, dω ) , ∀ x ∈ X . (4)We call g a generating function of η .When ν ( R × Ω) = M is finite, a L´evy random measure can be represented as L ( dω ) = (cid:80) j ≤ J β j δ ω j , where J has a Poisson distribution with mean M > { ( β j , ω j ) } iid ∼ π ( dβ j , dω j ) := ν/M, j = 1 , , . . . , J . In this case, equation (4) can be expressed as η ( x ) = J (cid:88) j =1 g ( x, ω j ) β j , (5)where { ( β j , ω j ) } is the random set of finite support points of a Poisson random measure. If g is bounded, L integrability condition (2) implies the existence of (4) for all x . See Lee et al.(2020).If a L´evy measure satisfying (2) is infinite, the number of the support points of N ( R , Ω) isinfinite almost surely. Tu (2006) proved that the truncated L´evy random field L (cid:15) [ g ] convergesin distribution to L [ g ] as (cid:15) →
0, where L (cid:15) [ g ] = (cid:90) (cid:90) [ − (cid:15),(cid:15) ] c × Ω g ( x, ω ) βN ( dβ, dω ) = (cid:90) (cid:90) R × Ω g ( x, ω ) βN (cid:15) ( dβ, dω ) , and N (cid:15) is a Poisson measure on R with mean measure ν (cid:15) ( dβ, dω ) := ν ( dβ, dω ) I | β | >(cid:15) . This truncation often used as an approximation of the posterior. For posterior computationmethods for the Poisson random measure without truncation, see Lee (2007) and Lee and Kim(2004).Together with data generating mechanism (1), the LARK model is defined as follows: E [ Y | L, θ ] = η ( x ) ≡ (cid:90) Ω g ( x, ω ) L ( dω ) L | θ ∼ L`evy( ν ) θ ∼ π θ ( dθ ) , ν ) denotes the L`evy process which has the characteristic function and ν is a L`evymeasure satisfying (2). Tu (2006) used gamma, symmetric gamma, and symmetric α -stable(S α S) (0 < α <
2) L`evy random fields. The conditional distribution for Y has a hyperparameter θ and π θ denotes the prior distribution of θ . The generating function g ( x, ω ) is used as elementsof an overcomplete system.Tu (2006) and Lee et al. (2020) used the Gaussian kernel, the Laplacekernel, and Haar wavelet as generating functions: • Haar kernel: g ( x, ω ) := I (cid:0) | x − χλ | ≤ (cid:1) • Gaussian kernel: g ( x, ω ) = exp (cid:110) − ( x − χ ) λ (cid:111) • Laplacian Kernel: g ( x, ω ) = exp (cid:110) − | x − χ | λ (cid:111) with ω := ( χ, λ ) ∈ R × R + := Ω. All of the above generating functions are bounded.This LARK model can be represented in a hierarchical structure as follows: Y i | η ( x i ) ind ∼ N ( η ( x i ) , σ ) η ( x i ) = J (cid:88) j =1 g ( x i , ω j ) β j J | (cid:15) ∼ Pois( ν (cid:15) ( R , Ω))( β j , ω j ) | J, (cid:15) i.i.d ∼ π ( dβ j , d ω j ) := ν (cid:15) ( dβ j , d ω j ) ν (cid:15) ( R , Ω)for j = 1 , . . . , J . J is the random number that is stochastically determined by L`evy randommeasure, ( β , . . . , β J ) is the unknown coefficients of a mean function and ( ω , . . . , ω J ) is theparameters of the generating functions. To obtain samples from the posterior distribution underthe LARK model, the reversible jump Markov chain Monte Carlo (RJMCMC) proposed byGreen (1995) is used because some parameters have varying dimensions.The LARK model stochastically extracts features and finds a compact representation for η ( · ) based on an overcomplete system. That is, it enables functions to be represented by thesmall number of elements from an overcomplete system. However, both the LARK model andmost methods for function estimation use only one type of kernel or basis and can find out therestricted smoothness of the target function. These models cannot afford to capture all partsof the function with various degrees of smoothness. For example, we consider a noisy modifiedHeavisine function sampled at n = 512 equally spaced points on [0 ,
1] in Figure 1. The data6ontains both smooth and non-smooth regions such as peaks and jumps. As shown in panel(a) of Figure 1, it is difficult for the LARK model with a finite L`evy measure using Gaussiankernel to estimate jump parts of the data. We, therefore, propose a new model which can adaptthe smoothness of function systematically by using a variety of B-spline bases as the generatingelements of an overcomplete system.Figure 1: Comparison of curve fitting functions with (a) LARK, and (b) LABS model for themodified heavisine dataset. The solid lines are estimated functions and the dashed line representsthe true function.
We consider a general type of basis function as the generating elements of an overcompletesystem instead of specific kernel functions such as Haar, Laplacian, and Gaussian. The LABSmodel uses B-spline basis functions which can all systematically express jumps, sharp peaks,and smooth parts of the function.
The B-spline basis function consists of piecewise k degree polynomials with k − k can be derived utilizing the Cox-de Boor7ecursion formula: B ∗ ,i ( x ) := t i ≤ x < t i +1 B ∗ k,i ( x ) := x − t i t i + k − t i B ∗ k − ,i ( x ) + t i + k +1 − xt i + k +1 − t i +1 B ∗ k − ,i +1 ( x ) , (6)where t i are called knots which must be in non-descending order t i ≤ t i +1 (De Boor; 1972),(Cox; 1972). The B-spline basis of degree k , B ∗ k,i ( x ) then needs ( k + 2) knots, ( t i , . . . , t i + k +1 ).For convenience of notation, we redefine the B-spline basis of degree k with a knot sequence ξ k := ( ξ k, , . . . , ξ k,k +2 ) as follows. B ( x ; ξ ) := ξ , ≤ x < ξ , B k ( x ; ξ k ) := x − ξ k, ξ k, ( k +1) − ξ k, B k − ( x ; ξ (cid:63)k ) + ξ k, ( k +2) − xξ k, ( k +2) − ξ k, B k − ( x ; ξ (cid:63)(cid:63) ) , (7)where ξ (cid:63)k := ( ξ k, , ξ k, , . . . , ξ k, ( k +1) ) and ξ (cid:63)(cid:63)k := ( ξ k, , ξ k, , . . . , ξ k, ( k +2) ).The B-spline basis functions can have a variety of shapes and smoothness determined byknot locations and the degree of it. For example, a B-spline basis function can be a piecewiseconstant (k = 0), linear ( k = 1), quadratic ( k = 2) , and cubic ( k = 3) functions. Furthermore,the B-spline basis with equally spaced knots has the symmetric form. These bases are called aUniform B-splines. Examples of the B-spline basis functions of different degrees with equallyspaced knots are shown in Figure 2.Figure 2: Different shapes of the B-spline basis function by increasing the degree k .2 Model Specification The LARK model with one type of kernel can not estimate well functions with both continu-ous and discontinuous parts. To improve this, we consider various a B-spline basis functionssimultaneously for estimating all parts of the unknown function. The new model uses B-splinebasis to systematically generate an overcomplete system with varying degrees of smoothness.For example, the B-spline basis functions of degrees 0, 1 and 2 or more are for jumps, sharppeaks and smooth parts of the function, respectively.We consider the mean function can be expressed as a random finite sum: η ( x ) = (cid:88) k ∈ S (cid:88) ≤ l ≤ J k B k ( x ; ξ k,l ) β k,l , (8)where S denotes the subset of degree numbers of B-spline basis and B k ( x ; ξ k ) is a B-splinebasis of degree k with knots, ξ k ∈ X ( k +2) := Ω. Generating functions of the LARK modelare replaced by the B-spline basis functions. J k has a Poisson distribution with M k > { ( β k,l , ξ k,l ) } iid ∼ π k ( dβ k , d ξ k ) := ν k ( dβ k , d ξ k ) /ν k ( R × Ω). In this paper, we assume π k ( dβ k , d ξ k ) = N ( β k ; 0 , φ k ) dβ k · U ( ξ k ; X ( k +2) ) d ξ k . The mean function can be also defined as η ( x ) ≡ (cid:88) k ∈ S (cid:90) Ω B k ( x ; ξ k ) L k ( d ξ k ) . (9)The stochastic integral representation of the mean function is determined by L k ∼ L´evy( ν k ( dβ k , d ξ k )) , ∀ k ∈ S, where ν k ( dβ k , d ξ k ) is a finite L´evy measure satisfying M k ≡ ν k ( R × Ω) < ∞ . Although the L´evymeasure ν k satisfying (2) may be infinite, the Poisson integrals and sums aboves are well definedfor all bounded measurable compactly-supported B k ( · , · ) for which for all k ∈ S , (cid:90) (cid:90) R × Ω (1 ∧ | β k B k ( · ; ξ k ) | ) ν k ( dβ k , d ξ k ) < ∞ . (10)In this paper, we consider only finite L´evy measures in the proposed model. In other words,we restrict our attention to the L´evy measure of a compound Poisson process. The new modelis more complex than the LARK model with one kernel and expected to give a more accurate9stimate of the regression function. It can estimate a mean function having both smooth andpeak shapes. The proposed model can write in hierarchical form as Y i | x i ind ∼ N ( η ( x i ) , σ ) , i = 1 , , · · · , n,η ( x ) = β + (cid:88) k ∈ S (cid:88) ≤ l ≤ J k B k ( x ; ξ k,l ) β k,l ,σ ∼ IG (cid:18) r , rR (cid:19) ,J k ∼ Poi( M k ) ,M k ∼ Ga( a γ k , b γ k ) ,β k,l iid ∼ N (0 , φ k ) , l = 1 , , · · · , J k , ξ k,l iid ∼ U ( X ( k +2) ) , l = 1 , , · · · , J k , (11)for k ∈ S . We set β = Y and φ k = 0 . × (max i { Y i } − min i { Y i } ). In this section, we present three theorems on the support of the LABS model. We first definethe modulus of smoothness and Besov spaces.
Definition 1
Let < p ≤ ∞ and h > . For f ∈ L p ( X ) , the r th order modulus of smoothnessof f is defined by ω r ( f, t ) p := sup h ≤ t (cid:107) ∆ rh f (cid:107) p , where ∆ rh f ( x ) = (cid:80) rk =0 (cid:0) rk (cid:1) ( − r − k f ( x + kh ) for x ∈ X and x + kh ∈ X . If r = 1, ω ( f, t ) p is the modulus of continuity. There exist equivalent definitions in definingBesov spaces. We follow DeVore and Lorentz (1993)[2.10, page 54]. Definition 2
Let α > be given and let r be the smallest integer such that r > α . For < p, q < ∞ , the Besov space B αp,q is the collection of all functions f ∈ L p ( X ) such that | f | B αp,q = (cid:18)(cid:90) ∞ [ t − α ω r ( f, t ) p ] q dtt (cid:19) /q is finite. The norm on B αp,q is defined as (cid:107) f (cid:107) B αp,q = (cid:107) f (cid:107) p + | f | B αp,q . L p ( X )and especially can allow smoothness of spatially inhomogeneous functions, including spikes andjumps. The Besov space has three parameters, α , p , and q , where α is the degree of smoothness, p represents that L p (Ω) is the function space where smoothness is measured, and q is a parameterfor a finer tuning on the degree of smoothness. Theorem 1
For fixed k ∈ S and ξ k ∈ X ( k +2) , the B-spline basis B k ( x ; ξ k ) falls in B αp,q ( X ) forall ≤ p, q < ∞ and α < k + 1 /p . The proof is given in Appendix A. For instance, the B-spline basis with degree 0 satisfies B k ( · , ξ k ) ∈ B αp,q for α < /p , the B-spline basis with degree 1 is in B αp,q for 1 + 1 /p and theB-spline basis with degree 2 falls in B sp,q for 2 + 1 /p .The following theorem describes the mean function of the LABS model, η , is in a Besov spacewith smoothness corresponding to degrees of B-spline bases used in the LABS model. The proofof the theorem closely follows that of Wolpert et al. (2011). The proof of Theorem 2 is given inAppendix A. Theorem 2
Suppose X is a compact subset of R . Let ν k be a L´evy measure on R × X ( k +2) thatsatisfies the following integrability condition, (cid:90) (cid:90) R ×X ( k +2) (1 ∧ | β k | ) ν k ( dβ k , d ξ k ) < ∞ . (12) and L k ∼ L´evy ( ν k ) for all k ∈ S . Define the mean function of the LABS model, η ( · ) = (cid:80) k ∈ S (cid:82) X ( k +2) B k ( x ; ξ k ) L k ( d ξ k ) on X where B k ( x ; ξ k ) satisfies (12) for each fixed x ∈ X . Then, η has the convergent series η ( x ) = (cid:88) k ∈ S (cid:88) l B k ( x ; ξ k,l ) β k,l (13) where S is a finite set including degree numbers of B-spline basis. Furthermore, η lies in theBesov space B αp,q ( X ) with α < min( S ) + p almost surely. For example, if a zero element is included in S then the mean function of the LABS, η falls in B αp,q with α < p almost surely, which consists of functions no longer continuous. If S = { , , } ,then, η belongs to B αp,q with α < p almost surely. Moreover, it is highly possible that thefunction spaces for the LABS model may be larger than those of the LARK model using onetype of kernel function. Specifically, the mean function for the LABS model with S = { , } falls11n B αp,p with α < p almost surely. If that of the LARK model using only one Laplacian kernelfalls in B αp,p with α < p , then the function spaces of the LABS model with given α < p arelarger than those of the LARK model for the range of smoothness parameter, p < α < p bythe properties of the Besov space.The next theorem shows that the prior distribution of our LABS model has sufficiently largesupport on the Besov space B αp,q with 1 ≤ p, q < ∞ and α >
0. For η ∈ B αp,q ( X ), denote theball around η of radius δ , ¯ b δ ( η ) = { η : (cid:107) η − η (cid:107) p < δ } where (cid:107) · (cid:107) p is a L p norm. The proof of Theorem 3 is given in Appendix A. Theorem 3
Let X be a bounded domain in R . Let ν k be a finite measure on R × X ( k +2) and L k ∼ Levy ( ν k ) for all k ∈ S . Suppose η has a prior Π for the LABS model (11) . Then, Π(¯ b δ ( η )) > for every η ∈ B αp,q ( X ) and all δ > . Based on the prior specifications and the likelihood function, the joint posterior distribution ofthe LABS model (11) is[ β , ξ , J , M , σ | Y ] ∝ [ Y | η, σ ] × [ β , ξ | J ] × [ J | M ] × [ M ] × [ σ ] ∝ (cid:34) ( σ ) − n/ exp (cid:40) − σ n (cid:88) i =1 ( Y i − β − (cid:88) k ∈ S J k (cid:88) l =1 B k ( x i ; ξ k,l ) β k,l ) (cid:41)(cid:35) × (cid:89) k ∈ S (cid:34) exp (cid:40) − σ k J k (cid:88) l =1 β k,l (cid:41)(cid:35) × (cid:89) k ∈ S (cid:34) |X ( k +2) | J k J k (cid:89) l =1 I ( ξ k,l ∈ X ( k +2) ) (cid:35) × (cid:89) k ∈ S (cid:34) M J k k J k ! exp {− M k } (cid:35) × (cid:89) k ∈ S (cid:104) M a γk − k exp {− b γ k M k } (cid:105) × (cid:20) ( σ ) − r +1 exp (cid:26) − rR σ (cid:27)(cid:21) . (14)The parameters β and ξ of the LABS model have varying dimensions as J k is a random variable.We use the Reversible Jump Markov Chain Monte Carlo (RJMCMC) algorithm (Green; 1995)for the posterior computation.We consider three transitions in the generation of posterior samples: (a) the addition of basisfunctions and coefficients; (b) the deletion of basis functions and coefficients; (c) the relocation12f knots which affects the shape of basis functions and coefficients. Note that in step (c) thenumbers of basis functions and coefficients do not change. We call such move types birth step,death step, and relocation step, respectively. A type of move is determined with probabilities p b , p d and p w with p b + p d + p w = 1, where p b , p d and p w are probabilities of choosing the birth,death, and relocation steps, respectively.Let us denote θ k,l = ( β k,l , ξ k,l ) by an element of θ k = { θ k, , θ k, , . . . , θ k,j , . . . , θ k,J k } , whereeach ξ k,l has the ( k + 2) dimensions. In general, the acceptance ratio of the RJMCMC can beexpressed as A = min [1 , (likelihood ratio) × (prior ratio) × (proposal ratio) × (Jacobian)] . In our problem the acceptance ratio for each move types is given by A = min (cid:20) , L ( Y | θ (cid:48) k , J (cid:48) k ) Π( θ (cid:48) k | J (cid:48) k )Π( J (cid:48) k ) q ( θ k | θ (cid:48) k ) L ( Y | θ k , J k ) Π( θ k | J k )Π( J k ) q ( θ (cid:48) k | θ k ) (cid:21) , (15)where θ k and J k refer to the current model parameters and the number of basis functions inthe current state, θ (cid:48) k and J (cid:48) k denote the proposed model parameters and the number of basisfunctions of the new state. Here, the Jacobian is 1 in all move types. q ( θ (cid:48) k | θ k ) is the jumpproposal distribution that proposes a new state θ (cid:48) k given a current state θ k . Specifically, wechoose the following jump proposal density proposed by Lee et al. (2020) for each move steps: q b ( θ (cid:48) k | θ k ) = p b × b ( θ k,J k +1 ) × J k + 1 ,q b ( θ (cid:48) k | θ k ) = p d × J k ,q w ( θ (cid:48) k | θ k ) = p w × q ( θ (cid:48) k,r | θ k,r ) , where b ( · ) is a candidate distribution which proposes a new element. For death and changesteps, a randomly chosen r th element of θ k is deleted and rearranged, respectively. The detailsregarding updating schemes of each move steps are as follows.(a) [Birth step] Assume that the current model is composed of J k basis functions. If the birthstep is selected, a new basis function B k,J k +1 and θ k,J k +1 is accepted with the acceptanceratio min (cid:20) , L ( Y | θ (cid:48) k , J (cid:48) k ) L ( Y | θ k , J k ) × π ( θ k,J k +1 ) M k ( J k + 1) × p d / ( J k + 1)( p b × b ( θ k,J k +1 )) / ( J k + 1) (cid:21) . β k,J k +1 and an ordered knot set ξ k,J k +1 are drawn from theirgenerating distributions and added at the end of ( β k, , . . . , β k,J k ) and ( ξ k, , . . . , ξ k,J k ).When J k = 0, the birth step must be exceptionally selected until J k becomes one.(b) [Death step] If the death step is selected, a r th element, θ k,r uniformly chosen is removedfrom the existing set of basis functions, coefficients and ordered knot sets. We can findout the acceptance ratio for a death step similarly. The acceptance ratio is given bymin (cid:20) , L ( Y | θ (cid:48) k , J (cid:48) k ) L ( Y | θ k , J k ) × J k π ( θ k,r ) M k × ( b ( θ k,r ) × p b ) /J k p d /J k (cid:21) . (c) [Relocation step] Unlike the other steps, the relocation step keeps the numbers of basisfunctions or coefficients or ordered knot sets fixed. Therefore, the updating scheme of thisstep is a Metropolis-Hastings within Gibbs sampler. If the relocation step is selected, acurrent location θ k,r is moved to a new location θ (cid:48) k,r generated by proposal distributionswith the acceptance ratio (16). Particularly, since knots of basis function must be innon-descending order, ξ k,r,i which is the i th element of an ordered knot set is sequentiallyreplaced with a new knot location ξ (cid:48) k,r,i generated by U ( ξ k,r,i − , ξ k,r,i +1 ) , i = 1 , . . . , ( k + 2),where ξ k,r, and ξ k,r,k +1 are boundary points of X . That is, each element of a specific knotset ξ k,r = ( ξ k,r, , . . . , ξ k,r,k +2 ) is moved to new knot locations ξ (cid:48) k,r = ( ξ (cid:48) k,r, , . . . , ξ (cid:48) k,r,k +2 ) inturn. The corresponding acceptance ratio is given bymin (cid:34) , L ( Y | θ (cid:48) k , J (cid:48) k ) L ( Y | θ k , J k ) × π ( θ (cid:48) k,r ) π ( θ k,r ) × q w ( θ k,r | θ (cid:48) k,r ) q w ( θ (cid:48) k,r | θ k,r ) (cid:35) . (16)When using an independent proposal distribution (i.e. q w ( θ (cid:48) k,r | θ k,r ) = π ( θ (cid:48) k,r )), the accep-tance ratio can reduce to min (cid:20) , L ( Y | θ (cid:48) k , J (cid:48) k ) L ( Y | θ k , J k ) (cid:21) . Finally, β (cid:48) k,r is sampled from its conditional posterior distribution by using the Gibbssampling.The posterior samples of σ and M k can be generated from their conditional posterior distri-butions. See Appendix C. The pseudo-code for the proposed strategy is given in Algorithm1. 14 lgorithm 1 A reversible jump MCMC algorithm for LABS procedure LABS ( S ) (cid:46) S : set of degree numbers Initialize parameters J , β , ξ , M , σ from prior distributions. for iteration i = 1 to N do for k = 1 to | S | do (cid:46) J := { J , . . . , J k , . . . , J | S | } Update ( J k , β k , ξ k ) through a reversible jump MCMC. Sample M k from the full conditional π ( M k | others). (cid:46) Gibbs step end for Sample σ from the full conditional π ( σ | β , ξ , J , M , y ). (cid:46) Gibbs step Store i th MCMC samples. end for end procedure In this section, we evaluate the performance of the LABS model (11) and competing methods onsimulated data sets. First, we apply the proposed method to four standard examples: Bumps,Blocks, Doppler and Heavisine test functions introduced by Donoho and Johnstone (1994).Second, we consider three functions that we created ourselves with jumps and peaks to assessthe practical performance of the proposed model.The simulated data sets are generated from equally spaced x ’s on X = [0 ,
1] with samplesizes n = 128 and 512. Independent normally distributed noises N (0 , σ ) are added to the truefunction η ( · ). The root signal-to-noise ratio (RSNR) is defined asRSNR := (cid:115) (cid:82) X ( f ( x ) − ¯ f ) dxσ , where ¯ f := |X | (cid:82) X f ( x ) dx and set at 3, 5 and 10. We also run the LABS model for 200,000iterations, with the first 100,000 iterations discarded as burn-in and retain every 10th sample.For comparison between the methods, we compute the mean squared errors of all methods using100 replicate data sets for each test function. The average of the posterior curves is used for theestimate of the test function. MSE = 1 n n (cid:88) i =1 ( η ( x i ) − ˆ η ( x i )) . .1 Simulation 1 : DJ test functions We carry out a simulation study using the benchmark test functions suggested by Donohoand Johnstone (1994) often used in the field of wavelet and nonparametric function estima-tion. The Donoho and Johnstone test functions consist of four functions called Bumps, Blocks,Doppler and Heavisine. These test functions are composed of various shapes such as sharppeaks (Bumps), jump discontinuities (Blocks), oscillating behavior (Doppler) and jumps/peaksin smooth functions (Heavisine) (See Figure 3).Figure 3: The Donoho and Johnstone test functions: (a) Bumps, (b) Blocks, (c) Doppler and(d) HeavisineThe hyperparameters and types of basis functions displayed in Table 1 were used in (11).For Bumps and Doppler, the parameter r of prior distribution for σ was set to 100 to speed upconvergence. We also took account of the combinations of a B-spline basis based on the shapesof test functions.We compared our model with a variety of methods such as B-spline curve of degree 2 with50 knots (denoted as BSP-2), Local polynomial regression with automatic smoothing param-eter selection (denoted by LOESS), Smoothing spline with smoothing parameter selected bycross-validation (denoted by SS), Nadaraya–Watson kernel regression using the Gaussian kernel16 r R a γ k b γ k Bumps { }
100 0.01 1 1Blocks { } { }
100 0.01 1 1Heavisine { } h which minimizes CV error (denoted by NWK), Empirical Bayes approach forwavelet shrinkage using a Laplace prior with Daubechies “least asymmetric” (la8) wavelets ex-cept for the Blocks example, where it uses the Haar wavelet; Johnstone and Silverman (2005) (de-noted by EBW), Gaussian process regression with the Radial basis or Laplacian kernel (denotedby GP-R or GP-L), Bayesian curve fitting using piecewise polynomials with l = , l = n = 128. It also has the smallestaverage mean square errors for all test functions except the Heavisine example with RSNR =3.Similarly, for sample size n = 512, the LABS model comes up with the best performancein Figure 5 except for the Doppler function, where it is competitive. Our model removes highfrequencies in the interval [0 , .
1] and produces a smooth curve within the corresponding domain.On the contrary, due to a small number of data points in the Doppler example with n = 128,most models yield similar smooth curves in [0 , . n = 128 and RSNR = (a) 3, (b) 5and (c) 10standard deviation of mean square errors in all scenarios. This suggests that our model has anexcellent ability to find jump points. Furthermore, LABS has consistently better performancethan B-spline regression using only one basis function for four simulated data sets since itsovercomplete systems can be constructed by various combinations of B-spline basis functions.See Appendix B. Our main interest lies in estimating smooth functions with either discontinuity such as jumps orsharp peaks or both. We design three test functions to assess the practical performance of theproposed method for our concerns. The first and second example is modified by adding somesmooth parts, unlike the original version of the Bumps and Blocks of DJ test functions. Each18igure 5: Boxplots of MSEs from the simulation study with n = 512 and RSNR = (a) 3, (b) 5and (c) 10test functions provided is given by η ( x ) = 0 . .
92 [4ssgn( x − . − x − .
13) + 5ssgn( x − . − . x − . . x − .
44) + 4 . x − . − . x − .
81) + 2] + 0 . πx ) ,η ( x ) =[7 K . ( x − .
1) + 5 K . ( x − .
25) + 4 . K . ( x − .
4) + 4 . K . ( x − . . K . ( x − .
78) + 3 . K . ( x − . πx ) , where sgn( x ) = I (0 , ∞ ) ( x ) − I ( −∞ , ( x ), ssgn( x ) = 1 + sgn( x ) / K w ( x ) := (1 + | x/w | ) − .Finally, we create a sum of jumps, peaks and some smoothness. A formula for a final testfunction is η ( x ) = 6 sin(4 πx ) + 7(1 + sgn( x − . / − x − . / − x − .
37) + 17 K . ( x − . − x − .
72) + 10 K . ( x − . . They are displayed in Figure 6. We call in turn them modified Blocks, modified Bumps, andmodified Heavisine, respectively.In these experiments, we use two or more types of B-spline basis as elements of overcompletesystems since three functions have different shapes, unlike previous simulation studies. Hyper-19igure 6: The three test functions used in the second simulation: Modified Blocks (left), ModifiedBumps (center) and Modified Heavisine (right)parameters are similar to the previous ones. All hyperparameters for the prior distributionsare summarized in Table 2. This time again, we only compare our model with BPP, BASS,EBW, and LARMuK models which have relatively good performance in some test functions ofSimulation 1. S r R a γ k b γ k Modified Blocks { } { }
50 0.01 1 1Modified Heavisine { } n = 512, we find out from both Table 4and Figure 7 that the LABS model performs well in most cases with either the lowest or thesecond lowest average MSE values across 100 replicates. In particular, the LABS outperformscompetitors in modified Blocks, irrespective of the sample size and noise levels as expected.Among all models, the worst performing method is the BASS-2 since it cannot estimate wellmany jumps or peak points for given test functions. Figure 8 supports that the LABS modelhas the abilities to overcome the noise and adapt to smooth functions with either discontinuitysuch as jumps or sharp peaks or both. 20 odel Modified Blocks Modified Bumps Modified HeavisineRSNR=3 RSNR=5 RSNR=10 RSNR=3 RSNR=5 RSNR=10 RSNR=3 RSNR=5 RSNR=10EBW 3.781(0.7407) 1.538(0.3854) 0.401(0.0684) 3.921(1.0117) 1.557(0.3191) 0.445(0.1043) 2.548(0.4738) 1.326(0.2793) 0.402(0.095)BPP-10 2.238(0.6436) 0.951(0.2439) 0.367(0.098) 2.949(0.749) 1.351(0.3244) 0.631(0.2488) 2.06(0.645) 0.824(0.2125) 0.287(0.0907)BPP-21 2.589(0.4787) 1.336(0.2305) 0.985(0.1714) 3.777(0.9094) 2.586(0.6268) 2.39(0.6003) 2.168(0.3821) 1.228(0.2969) 0.825(0.3458)BASS-1 2.283(0.5013) 0.76(0.2194) 0.172(0.0424) 2.199(0.5625) 0.858(0.1622) 0.276(0.0483) 2.013(0.5502) 0.737(0.198) 0.178(0.0418)BASS-2 4.038(0.6519) 2.232(0.371) 1.368(0.168) 9.881(1.1796) 7.944(0.7727) 6.999(0.5772) 3.275(0.3734) 2.276(0.3877) 1.378(0.2871)LARMuK 2.158(0.5735) 0.97(0.2352) 0.298(0.0929) 2.029(0.7688) 0.822(0.2446) 0.271(0.0944) 1.721(0.4757) 0.713(0.1912) 0.219(0.075)LABS Table 3: Average of MSEs over 100 replications for three functions of Simulation 2 with n = 128.Estimated standard errors of MSEs are shown in parentheses Model Modified Blocks Modified Bumps Modified HeavisineRSNR=3 RSNR=5 RSNR=10 RSNR=3 RSNR=5 RSNR=10 RSNR=3 RSNR=5 RSNR=10EBW 1.525(0.228) 0.644(0.1025) 0.171(0.0245) 1.487(0.2056) 0.595(0.0847)
Table 4: Average of MSEs over 100 replications for three functions of Simulation 2 with n = 512.Estimated standard errors of MSEs are shown in parentheses We now apply LABS model (11) real-world datasets, including the minimum legal drinking age(MLDA) on mortality rate, the closing bitcoin price index, and the daily maximum value ofconcentrations of fine particulate matter (PM2.5) in Seoul. All the real data examples exhibitwildly varying patterns that may have jumps or peaks. These fluctuating patterns are expectedto further illustrate the features of the LABS model.In the following applications we set the hyperparameter values of the proposed model, LABS: a J = 5 , b J = 1, r = 0 .
01, and R = 0 .
01. In this analysis, we practically choose S = { , , } because the true curve of real data is unknown and it may have varying smoothness. Werun it 200,000 times with a burn-in of 100,000 and thin by 10 to achieve convergence of theMCMC algorithm. Performance comparisons of our model and some rather good methods inthe simulated studies are also conducted. 21igure 7: Boxplots of MSEs from the second simulation with n = 512 and RSNR = (a) 3, (b) 5and (c) 10 The Minimum legal drinking age (MLDA) heavily affects youth alcohol consumption which hasbeen a sensitive issue worldwide for policymakers. In the past three decades, there have beenmany studies on the effect of legal access age to alcohol on death rates. The MLDA datasetcollected from Angrist and Pischke (2014) contains death rates, a measure of the total numberof deaths per 100,000 young Americans per year.This data has been widely used to estimate the causal effect of policies on the minimum legaldrinking age in the area of Regression Discontinuity Design (RDD). Figure 9 (a) highlights thatthe MLDA data might represent a piecewise smooth function with a single jump discontinuityat minimum drinking age of 21 referred to as cutoff in the RDD. Specifically, each observation(or point) in Figure 9 corresponds to the death rate from all causes in the monthly interval andthe number of all observations is 48.Using the MLDA data, we estimate unknown functions of death rates via LABS and severalcompeting models including BPP-21, BASS-1, LARMuK, and GP-R. Figure 9 (b) shows that22igure 8: Comparisons of the estimates of a data set generated from the modified Blocks with n = 128 and RSNR = 3 using (a) LABS, (b) BASS-1, (c) BPP-10, and (d) EBW. Dashed linesrepresent true curves, solid lines represent estimates of curve.Figure 9: (a) A piecewise curve fitting and (b) comparisons of the fitted posterior mean usingBASS-1, GP-R and LABS for Minimum legal drinking age (MLDA) dataset23hree posterior mean estimates for an unknown function. The solid curve denotes LABS, thedotted-dashed curve indicates GP-R, and the dashed curve represents BASS-1, respectively.In Figure 9 (b), both LABS and BASS-1 provide similar posterior curves to the piecewisepolynomial regression of Figure 9 (a). The estimated curves of them also have a jump pointat 21. While the estimated function of GP-R is smooth, the mean function for the LABSmodel has both smooth and jump parts. We calculated the mean squared error with 10-foldscross-validation for comparison between methods. The mean and standard deviation valuesof cross-validation prediction errors are given in Table 5. The smaller CV error rate of LABSimplies that LABS has a better performance of estimating a smooth function with discontinuouspoints than the others. LABS BASS-1 BPP-21 LARMuK GP-RMean Bitcoin is the best-known cryptocurrency based on Blockchain technology. The demands forbitcoin have increased globally because of offering a higher yield and easy access. The primarycharacteristic of bitcoin is to enable financial transactions from user to user on the peer-to-peernetwork configuration without a central bank. Unlimited trading methods and smaller marketsize than the stock market lead to high volatility in the bitcoin price. We collected a dailybitcoin exchange rate (BTC vs. USD) on Bitstamp from January 1, 2017, to December 31,2018. Bitcoin data (sourced from ) has 730 observations and8 variables: date, open price (in USD), high price, low price, closing price, volume in bitcoin,volume in currency, and weighted bitcoin price.We also added LOESS (locally estimated scatterplot smoothing) regression line to a scat-ter plot of a daily closing price in Figure 10. The dataset shows locally strong upward anddownward movements. We apply LABS and other models to estimate the curve of daily bitcoin24igure 10: Daily bitcoin closing price with a smoothing lineclosing price. Figure 11 illustrates the predicted curves of the LABS and competing modelsfor approximating an unknown function of daily bitcoin closing price. There are no significantdifferences between the estimated posterior curves.Alternatively, we calculate cross-validation errors to assess model performances. The valuesof cross-validation errors are given in Table 6. Both Table 6 and Figure 12 demonstrate that theLABS model provides more accurate function estimation and consistent performance throughboth minimum mean and relatively low standard deviation values of the cross-validation errors.They also indicate that the Gaussian process is not proper in the cases with locally varyingsmoothness. We found that the LABS gives more reliable estimated functions that may haveboth discontinuous and smooth parts than other methods.LABS BASS-1 BPP-21 LARMuK GP-RMean η on Bitcoin dataset using four models: (a) LABS, (b) BASS-1,(c) BPP-21 and (d) LARMuKFigure 12: Boxplot of the cross-validation test error rate for the Bitcoin data.26 .3 Example 3: Fine particulate matter in Seoul The fine dust has become a national issue and its forecast received great attention from themedia. A lot of research on fine particulate matter (PM2.5) have been carried out as it gainedsocial attention. According to the studies, Korea’s fine dust particles originated from within thecountry and external sources from China. Many factors cause PM2.5 concentration to rapidlyrise or fall and make it difficult to accurately predict the behavior of it.We estimate the unknown function of daily maximum concentrations of PM2.5 in Seoul. ThePM2.5 dataset collected from the AIRKOREA ( ) includes 1261daily maximum values of PM2.5 concentration from January 1, 2015, to June 30, 2018. Weremoved all observations that have missing values.Figure 13: Daily maximum concentrations of PM2.5 in Seoul with a smoothing lineFigure 13 displays daily fluctuations and seasonality. PM2.5 concentrations are higher inwinter and spring than in summer and fall. A LOESS smoothed line added in the figure does notreflect these features well. We take advantage of combinations of basis functions, S = { , , } to grasp such characteristics of PM2.5 data with multiple jumps and peak points. As shown inFigure 14, all four methods represent different estimated lines of the unknown mean functionand pick features of the data up in their way. Interestingly, LABS, BASS-1, and BPP-10 reactin different ways while they detect peaks, jumps, and smooth parts of PM2.5 data.27igure 14: Posterior mean of the mean function on PM2.5 dataset using four models: (a) LABS,(b) BASS-1, (c) BPP-10 and (d) GP-RWe also compute the average and standard deviation of the cross-validated errors of LABS,BPP-10, BASS-1, LARMuK, and GP-R, which are given in Table 7. The LABS model has thelowest cross-validation error among all methods. Moreover a comparably low standard deviationof LABS supports that it has a more stable performance for estimating any shape of functionsdue to using all three types of B-spline basis.LABS BASS-1 BPP-10 LARMuK GP-RMean Conclusions
We suggested general function estimation methodologies using the B-spline basis function as theelements of an overcomplete system. The B-spline basis can systematically represent functionswith varying smoothness since it has nice properties such as local support and differentiability.The overcomplete system and a L´evy random measure enable a function that has both continuousand discontinuous parts to capture all features of the unknown regression function. Simulationstudies and real data analysis also present that the proposed models show better performancethan other competing models. We also showed that the prior has full support in certain Besovspaces. The prominent limitation of the LABS model is the slow mixing of the MCMC algorithm.Future work will develop an efficient algorithm for the LABS model and extend the LABS modelfor multivariate analysis.
References
Abramovich, F., Sapatinas, T. and Silverman, B. W. (1998). Wavelet thresholding via abayesian approach,
Journal of the Royal Statistical Society: Series B (Statistical Methodology) (4): 725–749.Angrist, J. D. and Pischke, J.-S. (2014). Mastering’metrics: The path from cause to effect ,Princeton University Press.Chu, J.-H., Clyde, M. A. and Liang, F. (2009). Bayesian function estimation using continuouswavelet dictionaries,
Statistica Sinica pp. 1419–1438.Cohen, A. (2003).
Numerical analysis of wavelet methods , Elsevier.Cox, M. G. (1972). The numerical evaluation of b-splines,
IMA Journal of Applied Mathematics (2): 134–149.Crainiceanu, C. M., Ruppert, D., Carroll, R. J., Joshi, A. and Goodner, B. (2007). Spatiallyadaptive bayesian penalized splines with heteroscedastic errors, Journal of Computational andGraphical Statistics (2): 265–288.De Boor, C. (1972). On calculating with b-splines, Journal of Approximation Theory (1): 50–62.29enison, D. G., Mallick, B. K. and Smith, A. F. (1998a). Bayesian mars, Statistics and Com-puting (4): 337–346.Denison, D., Mallick, B. and Smith, A. (1998b). Automatic bayesian curve fitting, Journal ofthe Royal Statistical Society: Series B (Statistical Methodology) (2): 333–350.DeVore, R. A. and Lorentz, G. G. (1993). Constructive Approximation , Vol. 303, SpringerScience & Business Media.DiMatteo, I., Genovese, C. R. and Kass, R. E. (2001). Bayesian curve-fitting with free-knotsplines,
Biometrika (4): 1055–1071.Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via waveletshrinkage, Journal of the american statistical association (432): 1200–1224.Donoho, D. L. and Johnstone, J. M. (1994). Ideal spatial adaptation by wavelet shrinkage, Biometrika (3): 425–455.Feng, D. (2013). miscf: Miscellaneous functions. URL: https: // cran. r-project. org/ web/ packages/ miscF
Francom, D. and Sanso, B. (2016). Bass: Bayesian adaptive spline surfaces.
URL: https: // cran. r-project. org/ web/ packages/ BASS/
Francom, D., Sans´o, B., Kupresanin, A. and Johannesson, G. (2018). Sensitivity analysis andemulation for functional data using bayesian adaptive splines,
Statistica Sinica : 791–816.Friedman, J. H. (1991). Multivariate adaptive regression splines, The annals of statistics pp. 1–67.Gijbels, I., Lambert, A. and Qiu, P. (2007). Jump-preserving regression and smoothing using lo-cal linear fitting: a compromise,
Annals of the Institute of Statistical Mathematics (2): 235–272.Green, P. J. (1995). Reversible jump markov chain monte carlo computation and bayesian modeldetermination, Biometrika (4): 711–732. 30ohnstone, I. M. and Silverman, B. W. (2005). Empirical bayes selection of wavelet thresholds, Annals of Statistics pp. 1700–1752.Karatzoglou, A., Smola, A., Hornik, K., Maniscalco, M. A., Teo, C. H. and (NICTA), N. I. A.(2004). kernlab: Kernel-based machine learning lab.
URL: https: // cran. r-project. org/ web/ packages/ kernlab/
Koo, J.-Y. (1997). Spline estimation of discontinuous regression functions,
Journal of Compu-tational and Graphical Statistics (3): 266–284.Krivobokova, T., Crainiceanu, C. M. and Kauermann, G. (2008). Fast adaptive penalized splines, Journal of Computational and Graphical Statistics (1): 1–20.Lee, J. (2007). Sampling methods of neutral to the right processes, Journal of Computationaland Graphical Statistics (3): 656–671.Lee, J. and Kim, Y. (2004). A new algorithm to generate beta processes, Computational Statistics& Data Analysis (3): 441–453.Lee, T. C. (2002). Automatic smoothing for discontinuous regression functions, Statistica Sinica pp. 823–842.Lee, Y., Mano, S. and Lee, J. (2020). Bayesian curve fitting for discontinuous functions using anovercomplete system with multiple kernels,
Journal of the Korean Statistical Society pp. 1–21.Lewicki, M. S. and Sejnowski, T. J. (2000). Learning overcomplete representations,
Neuralcomputation (2): 337–365.Liu, Z. and Guo, W. (2010). Data driven adaptive spline smoothing, Statistica Sinica pp. 1143–1163.Luo, Z. and Wahba, G. (1997). Hybrid adaptive splines,
Journal of the American StatisticalAssociation (437): 107–116.Ouyang, Z., Clyde, M. A. and Wolpert, R. L. (2008). Bayesian additive regression kernels , DukeUniversity. 31etrushev, P. P. (1988). Direct and converse theorems for spline and rational approximationand besov spaces,
Function spaces and applications , Springer, pp. 363–377.Pillai, N. S. (2008).
L´evy random measures: Posterior consistency and applications , PhD dis-sertation, Duke University.Pillai, N. S., Wu, Q., Liang, F., Mukherjee, S. and Wolpert, R. L. (2007). Characteriz-ing the function space for bayesian kernel models,
Journal of Machine Learning Research (Aug): 1769–1797.Pintore, A., Speckman, P. and Holmes, C. C. (2006). Spatially adaptive smoothing splines, Biometrika (1): 113–125.Qiu, P. (2003). A jump-preserving curve fitting procedure based on local piecewise-linear kernelestimation, Journal of Nonparametric Statistics (4-5): 437–453.Qiu, P. and Yandell, B. (1998). Local polynomial jump-detection algorithm in nonparametricregression, Technometrics (2): 141–152.R Core Team (2020). R: A Language and Environment for Statistical Computing , R Foundationfor Statistical Computing, Vienna, Austria.
URL:
Ruppert, D. and Carroll, R. J. (2000). Theory & methods: Spatially-adaptive penalties forspline fitting,
Australian & New Zealand Journal of Statistics (2): 205–223.Silverman, B. W., Evers, L., Xu, K., Carbonetto, P. and Stephens, M. (2005). Ebayesthresh:Empirical bayes thresholding and related methods. URL: https: // cran. r-project. org/ web/ packages/ EbayesThresh/
Simoncelli, E. P., Freeman, W. T., Adelson, E. H. and Heeger, D. J. (1992). Shiftable multiscaletransforms,
IEEE Transactions on Information Theory (2): 587–607.Smith, M. and Kohn, R. (1996). Nonparametric regression using bayesian variable selection, Journal of Econometrics (2): 317–343. 32u, C. (2006). Bayesian nonparametric modeling using Levy process priors with applicationsfor function estimation, time series modeling and spatio-temporal modeling , PhD dissertation,Duke University.Wang, X., Du, P. and Shen, J. (2013). Smoothing splines with varying smoothing parameter,
Biometrika (4): 955–970.Wang, X.-F. (2010). fancova: Nonparametric analysis of covariance.
URL: https: // cran. r-project. org/ web/ packages/ fANCOVA/
Wolpert, R. L., Clyde, M. A., Tu, C. et al. (2011). Stochastic expansions using continuousdictionaries: L´evy adaptive regression kernels,
The Annals of Statistics (4): 1916–1962.Xia, Z. and Qiu, P. (2015). Jump information criterion for statistical inference in estimatingdiscontinuous curves, Biometrika (2): 397–408.Yang, L. and Hong, Y. (2017). Adaptive penalized splines for data smoothing,
ComputationalStatistics & Data Analysis : 70–83.Yang, Y. and Song, Q. (2014). Jump detection in time series nonparametric regression models: apolynomial spline approach,
Annals of the Institute of Statistical Mathematics (2): 325–344.Zhou, S. and Shen, X. (2001). Spatially adaptive regression splines and accurate knot selectionschemes, Journal of the American Statistical Association (453): 247–259.33 Proof of Theorems 1-3
A.1 Proof of Theorem 1
For simplicity, we assume that X = [0 , (cid:107) B k ( x ; ξ k ) (cid:107) p is finite for all k ≥
0. It is enough to show that if the Besov semi-norm, | B k ( x ; ξ k ) | B αp,q is finite for all k ≥ ω k ( f, t ) p ≤ r · ω k − r ( f, t ) p , ≤ r ≤ k, if f ∈ L p ( X ) imply that ω r ( B k ( x ; ξ k ) , t ) p ≤ r − · ω ( B k ( x ; ξ k ) , t ) p . Let k be zero. Then, the B-spline basis is piecewise constant with 2 knots, ξ := ( ξ , ξ ).By the definition of the B-spline basis (6), we divide into two cases to calculate the modulus ofcontinuity. Case 1
Assume ξ + h < ξ , h > . Thus, (cid:107) B ( x + h ; ξ ) − B ( x ; ξ ) (cid:107) p ≤ · h p . Case 2
Assume ξ + h > ξ , h > . Thus, (cid:107) B ( x + h ; ξ ) − B ( x ; ξ ) (cid:107) p ≤ · h p . Therefore, in all cases, ω r ( B ( x ; ξ ) , t ) p ≤ r · h p . (17)By definition, | B ( x ; ξ ) | B αp,q = (cid:0)(cid:82) ∞ ( t − s ω r ( B ( x ; ξ ) , t ) p ) p dtt (cid:1) /q , so | B ( x ; ξ ) | B αp,q ≤ (cid:20)(cid:90) t − sq − · rq · t p dt + (cid:90) ∞ t − sq − · rq dt (cid:21) /q = 2 r · (cid:20)(cid:90) t − q ( s − p ) − dt + (cid:90) ∞ t − sq − dt (cid:21) /q . The upper bound of | B ( x ; ξ ) | B αp,q is finite if and only if α < p and q < ∞ .Let k ≥
1. Since the B-spline basis of order k is a piecewise polynomial and has ( k − W kp ( X ), where W kp ( X ) is the Sobolev space, whichis a vector space of functions that have weak derivatives. See the definition of the Sobolev space34escribed in chapter 2.5 of DeVore and Lorentz (1993). We use the following property of themodulus of smoothness, ω r + k ( f, t ) p ≤ t r · ω k ( f ( r ) , t ) p , t > , where f ( r ) is the weak r th derivative of f . For k ≥
1, the Besov semi-norm of B k ( x ; ξ k ) isbounded by | B k ( x ; ξ k ) | B αp,q = (cid:18)(cid:90) ∞ ( t − α ω r ( B k ( x ; ξ k ) , t ) p ) q dtt (cid:19) /q ≤ (cid:18)(cid:90) ( t − α · ( t k · ω r − k ( B ( k ) k ( x ; ξ k ) , t ) p ) q dtt + (cid:90) ∞ rq · t − sq − dt (cid:19) /q ≤ (cid:18)(cid:90) ( t − αq − · ( t k · r − k − · ω ( B ( k ) k ( x ; ξ k ) , t ) p ) q dt + 2 rq · (cid:90) ∞ t − αq − dt (cid:19) /q (18)Since B ( k ) k ( x ; ξ k ) is a piecewise constant function, (17) implies that ω ( B ( k ) k ( x ; ξ k ) , t ) p ≤ C · h p , for some constant C > . (19)Using (18) and (19), it follows that | B k ( x ; ξ k ) | B αp,q ≤ (cid:18) C (cid:48) · (cid:90) t − αq + kq + qp − dt + 2 rq · (cid:90) ∞ t − αq − dt (cid:19) /q , for some constant C (cid:48) > . For all k ≥ | B k ( x ; ξ k ) | B αp,q is finite if and only if α < k + p and q < ∞ , so the proof is complete. A.2 Proof of Theorem 2
By Theorem 3 of Wolpert et al. (2011), the L p norm and Besov semi-norm of η satisfy thefollowing upper bounds, respectively. (cid:107) η (cid:107) p ≤ (cid:88) k ∈ S (cid:88) l (cid:107) B k ( x ; ξ k,l ) (cid:107) p | β k,l | , | η | B αp,q ≤ (cid:88) k ∈ S (cid:88) l | β k,l | · | B k | B αp,q , Since the condition for (12) is satisfied and B-spline basis is bounded and locally supported, (cid:107) η (cid:107) p is almost surely finite. To obtain finite Besov semi-norms for all k ∈ S , the smoothnessparameter α should be α < min( S ) + p by Theorem 1. Therefore, η belongs to B αp,q with α < min( S ) + p almost surely. 35 .3 Proof of Theorem 3 For the sake of simplicity we assume X = [0 , δ > η ∈ B αp,q ([0 , α > , ≤ p, q < ∞ . If 1 ≤ p (cid:48) ≤ p < ∞ , then η also belongs to B αp (cid:48) ,q ([0 , s ∈ S ( n (cid:63) , q ) such that (cid:107) η − s (cid:107) p < C (cid:107) η (cid:107) B αp (cid:48) ,q ( n (cid:63) ) α < δ , where S ( n (cid:63) , q ) denotes the set of all splines of degree ( q −
1) with a sufficiently large number n (cid:63) knots and constant C = C ( α, p, q ). Since any spline of given degree can be represented as alinear combination of B-spline basis functions with same degree, we can define a spline s ( x ) by s ( x ) = n (cid:63) (cid:88) j =1 β ∗ j B ∗ ( q − ,j ( x ) , (20)where B ∗ ( q − ,j ( x ) is the B-spline basis of degree ( q −
1) with a sequence of knots ξ ∗ in (6).Set n (cid:63) := (cid:80) k ∈ S J δk , A := (cid:80) k ∈ S (cid:80) J δk l =1 | β k,l | < ∞ , ρ := sup (cid:107) B k ( x, ξ k ) (cid:107) p < ∞ and (cid:15) := δ A + ρ ) .We denote the range of a sequence of knots ξ k,l by r ( ξ k,l ), e.g., r ( ξ k,l ) = ( ξ k,l, ( k +2) − ξ k,l, ). Forconvenience, we reindex the coefficients and knots of the spline s ( x ) in (20) such that β ∗ k,l and ξ ∗ k,l for l = 1 , . . . , J δk , k ∈ S . Then, the spline s ( x ) can be expressed as follows: s ( x ) = (cid:88) k ∈ S J δk (cid:88) l =1 β ∗ k,l B ∗ ( q − ,l ( x ; ξ ∗ k,l ) , where ξ ∗ k,l := ( ξ ∗ l , . . . , ξ ∗ l +( q − ) is a subsequence of given knots ξ ∗ . Using the definitions ofB-spline basis in (6) and (7), we can find a ζ > r ( ξ k,l ) , r ( ξ ∗ k,l )) < ζ ⇒ (cid:107) B k ( x ; ξ k,l ) − B ∗ ( q − ,l ( x ; ξ ∗ k,l ) (cid:107) p < (cid:15), ∀ l, ∀ k. Let’s define the set¯ b (cid:48) ( η ) := η : η ( x ) = (cid:88) k ∈ S J δk (cid:88) l =1 β k,l B k ( x ; ξ k,l ) , (cid:88) k ∈ S J δk (cid:88) l =1 | β k,l − β ∗ k,l | < (cid:15), max( r ( ξ k,l ) , r ( ξ ∗ k,l )) < ζ, ∀ l, ∀ k . emma 4 ¯ b (cid:48) δ ( η ) ⊂ ¯ b δ ( η ) Proof
It suffices to show that η ∈ ¯ b (cid:48) δ ( η ) = ⇒ η ∈ ¯ b δ ( η ) to finish the proof of the lemma. Forany η ∈ ¯ b (cid:48) δ ( η ), (cid:107) η − s (cid:107) p ≤ (cid:88) k ∈ S J δk (cid:88) l =1 (cid:107) β k,l B k ( x ; ξ k,l ) − β ∗ k,l B ∗ ( q − ,l ( x ; ξ ∗ k,l ) (cid:107) p ≤ (cid:88) k ∈ S J δk (cid:88) l =1 | β k,l | · (cid:107) B k ( x ; ξ k,l ) − B ∗ ( q − ,l ( x ; ξ ∗ k,l ) (cid:107) p + (cid:88) k ∈ S J δk (cid:88) l =1 | β k,l − β ∗ k,l | · (cid:107) B k ( x ; ξ k,l ) (cid:107) p ≤ (cid:15) · (cid:88) k ∈ S J δk (cid:88) l =1 | β k,l | + ρ · (cid:88) k ∈ S J δk (cid:88) l =1 | β k,l − β ∗ k,l |≤ (cid:15) · A + 2 ρ · (cid:15) = ( A + ρ ) · (cid:15) = δ . By the triangle inequality, (cid:107) η − η (cid:107) p ≤ (cid:107) η − s (cid:107) p + (cid:107) s − η (cid:107) p < δ δ δ. Thus, η ∈ ¯ b δ ( η ) and this finishes the proof of the lemma.37o complete the proof of this theorem, we have to show that Π (cid:0) η ∈ ¯ b (cid:48) δ ( η ) (cid:1) > J (cid:63) := max k ∈ S J δk . By the triangle inequality,Π (cid:0) η ∈ ¯ b (cid:48) δ ( η ) (cid:1) = Π (cid:32)(cid:88) k ∈ S (cid:90) (cid:90) R ×X ( k +2) β k B k ( x ; ξ k ) N k ( dβ k , d ξ k ) ∈ ¯ b (cid:48) δ ( η ) (cid:33) = Π (cid:32)(cid:88) k ∈ S J k (cid:88) l =1 B k ( x ; ξ k,l ) β k,l ∈ ¯ b (cid:48) δ ( η ) (cid:33) = P (cid:88) k ∈ S J δk (cid:88) l =1 | β k,l − β ∗ k,l | < (cid:15), max( r ( ξ k,l ) , r ( ξ ∗ k,l )) < ζ, J k = J δk , ∀ k ∈ S > (cid:89) k ∈ S (cid:26) P (cid:20) | β k,l − β ∗ k,l | < (cid:15) | S | J (cid:63) , max( r ( ξ k,l ) , r ( ξ ∗ k,l )) < ζ, l = 1 , , . . . , J δk (cid:21)(cid:27) × (cid:89) k ∈ S (cid:34) ν k ( R × X ( k +2) ) J δk · exp( − ν k ( R × X ( k +2) )) J δk ! (cid:35) = (cid:89) k ∈ S J δk (cid:89) j =1 (cid:34) ν k ( | β k,l − β ∗ k,l | < (cid:15) | S | J (cid:63) , max( r ( ξ k,l ) , r ( ξ ∗ k,l )) < ζ ) ν k ( R × X ( k +2) ) (cid:35) × (cid:89) k ∈ S (cid:34) ν k ( R × X ( k +2) ) J δk · exp( − ν k ( R × X ( k +2) )) J δk ! (cid:35) = (cid:89) k ∈ S J δk (cid:89) j =1 (cid:34)(cid:90) | β k,l − β ∗ k,l | < (cid:15) | S | J(cid:63) π ( β k ) dβ k (cid:90) max( r ( ξ k,l ) ,r ( ξ ∗ k,l )) <ζ π ( ξ k ) d ξ k (cid:35) × M J δk k · exp( − M k ) J δk ! . Since we assume a finite Levy measure and π ( β k ) = N ( β k ; 0 , φ k ), π ( ξ k ) = U ( X ( k +2) ) in theLABS model, Π (cid:0) η ∈ ¯ b (cid:48) δ ( η ) (cid:1) > . Hence, by applying the lemma, we get Π (cid:0) η ∈ ¯ b δ ( η ) (cid:1) ≥ Π (cid:0) η ∈ ¯ b (cid:48) δ ( η ) (cid:1) > Full simulation results for Simulation 1
This appendix contains the full simulations results of the four DJ test functions. We simulatedtwo scenarios: (a) small sample size ( n = 128) and (b) large sample size ( n = 512) with differentnoise levels (RSNR = 3, 5, and 10). Model Bumps BlocksRSNR=3 RSNR=5 RSNR=10 RSNR=3 RSNR=5 RSNR=10BSP-2 26.904(0.4461) 25.495(0.1606) 24.9(0.0401) 5.96(0.4429) 4.53(0.1594) 3.927(0.0399)LOESS 47.266(0.1618) 47.163(0.0812) 47.119(0.0366) 17.924(0.7218) 17.503(0.3846) 17.332(0.2312)SS 43.552(4.6764) 43.377(4.7159) 43.984(4.6074) 4.895(0.5145) 3.396(0.241) 2.699(0.1107)NWK 39.892(1.8831) 39.365(1.3862) 39.033(0.7393) 4.285(0.7336) 1.936(0.36) 0.472(0.0567)EBW 4.986(1.1761) 1.936(0.601) 0.447(0.0981) 3.243(1.0747) 0.859(0.2237) 0.21(0.0484)GSP-L 22.99(4.8847) 22.03(5.3556) 20.112(4.4213) 7.409(1.3261) 6.637(0.9807) 6.546(1.1115)GSP-R 41.626(2.9041) 40.819(2.7234) 40.955(3.102) 15.654(2.0255) 15.323(1.8902) 15.44(2.3391)BPP-10 4.571(2.3604) 3.668(2.7142) 3.304(2.9789) 2.156(0.789) 0.908(0.2537) 0.465(0.2403)BPP-21 15.115(5.9376) 14.674(6.3396) 14.363(6.7395) 3.918(0.589) 2.682(0.5399) 2.311(0.451)BASS-1 2.968(0.4322) 1.206(0.4907) 0.252(0.0421) 2.498(0.6331) 0.696(0.2226) 0.122(0.0381)BASS-2 47.977(7.4411) 45.988(9.3435) 45.021(9.2185) 7.533(1.7616) 4.253(0.8586) 2.852(0.3863)LARMuK 2.852(0.426) 1.182(0.678) 0.319(0.0663) 1.799(0.5873) 0.682(0.2436) 0.193(0.08)LABS
Table 8: Average mean squared errors with estimated standard errors in parentheses from 100replications for Bumps and Blocks examples with n = 128 Model Doppler HeavisineRSNR=3 RSNR=5 RSNR=10 RSNR=3 RSNR=5 RSNR=10BSP-2 3.896(0.4928) 2.447(0.1774) 1.836(0.0444) 2.399(0.4208) 0.926(0.1515) 0.305(0.0379)LOESS 8.891(1.506) 6.533(1.0853) 5.344(0.6362) 0.895(0.2248) 0.548(0.0976) 0.35(0.0436)SS 3.644(0.587) 2.025(0.24) 1.251(0.0812) 0.875(0.2725) 0.484(0.1021) 0.235(0.0299)NWK 4.045(1.102) 1.864(0.2683) 0.477(0.0624) 1.022(0.352) 0.521(0.1321) 0.228(0.0472)EBW 2.979(0.6397) 1.319(0.3142) 0.341(0.0855) 1.29(0.3473) 0.586(0.1365) 0.185(0.0456)GSP-L 5.845(1.3505) 5.313(1.3217) 4.843(0.9023) 2.601(2.0255) 2.217(2.1273) 1.743(1.7561)GSP-R 12.164(2.9995) 11.588(3.1093) 12.402(3.5243) 1.02(0.2869) 0.756(0.1707) 0.646(0.1152)BPP-10 2.911(0.6396) 1.275(0.2471) 0.559(0.2008) 1.503(0.4239) 0.674(0.1576) 0.217(0.0616)BPP-21 2.575(0.5217) 1.463(0.3399) 1.166(0.3932) 0.941(0.2702) 0.444(0.105) 0.147(0.032)BASS-1 2.865(0.6162) 1.167(0.2607) 0.353(0.0605) 1.022(0.2434) 0.499(0.1047) 0.135(0.033)BASS-2 2.841(0.5796) 1.753(0.2702) 1.344(0.1205)
Table 9: Average mean squared errors with estimated standard errors in parentheses from 100replications for Doppler and Heavisine examples with n = 12839 odel Bumps BlocksRSNR=3 RSNR=5 RSNR=10 RSNR=3 RSNR=5 RSNR=10BSP-2 29.359(0.1163) 29.011(0.0419) 28.864(0.0105) 5.097(0.118) 4.74(0.0425) 4.59(0.0106)LOESS 43.468(5.4554) 40.385(7.3012) 36.495(7.2734) 3.755(0.2529) 3.095(0.1343) 2.922(0.0252)SS 16.211(0.255) 15.406(0.123) 15.06(0.0537) 2.837(0.1609) 2.197(0.0713) 1.883(0.0256)NWK 4.885(0.3559) 1.796(0.1194) 0.482(0.0319) 2.34(0.1831) 1.349(0.1149) 0.581(0.1147)EBW 2.42(0.3291) 0.914(0.0992) Table 10: Average mean squared errors with estimated standard errors in parentheses from 100replications for Bumps and Blocks examples with n = 512 Model Doppler HeavisineRSNR=3 RSNR=5 RSNR=10 RSNR=3 RSNR=5 RSNR=10BSP-2 2.875(0.1038) 2.534(0.0374) 2.39(0.0093) 0.635(0.1112) 0.294(0.04) 0.15(0.01)LOESS 2.756(0.3006) 2.044(0.0474) 1.893(0.0202) 0.464(0.067) 0.306(0.0523) 0.172(0.0531)SS 1.993(0.1234) 1.385(0.0531) 1.086(0.0174) 0.384(0.0699) 0.225(0.0299) 0.113(0.0102)NWK 1.649(0.1233) 0.853(0.105) 0.431(0.0289) 0.398(0.0789) 0.223(0.0327) 0.106(0.0113)EBW 1.263(0.1618) 0.592(0.0809)
Table 11: Average mean squared errors with estimated standard errors in parentheses from 100replications for Doppler and Heavisine examples with n = 51240 Derivation of the full conditionals for LABS
In this appendix, we derive the full conditional distributions of some parameters required forGibbs sampling. The full conditional posterior of each parameter can be easily obtained viaconjugacy properties. Let us first find the full conditional posterior for β p,q . • Full conditional posterior for β p,q For each q = 1 , . . . , J p , [ β p,q | β p, − q , others , Y ] ∝ (cid:34) exp (cid:40) − σ n (cid:88) i =1 ( y i − β − (cid:88) k ∈ S J k (cid:88) l =1 B k,l ( x i ; ξ k,l ) β k,l ) (cid:41)(cid:35) × (cid:34) exp (cid:40) − (cid:88) k ∈ S J k (cid:88) l =1 β k,l σ k (cid:41)(cid:35) ∝ exp − σ n (cid:88) i =1 ( y i − β − (cid:88) k ∈ S \{ p } J k (cid:88) l =1 B k,l ( x i ; ξ k,l ) β k,l − J p (cid:88) l =1 B p,l ( x i ; ξ p,l ) β p,l ) − σ p J p (cid:88) k =1 β p,l For convenience, we set c i = β + (cid:80) k ∈ S \{ p } (cid:80) J k l =1 B k,l ( x i ; ξ k,l ) β k,l as a constant term.Then, ∝ exp − σ n (cid:88) i =1 ( y i − c i − J p (cid:88) l =1 B p,l ( x i ; ξ p,l ) β p,l ) − σ p J p (cid:88) l =1 β p,l ∝ exp − σ n (cid:88) i =1 ( y i − c i − J p (cid:88) l (cid:54) = q B p,l ( x i ; ξ p,l ) β p,l − B p,q ( x i ; ξ p,q ) β p,q ) − β p,q σ p ∝ exp − σ β p,l n (cid:88) i =1 (cid:0) B p,q ( x i ; ξ p,q ) (cid:1) − β p,q n (cid:88) i =1 y i − c i − J p (cid:88) l (cid:54) = q B p,l ( x i ; ξ p,q ) β p,l × B p,q ( x i ; ξ p,q ) − β p,q σ p = exp − (cid:32) (cid:80) ni =1 (cid:0) B p,q ( x i ; ξ p,q ) (cid:1) σ + 1 σ p (cid:33) β p,q − (cid:80) ni =1 (cid:16) y i − c i − (cid:80) J p l (cid:54) = q B p,l ( x i ; ξ p,q ) β p,l (cid:17) (cid:0) B p,q ( x i ; ξ p,q ) (cid:1) σ β p,q Thus, the full conditional distribution for β p,q is[ β p,q | β p, − q , others , Y ] ∼ N ( µ p , σ p )with σ p = (cid:32) (cid:80) ni =1 (cid:0) B p,q ( x i ; ξ p,q ) (cid:1) σ + 1 σ p (cid:33) − µ p = σ p × (cid:80) ni =1 (cid:16) y i − c i − (cid:80) J p l (cid:54) = q B p,l ( x i ; ξ p,q ) β p,l (cid:17) (cid:0) B p,q ( x i ; ξ p,q ) (cid:1) σ . Full conditional posterior of M k For each k ∈ S , [ M k | others] ∝ M J k k exp {− M k } × M a γk − k exp {− b γ k M k } = M J k + a γk − exp {− (1 + b γ k ) M k } The full conditional distribution for M k is given by[ M k | others] ∼ Ga( a k , b k )where a k = a γ k + J k ,b k = b γ k + 1 . • Full conditional posterior of σ [ σ | others , Y ] ∝ (cid:34) ( σ ) − n × exp (cid:40) − σ n (cid:88) i =1 ( y i − β − (cid:88) k ∈ S J k (cid:88) l =1 B k ( x ; ξ k,l ) β k,l ) (cid:41)(cid:35) × (cid:20) ( σ ) − r +1 × exp (cid:26) − rR σ (cid:27)(cid:21) ∝ ( σ ) − n + r +1 exp (cid:40) − σ n (cid:88) i =1 ( y i − β − (cid:88) k ∈ S J k (cid:88) l =1 B k ( x ; ξ k,l ) β k,l ) − rR σ (cid:41) The full conditional distribution for σ is[ σ | others , Y ] ∼ IG (cid:18) r , r R (cid:19) with r = r + n,R = (cid:80) ni =1 ( y i − β − (cid:80) k ∈ S (cid:80) J k l =1 B k ( x ; ξ k,l ) β k,l ) + rRr ..