Model selection criteria for regression models with splines and the automatic localization of knots
Alex Rodrigo dos S. Sousa, Magno T.F. Severino, Florencia G. Leonardi
MModel selection criteria for regression models with splines and theautomatic localization of knots
Alex Rodrigo dos S. Sousa, Magno T.F. Severino, and Florencia G. LeonardiDepartment of Statistics, University of Sao Paulo, BrazilJune 5, 2020
Abstract
In this paper we propose a model selection approach to fit a regression model using splines with avariable number of knots. We introduce a penalized criterion to estimate the number and the position ofthe knots where to anchor the splines basis. The method is evaluated on simulated data and applied tocovid-19 daily reported cases for short-term prediction.
Spline regression is one of the most used tools for nonparametric and semiparametric regression and hasbeen extensively developed along the last decades. The main ideia is to select some knots on the functiondomain to define a partition and then fit a fixed degree polynomial on each segment in such a way that thesepolynomials smoothly match at the internal knots through continuity conditions imposed on their derivatives.The result is typically a smooth curve fitting, which accurately estimates a large variety of smooth functions.Further, spline regression offers a great flexibility in curve estimation problems according to convenientchoices of degree and internal knots and, as generally occurs in nonparametric regression by basis expansion,the infinite estimation problem becomes a finite parametric estimation problem, once the spline function isdetermined by the estimation of its coefficients. For more details about spline based methods, their theoryand applications in Statistics, see Wahba (1990), Ramsay (2004), Friedman et al. (2001), and James et al.(2013).One of the main issues in spline regression is the selection of the number and positions of the knots. In somedata analyses, wrong specification of the knots can cause a high impact on the spline regression performance.Excessive number of knots may lead the spline regression to overfit the dataset while a not enough numberof knots can lead to underfitting. Moreover, the positions of the knots should be well determined, since high [email protected] a r X i v : . [ s t a t . M E ] J un oncentration of knots on a specific interval of the domain overfits the data on this region and underfits thedata outside it. Then, a good procedure for knots selection should provide optimal number and locations ofknots, in order to have a well performed spline curve fitting.In practice, the selection of knots is done by hand and few studies in the literature provide some systematicmethod to overcome this difficulty. Friedman and Silverman (1989), Friedman (1991) and Stone et al. (1997)studied stepwise based methods for a set of candidates of knots. Osborne et al. (1998) proposed an algorithmbased on the LASSO estimator to estimate the locations of knots. Ruppert (2002) presented two algorithmsfor selection of a fixed number of knots on penalized spline regression, called P-splines. The first one is basedon the minimization of the generalized cross validation statistic and the second one is based on the Demmler-Reinsch type diagonalization and ¨Ulker and Arslan (2009) proposed an artificial immune system to performknots selection in curve and surface B-splines estimation by considering the knots locations candidates asantibodies. Bayesian methods are proposed by Denison et al. (1998) who proposed a joint prior distributionfor the number and locations of knots, and Biller (2000), who gave an adaptive bayesian approach for knotselection based on reversible jump Markov chain Monte Carlo in semiparametric generalized linear models.Although the methods described above have been well succeeded, they work with a fixed number of knotsand/or require some prior information regarding possible regions where the knots should be set, which canbe seen as a limitation in some cases. In this work we propose an automatic method to estimate both thenumber of knots and their locations, based on the minimization of a penalized least squares criterion. Thepenalty plays the role of avoiding a high number of knots and consequently to overfit the data. We test theperformance of the method on simulated data and we also apply it to daily reported cases of covid-19 onseveral countries to obtain a fit on the epidemiological curve.This paper is organized as follows: Section 2 is addressed to the description of the spline model and theproposed method of automatic knots selection. Simulation studies about the performance of the methodare shown in Section 3. Applications to real datasets involving daily reported cases of Covid-19 are done inSection 4. Section 5 provides conclusion and suggestions of future works. We start with a nonparametric regression problem of the form y i = f ( x i ) + (cid:15) i , i = 1 , ..., n, where x i are scalars, f is an unknown smooth function with domain [ a, b ] such that f ( x i ) = E ( y i | x i ) and { (cid:15) i } ni =1 are zero mean independent random variables with unknown variance σ . To estimate f , we definea sequence of knots a = t < t < ... < t K < t K +1 = b that defines a partition of [ a, b ] in intervals[ a, t ) , [ t , t ) , ..., [ t K , b ] and use a m − th order spline regression model, m = p + 1, defined by f ( x ) = β + p (cid:88) j =1 β j x j + K (cid:88) k =1 β p + k ( x − t k ) p + , (2.1)2here p, K ∈ Z + , β = [ β , ..., β p + K ] (cid:48) is the vector of coefficients, ( u ) p + = u p I ( u ≥
0) and t < ... < t K arefixed internal knots which, from the now on, will be referred to as just knots. In general, a spline regressionmodel is a p degree polynomial on each interval of two consecutive knots and has p − | t k +1 − t k | > δ for some δ > k = 0 , ..., K . Thus, the problem of estimating f becomes a finite parametric estimation problem, whose parameters are the vector of coefficients β , thenumber of interna knots K and their locations t k , k = 1 , ..., K . Let us denote by θ the vector of parametersto be estimated, i.e, θ = [ β , K, t , ..., t K ] (cid:48) . We assume a fixed value for p , the degree of the spline function.The estimation of the parameters should consider its performance regarding a loss function, such as in ourcase, the the residual sum of squares and also some degree of smoothness for f . We propose to control thesmoothness with a roughness penalty defined in terms of the number of intervals established by the knots,i.e, the K knots define K + 1 intervals on the domain of f . Thus, we define the penalized sum of squares P SS λ as P SS λ ( θ ; y ) = n (cid:88) i =1 [ y i − f ( x i )] + λ ( K + 1) , (2.2)where y = [ y , ..., y n ] (cid:48) , λ > f is the spline function (2.1). Thus,the estimator ˆ θ is obtained by ˆ θ = arg min θ P SS λ ( θ ; y ) . (2.3)Considering the definition above, we search for a curve estimate that has good fitting properties with lowresidual sum of squares and which is also smooth enough according to a reasonable number of knots thatis controlled by the penalty on the number of segments of the domain. The tuning parameter λ attributesa weight on the roughness penalty term. High λ values increase the importance of the penalization, i.e, apreference for smoothness is given against accuracy. On the other hand, low λ value favors models withhigh number of parameters, which can lead to overfitting. The specification of a good value for λ is an issueinherent to any regularized estimator and, as in other similar approaches, could be chosen by a standardcross validation procedure. Our proposed method for automatic knots selection can be applied to other spline basis structures. Awell known spline basis is the B-splines, proposed by De Boor et al. (1978), which are more adapted forcomputational implementation due to its compact support. The i -th B-spline of order m and knots a = t Figure 4.2: Linear splines curve fitting of mean daily new registered cases of Covid-19 in logarithmic scale foreight countries considering data until May 30th. In each graph, the gray bars indicate mean daily recordednew cases, the solid black line is the fitted curve, dotted black line is the prediction for the following 7 days,the shaded area indicates the interval of estimates, vertical dashed lines represent the knots locations. In this paper we introduced a method to estimate the number and position of the knots of a spline regressionfunction. The method is based on the minimization of the sum of squared residuals plus a penalty thatdepends on the number of knots. We evaluate the performance of the criterion on simulated data, andwe showed that our proposed method has a great performance in the simulation studies, even with lowSNR values. We also applied the method to perform a descriptive data analysis on covid-19 daily reportedcases in several countries. In this analysis we show that the penalized least squares estimation guaranteedquite smooth curve fittings that estimated accurately underlying smooth functions automatically, i.e, without14equiring to specify the number of internal knots and their locations, which is necessary in most spline basedcurve fitting methods available in the literature.The penalizing constant λ used in our data analyses was set to a fixed value. However as in many othersimilar approaches, it could be chosen by a cross validation procedure. From a theoretical point of view, oneopen question that remains is what is the rate for the penalizing constant λ in order to obtain consistencyof the estimator ˆ θ . Some results on this direction for related penalized models is presented in Castro et al.(2018), where a consistency result was proved for a penalizing constant of order n − / . Whether this alsoholds on this setting is an issue that we would like to address in future work. References Biller, C. (2000). Adaptive bayesian regression splines in semiparametric generalized linear models. Journalof Computational and Graphical Statistics , 9(1):122–140.Castro, B. M., Lemes, R. B., Cesar, J., H¨unemeier, T., and Leonardi, F. (2018). A model selection approachfor multiple sequence segmentation and dimensionality reduction. Journal of Multivariate Analysis , 167:319– 330.De Boor, C. (2001). Calculation of the smoothing spline with weighted roughness measure. MathematicalModels and Methods in Applied Sciences , 11(01):33–41.De Boor, C., De Boor, C., Math´ematicien, E.-U., De Boor, C., and De Boor, C. (1978). A practical guide tosplines , volume 27. springer-verlag New York.Denison, D., Mallick, B., and Smith, A. (1998). Automatic bayesian curve fitting. Journal of the RoyalStatistical Society: Series B (Statistical Methodology) , 60(2):333–350.Friedman, J., Hastie, T., and Tibshirani, R. (2001). The elements of statistical learning . Springer Series inStatistics New York.Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics , pages 1–67.Friedman, J. H. and Silverman, B. W. (1989). Flexible parsimonious smoothing and additive modeling. Technometrics , 31(1):3–21.James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical learning withapplications in R . Springer New York.Osborne, M. R., Presnell, B., and Turlach, B. A. (1998). Knot selection for regression splines via the lasso. Computing Science and Statistics , pages 44–49.Ramsay, J. O. (2004). Functional data analysis. Encyclopedia of Statistical Sciences , 4.15uppert, D. (2002). Selecting the number of knots for penalized splines. Journal of computational andgraphical statistics , 11(4):735–757.Stone, C. J., Hansen, M. H., Kooperberg, C., Truong, Y. K., et al. (1997). Polynomial splines and their tensorproducts in extended linear modeling: 1994 wald memorial lecture. The Annals of Statistics , 25(4):1371–1470.¨Ulker, E. and Arslan, A. (2009). Automatic knot adjustment using an artificial immune system for b-splinecurve approximation. Information Sciences , 179(10):1483–1494.Wahba, G. (1990).