
Annals of Statistics | 2004

Least angle regression

Bradley Efron; Trevor Hastie; Iain M. Johnstone; Robert Tibshirani; Hemant Ishwaran; Keith Knight; Jean-Michel Loubes; Pascal Massart; David Madigan; Greg Ridgeway; Saharon Rosset; J. Zhu; Robert A. Stine; Berwin A. Turlach; Sanford Weisberg

DISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Jean-Michel Loubes and Pascal MassartUniversit´e Paris-SudThe issue of model selection has drawn the attention of both applied andtheoretical statisticians for a long time. Indeed, there has been an enor-mous range of contribution in model selection proposals, including work byAkaike (1973), Mallows (1973), Foster and George (1994), Birg´e and Mas-sart (2001a) and Abramovich, Benjamini, Donoho and Johnstone (2000).Over the last decade, modern computer-driven methods have been devel-oped such as All Subsets, Forward Selection, Forward Stagewise or Lasso.Such methods are useful in the setting of the standard linear model, wherewe observe noisy data and wish to predict the response variable using onlya few covariates, since they provide automatically linear models that fit thedata. The procedure described in this paper is, on the one hand, numeri-cally very efficient and, on the other hand, very general, since, with slightmodifications, it enables us to recover the estimates given by the Lasso andStagewise.1. Estimation procedure. The “LARS” method is based on a recursiveprocedure selecting, at each step, the covariates having largest absolute cor-relation with the response y. In the case of an orthogonal design, the esti-mates can then be viewed as an lDISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Berwin A. TurlachUniversity of Western AustraliaI would like to begin by congratulating the authors (referred to belowas EHJT) for their interesting paper in which they propose a new variableselection method (LARS) for building linear models and show how their newmethod relates to other methods that have been proposed recently. I foundthe paper to be very stimulating and found the additional insight that itprovides about the Lasso technique to be of particular interest.My comments center around the question of how we can select linearmodels that conform with the marginality principle [Nelder (1977, 1994)and McCullagh and Nelder (1989)]; that is, the response surface is invariantunder scaling and translation of the explanatory variables in the model.Recently one of my interests was to explore whether the Lasso techniqueor the nonnegative garrote [Breiman (1995)] could be modified such that itincorporates the marginality principle. However, it does not seem to be atrivial matter to change the criteria that these techniques minimize in such away that the marginality principle is incorporated in a satisfactory manner.On the other hand, it seems to be straightforward to modify the LARStechnique to incorporate this principle. In their paper, EHJT address thisissue somewhat in passing when they suggest toward the end of Section 3that one first fit main effects only and interactions in a second step to controlthe order in which variables are allowed to enter the model. However, sucha two-step procedure may have a somewhat less than optimal behavior asthe following, admittedly artificial, example shows.Assume we have a vector of explanatory variables X =(XThe purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

Archive | 2007

Concentration inequalities and model selection

Pascal Massart; Jean Picard; École d'été de probabilités de Saint-Flour

Exponential and Information Inequalities.- Gaussian Processes.- Gaussian Model Selection.- Concentration Inequalities.- Maximal Inequalities.- Density Estimation via Model Selection.- Statistical Learning.

Archive | 1997

From Model Selection to Adaptive Estimation

Lucien Birgé; Pascal Massart

Many different model selection information criteria can be found in the literature in various contexts including regression and density estimation. There is a huge amount of literature concerning this subject and we shall, in this paper, content ourselves to cite only a few typical references in order to illustrate our presentation. Let us just mention AIC, C p , or C L , BIC and MDL criteria proposed by Akaike (1973), Mallows (1973), Schwarz (1978), and Rissanen (1978) respectively. These methods propose to select among a given collection of parametric models that model which minimizes an empirical loss (typically squared error or minus log-likelihood) plus some penalty term which is proportional to the dimension of the model. From one criterion to another the penalty functions differ by factors of log n, where n represents the number of observations.

Random Structures and Algorithms | 2000

A sharp concentration inequality with application

Stéphane Boucheron; Gábor Lugosi; Pascal Massart

We present a new general concentration-of-measure inequality and illustrate its power by applications in random combinatorics. The results find direct applications in some problems of learning theory.

Probability Theory and Related Fields | 1993

Rates of convergence for minimum contrast estimators

Lucien Birgé; Pascal Massart

SummaryWe shall present here a general study of minimum contrast estimators in a nonparametric setting (although our results are also valid in the classical parametric case) for independent observations. These estimators include many of the most popular estimators in various situations such as maximum likelihood estimators, least squares and other estimators of the regression function, estimators for mixture models or deconvolution... The main theorem relates the rate of convergence of those estimators to the entropy structure of the space of parameters. Optimal rates depending on entropy conditions are already known, at least for some of the models involved, and they agree with what we get for minimum contrast estimators as long as the entropy counts are not too large. But, under some circumstances (“large” entropies or changes in the entropy structure due to local perturbations), the resulting the rates are only suboptimal. Counterexamples are constructed which show that the phenomenon is real for non-parametric maximum likelihood or regression. This proves that, under purely metric assumptions, our theorem is optimal and that minimum contrast estimators happen to be suboptimal.

Annales De L Institut Henri Poincare-probabilites Et Statistiques | 1999

A Dvoretzky-Kiefer-Wolfowitz type inequality for the Kaplan-Meier estimator

D. Bitouzé; Béatrice Laurent; Pascal Massart

Abstract We prove a new exponential inequality for the Kaplan–Meier estimator of a distribution function in a right censored data model. This inequality is of the same type as the Dvoretzky–Kiefer–Wolfowitz inequality for the empirical distribution function in the non-censored case. Our approach is based on Duhamel equation which allows to use empirical process theory.

arXiv: Statistics Theory | 2017

Estimator selection: a new method with applications to kernel density estimation

Claire Lacour; Pascal Massart; Vincent Rivoirard

Estimator selection has become a crucial issue in non parametric estimation. Two widely used methods are penalized empirical risk minimization (such as penalized log-likelihood estimation) or pairwise comparison (such as Lepskis method). Our aim in this paper is twofold. First we explain some general ideas about the calibration issue of estimator selection methods. We review some known results, putting the emphasis on the concept of minimal penalty which is helpful to design data-driven selection criteria. Secondly we present a new method for bandwidth selection within the framework of kernel density density estimation which is in some sense intermediate between these two main methods mentioned above. We provide some theoretical results which lead to some fully data-driven selection strategy.

algorithmic learning theory | 2012

Some rates of convergence for the selected lasso estimator

Pascal Massart

We consider the estimation of a function in some ordered finite or infinite dictionary. We focus on the selected Lasso estimator introduced by Massart and Meynet (2011) as an adaptation of the Lasso suited to deal with infinite dictionaries. We use the oracle inequality established by Massart and Meynet (2011) to derive rates of convergence of this estimator on a wide range of function classes described by interpolation spaces such as in Barron et al. (2008). The results highlight that the selected Lasso estimator is adaptive to the smoothness of the function to be estimated, contrary to the classical Lasso or the greedy algorithm considered by Barron et al. (2008). Moreover, we prove that the rates of convergence of this estimator are optimal in the orthonormal case.

arXiv: Statistics Theory | 2004


Jean-Michel Loubes; Pascal Massart

DISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Jean-Michel Loubes and Pascal MassartUniversit´e Paris-SudThe issue of model selection has drawn the attention of both applied andtheoretical statisticians for a long time. Indeed, there has been an enor-mous range of contribution in model selection proposals, including work byAkaike (1973), Mallows (1973), Foster and George (1994), Birg´e and Mas-sart (2001a) and Abramovich, Benjamini, Donoho and Johnstone (2000).Over the last decade, modern computer-driven methods have been devel-oped such as All Subsets, Forward Selection, Forward Stagewise or Lasso.Such methods are useful in the setting of the standard linear model, wherewe observe noisy data and wish to predict the response variable using onlya few covariates, since they provide automatically linear models that fit thedata. The procedure described in this paper is, on the one hand, numeri-cally very efficient and, on the other hand, very general, since, with slightmodifications, it enables us to recover the estimates given by the Lasso andStagewise.1. Estimation procedure. The “LARS” method is based on a recursiveprocedure selecting, at each step, the covariates having largest absolute cor-relation with the response y. In the case of an orthogonal design, the esti-mates can then be viewed as an lDISCUSSION OF “LEAST ANGLE REGRESSION” BY EFRONET AL.By Berwin A. TurlachUniversity of Western AustraliaI would like to begin by congratulating the authors (referred to belowas EHJT) for their interesting paper in which they propose a new variableselection method (LARS) for building linear models and show how their newmethod relates to other methods that have been proposed recently. I foundthe paper to be very stimulating and found the additional insight that itprovides about the Lasso technique to be of particular interest.My comments center around the question of how we can select linearmodels that conform with the marginality principle [Nelder (1977, 1994)and McCullagh and Nelder (1989)]; that is, the response surface is invariantunder scaling and translation of the explanatory variables in the model.Recently one of my interests was to explore whether the Lasso techniqueor the nonnegative garrote [Breiman (1995)] could be modified such that itincorporates the marginality principle. However, it does not seem to be atrivial matter to change the criteria that these techniques minimize in such away that the marginality principle is incorporated in a satisfactory manner.On the other hand, it seems to be straightforward to modify the LARStechnique to incorporate this principle. In their paper, EHJT address thisissue somewhat in passing when they suggest toward the end of Section 3that one first fit main effects only and interactions in a second step to controlthe order in which variables are allowed to enter the model. However, sucha two-step procedure may have a somewhat less than optimal behavior asthe following, admittedly artificial, example shows.Assume we have a vector of explanatory variables X =(XThe purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

Probability Theory and Related Fields | 1999

Risk bounds for model selection via penalization

Andrew R. Barron; Lucien Birgé; Pascal Massart

Researchain Logo
Decentralizing Knowledge