Fitting very flexible models: Linear regression with large numbers of parameters
FFitting very flexible models: Linear regression withlarge numbers of parameters David W. Hogg
Flatiron Institute, a division of the Simons FoundationCenter for Cosmology and Particle Physics, Department of Physics, New York UniversityCenter for Data Science, New York UniversityMax-Planck-Institut f¨ur Astronomie, Heidelberg
Soledad Villar
Department of Applied Mathematics & Statistics, Johns Hopkins UniversityMathematical Institute for Data Science, Johns Hopkins University
Abstract:
There are many uses for linear fitting; the context here is interpo-lation and denoising of data, as when you have calibration data and you wantto fit a smooth, flexible function to those data. Or you want to fit a flexiblefunction to de-trend a time series or normalize a spectrum. In these contexts,investigators often choose a polynomial basis, or a Fourier basis, or wavelets,or something equally general. They also choose an order, or number of ba-sis functions to fit, and (often) some kind of regularization. We discuss howthis basis-function fitting is done, with ordinary least squares and extensionsthereof. We emphasize that it is often valuable to choose far more parametersthan data points , despite folk rules to the contrary: Suitably regularized modelswith enormous numbers of parameters generalize well and make good predic-tions for held-out data; over-fitting is not (mainly) a problem of having toomany parameters. It is even possible to take the limit of infinite parameters, atwhich, if the basis and regularization are chosen correctly, the least-squares fitbecomes the mean of a Gaussian process. We recommend cross-validation asa good empirical method for model selection (for example, setting the numberof parameters and the form of the regularization), and jackknife resamplingas a good empirical method for estimating the uncertainties of the predictionsmade by the model. We also give advice for building stable computationalimplementations.
In contexts in which we want to fit a flexible function to data, for interpolationor denoising, we often perform linear fitting in a generic basis, such as poly-nomials, Fourier modes, wavelets, or spherical harmonics. This kind of linearfitting arises in astronomy when, for example, we want to calibrate the rela-tionship between wavelength and position on the detector in a spectrograph: It is a pleasure to thank Jed Brown (CU Boulder), Dan Foreman-Mackey (Flatiron),Alessandro Gentilini, Teresa Huang (JHU), Sam Roweis (deceased), Adrian Price-Whelan(Flatiron), Bernhard Sch¨olkopf (MPI-IS), Kate Storey-Fisher (NYU), Rachel Ward (UTAustin), and Lily Zhao (Yale) for valuable conversations and input. SV is partially fundedby NSF DMS 2044349, EOARD FA9550-18-1-7007, and the NSF–Simons Research Collab-oration on the Mathematical and Scientific Foundations of Deep Learning (MoDL) (NSFDMS-2031985). a r X i v : . [ phy s i c s . d a t a - a n ] J a n itting very flexible models / Hogg & Villar must have fewer parameters than datapoints . Here we are going to show that this folk rule is not valid; you can goto extremely large numbers of coefficients without trouble. But like most folkrules, it has a strong basis in reality: There are extremely bad choices possiblefor the number of coefficients, and especially when the number of parametersis close or comparable to the number of data points. As we will discuss below,these choices ought to be made with care.In many cases the third kind of choice—about regularization—is made im-plicitly, not explicitly. Here we are going to emphasize this choice and itsimportance, and its value in improving your results.Alternatively, if you are unhappy with the three choices of basis, order,and regularization, you might decide to avoid such decisions and go fully non-parametric : Instead of fitting a basis expansion, you can use a strict inter-polator (like a cubic spline), or you can fit a Gaussian process to your data.Here we will show that the choice of any Gaussian process kernel function isequivalent to choosing a basis and a regularization and letting the number offit components go to infinity. That is, going non-parametric doesn’t really getyou out of making these choices. It just makes these choices more implicit. Andit is a pleasure to note that any time you have gone non-parametric you haveimplicitly chosen to use a basis with way more fit parameters than data points!The fact that non-parametrics work so well is strong evidence against the folkrule about the number of parameters needing to be less than the number ofdata points.An important assumption or setting for the problems we are addressing inthis Note will be that you care about predicting new data or interpolating thedata, but you explicitly don’t care about the parameters of the fit or the weightsof the basis functions per se . In this setting there are no important meaningsto the components of the model. That is—for us—only the data exist. Thedetails of the model are just choices that permit high-quality interpolationsand predictions in the space of the data.In what follows, we will look at applications that look like interpolation; itting very flexible models / Hogg & Villar et al. et al. and the deep networkis capable of generating (effectively) far more, internally.In some contexts you have strong beliefs about the noise affecting yourmeasurements. In other cases you don’t. In some cases you have strong reasonsto use a particular basis. In other cases you don’t. The differences in thesebeliefs and the differences in your objectives will change what methods youchoose and how you use and analyze them. We’ll try to be useful to you nomatter where you’re at. This document does not deliver new research resultson the mathematics or statistics of regression. It is novel only in that it makesvery specific the connection between regularized linear regression and Gaussianprocesses.Ordinary least squares is reviewed in Section 2. The extensions of weightedleast squares and ridge regression are shown in Section 3. The over-para-meterized case (more parameters than data) is discussed in Section 4. The con-cept of feature weighting for controlling regularization in over-parameterizedfits is introduced in Section 5. Cross-validation is explained and used to choosethe number of parameters in Section 6, and the double descent phenomenonis shown. The Gaussian Process appears as the limit of infinite parameters inSection 7. Jackknife resampling is explained and used to estimate uncertain-ties in Section 8. Numerical implementation considerations are discussed inSection 9 and some final remarks are made in Section 10.
Our setup will be that there are n scalar data points y i . Each of these datapoints has an associated coordinate or location t i . In the machine-learninglexicon, these will be our “training data”. The location t i could be thoughtof as a time at which the data point was taken, or a position, or it can bea higher dimensional vector or blob of housekeeping data associated with thedata point. Critically, we are going to imagine that the t i are known very well(not very uncertain or noisy), while the y i are possibly very uncertain or noisymeasurements. We’ll return to these assumptions at the end.We are going to fit these data y i with a linear sum of p basis functions g j ( t ).These basis functions are functions of the coordinates t . That is, our implicit itting very flexible models / Hogg & Villar y i = p (cid:88) j =1 β j g j ( t i ) + noise , (1)where the p values β j are parameters or coefficients of the linear fit. We canassemble the evaluations of the p functions g j ( t ) at the n data coordinates t i into a n × p design matrix or feature matrix X such that[ X ] ij = g j ( t i ) . (2)The transformation of the n locations t i into a n × p matrix X is the oppo-site of a dimensionality reduction. It is called variously a feature map or a feature embedding . We like “embedding” because it (almost always) raises thedimensionality of the locations t into the p -dimensional space of the rows of X .For a concrete example, one common choice is to make the feature embed-ding functions g j ( t ) terms in a Fourier series g j ( t ) = (cid:26) cos ω j t for j oddsin ω j t for j even (3) ω j = πT (cid:22) j (cid:23) , (4)where T is a (large) length-scale in the coordinate space ( t space) and (cid:98) j/ (cid:99) indicates the floor of j/ g j ( t ) from thisbasis are shown in Figure 1. Alternative common choices would be to makethe embedding functions g j ( t ) polynomials or other kinds of ordered basisfunctions, such as wavelets or spherical harmonics (the latter if, say, the t i are positions on the sphere). Another choice that isn’t common in the naturalsciences, but studied in machine learning (for example, Rahimi & Recht 2007),is to choose the features randomly from a distribution (rather than on a regulargrid in frequency, as we do here). That is beyond our scope, but we come backto it in Section 10.The idea of least-squares fitting is that the “best” values of the parameters β j are the values that minimize the sum of squares of the differences betweenthe data and the linear combination of features:ˆ β = arg min β (cid:107) Y − X β (cid:107) , (5)where ˆ β is the p -vector (column vector) of the p best-fit values ˆ β j of theparameters β j , and (cid:107) q (cid:107) denotes the squared L2-norm, or the sum of squares of the components of a vector q (cid:107) q (cid:107) ≡ q (cid:62) q . (6)This optimization objective (5) is convex and the optimization problem has asolution in closed form as long as the number of parameters (and features) p is Note that ∇ β (cid:107) Y − Xβ (cid:107) = 2 X (cid:62) ( Y − Xβ ) so critical points are such that X (cid:62) Y = X (cid:62) Xβ .Since the objective is convex all these critical points are global minima. Further analysis isgiven in Appendix A. itting very flexible models / Hogg & Villar t b a s i s f un c t i o n p l u s o ff s e t a few examples from the Fourier basis g ( t ) g ( t ) g ( t ) g ( t ) g ( t ) g ( t ) g ( t ) g ( t ) Figure 1: Examples of basis functions g j ( t ) from the basis given in equa-tion (3). This basis was constructed with length-scale parameter T = 3.The wavelength in the location space decreases, and the frequency in-creases, with index j .less than the number of training points n (and the matrix X (cid:62) X is invertible):ˆ β = ( X (cid:62) X ) − X (cid:62) Y . (7)We will treat the case in which the matrix is not invertible below. When theinvestigator knows uncertainties on the training data y i , this expression willchange a bit; we begin to discuss that in the next Section.But recall our setting: We are using the linear fit to interpolate the data,or de-noise the data, or predict new data. In these contexts, we don’t careabout the parameter vector ˆ β itself. We care only about the predictions at aset of new “test” locations t ∗ , which will usually be different from the traininglocations t j . From the new test times t ∗ we create the test feature matrix X ∗ (by the same feature embedding functions g j ( t )). The prediction ˆ Y ∗ for the y values at the test locations t ∗ becomesˆ Y ∗ = X ∗ ( X (cid:62) X ) − X (cid:62) Y . (8)Examples of OLS applied to some toy data are shown in Figure 2. This form(8) of the prediction of new test data is called ordinary least squares (OLS).It has many good properties, some of which are encoded in the Gauss–Markovtheorem. In particular, if the noise contributions in the model given in equation The algorithm by which the toy data were made—and indeed all of the code used tomake the figures for this
Note —is available online at https://github.com/davidwhogg/FlexibleLinearModels . itting very flexible models / Hogg & Villar t y OLS with p < n p = 3 p = 7 p = 21 Figure 2: Ordinary least-squares (OLS) fits (continuous lines) to a set ofexample data points (black dots). Fits are shown for different values of thenumber of basis functions p ; there are n = 23 data points. Here (and in allthe Figures to follow) we are using the Fourier basis functions in (3) andshown in Figure 1. The data y i were generated using a function that doesnot reside in the function space spanned by the basis. Here the X ∗ matricesused in the predictions ˆ Y ∗ were generated from a fine grid of locations inthe location coordinate t ; the plots of the fine grid of predictions are thecontinuous lines. The fits with larger p have more flexibility to fit the datathan the fit with p = 3, but the fit at the highest p shows evidence ofover-fitting.(1) are uncorrelated, have zero mean, and equal variances for i = 1 , . . . , n then the Gauss–Markov theorem states that the OLS estimator has the lowestvariance within the class of unbiased estimators that are linear in Y (see, forexample, Hastie et al. The prediction (8) when p < n (the under-parameterized or traditional regime)is affine invariant in that p -dimensional rotations or rescalings of the rectan-gular feature matrix X do not affect predictions. That is, if R is an invertible p × p matrix, the prediction using X (cid:48) ← X R will be identical to the pre-diction using the original X . This affine invariance will be modified in theover-parameterized regime, below.Although the OLS prediction (8) is affine invariant with respect to p -dimensional transformations, it is not affine invariant with respect to n -dimensional transformations, such as a re-weighting of the input data. Indeed, itting very flexible models / Hogg & Villar C − (written this way to emphasize thatreweighting is usually inverse-variance weighting) and writeˆ Y ∗ = X ∗ ( X (cid:62) C − X ) − X (cid:62) C − Y , (9)where the weight matrix C − is n × n (and often diagonal in standard applica-tions). The weight matrix C − is sometimes called the information tensor (orinformation matrix) and its inverse C is often called the covariance matrix orthe noise variance tensor.This form (9) of least squares is called weighted least squares (WLS) be-cause of the data weighting (not to be confused with feature weighting, toappear below). It is also called chi-squared fitting because it optimizes thescalar objective commonly called chi-squared: χ = ( Y − X β ) (cid:62) C − ( Y − X β ) . (10)(If that isn’t obviously chi-squared to you, recall that C − is often diagonal andhas the inverses of the squares of the data uncertainties on that diagonal.) InFigure 3 a comparison of OLS and WLS is shown, for a case of non-trivial dataweights, where the data weights are set to be the inverse squares of individualdata-point uncertainties.We will say more about noisy data and the propagation of uncertaintiesbelow in Section 8. It might be crossing your mind that there are uncertaintiesnot just in the data points y i , but also often in the locations t i of the data aswell. It turns out that taking the latter into account is a much harder problem;we will discuss this briefly in Section 8.It is common to include a regularization that discourages the fit from makinguse of large amplitudes β j . There are many options, but the simplest is ridgeregression (or Tikhonov regularization or L2 regularization), which (in theform that doesn’t have data weights) looks likeˆ β = arg min β (cid:107) Y − X β (cid:107) + λ (cid:107) β (cid:107) , (11)where λ > β j of the parameter vector. This optimization is also convex. Theridge-regularized prediction for new data looks likeˆ Y ∗ = X ∗ ( X (cid:62) X + λ I ) − X (cid:62) Y , (12)where I is the p × p identity (see Appendix A). In the language of Bayesianinference, λ can be seen as the inverse of a prior variance for the parameters β j . The salutary effect of the ridge regularization is shown in Figure 4.The ridge brings with it a choice: How to set the hyper-parameter λ ? We gen-erally recommend cross-validation, to be discussed below in Section 6. Ridgeregression is not affine invariant in the sense that the standard OLS prediction(8) is; that is, the effect of the regularization depends on the amplitudes andlinear combinations of features placed in the feature matrix. This will become itting very flexible models / Hogg & Villar t y comparison of OLS and WLS at p = 3 OLSWLS
Figure 3: Comparison of ordinary least-squares (OLS) and weighted least-squares (WLS) fits (continuous lines) to the example data (black dots). Inorder to illustrate the differences, we assigned non-trivial error bars to thedata. The error bars are ignored in the OLS fit, but in the WLS fit, theweight matrix C − is diagonal with the diagonal entries set to the inversesof the squares of those error bars. The WLS fit “pays less attention to”the points on the left with the largest error bars.important later when we consider feature weights below. It is also the case thatthe regularization need not be proportional to the identity matrix: In principleany positive definite matrix Λ could be used in place of λ I ; this makes senseto consider when you have detailed prior beliefs about all the parameters β j or if those parameters are measured with different units (say).You can combine both the point weighting from WLS and the generalizedridge regression into a weighted ridge that looks likeˆ Y ∗ = X ∗ ( X (cid:62) C − X + Λ) − X (cid:62) C − Y . (13)This form has good properties for many real-world physics applications, wheredata-point error bars are often known, and functions are often expected to besmooth. We’ll discuss this form (13) more below in Section 5, but briefly wecan say here that it appears in Bayesian inference contexts where the datapoints y i are treated as having Gaussian noise associated with them (withknown variance tensor or covariance matrix C ) and there is a Gaussian prioron the parameter vector β (with known variance Λ − ). We have discussed thatmodel elsewhere (Hogg et al. It is also useful sometimes to think of the“units” or dimensions of the quantities in (13): In the Bayesian setting, the In our previous discussion of this product of Gaussians, the notation differs. What’scalled Λ − here is called Λ in Hogg et al. (2020). itting very flexible models / Hogg & Villar t y comparsion of OLS and ridge regression at p = 21 OLSridge regression
Figure 4: Comparison of ordinary least-squares (OLS) and ridge-regressionfits (continuous lines) to the example data (black dots). In order to illus-trate the differences, we chose the p = 21-parameter fit, which showsevidence of over-fitting at the edges of the fit range. The regularized fitlooks more sensible, though it fits the individual data points less precisely.The choice of regularization parameter λ matters; here we used λ = 0 . C would be the square of the units of Y and the units of Λ − wouldbe the square of the units of the ratio Y /X (the square of the units of β ). We are taught folklore, at a young age, that we can never fit for more pa-rameters than we have data. That is, we can never work at p > n . This isn’ttrue! Not only is it the case that we can work at p > n , in many cases we should work at p > n , and many real-world regressions do. But it is true thatthe p > n regime is indeed strange! In the over-parameterized case, there aretypically many settings of the parameters β j that will literally zero out thedifferences between the data Y and the linear prediction X β . The OLS solu-tion, in this case, is defined (somewhat arbitrarily) to be the minimum-normparameter vector β that interpolates the data:ˆ β = arg min β (cid:107) β (cid:107) subject to Y = X β . (14)Technically this formulation depends on an additional assumption that the fea-ture matrix X is full rank or that the data Y lie in the subspace spanned by X .This is true almost always when p > n . This optimization is again convex andhas a unique solution, although that solution will depend on feature weights itting very flexible models / Hogg & Villar y i , this expression will change a bit; we return to thatin the next Section. The under-parameterized and over-parameterized opti-mization statements (5) and (14) can be unified into one form by consideringthe limit of light L2 regularization:ˆ β = lim λ → + (cid:20) arg min β (cid:107) Y − X β (cid:107) + λ (cid:107) β (cid:107) (cid:21) ; (15)in the limit, this delivers the OLS solution in either case and doesn’t requirethe constraint in (14) to be satisfied.In the over-parameterized case ( p > n ), the prediction looks likeˆ Y ∗ = X ∗ X (cid:62) ( X X (cid:62) ) − Y , (16)provided that
X X (cid:62) is invertible (which will usually be the case). Like (8),this prediction is also called ordinary least squares (OLS), or sometimes “min-norm least-squares” to emphasize the point that it is making the minimum-norm choice of β among many degenerate solutions. Examples of OLS fits inthe over-parameterized regime are shown in Figure 5. Note that the fits withdifferent numbers of parameters p lead to very different predictions, but theyall go through the data exactly.It is slightly off-topic to notice that in Figure 5, as p gets very large, the OLSsolution approaches y = 0 everywhere that it can. This behavior is a directconsequence of Plancharel’s theorem, which states that the Fourier transformis unitary (see for instance Folland 1994). This means that the integral overfrequency of the square of the Fourier transform is equal to the integral overlocation of the square of the original function. Thus the min-norm solutiondelivered by OLS, which chooses the interpolating function that minimizesthe squares of the component amplitudes β in the Fourier basis used to makeFigure 5, will choose the interpolating function y ( t ) that minimizes the meansquare of the value of the function in the location space. It will try to stay asclose to y = 0 as possible. That’s probably not desirable in most applications!We will fix that problem in the next Section.The two equations for OLS—(8) and (16)—can be unified into one equation(and also generalized to handle non-invertible matrices X (cid:62) X and X X (cid:62) ) if wedefine the pseudo-inverse X † : ˆ Y ∗ = X ∗ X † Y . (17)The pseudo-inverse of a diagonal matrix is defined as the diagonal matrixmade by inverting the non-zero diagonal entries. And for any non-diagonal(or any non-square) matrix X the pseudo-inverse is defined by taking thesingular-value decomposition (SVD) of X , X = U S V , with S diagonal and U, V orthogonal, then X † ≡ V (cid:62) S † U (cid:62) .This pseudo-inverse form (17) of OLS is extremely general: It works for boththe p ≤ n and p > n cases, and it works when the X X (cid:62) or X (cid:62) X matrices arenot invertible. There can be significant numerical issues with implementingthe pseudo-inverse; we comment on those in Section 9. itting very flexible models / Hogg & Villar t y OLS with p > n p = 30 p = 73 p = 2049 Figure 5: Ordinary least-squares (OLS) or min-norm least-squares fits(continuous lines) to a set of example data points (black dots), but nowfor a few over-parameterized cases. There are n = 23 data points. As inFigure 2, the data points y i were generated using a function that does notprecisely reside in the function space spanned by the basis. The three fitsare very different, but they all go through all the data points exactly. As p gets large, the fit function approaches y = 0 almost everywhere; this isa consequence of Plancharel’s Theorem (see text). The OLS prediction for p > n is not affine invariant with respect to p -dimensional rotations or rescalings. That is, rotations and scalings in thefeature space will affect predictions. It behooves us to re-scale the features(the p n -vectors of the feature matrix X ) in a sensible way, like for instance,to encourage the fit to use low-frequency features more than high-frequencyfeatures (Xie et al. p × p diagonal weight matrix Λ − and theprediction becomes ˆ Y ∗ = X ∗ Λ − X (cid:62) ( X Λ − X (cid:62) ) − Y . (18)This can be seen as the prediction for new data resulting from the followingoptimization (again, assuming the data can be interpolated):ˆ β = arg min β (cid:107) Λ / β (cid:107) subject to Y = X β , (19)which—in analogy to the optimizations (14) and (15)—can also be written asˆ β = lim λ → + (cid:20) arg min β (cid:107) Y − X β (cid:107) + λ (cid:107) Λ / β (cid:107) (cid:21) . (20) itting very flexible models / Hogg & Villar t y comparison of OLS and feature-weighted OLS at p = 1024 OLSOLS with feature weights ( s + 1) Figure 6: Comparison of ordinary least squares (OLS) to OLS with a spe-cific feature weighting. The feature-weighting function is given in the textin equations (21) and (22); in this and the following Figures, we (somewhatarbitrarily) set the width parameter to be s = 0 .
05. Because this weightingfunction penalizes more strongly the higher-frequency features, the min-norm solution gives them smaller amplitudes, making the fit smoother.This doesn’t require the constraints in (19) to be satisfied. This optimization penalizes more strongly the parameters β j corresponding to features g j ( t ) withlarger values of [Λ / ] jj .In the Fourier case this re-weighting can be very straightforward: Each of the p embedding functions g j ( t ) has an associated frequency ω j ; we can control thefit by weighting the j features by a function f ( ω ). For demonstration purposes,we can choose f ( ω ) = 1 s ω + 1 , (21)where s is a hyper-parameter controlling the (inverse) width of the weightingfunction in frequency space, and the frequency input will be ω j for each feature j . That is, [Λ − ] jj = [ f ( ω j )] . (22)We have chosen this form (21) for f ( ω ) for specific reasons that will becomeobvious below. The introduction of the feature weights dramatically changesthe predictions; this is shown in Figure 6. High frequencies are suppressed andthe prediction becomes smooth. If you are a physicist and you don’t like the mathematical (cid:107)·(cid:107) notation—if you prefer tolook at quantities that are obvious scalar forms—then recall that (cid:107) Λ / β (cid:107) = β (cid:62) Λ β . Thatright-hand-side object is a gauge-invariant object so long as β and Λ are also gauge-invariantobjects themselves. Yes, data analysis can have this kind of geometric structure! itting very flexible models / Hogg & Villar β j ofthe parameter vector β have to change in response, which in turn changes itsnorm (cid:107) β (cid:107) . That is, the details of the min-norm data-fitting parameter vec-tor depends on the details of how the features are normalized or weighted. Infeature-weighted OLS, this property can be exploited to make the fits smooth,or meet other desiderata.If we think of the OLS choice ˆ β —the min-norm vector among all vectors β that thread the data (satisfy X β = Y )—as the result of a kind of lightregularization, then the feature weighting is an adjustment of the form of thatregularization: It asks the optimization to pull some components of β towardszero harder than others. In our view, feature weighting should be consideredpart of the investigator’s choice of regularization. The feature weighting canbe thought of as altering the details of that choice, or it can be thought ofas making the standard min-norm choice but in a carefully chosen, rescaledbasis.In the most general case, you have not just a set of feature weights Λ − , youalso have a set of data-point weights C − (which, in standard settings, wouldbe the inverses of the variances of the noise affecting the data points y i ). Whenyou put these all together, the feature-weighted, data-weighted least squarespredictions are given by either of these two equivalent expressions:ˆ Y = X ∗ ( X (cid:62) C − X + Λ) − X (cid:62) C − Y (23)ˆ Y = X ∗ Λ − X (cid:62) ( X Λ − X (cid:62) + C ) − Y . (24)The equivalence of these is due to the Woodbury matrix identity (Henderson& Searle 1981). These expressions appear in Bayesian-inference contexts (see,for example, Hogg et al. − is the variance of the prior on theparameter vector β and C is the variance of the noise on the data vector Y .They are solutions to the optimizationˆ β = arg min β (cid:107) C − / ( Y − X β ) (cid:107) + (cid:107) Λ / β (cid:107) . (25)The units of C must be the units of Y squared, and the units of Λ − mustbe the square of the ratio of the units of Y to the units of X (the square ofthe units of β ). These weighted-feature, weighted-data least-square forms (23)and (24) are good because (if the weightings are chosen appropriately) theylead to smooth solutions that are not required to pass precisely through everydata point. This is appropriate in common, real situations in which data arenoisy and the world is smooth. Which form you choose, between (23) and (24),depends on a few things, but primarily p/n . If p < n it’s both faster and morestable to use (23); if p > n it’s faster and more stable to use (24). We show atoy example of a fit with both feature weights Λ − and data weights C − inFigure 7. itting very flexible models / Hogg & Villar t y comparison of feature-weighted OLS and WLS at p = 1024 OLS with feature weights ( s + 1) WLS with both data and feature weights
Figure 7: Comparison of the feature-weighted OLS shown in Figure 6 tothe same but also including data weights C − , as in equations (23) and(24). Similarly to Figure 3, the data weights on the diagonal of C − arethe inverses of the squares of the uncertainties shown as vertical errorbars. The feature weights encourage the prediction to be smooth; the dataweights permit it to be even smoother because they permit the predictionto miss the training data. Prediction results now depend on the specificamplitude or prefactor multiplying the feature weights; a particular valueof 0 . f ( ω ) was chosen for the diagonal of the feature weight matrix Λ − for this demonstration. There is a lot of literature analyzing the performance of linear regressions asa function of the sizes n and p , regularization strengths and forms, and so on(for example, Bartlett et al. et al. X and Y in ournomenclature) were drawn. In the Real World (tm), you don’t get this luxury.The data are given to you without documentation! Indeed, understanding thegenerating process of the world is the goal of investigations in the naturalsciences (such as astronomy); the investigator does not know the generatingprocess at the outset.Given a data set (location–data pairs t i , y i ), what is the best way to em-pirically estimate, from those data, the out-of-sample prediction error? Thatis, how do you estimate how well you are likely to predict new data? A rea- itting very flexible models / Hogg & Villar t y leave-one-out cross-validation at p = 15 data pointsleave-one-out fitspredictions of held-out data points Figure 8: A demonstration of leave-one-out cross-validation for one partic-ular model. The 23 lines are the 23 fits, for each of which one data pointwas held out. Also plotted on each of the 23 lines is the prediction madefor that fit’s held-out data point. Some of the predictions don’t appearwithin the plot window, because the predictions obtain large amplitudesat the edges.sonable answer to this is cross-validation: In cross-validation, we leave out apart of the data (in the most extreme form, leave out one single data pointat a time), train the model using all but the left-out part, and predict theleft-out part. Then this process is iterated over all choices of what part (orpoint) to leave out. This process is illustrated in Figure 8, where we show 23fits, each of which is trained to the 22 points remaining when we leave one ofthe n = 23 points out. Also shown is the prediction, in each leave-one-out fit,for the left-out point.The prediction for the mean-squared error (MSE) for new data is the meanof the square of the differences between the leave-one-out predictions andthe left-out data. This leave-one-out cross-validation MSE (CVMSE) will bedifferent for different choices for the basis, size ( p ), and regularization (in-cluding feature-weight functions) of the fits you do. It is a fairly reliable andwell-studied method for assessing predictive accuracy (see, for example, Stone1974), although it does depend on some assumptions (for example, that datapoints are not duplicated or strongly corrrelated, and that the predictive infor-mation in the data is distributed among multiple individual data points). Anexample of CVMSE for two models (the feature-weighted OLS and the ridgeregression) are shown for our toy data, as a function of the number of features p , in Figure 9. The predictive accuracy is indeed a very strong function of p .In particular, the CVMSE for the feature-weighted OLS fits is bad when itting very flexible models / Hogg & Villar p m e a n s q u a r e d e rr o r p = n cross-validation estimates of error OLS with weighted featuresridge regression with = 0.1Gaussian process
Figure 9: Leave-one-out cross-validation estimates of mean squared predic-tion error for the feature-weighted OLS fits, as a function of the number offeatures p . Also shown are the same for ridge regression with λ = 0 . p . In detail the first (lowest- t ) and last (highest- t ) data points were not used in computing the mean squared error; thatchoice is debatable, but we are imagining an assessment of the quality ofthe interpolations , not extrapolations, of these models. Note that the OLSpredictions are much, much worse at p ≈ n (check out all the orders ofmagnitude on the vertical axis) than they are at very low or very high p ,but that ridge regression doesn’t show this behavior. p ≈ n , and much better at p (cid:28) n and p (cid:29) n . This phenomenology is notparticular to this problem. It is extremely general. There is an effect known inmany kinds of regression called “double descent” or “peaking phenomenon” or“jamming” in which predictive accuracy becomes very poor when the numberof free parameters p comes close to the number of data points n (Jain & Chan-drasekaran 1982, Spigler et al. et al. p = n (Hastie et al. X (cid:62) X ) − and ( X X (cid:62) ) − respectively,which are very badly conditioned around p ≈ n . This translates to a large vari-ance for the estimator ˆ β which implies a large risk. One way to think aboutthis is that when the condition number of the matrix X (cid:62) X or X X (cid:62) is large,some directions in the data space—some linear combinations of the elements itting very flexible models / Hogg & Villar Y —are very strongly amplified. This makes the regressionunreasonably sensitive to noise in the data. The high-risk behavior at p ≈ n disappears with certain kinds of regularization. These include the ridge reg-ularization shown in Figure 9 (also known as Tikhonov regression), but alsoearly stopping (Hastie et al. et al. p of parameters can make goodpredictions for held-out data, and that regularization can also protect a modelfrom over-fitting, even when p ≈ n . We don’t mean to imply that over-fittingis not a problem in regression; it can be! Our recommendation is just that theproblem of over-fitting be analyzed empirically through cross-validation, notthrough intuitive ideas about the number of parameters.The CVMSE gets better and worse at different values of p . That’s notsurprising; there are places where the fit basis and size and feature weightingare all more appropriate for your specific problem. These good-CVMSE placesare the best places to work, if you can afford to search for them and find them.In general, if you care about predictive accuracy, it is worth doing a searchfor the values of p (and other hyper-parameters, like regularization strengthsand feature-weighting function parameters) that minimize the CVMSE. Theminimum-CVMSE choices for p and the hyper-parameters will generally bevery close to the choices that lead to the best predictions for new data.That said, there is one more point to make about over-fitting, which is that,when your data set is small, it is possible to over-fit even your cross-validation.This problem is beyond the scope of this Note ; all we will say here is that it isnot possible to make precise settings of many hyper-parameters through crossvalidation. Because in cross validation you are making use of your data toset the properties of your regression, there are dangers of over-adapting yourmethod to your data. It is safer to have a fully independent validation dataset, but this is rarely practical (and it doesn’t completely protect you either;see Recht et al. t space) to begin with, and become more irregularly spacedwhen you leave one out. How much does this matter? It probably does matterin detail, but you don’t have much choice in problems like this. All empirical The condition number comes back up below in Section 9. It turns out that when thecondition number is large, not only do the output predictions become very sensitive to theinput training data, but also the numerical (computational) stability of the linear algebracan also be badly affected. But we emphasize here that the risk goes bad when the conditionnumber is large even if it is possible to perform the linear algebra correctly at high precision. itting very flexible models / Hogg & Villar
The Gaussian process (GP; see Rasmussen & Williams 2005 for a completeintroduction) is a non-parametric regression that takes training data y i atcoordinates t i , plus a kernel function, and makes predictions ˆ y ∗ for new dataat new positions t ∗ . The Gaussian process mean prediction looks likeˆ Y ∗ = K ∗ K − Y , (26)where K is a square n × n kernel matrix between training locations and them-selves [ K ] i,i (cid:48) = k ( t i , t i (cid:48) ) , (27)and K ∗ is the same except it is the rectangular kernel matrix between testlocations t ∗ and training locations t i ; the function k ( · , · ) is a positive semi-definite kernel function. Stated this simply, the GP looks like magic; our goalhere is to connect this to the feature-weighted OLS from Section 5.First, what does it mean for a kernel function k ( · , · ) to be positive semi-definite? It means that no matter what set of n locations t i you choose, the n × n kernel matrix K made from the function according to (27) has only non-negative, real eigenvalues. The kernel function must be positive semi-definitebecause, among other things, it is describing the variance of a process, andvariances are always non-negative. It turns out that you can guarantee thata kernel function is positive semi-definite if you can show that it is, itself,the Fourier transform of a non-negative function (Bochner’s Theorem, see forexample Folland 1994). Although the kernel matrix K is, by construction, always positive semi-definite, it can have extremely bad or even infinite condition number. Thatis, it can lead your code or implementation into very unstable linear-algebraoperations. We discuss how to handle these issues below in Section 9.Above, we called the GP prediction the “mean”. This is because the Gaus-sian process is a model for a mean and variance in function space. In additionto the mean ˆ Y ∗ predicted in (26), the GP also predicts a variance in the Y ∗ space around that mean. In this Note , we are going to treat the GP as produc-ing only a mean prediction, where the shape of the kernel function matters, butthe amplitude of the kernel function does not (the prediction is a ratio of ker-nel matrices, so the kernel amplitudes cancel). However, in Bayesian-inferencecontexts the amplitude of the GP kernel is important, and the predicted vari-ances are important. When we consider only the mean prediction of the GP, aswe do here, then the GP is a kind of linear filter that operates on data Y and It is not required that the function k ( · , · ) itself be non-negative; non-negative definitenessis a different condition entirely from non-negativity. There is an additional point worthy of mention here, which is that the expression (26) isimplicitly for a GP with a zero prior mean. There is a more general expression that subtractsa prior mean function from the Y values and adds the prior mean back in to the ˆ Y ∗ values.See Rasmussen & Williams (2005) for all the math. itting very flexible models / Hogg & Villar Y ∗ . Classically, this linear filter is some-times called a Wiener filter (more on this in a moment) or kriging (possiblybecause of Krige 1951).Importantly for us, there is a strong connection between the feature-weighted OLS and the GP. In particular, when we take the limit of infinitefeatures ( p → ∞ ; provided the limit exists) we get kernel matrices K in placeof the X Λ − X (cid:62) matrix products:lim p →∞ X Λ − X (cid:62) = K (28)lim p →∞ X ∗ Λ − X (cid:62) = K ∗ , (29)where Λ − is the diagonal matrix of weights, and element i, i (cid:48) of K is obtainedby evaluating a kernel function k ( t i , t i (cid:48) ). Equivalently, the limit islim p →∞ p (cid:88) j =1 [Λ − ] jj g j ( t ) g j ( t (cid:48) ) = k ( t, t (cid:48) ) , (30)where we have used the diagonality of Λ to make a single sum over j . Thespecific form of the kernel function k ( · , · ) depends on the basis (the features)we choose, and the weighting of the basis functions in the OLS.The connection between the infinite basis chosen (the form and weightingof the features) and the kernel function is governed by Mercer’s theorem (seeMinh et al. ω = ω j +2 − ω j ) is small enough, the kernelapproximates the Fourier transform of the square of the weighting function f ( ω ) we use to weight the features.That is, in the case of the specific example of weight function f ( ω ) given inequation (21), we can connect our feature-weighted OLS to an equivalent GPif we know the Fourier transform of the square of f ( ω ). We chose that specificform for f ( ω ) because it has a square that is a member of a Fourier-transformpair: F [ F ( t )] = [ f ( ω )] (31) F ( t ) = (cid:114) π (cid:18) | t | s (cid:19) exp − | t | s . (32)This latter function is also known as the Mat´ern 3 / ω → p → ∞ feature-weightedOLS given here becomes a GP with kernel function F ( · ) (under the mildassumptions that guarantee that the discrete Fourier transform converges tothe continuous Fourier transform, see Epstein 2005): k ( t i , t i (cid:48) ) → F ( t i − t i (cid:48) ) (33)= (cid:114) π (cid:18) | t i − t i (cid:48) | s (cid:19) exp − | t i − t i (cid:48) | s . (34) itting very flexible models / Hogg & Villar t y comparison of feature-weighted OLS at p = 1024 and a GP OLSOLS with feature weights ( s + 1) GP with Matérn 3/2 kernel
Figure 10: Comparison of feature-weighted ordinary least squares witha Gaussian process. The figure shows the OLS fit, the feature-weightedversion with the particular feature weighting given in (21), and the GPfit using the kernel (32) that is the Fourier transform of the square ofthe feature weighting. The feature-weighted fit and the GP are essentiallyidentical, as expected.Technically, the kernel function k ( · , · ) converges to the Fourier transform ofthe square of the weighting function f ( · ) in the limit ∆ ω → k ( · , · ) already assumes p → ∞ ). However, provided that the spacing ofthe Fourier modes in frequency space is ∆ ω (cid:28) / ∆ t max and the maximumfrequency is (cid:98) p/ (cid:99) ∆ ω (cid:29) / min( s, ∆ t min ), where∆ t min ≡ min i (cid:54) = j | t i − t j | (35)∆ t max ≡ max i,j | t i − t j | , (36)it will be true that the OLS with feature weights f ( ω ) will closely approximatethe mean of a GP with kernel k (∆ t ) = F [ f ] (Epstein 2005). The comparisonof the OLS and GP is shown in Figure 10.In our toy examples, we use the feature weighting that generates the Mat´ern3/2 kernel. It is important to emphasize that there are literally infinite alter-native choices that can be made here, and hundreds even if you restrict tokernels with known closed forms. Indeed, any function F ( t − t (cid:48) ) that has afinite, all-positive Fourier transform f ( ω ) can be substituted for the Mat´ernkernel and associated frequency weighting function we use here.The particular kernel we obtain in equation (34) is a stationary kernel,meaning it depends only on absolute values of time differences | t − t (cid:48) | . Not all p → ∞ kernels will be stationary. The limit p → ∞ of X Λ − X (cid:62) in (29) leads itting very flexible models / Hogg & Villar k ( · , · ) in this case because the feature embedding in X is a set of sines and cosines. Sines and cosines form a basis for translation-invariant function spaces. If we had made a different choice for our basis, suchas polynomials or wavelets, we would have obtained a non-stationary kernelin the p → ∞ limit.At the end of Section 5, we discussed including not just feature weights in amatrix Λ − but also data weights in a matrix C − . The GP also permits this,and it is often a very good idea. The generalization of equations (23) and (24)to the GP case is ˆ Y ∗ = K ∗ ( K + C ) − Y . (37)As before, in contexts where you have independent uncertainties on each ele-ment y i of your training data Y , C − would naturally be set to the diagonalmatrix containing the inverses of the variances of those uncertainties (so C would contain the variances). This form (37) is used a lot in astronomy andcosmology (for example, Zaroubi et al. et al. et al. K is the Fourier transform of the“signal power” and the kernel function generating C is the Fourier transformof the “noise power”. In this form, the amplitude of the kernel function mat-ters, because it is competing, in some sense, the variance K of the GP againstthe variance C of the noise. It is not sufficient to get the shape of the kernelfunction right to make a stable prediction of the mean, when using form (37). Often you need to compute not just a prediction, but also an uncertaintyon that prediction. How faithfully you must compute that uncertainty willdepend strongly on the context in which you are doing the fitting or interpola-tion. However, it is often the case in the physical sciences that predictions arerequired to come with good or conservative estimates of uncertainty. There arefour-ish sources of uncertainty in the predictions you are making in problemslike these: (1) The data points y i you have are individually noisy. (2) Thereare finitely many of those data points (there are n of them) and there aregaps between them in the location space t . (3) The data points might haveuncertain locations or location measurements t i . (4) And the predictions youmake depend on hyper-parameter choices, such as the form of the basis, thenumber of parameters p , and any regularization or feature weighting. Thesefour different sources of uncertainty propagate differently and are differently“simple” to deal with. In particular it turns out that the easiest sources ofuncertainty to understand and propagate are those coming from (1) the noisein the y i and (2) the number n and locations t i of the data. The uncertaintiescoming from (3) uncertainties in the locations t i , and (4) model choices, areboth much harder to model and propagate.In the best-case scenario, you might have accurate estimates of the variances[ C ] ii = σ i of the noise contributions affecting your training data values y i , thenoise might be Gaussian with zero mean, you might have locations t i that are itting very flexible models / Hogg & Villar − ] jj on your parameters β j . In this case, you can assume that the linearmodel with p parameters is a good model for your data, and write down alikelihood function and a prior. In this case, you can turn the Bayesian crankand deliver a posterior density for the predicted values Y ∗ at the test locations t ∗ . The second derivative of the natural logarithm of the posterior density canthen be processed into a standard error on the predictions ˆ Y ∗ (or you candeliver full posterior densities somehow). The details of this are way beyondthe scope of this Note , but this (or a closely related) problem is discussed insome detail by us elsewhere (Hogg et al. k subsamples of the data,in each of which we have dropped (or held out) a unique fraction 1 /k of thedata. The variance of the results (in this case the prediction ˆ Y ∗ ) across thejackknife samples can be transformed into an estimate of the variance of theestimator acting on the full data set. In the most extreme form—leave-one-outjackknife—we set k = n . In this case, the jackknife estimate ˆΣ ∗ of the varianceof the best-fit predictions ˆ Y ∗ is given byˆΣ ∗ = n − n n (cid:88) i =1 ( ˆ Y ∗ i − ˆ Y ∗ ) ( ˆ Y ∗ i − ˆ Y ∗ ) (cid:62) , (38)where ˆ Y ∗ i is the estimate of Y ∗ made after leaving out the one training point t i , y i , and ˆ Y ∗ is the estimate made using all the data. The pre-factor ( n − /n in (38) is not anything like the 1 / ( n −
1) that you use when you estimatea standard variance; this difference comes from the fact that the jackknifesubsamples are extremely correlated; the pre-factor is computed to amplify thejackknife variance into an unbiased estimate of the variance of the estimator ˆ Y ∗ .The square-root of the trace of this uncertainty variance ˆΣ (that is, what youmight call the jackknife estimate of the standard error) is shown in Figure 11.Note the similarity between jackknife and cross-validation; it is relevant andimportant: These empirical estimates of uncertainty and prediction error areclosely related.These methods—full likelihood analysis and jackknife—take into account(1) the noise in the data y i and (2) the number and spacing of the trainingdata points. They do not, by themselves, take into account (3) the uncertaintieson the locations t i . There are no simple methods for uncertainty propagationfrom the locations t i into the predictions ˆ Y ∗ . One option is to resample the t i according to your uncertainty estimates, and re-do the fits. That’s potentiallyexpensive, and overly conservative (because it ignores the information aboutthe t i coming from the y i ). Another is to linearize the first derivative of the itting very flexible models / Hogg & Villar t y jackknife resampling at p = 5 leave-one-out fitsjackknife-estimated one-sigma region Figure 11: The jackknife-estimated uncertainty on an ordinary leastsquares fit. What is shown is the diagonal of the empirical jackknife esti-mate ˆΣ of the variance tensor on the prediction ˆ Y ∗ of an OLS fit. In otherwords, the off-diagonal entries of ˆΣ ∗ are being ignored here; there are im-portant covariances that are hard to visualize (and beyond our scope).The one-sigma region is larger than the span of the jackknife trials; this isbecause the jackknife trials are strongly correlated; the jackknife formula(38) compensates for those correlations.best-fit curve y ( t ) and propagate uncertainty using those derivatives. This isnot conservative, because it only works if the location uncertainties are smallrelative to substantial changes in the slope of the predictions with t . The mostextreme option would be to simultaneously fit for the prediction ˆ Y ∗ and all ofthe noisily measured t i . That’s a great idea, but that fit would be extremelycomputationally expensive, and no reasonable fit objective would be convex.All of these ideas are out of scope here.Finally, in order to take account of (4) the uncertainty coming from yourchoices of basis, number p , and regularization, you might have to know(or learn) the distribution over these things. Since these choices are hyper-parameters, any inference that propagates uncertainties coming from thesechoices would have to be hierarchical in structure. That is way, way out ofscope. All of the code used to make the figures for this
Note is available publicly. Although the examples are toys, the implementation of everything can be https://github.com/davidwhogg/FlexibleLinearModels itting very flexible models / Hogg & Villar condition number of thematrix in question. The condition number, for our purposes, is the ratio ofthe largest eigenvalue (for non-negative definite matrices) to the smallest non-zero eigenvalue. For rectangular feature matrices it is the ratio of the largestsingular value (that is, what you get out of a singular value decomposition) tothe smallest nonzero singular value.When the condition number is large, different linear-algebra approaches willdiffer. For example, the function call solve(A, b) should give you the bestestimate it can for the product A − b , whereas the function call dot(inv(A),b) , which is the same on paper, will give you the dot product between b andthe best estimate that inv() can find for the inverse of A . When the conditionnumber of A is large, these two values can be very different, and the solve() will be better. Use solve(A, b) and never dot(inv(A), b) unless there arecompelling code-structure reasons to use the latter, such as repeated calls (buteven then, a Cholesky decomposition followed by repeated Cholesky solves isprobably a more stable solution).If your condition number gets very large, even solve() can give bad results,because very small eigenvalues of the matrix are being poorly estimated andthen inverted. That’s unstable. It is better to zero out those small eigenvalues—it is better to destroy bad information than to use it—and perform a pseudo-inverse in those cases. So if you really want to be safe (and you do, here) youshould really use lstsq(A, b, rcond=tiny) instead of solve(A, b) . The lstsq() function requires a rcond input, which says at what (dimensionless)precision to zero-out low eigenvalues. It usually makes sense to set this di-mensionless rcond input to something like machine precision, which is about for our current hardware–software setups.You might think that all of this is academic, but it really isn’t when thenumber of features p is close to the number of data points n . For example, inour toy-data OLS experiments in this paper with n = 23, the matrix X (cid:62) X hasa condition number (ratio of largest eigenvalue to smallest nonzero eigenvalue)that saturates machine precision for the entire range 12 < p <
50. And that’sfor n = 23; things generally get worse as n gets larger.Related to this, if you have a choice between formulations that involvean inverse of a square, like ( X (cid:62) X ) − X (cid:62) Y , and formulations that involve apseudo-inverse, like X † Y , you should do the latter, because X (cid:62) X has the square of the condition number of X . When you want to execute X † Y , aswe do in (17), you should again use lstsq() , which was literally designedfor these applications; use lstsq(X, Y, rcond=tiny) . This function returnsthe best estimate it can for X † Y , so it should be better than computing apseudo-inverse pinv(X) and then matrix-multiplying pinv(X) with Y . Again This claim contradicts what is stated in the abstract of Druinsky & Toledo (2012),but this claim is based on our real numerical experiments on real matrices, so we standby it. What is uncontroversial is that solve(A, b) will always perform non-worse than dot(inv(A), b) ; sometimes it will perform far better. itting very flexible models / Hogg & Villar rcond input to something like machineprecision.Once p is large enough, if you are using a feature weighting with Λ − withdiagonal entries [Λ − ] jj that decrease to zero with j , at some point, at ma-chine precision, the additional columns you are adding to X are effectively allzeros. They will literally underflow the linear-algebra representation. That’snot a problem if you implement your linear algebra well (that is, use lstsq() appropriately), but it does mean that your predictions and cross-validationswill saturate at some p (as we see them do in Figure 9).In multiple places, but especially (23) and (24), you are performing opera-tions on matrices that are diagonal ( C and Λ and their inverses are all diago-nal). You should avoid ever constructing diagonal matrices. You can multiply X or X (cid:62) by a diagonal matrix by just multiplying the rows (or columns) bythe diagonal entries. And you can invert by just inverting the diagonal entries.Avoid constructing and operating on operators that are almost entirely zeros,unless you have a very efficient sparse linear algebra implementation and youare a power user.And finally, it is most numerically stable to operate on the smallest matricesyou can. For example, when you have the choice between the formulation inequation (23) and that in equation (24), you should choose the former when p < n and the latter when p > n . That way you are always doing the heavylinear algebra ( solve() and lstsq() function calls) at a size min( p, n ), whichis both faster and more stable than linear algebra at max( p, n ). And it is muchfaster when you make these choices correctly: Linear algebra scales naively asthe dimension cubed . (In practice—with excellent packages—it actually scaleswith a power more like 2.6 than 3, but still!)
10 Discussion
This
Note was about linear fitting with very flexible models, for interpolation,prediction, and de-noising of data. We encouraged you to consider using verybig models, but being intentional about regularization. These settings (over-parameterized, but carefully regularized) are adaptive and useful, and theyconnect, as we showed, to Gaussian processes in the limit, which are well-studied workhorses of machine learning and data science.All the rage these days, is instead nonlinear fitting , with deep learning andthe like (LeCun et al.
Note carry over: All of these toolswork well when the function space is flexible but also carefully regularized.In these nonlinear settings, regularization takes many additional new forms,like early stopping (Yao et al. et al. Y ∗ and never the setting in which you care about the internal parameters β of the linear fit. However, if you are, say, measuring the power spectrum of a itting very flexible models / Hogg & Villar β matter), you also have to get serious about the noise model on the data. Thatis, the beliefs encoded in the data noise variance tensor C also must representyour true beliefs if you don’t want your answers distorted in bad ways. All ofthat is out of scope here, but we say things about it all elsewhere (for example,Hogg et al. t i of the data) acting on the input data vector Y , just like our linear fits produce linear operators. This means that in manycases, these methods can be translated into versions of the methods we havepresented in this Note . A full translation is beyond our scope, though. Andour view is that the value of making explicit investigator choices about basisand the regularization makes the linear fitting approach better in general.Regularization was perhaps the biggest theme of this
Note . But we onlyreally considered variants of L2 regularization (or ridge or Tikhonov). Thereare other regularizations, even other convex regularizations. A valuable anduseful option is L1 or the lasso (Tibshirani 1996), which encourages sparsity—it encourages parameters β j to take the value β j = 0 exactly, where possible.This kind of regularization makes sense when we have prior beliefs that thefunctional forms we seek will in fact be sparse in our chosen basis. Usuallythis kind of consideration is less important in interpolation, prediction, andde-noising settings, but it is not unheard of. A nice recent result is that thefeature weighting that we employ here can be used with L1 regularization toobtain simultaneously smoothness and sparsity (Rauhut & Ward 2016).We discussed deterministic, ordered expansions like Fourier series and poly-nomials, but there is another class of random features methods (see, for ex-ample, Rahimi & Recht 2007) that we did not discuss. Briefly, the idea withrandom features is that instead of weighting features that are regularly spacedin frequency space with a weighting function f ( ω ), we could have generatedunweighted features but randomly from a probability distribution ∝ f ( ω ).The same Gaussian process limit appears as p → ∞ (provided we choosethose features with appropriately random phases too). These random-featureapproaches have rarely been used in the natural sciences, but they are poten-tially of interest in many applications.In the toy examples used throughout this Note , we considered data withlocations t i in a one-dimensional location space or ambient space. This wasa choice for simplicity of visualization; in principle the locations t i could be itting very flexible models / Hogg & Villar g j ( t ) have to sensibly take the locations t i as input. In astronomyand cosmology contexts, it is common for the locations t i to be positions onthe sphere, and it is common for the natural basis functions to be sphericalharmonics, for example.The methods in this Note are all discriminative , in the sense that we took thelocations t i to be prior or primary; the goal was to find a function of locations t i that predicts data y i . This asymmetry between the locations and the dataled to concerning statements in Section 8 when we considered the possibilitythat the locations t i themselves might be noisy or uncertain. An alternativeformulation to these discriminative models are generative models, in whichthe model attempts not just to predict the y i from the t i but instead predictsboth the t i and the y i . A model that generates both can be used to makepredictions y ∗ at new locations t ∗ by executing an inference or inverse problemon the generative model. Generative models are generally non-convex—andharder to execute—if you want the relationship between the t i and the y i tobe nonlinear (for an example of this in astronomy, see, for example, Ness et al. t i and the data y i are comparably noisy measurements.In our toy examples, we used a Fourier series. That is just one choice amongmany. However, it is often a great choice . When you choose the Fourier basis(and—importantly—for every cosine term you include the corresponding sineterm), the matrix X Λ − X (cid:62) (and the limiting kernel matrix K at p → ∞ )has the property that every matrix element [ X Λ − X (cid:62) ] ii (cid:48) depends on (or canbe calculated from) just the absolute difference | t i − t i (cid:48) | . That is, in thisbasis, all fitting methods become technically stationary . This all relates tothe translation-independence properties of the Fourier basis. So although theFourier basis is just one choice among many, it is the right choice when youthink your problem has (or might have) certain kinds of translation invari-ances. References
Agresti, Alan. 2015.
Foundations of Linear and Generalized Linear Models .John Wiley & Sons.Aigrain, S., Parviainen, H., & Pope, B. J. S. 2016. K2SC: flexible systematicscorrection and detrending of K2 light curves using Gaussian processregression.
Monthly Notices of the Royal Astronomical Society , (3),2408–2419.Bah, Bubacarr, & Ward, Rachel. 2016. The sample complexity of weightedsparse approximation. IEEE Transactions on Signal Processing , (12),3145–3155. itting very flexible models / Hogg & Villar arXiv:1906.11300 .Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning .Springer.Druinsky, Alex, & Toledo, Sivan. 2012. How Accurate is inv(A)*b? arXiv:1201.6035 .Efron, B. 1979. Bootstrap Methods: Another Look at the Jackknife.
Annalsof Statistics , (1), 1–26.Epstein, Charles L. 2005. How well does the finite Fourier transformapproximate the Fourier transform? Communications on Pure and AppliedMathematics , (10), 1421–1435.Folland, Gerald B. 1994. A course in abstract harmonic analysis . CRC press.Foreman-Mackey, Daniel, Agol, Eric, Ambikasaran, Sivaram, & Angus, Ruth.2017. Fast and Scalable Gaussian Process Modeling with Applications toAstronomical Time Series.
The Astronomical Journal , (6), 220.Geiger, Mario, Spigler, Stefano, d’Ascoli, St´ephane, Sagun, Levent,Baity-Jesi, Marco, Biroli, Giulio, & Wyart, Matthieu. 2019. Jammingtransition as a paradigm to understand the loss landscape of deep neuralnetworks. Physical Review E , (1), 012115.Gelman, Andrew, Hill, Jennifer, & Vehtari, Aki. 2020. Regression and OtherStories . Cambridge University Press.Hastie, Trevor, Tibshirani, Robert, & Friedman, Jerome. 2009.
The Elementsof Statistical Learning: Data Mining, Inference, and Prediction, 2ed .Springer Science & Business Media.Hastie, Trevor, Montanari, Andrea, Rosset, Saharon, & Tibshirani, Ryan J.2019. Surprises in high-dimensional ridgeless least squares interpolation. arXiv:1903.08560 .Henderson, Harold V, & Searle, Shayle R. 1981. On deriving the inverse of asum of matrices.
Siam Review , (1), 53–60.Hogg, David W., Bovy, Jo, & Lang, Dustin. 2010. Data analysis recipes:Fitting a model to data. arXiv:1008.4686 .Hogg, David W., Price-Whelan, Adrian M., & Leistedt, Boris. 2020. DataAnalysis Recipes: Products of multivariate Gaussians in Bayesianinferences. arXiv:2005.14199 .Huang, Ningyuan, Hogg, David W, & Villar, Soledad. 2020. Dimensionalityreduction, regularization, and generalization in overparameterizedregressions. arXiv:2011.11477 .Jain, Anil K, & Chandrasekaran, Balakrishnan. 1982. Dimensionality andsample size considerations in pattern recognition practice. Handbook ofStatistics , , 835–855.Krige, Daniel G. 1951. A statistical approach to some basic mine valuationproblems on the Witwatersrand. Journal of the Southern African Instituteof Mining and Metallurgy , (6), 119–139.LeCun, Yann, Bengio, Yoshua, & Hinton, Geoffrey. 2015. Deep learning. Nature , (7553), 436–444. itting very flexible models / Hogg & Villar × Computers & Mathematics with Applications , (1-2), 119–129.Minh, Ha Quang, Niyogi, Partha, & Yao, Yuan. 2006. Mercer’s theorem,feature maps, and smoothing. Pages 154–168 of: International Conferenceon Computational Learning Theory . Springer.Ness, M., Hogg, David W., Rix, H. W., Ho, Anna. Y. Q., & Zasowski, G.2015. The Cannon: A data-driven approach to Stellar LabelDetermination.
The Astrophysical Journal , (1), 16.Nocedal, Jorge, & Wright, Stephen. 2006. Numerical optimization . SpringerScience & Business Media.Rahimi, Ali, & Recht, Benjamin. 2007. Random features for large-scalekernel machines.
Advances in Neural Information Processing Systems , ,1177–1184.Rasmussen, Carl Edward, & Williams, Christopher K. I. 2005. GaussianProcesses for Machine Learning . The MIT Press.Rauhut, Holger, & Ward, Rachel. 2016. Interpolation via weighted (cid:96)
Applied and Computational Harmonic Analysis , (2),321–351.Recht, Benjamin, Roelofs, Rebecca, Schmidt, Ludwig, & Shankar, Vaishaal.2018. Do CIFAR-10 classifiers generalize to CIFAR-10? arXiv:1806.00451 .Spigler, Stefano, Geiger, Mario, d’Ascoli, St´ephane, Sagun, Levent, Biroli,Giulio, & Wyart, Matthieu. 2018. A jamming transition from under-toover-parametrization affects loss landscape and generalization. arXiv:1810.09665 .Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, &Salakhutdinov, Ruslan. 2014. Dropout: A simple way to prevent neuralnetworks from overfitting. The Journal of Machine Learning Research , (1), 1929–1958.Stone, M. 1974. Cross-Validatory Choice and Assessment of StatisticalPredictions. Journal of the Royal Statistical Society: Series B(Methodological) , (2), 111–133.Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) , (1),267–288.Xie, Yuege, Ward, Rachel, Rauhut, Holger, & Chou, Hung-Hsu. 2020.Weighted optimization: better generalization by smoother interpolation. arXiv:2006.08495 .Yao, Yuan, Rosasco, Lorenzo, & Caponnetto, Andrea. 2007. On earlystopping in gradient descent learning. Constructive Approximation , (2),289–315.Zaroubi, S., Hoffman, Y., Fisher, K. B., & Lahav, O. 1995. WienerReconstruction of the Large-Scale Structure. The Astrophysical Journal , , 446. itting very flexible models / Hogg & Villar A Optimization arguments