Using reference models in variable selection
Federico Pavone, Juho Piironen, Paul-Christian Bürkner, Aki Vehtari
UUsing reference models in variable selection
Federico Pavone ∗† , Juho Piironen ∗ , Paul-Christian Bürkner ∗ and Aki Vehtari ∗ April 29, 2020
Abstract
Variable selection, or more generally, model reduction is an important aspect of the statistical workflowaiming to provide insights from data. In this paper, we discuss and demonstrate the benefits of using areference model in variable selection. A reference model acts as a noise-filter on the target variable bymodeling its data generating mechanism. As a result, using the reference model predictions in the modelselection procedure reduces the variability and improves stability leading to improved model selectionperformance. Assuming that a Bayesian reference model describes the true distribution of future data well,the theoretically preferred usage of the reference model is to project its predictive distribution to a reducedmodel leading to projection predictive variable selection approach. Alternatively, reference models may alsobe used in an ad-hoc manner in combination with common variable selection methods. In several numericalexperiments, we investigate the performance of the projective prediction approach as well as alternativevariable selection methods with and without reference models. Our results indicate that the use of referencemodels generally translates into better and more stable variable selection. Additionally, we demonstrate thatthe projection predictive approach shows superior performance as compared to alternative variable selectionmethods independently of whether or not they use reference models.
In statistical applications, one of the main steps in the modelling workflow is variable selection, which is aspecial case of model reduction. Variable selection (also known as feature or covariate selection) may havemultiple goals. First, if the variables themselves are of interest, we can use variable selection to infer whichvariables contain predictive information about the target. Second, as simpler models come with the advantagesof reduced measurement costs and improved interpretability, we may be interested in finding the minimalsubset of variables which still provides good predictive performance (or good balance between simplicity andpredictive performance). When the predictive capability is guiding the selection, the true data generationmechanism of future data can be approximated either by using the observed data directly or alternatively byusing predictions from a reference model (Vehtari and Ojanen, 2012).In data-based approaches, such as Lasso selection (Tibshirani, 1996) or stepwise backward/forward regression(Venables and Ripley, 2013; Harrell, 2015), the observed empirical data distribution is utilised as a proxyof future data usually in combination with cross-validation or information criteria to provide estimates forout-of-sample predictive performance. In contrast, reference model based methods approximate the future datageneration mechanism using the predictive distribution of a reference model, which can be, for example, afull-encompassing model including all variables. The main assumption under any reference model approach isthat we operate in an M - complete framework (Bernardo and Smith, 2009; Vehtari and Ojanen, 2012), that is,we have constructed a model which reflects our beliefs about the future data in the best possible way and whichhas passed model checking and criticism (see, e.g. Gelman et al., 2013; Gabry et al., 2019). The referencemodel approach has been used in Bayesian statistics at least since the seminal work of Lindley (1968). Formore historical references, see Vehtari and Ojanen (2012) and Piironen and Vehtari (2017a), and for mostrecent methodological developments see Piironen et al. (2020).Reference models have been also used in non-Bayesian contexts, in which Harrell (2015) describes them asfull models that can be thought of as a “gold standard” (for a given application). For example, Faraggi et al.(2001) deal with the necessity of identifying interpretable risk groups in the context of survival data usingneural networks, which typically perform very well in terms of prediction, but whose variables are difficult to ∗ Department of Computer Science, Aalto University, Finland. † Department of Decision Sciences, Bocconi University, Italy. a r X i v : . [ s t a t . M E ] A p r e understood in terms of relevance. Paul et al. (2008), using the term preconditioning, explore approximatingmodels fitting Lasso or stepwise regression against consistent estimates ˆ y of a reference model instead of theobserved responses y . Whatever the terminology or applied statistical framework, reference models offer apowerful approach to improving variable selection as we will demonstrate in the present paper.This paper is structured as follows. In Section 2, we review the concept of the reference model, its benefitswith examples and how it can be used as a filter on data in a simple way. In Section 3 and Section 4, we showthe benefits of a reference model approaches for minimal and complete variable selection, respectively, beforewe end with a conclusion in Section 5. In this section, we will provide an initial motivation and intuition for the use of reference models to improvevariable selection methods. We will start with a case study that is repeatedly used throughout the paper toillustrate the benefits of reference models before we dive deeper into the theoretical reasons why referencemodels help in variable selection.
To motivate the further discussion and experiments, we start by a simple variable selection example usingbody fat data by Johnson (1996). We compare the projective prediction approach ( projpred , Piironen et al.(2020)) which uses a reference model, and classic stepwise backward regression ( steplm ). The experiments areimplemented in R (R Core Team, 2018).The target variable of interest is the amount of body fat, which is obtained by a complex and expensiveprocedure consisting in immersing a person in a water tank and carrying out different measurements andcomputations. Additionally, we have information about 13 variables which are anthropometric measurements(e.g., height, weight and circumference of different body parts). The variables are highly correlated whichcauses additional challenge in the variable selection. In total, we have 251 observations. The goal is to find themodel which is able to predict the amount of body fat well while requiring the least amount of measurementsfor a new person.Heinze et al. (2018) report results using steplm with a significance level of 0.157 with AIC selection (Akaike,1974), fixing abdomen and height to be always included in the model. For better comparison we do not fix anyof the variables. The steplm approach is carried out combining the step and lm functions in R .For the selection via projpred, the Bayesian reference model includes all the variables using a regularisedhorseshoe prior (Piironen and Vehtari, 2017b) on the variable coefficients. Submodels are explored usingforward search (the results are not sensitive to whether forward or backward search is used), and the predictiveutility is the expected log-predictive density (elpd) estimated using approximate leave-one-out cross-validationvia Pareto-smoothed importance-sampling (PSIS-LOO-CV; Vehtari et al., 2017). We select the smallestsubmodel with an elpd score similar to the reference model when taking into account the uncertainty inestimating the predictive model performance. See Appendix A for brief review of projection predictiveapproach, and papers by Piironen and Vehtari (2017a) and Piironen et al. (2020) for more details. The completeprojpred approach is implemented in the projpred R package (Piironen et al., 2019).The inclusion frequencies of each variable in the final model given 100 bootstrap samples are shown in Figure1. In case of projpred there are two variables, ‘abdomen’ and ‘weight’, which have inclusion frequencies above50% (‘abdomen’ is the only one included always), the third most frequently included is ‘wrist’ at 44%, and thefourth one is ‘height’ at 35%. The steplm approach has seven variables with inclusion frequencies above 50%.Such a higher variability and lower stability of steplm can be observed also in the bootstrap model selectionfrequencies reported in Table 1. For example, the first five selected models have a cumulative frequency of76% with projpred, but only of 14% with steplm. In addition, the sizes of the selected models with projpredare much smaller than the ones selected with steplm. 2 %25%50%75%100% abdomen weight wrist height age neck chest biceps thigh ankle forearm hip knee projpredsteplm Figure 1: Body fat example: Bootstrap inclusion frequencies calculated from 100 bootstrap samples. The projpredapproach has less variability on which variables are selected.
M projpred Freq % steplm Freq %1 abdom., weight 39 abdom., age, forearm, height, hip, neck, thigh, wrist 42 abdom., wrist 10 abdom., age, chest, forearm, height, neck, thigh, wrist 43 abdom., height 10 abdom., forearm, height, neck, wrist 24 abdom., height, wrist 9 abdom., forearm, neck, weight, wrist 25 abdom., weight, wrist 8 abdom., age, height, hip, thigh, wrist 26 abdom., chest, height, wrist 2 abdom., age, height, hip, neck, thigh, wrist 27 abdom., biceps, weight, wrist 2 abdom., age, ankle, forearm, height, hip, neck, thigh, wrist 28 abdom., height, weight, wrist 2 abdom., age, biceps, chest, height, neck, wrist 29 abdom., age, wrist 2 abdom., age, biceps, chest, forearm, height, neck, thigh, wrist 210 abdom., age, height, neck, thigh, wrist 2 abdom., age, ankle, biceps, weight, wrist 2
Table 1: Body fat example: Bootstrap model selection frequencies from 100 bootstrap samples. The projpred approach hasless variability on which variable combinations are selected.
The first two rows of Table 2 show the predictive performances, in terms of cross-validated root mean squareerror (RMSE), of the full model and the selected models using projpred or steplm. There is no significantdifference in predictive performance of the selected models by different approaches even if there is cleardifference in the number of selected variables. This can be explained by high correlation between the variablesand different combinations can provide similar predictive accuracy.We repeat the experiment with a modified data set by adding 84 unrelated noisy variables, resulting in 100variables in total. The last two rows of Table 2 show the cross-validated RMSE, the size of the selected modeland the number of selected noisy variables using projpred or steplm. The results show that projpred has similarpredictive performance and the same number of selected variables as with the original data, whereas thestepwise regression has worse predictive performance and the number of selected variables is much higher andinclude a large number of irrelevant variables.Both projpred and steplm compare a large number of models using either forward or backward search, whichcan lead to selection induced overfitting, but even with 100 variables, projpred is able to select a submodelwith similar performance as the full model. In this example, the two compared methods also differ in otheraspects than the usage of a reference model, such as that projpred uses Bayesian inference and steplm usesmaximum likelihood estimation. However, as we show in the present paper, one of the primary difference withrespect to quality of the variable selection is indeed whether or not a reference model is applied.
A properly designed reference model is able to filter parts of the noise present in the data, and hence to providean improved and more stable selection process. This holds even if the reference model does not perfectlyresemble the true data generating process. Moreover, our analyses indicate that the substantial reduction ofvariance attributable to noise is usually more important than small potential bias due to model misspecification.We argue that, regardless of how the reference model is set up and used in the inference procedure, it can be3 ata Method RMSE Full RMSE Sel
Table 2: Body fat example: Predictive performances with original data (first two rows) and with extra noisy variables (lasttwo rows) estimated with 10-fold cross-validation. Abbreviations: RMSE = root mean squared error; Full = full model; Sel= selected submodel; always seen as acting as a filter on the observed data. Furthermore, regardless of what specific model selectionmethod is used, a reference model can be used instead of raw data during the selection process to improve thestability and selection performance. Our results indicate that the core reason why the reference model basedmethods perform well is the reference model itself, rather than the specific way of using it. In general, the lessdata we have and the more complex the estimated models are, the higher is the benefit of using a referencemodels as the basis for variable selection.If one of the models to be compared is the full model, which can be used as a reference model, there is noadditional cost of using a reference model as it was estimated as part of the analysis anyway. Sometimesincluding all the available variables in an elaborate model can be computationally demanding. In such acase, even simpler screening or dimensionality reduction techniques, as for example the supervised principalcomponents (Bair et al., 2006; Piironen and Vehtari, 2018) can produce useful reference models.
A good predictive model is able to filter part of the noise present in the data. The noise is the main sourceof the instability in the selection and tends to obscure the relevance of the variables in relation to the targetvariable of interest. We demonstrate it with the following simple explanatory example taken from Piironenet al. (2020). The data generation mechanism is f ∼ N ( , ) Y | f ∼ N ( f , ) (1) X j | f iid ∼ N (√ ρ f , − ρ ) j = , . . . , kX j | f iid ∼ N ( , ) j = k + , . . . , p , where f is the latent variable of interest of which Y is a noisy observation. The first k variables are stronglyrelated to the target variable Y and correlated among themselves. Precisely, ρ is the correlation among any pairof the first k variables, whereas √ ρ and (cid:112) ρ / f and Y . If we had an infinite amount of observations, the sample correlation would be equal tothe true correlation between X j and Y . However, even in this ideal asymptotic regime, this correlation wouldstill remain a biased indicator of the true relevance of each variable (represented by the correlation between X j and f ) due to the intrinsic noisy nature of Y .When using a reference model, we first obtain predictions for f using all the variables { X j } pj = taking intoaccount that we have only observed the noisy representation Y of f . If our model is good, we are able todescribe f better than Y itself can, which improves the accuracy in the estimation of the relevance of thevariables. Figure 2 illustrates this process in the form of a scatter plot of (absolute) correlations of the variableswith Y against the corresponding correlations with the predictions of a reference model (in this case theposterior predictive means of model (7); see Section 4.3). Looking at the marginal distributions, we see thatusing a reference model to filter out noise in the data, the two groups of variables (relevant and non-relevant)can be distinguished much better than when the correlation is computed using the observed noisy data directly.4 ll lll l llll ll l ll llll l ll lll ll l ll l lll l lllll ll ll ll ll llllll l ll l lll ll ll l ll ll ll lll l l lllll lll l ll ll ll ll l ll l llll ll lll l ll ll l llllllll lllll ll ll llll llll ll ll ll ll lll ll llll llllll lll ll lll ll l ll ll l ll l lll ll lll l llll lll lll ll ll ll lll lll l ll lll lll lll l l lll ll ll lll ll ll l l ll llll ll ll l lll ll ll lll lll ll lll ll llllll llll llll ll lll l lll lll lll l lll ll lll ll ll ll lll ll ll ll ll lll l ll l l llll lll ll lll l llll ll ll llll ll ll lll l ll l ll llll l l l ll ll l ll llll ll ll ll ll ll lll l lll lllll ll lll l lll llllll ll ll l ll ll ll ll llll l l lllll ll ll l ll lll llll ll ll llll ll l lll l ll llll lll lllll llllll lll ll lll lllll lllll l ll l ll llll l ll ll lll lll lll l ll lll l lll llll lll ll lll l ll ll lllll l lll ll llll llll ll ll l l ll lll l ll ll l l ll ll lll ll ll ll l l ll lllll l ll l ll lllll ll ll ll l l lll ll ll ll lll ll ll l llll ll l l lll l l llll lllll lll ll lll llllll lll lll l lllllll l ll l llll l llll ll ll ll lll l lll ll ll lll l ll ll l ll ll ll lll l ll l l ll lll lll l ll lll ll ll lll lll ll lll llllll ll l ll lll llll lllll ll ll llll ll ll l llll ll l lllll lll ll lll ll lllllll llll ll ll lll ll l llll lll l lll lll lll llll ll lll l ll lll llllll l lll ll lll l ll lllll lll lll ll l lll llll l lllll ll llll lll lll l ll lll l l lllll l ll ll ll l l l lll lll lll l l | C o r( x , y ^ ) | Figure 2: Sample correlation plot of each variable (relevant in red, non-relevant in blue) with the target variable y and thelatent variable f respectively on the x and y axis. Simulated data are generated according to (1) with parameters n = , ρ = . , k = , p = . In the body fat example above, our two simultaneous goals were to obtain good predictive performance and toselect a smaller number of variables. When the goal is to select a minimal subset of variables, which havesimilar predictive performance as the full model, we call it minimal subset variable selection . This minimalsubset might exclude variables which have some predictive information about the target but, given the minimalsubset, these variables are not able to provide such additional information that would improve predictiveperformance in a substantial manner. The usual reason for this is that the relevant variables which are not inthe minimal subset are highly correlated with variables already in the minimal subset. We will return to aproblem of finding all the variables with some predictive power in Section 4.
Using the data generating mechanism (1), we simulate data sets of different sizes with a relatively large numberof variables p =
70, with k =
20 of them being predictive. We compare the minimal subset variable selectionperformance of the projection predictive approach (which uses a reference model and it is referred to asprojpred), a Bayesian stepwise forward selection with and without a reference model, and maximum likelihoodstepwise forward selection with and without a reference model (steplm). The following is a summary of theimplementation of the compared methods:• projpred : the projective prediction approach is used. The reference model is a Bayesian linear regressionmodel using the first five supervised principal components (Piironen and Vehtari, 2018) as predictorsand the full posterior predictive distribution as the basis of the projection. The search heuristic is forwardsearch and the predictive performance is estimated via 10-fold cross-validation. The selection continuesuntil the predictive performance is close to the predictive performance of the reference model. See moredetails in Appendix A.•
Bayesian stepwise selection (without a reference model) : at each step, the fitted model is a Bayesianlinear regression using the regularised horseshoe prior and the variable excluded is the one with thehighest Bayesian p -value defined as min { P ( θ ≤ | D ) , P ( θ > | D )} , where D stays for the observeddata. The selection continues if the reduced model has an elpd score higher (i.e., better) than the currentmodel.• Bayesian stepwise selection (with a reference model) : the reference model is the same as for projpred,but only point predictions (posterior predictive means) ˆ y are used to replace the target variable y . Thesame Bayesian p -value selection strategy as in the data based Bayesian stepwise selection is used.• steplm (without a reference model) : stepwise selection using AIC as in the body fat example.5 lll llll l ll l n=80 n=100 n=150 r ho = . r ho = . False discovery rate R M SE Approach l dataref Method lll projpredstep.bayesstep.lm
Figure 3: Simulation study 1: Root mean square error (RMSE) against false discovery rate in the minimal subset variableselection with one standard deviation error bars. The projpred approach has the smallest RMSE and false discovery ratio. n=80 n=100 n=150 r ho = . r ho = . Entropy Approach dataref
Figure 4: Simulation study 1: Entropy score in the minimal subset variable selection. The projpred approach has muchsmaller entropy score than the other approaches. • steplm (with a reference model) : the reference model is the same as for projpred, but only pointpredictions (posterior predictive means) ˆ y are used to replace the target variable y . The same AICselection strategy as in the data based steplm is used.In Figure 3, the predictive performance is shown for different values of n and ρ in terms of RMSE and the falsediscovery rate (FDR, the ratio of the number of non-relevant selected variables over the number of selectedvariables) of the selected submodel, averaged after 100 data simulations. We see that projpred has superiorFDR and the smallest RMSE. Using a reference model improves steplm significantly, but the minimal subsetvariable selection stays unreliable. Stepwise Bayesian linear regression is better than steplm when comparingboth methods with and without a reference model, respectively. However, the minimal subset variable selectionof the Bayesian linear regression is still less reliable than projpred.When the variable selection is repeated with different simulated data sets, there is some variability in theselected variables. We measure the stability of variable selection by computing the entropy of the observeddistribution of the included variables over different models. The smallest entropy would be obtained ifthe approach always selected the same set of variables, and the largest entropy would be observed if theapproach would always selected different sets of variables. Therefore, lower entropy corresponds to a morestable selection. Highly correlated predictive variables may happen to be selected alternately, thus makingstability estimation of the selection a non-trivial task. Entropy can not distinguish the interchangeability due tocorrelation from instability. Thus, such a measure should be considered as a relative, and not as an absolute,measure of stability. Figure 4 shows the entropy scores for the different compared methods. The use of areference model improves the stability of steplm in variable selection slightly, while it makes little differencefor the Bayesian linear regression. The projpred approach turns out to be far more stable than all other methods.This is likely due to projpred being based on better decision theoretical formulation which 1) takes into accountthe full predictive distribution and not just point estimate and 2) projects the reference model posterior to thesubmodel instead of using a simple refit of submodels.6 oisy features selected RMSE20 40 60 4 6 8 Approach dataref
Figure 5: Body fat example: Stepwise backward selection with and without using a reference model. The x-axis denotesthe number of selected irrelevant variables on the left and the out-of-sample RMSE on the right-hand side based on 100bootstrap samples. The reference approach reduces the number of noisy variables selected and the out-of-sample RMSE.
Method RMSE Selected noisyStepwise selection 6.9 (0.7) 48 (7)Reference model + Stepwise selection 4.7 (0.5) 25 (7)
Table 3: Body fat example: Means and standard deviations (between brackets) of the results shown in Figure 5.
Here we repeat the selection of Section 2.1 via stepwise backward regression. In this case, the overall numberof variables (original plus noisy) is 100, as it was in the last part of Section 2.1. We compare results withand without using a simple reference model approach outlined in (7) with steplm. Figure 5 shows the numberof irrelevant variables included in the final model and the out-of-sample root mean square error (RMSE).Results are based on 100 bootstrap samples on the whole dataset, and the predictive performance is testedon the observations excluded at each bootstrap sample. We observe that the reference model reduces thenumber of irrelevant variables included in the final model. This leads to less overfitting and thus to improvedout-of-sample predictive performance in terms of RMSE. The reference model approach applied to the stepwisebackward regression achieves outstanding improvements considering its simplicity, yet it does not reach thegoodness of the much more sophisticated projective prediction approach (see results of Section 2.1).
An alternative to minimal subset variable selection is complete variable selection in which the goal is to findall relevant variables that have some predictive information about the target. In complete variable selection, itis possible that there are theoretically relevant variables, but given finite noisy data we are not able to infertheir relevance. The projection predictive approach was originally designed for the minimal subset variableselection but we will test a simple iterative variant for the complete variable selection case in this Section. Inaddition, we analyse the benefits of the reference model approach in combination with three other methodswhich have been specifically designed for complete variable selection. As the criteria of selection performance,we evaluate the average false discovery rate and the average sensitivity (i.e., the ratio of the number of relevantselected variables over the total number of relevant variables). We also provide a comparison of the stability ofthe selection by means of a stability measure proposed by Nogueira et al. (2017), which goes from 0 to 1 and ahigher value means a more stable selection.
We modify the projection predictive approach for complete variable selection by using it iteratively. Applyingthe straightforward implementation of projpred, we are able to select a minimal subset of variables, whichyield to a model with a predictive performance comparable to the full model’s predictive performance. Theiterative projection repeats the projpred selection for different iterations, at each time excluding the variablesselected in the previous iterations from the search. At each iteration the selected submodel size corresponds to7he one having a predictive performance close enough to the baseline model, which in this iterative version isthe submodel with the highest predictive score explored at the current iteration. This translates in the followingstopping rule at each iteration:min { i ∈ { , . . . , p } : P ( elpd i − elpd best > ) ≥ α } , (2)where i indexes the submodel size and “best” stands for the best predictive explored submodel at the considerediteration. The algorithm terminates when the empty model (only intercept) satisfies the stopping rule (seeAlgorithm 1). The choice of the hyperparameter α is non-trivial, and we have observed sensitivity of theselection to such a choice, mainly when using cross-validation with a small number of observations or not verypredictive variables. In our experiments we chose the default value used in the projpred R -package, that is, α = . Result: R : = { selected variables } F : = { set of variables } ; R = {∅} ;Fit reference model; while F (cid:44) {∅} do projection.HeuristicSearch() ;projection.elpdEstimate() ; S = min { sub : P ( elpd sub − elpd base > ) ≥ α ) } ; if S = {∅} then break ; else R = R ∪ S ; F = F \ S ; endend Algorithm 1: Automated iterative projectionsIn the experiments shown in the next sections, we include an additional iterative method which we refer toas ‘iterative lasso’. It consists of the same iterative algorithm as iterative projpred expect for not using anyreference model, but the lasso method for variable selection, instead. That is, it uses the observed target valuesinstead of predictions of the reference model. The comparison with iterative lasso can help to disentangle theeffects of the iterative procedure and the usage of a reference model in complete feature selection.
We consider three alternative complete selection methods: the control of the local false discovery rate (Efron,2008, 2012), the empirical Bayes median (Johnstone and Silverman, 2004), and the selection by posteriorcredible intervals.The control of the local false discovery rate consists of testing the z-values { z j } pj = of a normal mean problem(explained in Section 4.3) on whether they belong to the theoretical null distribution f (i.e., the null hypothesis H meaning no relevance) against the alternative hypothesis distribution f . In our case f corresponds tothe standard normal distribution (see expression (6)). The quantity of interest is the local false discovery rate(loc.fdr) defined as: loc.fdr ( z ) = P ( H | z ) = f ( z ) π f ( z ) , (3)where π is the prior probability of H and f ( z ) = π f ( z ) + ( − π ) f ( z ) is the marginal distribution of thez-values. The latter is estimated using splines with 7 degrees of freedom. We select variables with local falsediscovery rate below 0 .
2, which is suggested by Efron (2012) as it corresponds to a Bayes factor larger than 36(assuming π ≥ . π fromthe data, we use the default setting provided by the R -package locfdr (Efron et al., 2015).8he empirical Bayes median approach consists of fitting a Bayesian model with a prior composed by amixture of a delta spike in zero and a heavy-tailed distribution. We use the implementation in the R -package EbayesThresh (Silverman et al., 2017). As suggested by Johnstone and Silverman (2004), we use a Laplacedistribution resulting in a thresholding property, that is, there exists a threshold value such that all the dataunder that threshold have posterior median equal to zero. Therefore the selection is done by selecting only thoseparameters whose posterior median is different from zero. The hyperparameter of the Laplace distribution andthe mixing weight of the prior are estimated by marginal maximum likelihood.The selection by 90% posterior credible intervals is done using the regularised horseshoe prior (Piironen andVehtari, 2017b) and selecting those variables whose posterior distribution does not include zero in the intervalbetween the 5% and the 95% quantiles.All of these methods provide a complete selection approach and we compare their performance with andwithout using a reference model. That is, in the data condition, we apply the method on the original data y while, in the reference model condition, we replace y by their mean predictions ˆ y based on the reference model. The iterative projection applies straightforwardly to data, whereas to investigate the performance of the threealternative complete selection approaches, we are going to use simulations based on the normal means problem.The normal means problem consists of estimating the (usually sparse) vector of means of a vector of normallydistributed observations. The dimensionality of the vector of means is denoted by p and { z j } pj = is the vectorof observations of the random variables { Z j } pj = . The task is to estimate the latent variables { θ j } pj = of thefollowing model: Z j | θ j , σ ind ∼ N ( θ j , σ ) , j = , . . . , p . (4)This is equivalent to a linear regression where the design matrix is the identity matrix with the number ofvariables being equal to the number of observations. This formulation can be found in practice, for example,in the analysis of microarray data, where a large set of genes are tested in two groups of patients labeled aspositive or negative to some disease (Efron, 2008, 2012). The objective of the analysis is to select the subset ofgenes statistically relevant to the disease. One common way to proceed is to compute the two-sample t -statisticfor every gene separately. After normalising these statistics, they then become the data Z in the normal meansproblem (4). For further details see the examples by Efron (2008, 2012).In our experiments, we retrieve the normal means problem from the sample correlations between the target andthe variables using the Fisher z -transformation (Hawkins, 1989). Suppose we have a continuous target randomvariable Y , a set of p continuous variables { X j } pj = , and denote ρ j = Cor ( Y , X j ) . Further, suppose we haveobserved n statistical units and define r j the sample correlation between the observations of the target variables { y i } ni = and the j -th variable { x ij } ni = . Finally, we refer to the Fisher z -transformation function tanh − (·) as T F (·) . Assuming each pair ( Y , X j ) to be bivariate normally distributed, the corresponding transformedcorrelations are approximately normally distributed with known variance: T F ( r j ) ind ∼ N (cid:18) T F ( ρ j ) , n − (cid:19) , j = , . . . , p . (5)Therefore, rescaling the quantities T F ( r j ) by √ n − z j , we have the formulation (4)of the normal means problem, this time with unit variance: Z j | θ j ind ∼ N ( θ j , ) , j = , . . . , p . (6)In this case, the quantities of interest θ j are equal to √ n − T F ( ρ j ) .In our simulations, we use different levels of correlation ρ ∈ { . , . } and numbers of observations n ∈ { , , } . The total number of variables p and the number of relevant variables k are fixed to p = k = ρ and n , the more challenging the variable selection is. Forthis example, Piironen et al. (2020) proposed to use a reference model which a Bayesian linear regression9 ll llll l lll llll l lll llll l n=50 n=70 n=100 r ho = . r ho = . False discovery rate S en s i t i v i t y Approach l dataref Method lllll lasso.iterprojpred.iterloc.fdrci.90EB.med
Figure 6: Simulation study 2: Complete variable selection sensitivity against false discovery rate based on 100 datasimulations with one standard deviation error bars. The reference approach improves sensitivity and reduces false discoveryrate for all methods. n=50 n=70 n=100 r ho = . r ho = . Stability Approach dataref
Figure 7: Simulation study 2: Complete variable selection stability estimates with 95% intervals based on 100 datasimulations. The reference approach improves stability for all methods. using the first five supervised principal components (SPC) as variables and imposing an hierarchical prior ontheir coefficients: Y i | β , σ , u i ind ∼ N ( u Ti β , σ ) i = , . . . , n β j | τ iid ∼ N ( , τ ) τ ∼ t + ( , s − max ) j = , . . . , σ ∼ t + ( , ) . (7)In the above, u ij represents the j -th SPC evaluated at observation i , and s max denotes the sample standarddeviation of the largest SPC. The SPCs are computed using the R -package dimreduce (https://github.com/jpiironen/dimreduce) setting the screening threshold parameter at 0 . s max . In our experiments, the results arenot sensitive to the specific choice of the screening threshold, yet a more principled approach would be to usecross-validation to select the threshold as done by Piironen et al. (2020).Figure 6 shows the average sensitivity on the vertical axis and the average false discovery rate on the horizontalaxis based on 100 data simulations for the different combinations of n and ρ . The best selection performanceis on the top-left corner of each plot, as it implies the lowest false discovery rate and the highest sensitivity.We see that regardless of the applied selection method, the use of a reference model improves the selectionperformance, as it reduces the false discovery rate (shifting to the left) or increases the sensitivity (shiftingupwards). In accordance with what can be expected, the larger data set size ( n ) and the higher the truecorrelations ( ρ ), the easier the selection is. Thus, for easier selection scenarios, the benefits of the referencemodel are smaller since the raw data already provide enough information to identify the relevant variables.The iterative projpred has good false discovery rate in all cases, and the sensitivity is good except when thenumber of observations and the correlation level are small. It performs better than the iterative lasso selectionin any of the simulated scenarios. Tuning α might lead to improved selection, yet the main source of poor10 ll l lll l lll l lll l n=50 n=70 n=100 n=2510.00 0.25 0.50 0.00 0.25 0.50 0.00 0.25 0.50 0.00 0.25 0.500.40.60.81.0 False discovery rate S en s i t i v i t y Approach l dataref Method lllll lasso.iterprojpred.iterloc.fdrci.90EB.med
Figure 8: Body fat example with noisy variables: Complete variable selection sensitivity against false discovery rate basedon 100 bootstrap samples with one standard deviation error bars. The improvement from using the reference approach issmall (except that the projpred is much better than lasso). n=50 n=70 n=100 n=2510.4 0.6 0.8 0.4 0.6 0.8 0.4 0.6 0.8 0.4 0.6 0.8EB.medci.90loc.fdrlasso.iterprojpred.iter
Stability Approach dataref
Figure 9: Body fat example with noisy variables: Complete variable selection stability estimates with 0.95 confidenceintervals based on 100 bootstrap samples. The improvement from using the reference approach is small (except that theprojpred is much better than lasso). performance could well be the miscalibration of the uncertainty of the predictive utility score for misspecifiedmodels with cross-validation (Bengio and Grandvalet, 2004).Figure 7 shows the estimates of the stability measure proposed by Nogueira et al. (2017) with 0.95 confidenceintervals based on 100 simulations. Such a measure takes in account the variability of the subset of the selectedvariables at each simulation (originally at each bootstrap sample), modelling the selection of each variableas a Bernoulli process. Further details are available in Nogueira et al. (2017). The reference model helps inimproving the stability of the selection: again, the benefits are bigger when the problem is more difficult (small n and ρ ). In addition, we observe less uncertainty in the stability estimates for the reference approach (i.e.,smaller width of the 95% intervals), which can be still connected to the overall stability of the procedure. As inFigure 6, the iterative projection does not perform well in the hardest scenarios. We conclude our complete selection experiments using the body fat dataset one more time. As earlier we addnoisy uncorrelated variables to the original data to get a total of 100 variables. Since we do not have a groundtruth available with regard the original variables of the data, we assume it is reasonable to consider all of themrelevant, at least to some degree. The artificially added variables are naturally irrelevant by construction. Wecompute correlations between each variable and the target variable, that is the amount of fat, and transformthem by Fisher-Z-transformation. The original assumption in order for (5) to hold is that the variables arejointly normally distributed. In our experience the normal approximation in (5) is still reasonable, but afterrescaling by √ n − n =
50 up to n =
251 (i.e., the full size of thedata). For each condition, results are averaged over 100 bootstrap samples of the respective size.Figure 8 shows the sensitivity against the false discovery rate. In almost all of the bootstrapped subsamples,11he reference model improves the selection both in terms of sensitivity and false discovery rate. When n = In this paper, we demonstrated the benefits of using a reference model to improve variable selection, or moregenerally, model reduction. We have motivated and explained the general benefits of a reference modelregardless of the method it is applied in combination with. Specifically, we have seen how the referencemodel acts as an approximation of the data generation mechanism through its predictive distribution. Suchapproximation is generally less noisy than the sample estimation available purely from the observed data,leading to the main benefits of the reference model approach. In our comparisons, we have analysed theeffect of a reference model in the form of a filter on the observed target values on top of different widelyused variable selection methods. Overall, using a reference model leads to more accurate and stable selectionresults independently of the specific selection method. These benefits apply to a large family of differentmethods all involving a reference model in one way or the other. Some of these approaches have been presentin the literature for some time (e.g., see references in Vehtari and Ojanen, 2012; Piironen et al., 2020) butoften without a clear explanation of why they are actually favourable and how they connect to other relatedapproaches. We hope that the present paper can fill some of these gaps by providing a unifying framework andunderstanding of reference models.We argue that, whenever it is possible to construct a reasonable reference model, it should be employed ontop of the preferred selection procedure or as an integral part of more complex methods, for example, theprojective prediction approach (Piironen et al., 2020). Note that one of the main challenges in many real worldapplication will consist in devising a sensible reference model itself and assessing its predictive performance.To build good predictive reference models, which are specifically tuned to the data and problem at hand, werecommend them to be developed using a robust Bayesian modelling workflow, for instance, as outlined byGelman et al. (2013) and Gabry et al. (2019).Another main result of this paper is that the projective prediction approach shows superior performance inminimal subset variable selection compared to alternative methods whether or not these methods make useof a reference model. That is, while the reference model is certainly one important aspect the projectiveprediction approach, it is not the only reason for its superior performance. Rather, by incorporating the fulluncertainty of the posterior predictive distribution into the variable selection procedure (instead of just usingpoint estimates) and using principled cross-validation method, projective predictions combine several desirablevariables into a single procedure (Piironen et al., 2020). In summary, we would strongly recommend usingprojective predictions for minimal subset variable selection if possible and feasible. However, if this is not anoption in a given situation, we would in any case recommend using a reference model on top of the chosenvariable selection method. 12he projective prediction approach was not designed for the complete variable selection. We tested a simpleiterative version of projpred, with mixed results and the methods specifically designed for complete variableselection (especially loc.fdr) performed better in our experiments. It is left for future research to develop abetter projective prediction approach for complete variable selection problems.All Bayesian models in this paper have been implemented in the probabilistic programming language
Stan (Carpenter et al., 2017) and fit via dynamic Hamiltonian Monte Carlo (Hoffman and Gelman, 2014;Betancourt, 2017), through the R -packages rstan (Stan Development Team, 2019) and rstanarm (Goodrichet al., 2019). Graphics elaborations have been done using ggplot2 (Wickham, 2016) and the tidyverse framework (Wickham et al., 2019). The code to run all the experiments is available on GitHub (https://github.com/fpavone/ref-approach-paper). Acknowledgments
We thank Alejandro Catalina Feliu for help with experiments, and Academy of Finland (grants 298742, and313122), Finnish Center for Artificial Intelligence and Technology Industries of Finland Centennial Foundation(grant 70007503; Artificial Intelligence for Research and Development) for partial support of this research. Wealso acknowledge the computational resources provided by the Aalto Science-IT project.
References
Hirotugu Akaike. A new look at the statistical model identification. In
Selected Papers of Hirotugu Akaike ,pages 215–222. Springer, 1974.Eric Bair, Trevor Hastie, Debashis Paul, and Robert Tibshirani. Prediction by supervised principal components.
Journal of the American Statistical Association , 101(473):119–137, 2006.Yoshua Bengio and Yves Grandvalet. No unbiased estimator of the variance of k-fold cross-validation.
Journalof machine learning research , 5(Sep):1089–1105, 2004.José M Bernardo and Adrian FM Smith.
Bayesian Theory . John Wiley & Sons, 2009.Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434 ,2017.Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, MarcusBrubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language.
Journalof Statistical Software , 76(1), 2017.Jérome A Dupuis and Christian P Robert. Variable selection in qualitative models via an entropic explanatorypower.
Journal of Statistical Planning and Inference , 111(1-2):77–94, 2003.Bradley Efron. Microarrays, empirical Bayes and the two-groups model.
Statistical Science , 23(1):1–22, 2008.Bradley Efron.
Large-scale inference: empirical Bayes methods for estimation, testing, and prediction .Cambridge University Press, 2012.Bradley Efron, Brit Turnbull, and Balasubramanian Narasimhan. locfdr: Computes Local False DiscoveryRates , 2015. URL https://CRAN.R-project.org/package=locfdr. R package version 1.1-8.David Faraggi, Michael LeBlanc, and John Crowley. Understanding neural networks using regression trees: anapplication to multiple myeloma survival data.
Statistics in medicine , 20(19):2965–2976, 2001.Jonah Gabry, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. Visualization inBayesian workflow.
Journal of the Royal Statistical Society: Series A (Statistics in Society) , 182(2):389–402,2019. 13ndrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin.
Bayesiandata analysis . Chapman and Hall/CRC, 2013.Ben Goodrich, Jonah Gabry, Imad Ali, and Sam Brilleman. rstanarm: Bayesian applied regression modelingvia Stan., 2019. URL https://mc-stan.org/rstanarm. R package version 2.19.3.Frank E Harrell.
Regression modeling strategies: with applications to linear models, logistic and ordinalregression, and survival analysis . Springer, 2015.DL Hawkins. Using U statistics to derive the asymptotic distribution of Fisher’s Z statistic.
The AmericanStatistician , 43(4):235–237, 1989.Georg Heinze, Christine Wallisch, and Daniela Dunkler. Variable selection–a review and recommendations forthe practicing statistician.
Biometrical Journal , 60(3):431–449, 2018.Matthew D Hoffman and Andrew Gelman. The No-U-Turn sampler: adaptively setting path lengths inHamiltonian Monte Carlo.
Journal of Machine Learning Research , 15(1):1593–1623, 2014.Roger W Johnson. Fitting percentage of body fat to simple body measurements.
Journal of Statistics Education ,4(1), 1996.Iain M Johnstone and Bernard W Silverman. Needles and straw in haystacks: Empirical Bayes estimates ofpossibly sparse sequences.
The Annals of Statistics , 32(4):1594–1649, 2004.Dennis V Lindley. The choice of variables in multiple regression.
Journal of the Royal Statistical Society:Series B (Methodological) , 30(1):31–53, 1968.Sarah Nogueira, Konstantinos Sechidis, and Gavin Brown. On the stability of feature selection algorithms.
TheJournal of Machine Learning Research , 18(1):6345–6398, 2017.Debashis Paul, Eric Bair, Trevor Hastie, and Robert Tibshirani. “Preconditioning” for feature selection andregression in high-dimensional problems.
The Annals of Statistics , 36(4):1595–1618, 2008.Juho Piironen and Aki Vehtari. Comparison of Bayesian predictive methods for model selection.
Statistics andComputing , 27(3):711–735, 2017a.Juho Piironen and Aki Vehtari. Sparsity information and regularization in the horseshoe and other shrinkagepriors.
Electronic Journal of Statistics , 11(2):5018–5051, 2017b.Juho Piironen and Aki Vehtari. Iterative supervised principal components. In Amos Storkey and FernandoPerez-Cruz, editors,
Proceedings of the 21st International Conference on Artificial Intelligence and Statistics ,volume 84 of
Proceedings of Machine Learning Research , pages 106–114, 2018.Juho Piironen, Markus Paasiniemi, and Aki Vehtari. projpred: Projection Predictive Feature Selection , 2019.http://mc-stan.org/projpred, http://discourse.mc-stan.org/.Juho Piironen, Markus Paasiniemi, and Aki Vehtari. Projective inference in high-dimensional problems:prediction and feature selection.
Electronic Journal of Statistics , 2020. Accepted for publication.R Core Team.
R: A Language and Environment for Statistical Computing
EbayesThresh:Empirical Bayes Thresholding and Related Methods , 2017. URL https://CRAN.R-project.org/package=EbayesThresh. R package version 1.4-12.Stan Development Team. RStan: the R interface to Stan, 2019. URL http://mc-stan.org/. R package version2.19.2.Robert Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society:Series B (Methodological) , 58(1):267–288, 1996. 14ki Vehtari and Janne Ojanen. A survey of Bayesian predictive methods for model assessment, selection andcomparison.
Statistics Surveys , 6:142–228, 2012.Aki Vehtari, Andrew Gelman, and Jonah Gabry. Practical Bayesian model evaluation using leave-one-outcross-validation and WAIC.
Statistics and Computing , 27(5):1413–1432, 2017.William N Venables and Brian D Ripley.
Modern applied statistics with S-PLUS . Springer Science & BusinessMedia, 2013.Hadley Wickham. ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag New York, 2016. ISBN978-3-319-24277-4. URL https://ggplot2.tidyverse.org.Hadley Wickham, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, RomainFrançois, Garrett Grolemund, Alex Hayes, Lionel Henry, Jim Hester, Max Kuhn, Thomas Lin Pedersen,Evan Miller, Stephan Milton Bache, Kirill Müller, Jeroen Ooms, David Robinson, Dana Paige Seidel, VitalieSpinu, Kohske Takahashi, Davis Vaughan, Claus Wilke, Kara Woo, and Hiroaki Yutani. Welcome to thetidyverse.
Journal of Open Source Software , 4(43):1686, 2019. doi: 10.21105/joss.01686.
Appendix
Appendix A: Projective Predictions
The projective prediction (projpred) approach was developed and is thoroughly described by Piironen et al.(2020). In this appendix, we provide a high level description of the method so that readers do not need to studypaper by Piironen et al. (2020) in detail to understand the main ideas behind projpred.The parameter distribution of a given candidate submodel is denoted by π and the induced predictive distributionby q π ( ˜ y ) . We would like to choose π so that q π ( ˜ y ) maximises some predictive performance utility, for example,the expected log-predictive density (elpd) defined as:elpd [ q π ] = ∫ log q π ( ˜ y ) p t ( ˜ y ) d ˜ y, (8)where p t ( ˜ y ) denotes the (usually unknown) true generating mechanism of future data ˜ y . If we refer to theposterior predictive distribution of a reference model with p ( ˜ y | D ) , where D stands for the data on which weconditioned on, we can approximate (8) using p ( ˜ y | D ) instead of the true data generation mechanism p t ( ˜ y ) . Themaximisation of the elpd using the reference model’s predictive distribution is equivalent to the minimisationof the Kullback-Leibler (KL) divergence from the reference model’s predictive distribution to the submodel’spredictive distribution: max π ∫ log q π ( ˜ y ) p ( ˜ y | D ) d ˜ y ⇔ min π KL [ p ( ˜ y | D ) || q π ( ˜ y )] (9)The term on the right-hand side of Equation (9) describes what is referred to as the projection of the predictivedistribution, which is the general idea behind the projection predictive approach (see Piironen et al., 2020). Wenow summarise the workflow of the projection predictive approach in the particular case of the draw-by-drawprojection (original formulation by Dupuis and Robert, 2003), following Piironen et al. (2020). Suppose wehave observed n statistical units with target values { y i } ni = and a set of observed variables for which we want toobtain a minimally relevant subset. Than, the main steps are the following:1. Devise and fit a reference model. Let { θ s ∗ } Ss = be the set of S draws from the reference model’s posterior.2. Rank the variables according to their relevance using some heuristics and consider as candidate submodelsonly those which preserve this order, starting from including only the highest ranked variable. Thesubmodels are then naturally identified by their model size. This step is not strictly necessary but reducesthe number of submodels to considered in the following steps and thus reduces computation time.15. For each submodel π selected in Step 2, project each of the reference model’s posterior draws θ s ∗ asfollows: θ s ⊥ = argmin θ s ∈ Θ n n (cid:213) i = KL [ p ( ˜ y i | θ s ∗ ) || q π ( ˜ y i | θ s )] , (10)where p ( ˜ y i | θ s ∗ ) stands for the predictive distribution of the reference model with parameters fixed at θ s ∗ and conditioning on all the variable values related to the statistical unit (identified by the subscript i ),whereas q π ( ˜ y i | θ s ) is the predictive distribution of the submodel. The projected draws θ s ⊥⊥