SLOPE - Adaptive variable selection via convex optimization
Małgorzata Bogdan, Ewout van den Berg, Chiara Sabatti, Weijie Su, Emmanuel J. Candès
aa r X i v : . [ s t a t . M E ] N ov The Annals of Applied Statistics (cid:13)
Institute of Mathematical Statistics, 2015
SLOPE—ADAPTIVE VARIABLE SELECTION VIACONVEX OPTIMIZATION
By Ma lgorzata Bogdan ∗ , , Ewout van den Berg † , ,Chiara Sabatti ‡ , , Weijie Su ‡ , and Emmanuel J. Cand`es ‡ , Wroc law University of Technology ∗ , IBM T.J. Watson Research Center † and Stanford University ‡ We introduce a new estimator for the vector of coefficients β inthe linear model y = Xβ + z , where X has dimensions n × p with p possibly larger than n . SLOPE, short for Sorted L-One PenalizedEstimation, is the solution tomin b ∈ R p k y − Xb k ℓ + λ | b | (1) + λ | b | (2) + · · · + λ p | b | ( p ) , where λ ≥ λ ≥ · · · ≥ λ p ≥ | b | (1) ≥ | b | (2) ≥ · · · ≥ | b | ( p ) are thedecreasing absolute values of the entries of b . This is a convex pro-gram and we demonstrate a solution algorithm whose computationalcomplexity is roughly comparable to that of classical ℓ proceduressuch as the Lasso. Here, the regularizer is a sorted ℓ norm, which pe-nalizes the regression coefficients according to their rank: the higherthe rank—that is, stronger the signal—the larger the penalty. This issimilar to the Benjamini and Hochberg [ J. Roy. Statist. Soc. Ser. B (1995) 289–300] procedure (BH) which compares more significant p -values with more stringent thresholds. One notable choice of the se-quence { λ i } is given by the BH critical values λ BH ( i ) = z (1 − i · q/ p ),where q ∈ (0 ,
1) and z ( α ) is the quantile of a standard normal dis-tribution. SLOPE aims to provide finite sample guarantees on theselected model; of special interest is the false discovery rate (FDR),Received May 2014; revised February 2015. Supported in part by a Fulbright Scholarship, NSF Grant DMS-10-43204 and theEuropean Union’s 7th Framework Programme for research, technological developmentand demonstration under Grant Agreement no. 602552. Supported in part by NSF Grant DMS-09-06812 (American Reinvestment and Recov-ery Act). Supported in part by NIH Grants HG006695 and MH101782. Supported in part by a General Wang Yaowu Stanford Graduate Fellowship. Supported in part by AFOSR under Grant FA9550-09-1-0643, by ONR under GrantN00014-09-1-0258 and by a gift from the Broadcom Foundation.
Key words and phrases.
Sparse regression, variable selection, false discovery rate,Lasso, sorted ℓ penalized estimation (SLOPE). This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Applied Statistics ,2015, Vol. 9, No. 3, 1103–1140. This reprint differs from the original in paginationand typographic detail. 1
M. BOGDAN ET AL.defined as the expected proportion of irrelevant regressors amongall selected predictors. Under orthogonal designs, SLOPE with λ BH provably controls FDR at level q . Moreover, it also appears to haveappreciable inferential properties under more general designs X whilehaving substantial power, as demonstrated in a series of experimentsrunning on both simulated and real data. Introduction.
Analyzing and extracting information from data sets wherethe number of observations n is smaller than the number of variables p is oneof the challenges of the present “big-data” world. In response, the statisticsliterature of the past two decades documents the development of a variety ofmethodological approaches to address this challenge. A frequently discussedproblem is that of linking, through a linear model, a response variable y toa set of predictors { X j } taken from a very large family of possible explana-tory variables. In this context, the Lasso [Tibshirani (1996)] and the Dantzigselector [Candes and Tao (2007)], for example, are computationally attrac-tive procedures offering some theoretical guarantees, and with consequentwidespread application. In spite of this, there are some scientific problemswhere the outcome of these procedures is not entirely satisfying, as they donot come with a machinery allowing us to make inferential statements onthe validity of selected models in finite samples. To illustrate this, we resortto an example.Consider a study where a geneticist has collected information about n individuals by having identified and measured all p possible genetics vari-ants in a genomic region. The geneticist wishes to discover which variantscause a certain biological phenomenon, such as an increase in blood choles-terol level. Measuring cholesterol levels in a new individual is cheaper andfaster than scoring his or her genetic variants, so that predicting y in fu-ture samples given the value of the relevant covariates is not an importantgoal. Instead, correctly identifying functional variants is relevant. A geneticpolymorphism correctly implicated in the determination of cholesterol lev-els points to a specific gene and to a biological pathway that might notbe previously known to be related to blood lipid levels and, therefore, pro-motes an increase in our understanding of biological mechanisms, as well asproviding targets for drug development. On the other hand, the erroneous discovery of an association between a genetic variant and cholesterol levelswill translate to a considerable waste of time and money, which will be spentin trying to verify this association with direct manipulation experiments. Itis worth emphasizing that some of the genetic variants in the study have abiological effect while others do not—there is a ground truth that statisti-cians can aim to discover. To be able to share the results with the scientificcommunity in a convincing manner, the researcher needs to be able to at-tach some finite sample confidence statements to his/her findings. In a more ORTED L-ONE PENALIZED ESTIMATION abstract language, our geneticist would need a tool that privileges correctmodel selection over minimization of prediction error, and would allow forinferential statements to be made on the validity of his/her selections. Thispaper presents a new methodology that attempts to address some of theseneeds.We imagine that the n -dimensional response vector y is truly generatedby a linear model of the form y = Xβ + z, with X an n × p design matrix, β a p -dimensional vector of regression coef-ficients and z an n × β i = 0) are measured in addition to a large number ofirrelevant ones. As any statistician knows, these assumptions are quite re-strictive, but they are a widely accepted starting point. To formalize our goal,namely, the selection of important variables accompanied by a finite sampleconfidence statement, we seek a procedure that controls the expected propor-tion of irrelevant variables among the selected. In a scientific context whereselecting a variable corresponds to making a discovery, we aim at control-ling the False Discovery Rate (FDR). The FDR is of course a well-recognizedmeasure of global error in multiple testing and effective procedures to con-trol it are available: indeed, the Benjamini and Hochberg (1995) procedure(BH) inspired the present proposal. The connection between multiple test-ing and model selection has been made before [see, e.g., Bauer, P¨otscher andHackl (1988), Foster and George (1994), Abramovich and Benjamini (1995),Abramovich et al. (2006) and Bogdan, Ghosh and ˙Zak-Szatkowska (2008)]and others in recent literature have tackled the challenges encountered byour geneticists: we will discuss the differences between our approach andothers in later sections as appropriate. The procedure we introduce in thispaper is, however, entirely new. Variable selection is achieved by solving aconvex problem not previously considered in the statistical literature, andwhich marries the advantages of ℓ penalization with the adaptivity inherentin strategies like BH.Section 1 of this paper introduces SLOPE, our novel penalization strategy,motivates its construction in the context of orthogonal designs, and placesit in the context of current knowledge of effective model selection strategies.Section 2 describes the algorithm we developed and implemented to findSLOPE estimates. Section 3 showcases the application of our novel proce-dure in a variety of settings: we illustrate how it effectively solves a multipletesting problem with positively correlated test statistics; we discuss howregularizing parameters should be chosen in nonorthogonal designs; we in-vestigate the robustness of SLOPE to some violations of model assumptionsand we apply it to a genetic data set, not unlike our idealized example. Sec-tion 4 concludes the paper with a discussion comparing our methodology toother recently introduced proposals as well as outlining open problems. M. BOGDAN ET AL.
1. Sorted L-One Penalized Estimation (SLOPE).
Adaptive penalization and multiple testing in orthogonal designs.
Tobuild intuition behind SLOPE, which encompasses our proposal for modelselection in situations where p > n , we begin by considering the case oforthogonal designs and i.i.d. Gaussian errors with known standard deviation,as this makes the connection between model selection and multiple testingnatural. Since the design is orthogonal, X ′ X = I p , and the regression y = Xβ + z with z ∼ N (0 , σ I n ) can be recast as˜ y = X ′ y = X ′ Xβ + X ′ z = β + X ′ z ∼ N ( β, σ I p ) . (1.1)In some sense, the problem of selecting the correct model reduces to theproblem of testing the p hypotheses H ,j : β j = 0 versus two-sided alterna-tives H ,j : β i = 0. When p is large, a multiple comparison correction strategyis called for and we consider two popular procedures: • Bonferroni’s method.
To control the familywise error rate (FWER) atlevel α ∈ [0 , H ,j if | ˜ y j | /σ > Φ − (1 − α/ p ), where Φ − ( α ) is the α th quantile of the standardnormal distribution. Hence, Bonferroni’s method defines a comparisonthreshold that depends only on the number of covariates, p , and the noiselevel. • Benjamini–Hochberg step-up procedure.
To control the FDR at level q ∈ [0 , y in decreasing order of magni-tude, | ˜ y | (1) ≥ | ˜ y | (2) ≥ · · · ≥ | ˜ y | ( p ) , which yields corresponding ordered hy-potheses H (1) , . . . , H ( p ) . [Note that here, as in the rest of the paper, (1)indicates the largest element of a set, instead of the smallest. This break-ing with common convention allows us to keep (1) as the index for themost “interesting” hypothesis]. Then BH rejects all hypotheses H ( i ) forwhich i ≤ i BH , where i BH is defined by i BH = max { i : | ˜ y | ( i ) /σ ≥ Φ − (1 − q i ) } , q i = i · q/ p (1.2)(with the convention that i BH = 0 if the set above is empty). Letting V (resp., R ) be the total number of false rejections (resp., total number ofrejections), Benjamini and Hochberg (1995) showed that for BHFDR = E (cid:20) VR ∨ (cid:21) = q p p , (1.3)where p is the number of true null hypotheses, p := |{ i : β i = 0 }| = p −k β k ℓ . Recall that the FWER is the probability of at least one false rejection.ORTED L-ONE PENALIZED ESTIMATION In contrast to Bonferroni’s method, BH is an adaptive procedure in thesense that the threshold for rejection | y | ( i BH ) is defined in a data-dependentfashion, and is sensitive to the sparsity and magnitude of the true signals.In a setting where there are many large β j ’s, the last selected variable needsto pass a far less stringent threshold than it would in a situation where no β j is truly different from 0. It has been shown in a variety of papers [see,e.g., Abramovich et al. (2006), Bogdan et al. (2011), Wu and Zhou (2013),Frommlet and Bogdan (2013)] that this behavior allows BH to adapt to theunknown signal sparsity, resulting in some important asymptotic optimalityproperties.We now consider how the Lasso would behave in this setting. The solutionto min b ∈ R p k y − Xb k ℓ + λ k b k ℓ (1.4)in the case of orthogonal designs is given by soft thresholding. In particular,the Lasso estimate ˆ β j is not zero if and only if | ˜ y j | > λ . That is, variablesare selected using a nonadaptive threshold λ . Mindful of the costs associatedwith the selection of irrelevant variables, we can control the FWER by set-ting λ Bonf = σ · Φ − (1 − α/ p ) ≈ σ · √ p . This choice, however, is likelyto result in a loss of power, and may not strike the right balance betweenerrors of type I and missed discoveries. Choosing a value of λ substantiallysmaller than λ Bonf in a nondata dependent fashion would lead to a lossnot only of FWER control, but also of FDR control since FDR and FWERare identical measures under the global null in which all our variables areirrelevant. Another strategy is to use cross-validation. However, this data-dependent approach for selecting the regularization parameter λ targets theminimization of prediction error, and does not offer guarantees with re-spect to model selection (see Section 1.3.3). Our idea to achieve adaptivity,thereby increasing power while controlling some form of type-one error, isto break the monolithic penalty λ k β k ℓ , which treats every variable in thesame manner. Set λ BH ( i ) def = Φ − (1 − q i ) , q i = i · q/ p, and consider the following program:min b ∈ R p k y − Xb k ℓ + σ · p X i =1 λ BH ( i ) | b | ( i ) , (1.5) For large t , we have 1 − Φ( t ) = t − φ ( t )(1 + o ( t − )), where φ ( · ) denotes the density of N (0 , α/ p for a fixedvalue of α , say, α = 0 .
05, and a large value of p . M. BOGDAN ET AL.
Fig. 1.
FDR of (1.5) in an orthogonal setting in which n = p = 5000 . Straight lines cor-respond to q · p /p , marked points indicate the average False Discovery Proportion (FDP)across 500 replicates, and bars correspond to ± where | b | (1) ≥ | b | (2) ≥ · · · ≥ | b | ( p ) are the order statistics of the absolute val-ues of the coordinates of b : in (1.5) different variables receive different levelsof penalization depending on their relative importance. While the similar-ities of (1.5) with BH are evident, the solution to (1.5) is not a series ofscalar-thresholding operations: the procedures are not—even in this case oforthogonal variables–exactly equivalent. Nevertheless, an upper bound onFDR proved in the supplementary appendix [Bogdan et al. (2015)] can stillbe assured. Theorem 1.1.
In the linear model with orthogonal design X and z ∼N (0 , σ I n ) , the procedure (1.5) rejecting hypotheses for which ˆ β j = 0 has anFDR obeying FDR = E (cid:20) VR ∨ (cid:21) ≤ q p p . (1.6)Figure 1 illustrates the FDR achieved by (1.5) in simulations using a5000 × X and nonzero regression coefficients equalto 5 √ p .We conclude this section with several remarks describing the propertiesof our procedure under orthogonal designs:1. While the λ BH ( i )’s are chosen with reference to BH, (1.5) is neitherequivalent to the step-up procedure described above nor to the step-downversion. The step-down version rejects H (1) , . . . , H ( i − , where i is the first time at which | ˜ y i | /σ ≤ Φ − (1 − q i ).ORTED L-ONE PENALIZED ESTIMATION
2. The proposal (1.5) is sandwiched between the step-down and step-upprocedures in the sense that it rejects at most as many hypotheses as thestep-up procedure and at least as many as the step-down cousin, also knownto control the FDR [Sarkar (2002)].3. The fact that (1.5) controls FDR is not a trivial consequence of thissandwiching.The observations above reinforce the fact that (1.5) is different from theprocedure known as
FDR thresholding developed by Abramovich and Ben-jamini (1995) in the context of wavelet estimation and later analyzed inAbramovich et al. (2006). With t FDR = | ˜ y | ( i BH ) , FDR thresholding setsˆ β i = (cid:26) ˜ y i , | ˜ y i | ≥ t FDR ,0 , | ˜ y i | < t FDR .(1.7)This is a hard-thresholding estimate but with a data-dependent threshold:the threshold decreases as more components are judged to be statistically sig-nificant. It has been shown that this simple estimate is asymptotically min-imax throughout a range of sparsity classes [Abramovich et al. (2006)]. Ourmethod is similar in the sense that it also chooses an adaptive threshold re-flecting the BH procedure. However, it does not produce a hard-thresholdingestimate. Rather, owing to nature of the sorted ℓ norm, it outputs a sort ofsoft-thresholding estimate. A substantial difference is that FDR thresholding(1.7) is designed specifically for orthogonal designs, whereas the formulation(1.5) can be employed for arbitrary design matrices leading to efficient al-gorithms. Aside from algorithmic issues, the choice of the λ sequence is,however, generally challenging.1.2. SLOPE.
While orthogonal designs have helped us define the prog-ram (1.5), this penalized estimation strategy is clearly applicable in moregeneral settings. To make this explicit, it is useful to introduce the sorted ℓ norm : letting λ = 0 be a nonincreasing sequence of nonnegative scalars, λ ≥ λ ≥ · · · ≥ λ p ≥ , (1.8)we define the sorted- ℓ norm of a vector b ∈ R p as J λ ( b ) = λ | b | (1) + λ | b | (2) + · · · + λ p | b | ( p ) . (1.9) Proposition 1.2.
The functional (1.9) is a norm provided (1.8) holds. Observe that when all the λ i ’s take on an identical positive value, the sorted ℓ normreduces to the usual ℓ norm (up to a multiplicative factor). Also, when λ > λ = · · · = λ p = 0, the sorted ℓ norm reduces to the ℓ ∞ norm (again, up to a multiplicativefactor). M. BOGDAN ET AL.
The proof of Proposition 1.2 is provided in the supplementary appendix[Bogdan et al. (2015)]. Now define SLOPE as the solution tominimize 12 k y − Xb k + p X i =1 λ i | b | ( i ) . (1.10)As a convex program, SLOPE is tractable: as a matter of fact, we shall seein Section 2 that its computational cost is roughly the same as that of theLasso. Just as the sorted ℓ norm is an extension of the ℓ norm, SLOPE canbe also viewed as an extension of the Lasso. SLOPE’s general formulation,however, allows to achieve the adaptivity we discussed earlier. The case oforthogonal regressors suggests one particular choice of a λ sequence and wewill discuss others in later sections.1.3. Relationship to other model selection strategies.
Our purpose is tobring the program (1.10) to the attention of the statistical community: this isa computational tractable proposal for which we provide robust algorithms;it is very similar to BH when the design is orthogonal, and has promisingproperties in terms of FDR control for general designs. We now compareit with two other commonly used approaches to model selection: methodsbased on the minimization of ℓ penalties and the adaptive Lasso. We discussthese here because they allow us to emphasize the motivation and charac-teristics of the SLOPE algorithm. We also note that the last few years havewitnessed a substantive push toward the development of an inferential frame-work after selection [see, e.g., Benjamini and Yekutieli (2005), Berk et al.(2013), B¨uhlmann (2013), Efron (2011), Javanmard and Montanari (2014a,2014b), Lockhart et al. (2014), Meinshausen and B¨uhlmann (2010), Mein-shausen, Meier and B¨uhlmann (2009), van de Geer et al. (2014), Wassermanand Roeder (2009), Zhang and Zhang (2014)], with the exploration of quitedifferent viewpoints. We will comment on the relationships between SLOPEand some of these methods, developed while editing this work, in the dis-cussion section.1.3.1. Methods based on ℓ penalties. Canonical model selection proce-dures find estimates ˆ β by solvingmin b ∈ R p k y − Xb k ℓ + λ k b k ℓ , (1.11)where k b k ℓ is the number of nonzero components in b . The idea behind suchprocedures is to achieve the best possible trade-off between the goodness offit and the number of variables included in the model. Popular selectionprocedures such as AIC [Akaike (1974)] and C p [Mallows (1973)] are ofthis form: when the errors are i.i.d. N (0 , σ ), AIC and C p take λ = 2 σ .In the high-dimensional regime, such a choice typically leads to including ORTED L-ONE PENALIZED ESTIMATION very many irrelevant variables, yielding rather poor predictive propertieswhen the true vector of regression coefficients is sparse. In part to remedythis problem, Foster and George (1994) developed the risk inflation criterion(RIC): they proposed using a larger value of λ , effectively proportional to2 σ log p , where p is the total number of variables in the study. Under or-thogonal designs, if we associate nonzero fitted coefficients with rejections,this yields FWER control. Unfortunately, RIC is also rather conservativeand, therefore, it may not have much power in detecting variables with non-vanishing regression coefficients unless they are very large.The above dichotomy has been recognized for some time now and sev-eral researchers have proposed more adaptive strategies. One frequently dis-cussed idea in the literature is to let the parameter λ in (1.11) decrease asthe number of included variables increases. For instance, when minimizing k y − Xb k ℓ + p ( k b k ℓ ) , penalties with appealing information- and decision-theoretic properties areroughly of the form p ( k ) = 2 σ k log( p/k ) or p ( k ) = 2 σ X ≤ j ≤ k log( p/j ) . (1.12)Among others, we refer the interested reader to Foster and Stine (1999),Birg´e and Massart (2001) and to Tibshirani and Knight (1999) for relatedapproaches.Interestingly, for large p and small k these penalties are close to the FDRrelated penalty p ( k ) = σ X ≤ j ≤ k λ ( i ) , (1.13)proposed in Abramovich et al. (2006) in the context of the estimation of thevector of normal means, or regression under the orthogonal design (see thepreceding section) and further explored in Benjamini and Gavrilov (2009).Due to an implicit control of the number of false discoveries, similar modelselection criteria are appealing in gene mapping studies [see, e.g., Frommletet al. (2012)].The problem with these selection strategies is that, in general, they arecomputationally intractable. Solving (1.12) would involve a brute-force searchessentially requiring to fit least-squares estimates for all possible subsets ofvariables. This is not practical for even moderate values of p , for example,for p > ℓ penalties specified in (1.12), in which the “costper variable included” decreases as more get selected. However, SLOPEis computationally tractable and can be easily evaluated even for large-dimensional problems. M. BOGDAN ET AL.
Adaptive Lasso.
Perhaps the most popular alternative to the com-putationally intractable ℓ penalization methods is the Lasso. We have al-ready discussed some of the limitations of this approach with respect to FDRcontrol and now wish to explore further the connections between SLOPEand variants of this procedure. It is well known that the Lasso estimates ofthe regression coefficients are biased due to the shrinkage imposed by the ℓ penalty. To increase the accuracy of the estimation of large signals andeliminate some false discoveries, the adaptive or reweighted versions of Lassowere introduced [see, e.g., Zou (2006) or Cand`es, Wakin and Boyd (2008)].In these procedures the smoothing parameters λ , . . . , λ p are adjusted to theunknown signal magnitudes based on some estimates of regression coeffi-cients, perhaps obtained through previous iterations of Lasso. The idea isthen to consider a weighted penalty P i w i | b i | , where w i is inversely propor-tional to the estimated magnitudes so that large regression coefficients areshrunk less than smaller ones. In some circumstances, such adaptive versionsof Lasso outperform its regular version [Zou (2006)].The idea behind SLOPE is entirely different. In the adaptive Lasso, thepenalty tends to decrease as the magnitude of coefficients increases. In ourapproach, the exact opposite happens. This comes from the fact that weseek to adapt to the unknown signal sparsity and control FDR. As shownin Abramovich et al. (2006), FDR controlling properties can have inter-esting consequences for estimation. In practice, since the SLOPE sequence λ ≥ · · · ≥ λ p leading to FDR control is typically rather large, we do notrecommend using SLOPE directly for the estimation of regression coeffi-cients. Instead we propose the following two-stage procedure: in the firststep, SLOPE is used to identify significant predictors; in the second step, thecorresponding regression coefficients are estimated using the least-squaresmethod within the identified sparse regression model. Such a two-step pro-cedure, previously proposed in the context of Lasso [see, e.g., Meinshausen(2007)], can be thought of as an extreme case of reweighting, where the se-lected variables are not penalized while those that are not selected receive aninfinite penalty. As shown below, these estimates have very good propertieswhen the coefficient sequence β is sparse.1.3.3. A first illustrative simulation.
To concretely illustrate the specificbehavior of SLOPE compared to more traditional penalized approaches, werely on the simulation of a relatively simple data structure. We set n = p = 5000 and generate the entries of the design matrix with i.i.d. N (0 , /n )entries. The number of true signals k varies between 0 and 50 and theirmagnitudes are set to β i = √ p ≈ .
1, while the variance of the errorterm is assumed known and equal to 1. Since the expected value of the max-imum of p independent standard normal variables is approximately equalto √ p and the whole distribution of the maximum concentrates around ORTED L-ONE PENALIZED ESTIMATION this value, this choice of model parameters makes the sparse signal barelydistinguishable from the noise because the nonzero means are at the levelof the largest null statistics. We refer to, for example, Ingster (1998) for aprecise discussion of the limits of detectability in sparse mixtures.We fit these observations with three procedures: (1) Lasso with parameter λ Bonf = σ · Φ − (1 − α/ p ), which controls FWER weakly; (2) Lasso with thesmoothing parameter λ CV chosen with 10-fold cross-validation; (3) SLOPEwith a sequence λ , . . . , λ p defined in Section 3.2.2, expression (3.8). Thelevel α for λ Bonf and q for FDR control in SLOPE are both set to 0.1. Tocompensate for the fact that Lasso with λ Bonf and SLOPE tend to applya much more stringent penalization than Lasso with λ CV —which aims tominimize prediction error—we have “de-biased” their resulting ˆ β , using or-dinary least squares to estimate the coefficients of the variables selected byLasso– λ Bonf and SLOPE [see Meinshausen (2007)].We compare the procedures on the basis of three criteria: (a) FDR, (b)power, and (c) relative squared error k X ˆ β − Xβ k ℓ / k Xβ k ℓ . Note that onlythe first of these measures is meaningful for the case where k = 0, and insuch a case FDR = FWER.Figure 2 reports the results of 500 independent replicates. The threeapproaches exhibit quite dramatically different properties with respect tomodel selection. SLOPE controls FDR at the desired level 0.1 for the ex-plored range of k ; as k increases, its power goes from 45% to 70%. Lasso– λ Bonf has FDR =0.1 at k = 0, and a much lower one for the remaining valuesof k . This results in a loss of power with respect to SLOPE: irrespective of k , the power is less than 45%. Cross-validation chooses a λ that minimizesan estimate of prediction error, and in our experiments, λ CV is quite smaller Fig. 2.
Properties of different procedures as a function of the true number of nonzeroregression coefficients: (a)
FDR, (b) power, and (c) relative MSE defined as the av-erage of · k ˆ µ − µ k ℓ / k µ k ℓ , with µ = Xβ , ˆ µ = X ˆ β . The design matrix entriesare i.i.d. N (0 , /n ) , n = p = √ p ≈ . , and σ = 1 . Each point in the figures corresponds to the average of 500replicates. M. BOGDAN ET AL. than a penalization parameter chosen with FDR control in mind. This re-sults in greater power than SLOPE, but with a much larger FDR (80% onaverage).Figure 2(c) illustrates the relative mean-square error, which serves asa measure of prediction accuracy. It is remarkable how, despite the factthat Lasso– λ CV has higher power, SLOPE builds a better predictive modelsince it has a lower prediction error percentage for all the sparsity levelsconsidered.
2. Algorithms.
In this section we present effective algorithms for com-puting the solution to SLOPE (1.10) which rely on the numerical evaluationof the proximity operator (prox) to the sorted ℓ norm.2.1. Proximal gradient algorithms.
SLOPE is a convex optimization prob-lem of the form minimize f ( b ) = g ( b ) + h ( b ) , (2.1)where g is smooth and convex, and h is convex but not smooth. In SLOPE, g is the residual sum of squares and, therefore, quadratic, while h is the sorted ℓ norm. A general class of algorithms for solving problems of this kind areknown as proximal gradient methods ; see Nesterov (2007), Parikh and Boyd(2013) and references therein. These are iterative algorithms operating asfollows: at each iteration, we hold a guess b of the solution and compute alocal approximation to the smooth term g of the form g ( b ) + h∇ g ( b ) , x − b i + 12 t k x − b k ℓ . This is interpreted as the sum of a Taylor approximation of g and of aproximity term; as we shall see, this term is responsible for searching anupdate reasonably close to the current guess b , and t can be thought of asa step size. Then the next guess b + is the unique solution to b + = arg min x (cid:26) g ( b ) + h∇ g ( b ) , x − b i + 12 t k x − b k ℓ + h ( x ) (cid:27) = arg min x (cid:26) t k ( b − t ∇ g ( b )) − x k ℓ + h ( x ) (cid:27) (unicity follows from strong convexity). In the literature, the mapping x ( y ) = arg min x (cid:26) t k y − x k ℓ + h ( x ) (cid:27) is called the proximal mapping or prox for short, and denoted by x =prox th ( y ). ORTED L-ONE PENALIZED ESTIMATION The prox of the ℓ norm is given by entry-wise soft thresholding [Parikhand Boyd (2013), page 150] so that a proximal gradient method to solvethe Lasso would take the following form: starting with b ∈ R p , inductivelydefine b k +1 = η λt k ( b k − t k X ′ ( Xb k − y ); t k λ ) , where η λ ( y ) = sign( y ) · ( | y | − λ ) + and { t k } is a sequence of step sizes. Hence,we can solve the Lasso by iterative soft thresholding.It turns out that one can compute the prox to the sorted ℓ norm in nearlythe same amount of time as it takes to apply soft thesholding. In particular,assuming that the entries are sorted (an order p log p operation), we shalldemonstrate a linear-time algorithm. Hence, we may consider a proximalgradient method for SLOPE as in Algorithm 1.It is well known that the algorithm converges [in the sense that f ( b k ),where f is the objective functional, converges to the optimal value] undersome conditions on the sequence of step sizes { t k } . Valid choices includestep sizes obeying t k < / k X k Fast prox algorithm.
Given y ∈ R p and λ ≥ λ ≥ · · · ≥ λ p ≥
0, theprox to the sorted ℓ norm is the unique solution toprox( y ; λ ) := arg min x ∈ R p k y − x k ℓ + p X i =1 λ i | x | ( i ) . (2.2) Algorithm 1
Proximal gradient algorithm for SLOPE (1.10)
Require: b ∈ R p for k = 0 , , . . . do b k +1 = prox t k J λ ( b k − t k X ′ ( Xb k − y )) end for M. BOGDAN ET AL.
Algorithm 2
Accelerated proximal gradient algorithm for SLOPE (1.10)
Require: b ∈ R p , and set a = b and θ = 1 for k = 0 , , . . . do b k +1 = prox t k J λ ( a k − t k X ′ ( Xa k − y )) θ − k +1 = (1 + q /θ k ) a k +1 = b k +1 + θ k +1 ( θ − k − b k +1 − b k ) end for A simple observation is this: at the solution to (2.2), the sign of each x i = 0will match that of y i . It therefore suffices to solve the problem for | y | andrestore the signs in a post-processing step, if needed. Likewise, note thatapplying any permutation P to y results in a solution P x . We can thuschoose a permutation that sorts the entries in y and apply its inverse toobtain the desired solution. Therefore, without loss of generality, we canmake the following assumption: Assumption 2.1.
The vector y obeys y ≥ y ≥ · · · ≥ y p ≥ Proposition 2.2.
Under Assumption 2.1 we can reformulate (2.2) asminimize k y − x k ℓ + p X i =1 λ i x i , (2.3) subject to x ≥ x ≥ · · · ≥ x p ≥ . We do not suggest performing the prox calculation by calling a standardQP solver applied to (2.3). Rather, we introduce the FastProxSL1 algorithmfor computing the prox: for ease of exposition, we introduce Algorithm 3 inits simplest form before presenting a stack implementation (Algorithm 4)running in O ( p ) flops, after an O ( p log p ) sorting step.Algorithm 3, which terminates in at most p steps, is simple to understand:we simply keep on averaging until the monotonicity property holds, at whichpoint the solution is known in closed form. The key point establishing thecorrectness of the algorithm is that the update does not change the value ofthe prox. This is formalized below. ORTED L-ONE PENALIZED ESTIMATION Algorithm 3
FastProxSL1 input:
Nonnegative and nonincreasing sequences y and λ . while y − λ is not nonincreasing do Identify nondecreasing and nonconstant subsequences, that is, segments i : j such that y i − λ i ≤ y i +1 − λ i +1 ≤ · · · ≤ y j − λ j and y i − λ i < y j − λ j . (2.4)Replace the values of y and λ over such segments by their average value:for k ∈ { i, i + 1 , . . . , j } y k ← j − i + 1 X i ≤ k ≤ j y k , λ k ← j − i + 1 X i ≤ k ≤ j λ k . end whileoutput: x = ( y − λ ) + . Lemma 2.3.
The solution does not change after each update; formally,letting ( y + , λ + ) be the updated value of ( y, λ ) after one pass in Algorithm 3, prox( y ; λ ) = prox( y + ; λ + ) . Algorithm 4
Stack-based algorithm for FastProxSL1 input: Nonnegative and nonincreasing sequences y and λ . t ← for k = 1 to n do t ← t + 1 ( i, j, s, w ) t = ( k, k, y i − λ i , ( y i − λ i ) + ) while ( t >
1) and ( w t − ≤ w t ) do ( i, j, s, w ) t − ← ( i t − , j t , s t − + s t , ( j t − − i t − +1 j t − i t − +1 · s t − + j t − i t +1 j t − i i − +1 · s t ) + ) Delete ( i, j, s, w ) t , t ← t − end while end for x for each block for ℓ = 1 to t do for k = i ℓ to j ℓ do x k ← w ℓ end for end for M. BOGDAN ET AL.
Next, if ( y − λ ) + is nonincreasing, then it is the solution to (2.2), that is, prox( y ; λ ) = ( y − λ ) + . This lemma, whose proof is in the supplementary Appendix [Bogdan et al.(2015)], guarantees that the FastProxSL1 algorithm finds the solution to(2.2) in a finite number of steps.As stated earlier, it is possible to obtain a careful O ( p ) implementationof FastProxSL1. Below we present a stack-based approach. We use tuplenotation ( a, b ) i = ( c, d ) to denote a i = c , b i = d . For the complexity of thealgorithm note that we create a total of p new tuples. Each of these tuples ismerged into a previous tuple at most once. Since the merge takes a constantamount of time, the algorithm has the desired O ( p ) complexity.With this paper, we are making available a C, a Matlab and an R im-plementation of the stack-based algorithm at . The algorithm is also implemented in R packageSLOPE, available on CRAN, and included in the current version of theTFOCS package. Table 1 reports the average runtimes of the algorithm(MacBook Pro, 2.66 GHz, Intel Core i7) when applied to vectors of fixedlength and varying sparsity.2.3. Related algorithms.
Brad Efron informed us about the connectionbetween the FastProxSL1 algorithm for SLOPE and a simple iterative algo-rithm for solving isotonic problems called the pool adjacent violators algo-rithm (PAVA) [Kruskal (1964), Barlow et al. (1972)]. A simple instance ofan isotonic regression problem involves fitting data in a least-squares sensein such a way that the fitted values are monotone:minimize k y − x k ℓ , (2.5) subject to x ≥ x ≥ · · · ≥ x p . Here, y is a vector of observations and x is the vector of fitted values, whichare here constrained to be nonincreasing. We have chosen this formulationto emphasize the connection with (2.3). Indeed, our QP (2.3) is equivalent Table 1
Average runtimes of the stack-based prox implementation with normalization steps(sorting and sign changes) included, respectively, excluded p = 10 p = 10 p = 10 Total prox time (s) 9.82e–03 1.11e–01 1.20e+00Prox time after normalization (s) 6.57e–05 4.96e–05 5.21e–05ORTED L-ONE PENALIZED ESTIMATION to minimize 12 p X i =1 ( y i − λ i − x i ) , subject to x ≥ x ≥ · · · ≥ x p ≥ , so that we see are really solving an isotonic regression problem with data y i − λ i . Algorithm 3 is then a version of PAVA as described in Barlow et al.(1972); see Best and Chakravarti (1990), Grotzinger and Witzgall (1984)for related work and connections with active set methods. Also, an elegantR package for isotone regression has been contributed by de Leeuw, Hornikand Mair (2009) and can be used to compute the prox to the sorted ℓ norm.Similar algorithms were also proposed in Zhong and Kwok (2012) to solvethe OSCAR optimization problem defined asminimize 12 k y − Xb k ℓ + λ k b k ℓ + λ X i 3. Results. We now illustrate the performance of our SLOPE proposalin three different ways. First, we describe a multiple-testing situation wherereducing the problem to a model selection setting and applying SLOPEassures FDR control, and results in a testing procedure with appreciableproperties. Second, we discuss guiding principles to choose the sequenceof λ i ’s in general settings, and illustrate the efficacy of the proposals withsimulations. Third, we apply SLOPE to a data set collected in geneticsinvestigations.3.1. An application to multiple testing. In this section we show howSLOPE can be used as an effective multiple comparison controlling pro-cedure in a testing problem with a specific correlation structure. Considerthe following situation. Scientists perform p = 1000 experiments in each of 5 M. BOGDAN ET AL. randomly selected laboratories, resulting in observations that can be mod-eled as y i,j = µ i + τ j + z i,j , ≤ i ≤ , ≤ j ≤ , (3.1)where the laboratory effects τ j are i.i.d. N (0 , σ τ ) random variables and theerrors z i,j are i.i.d. N (0 , σ z ), with the τ and z sequences independent ofeach other. It is of interest to test whether H i : µ i = 0 versus a two-sidedalternative. Averaging the scores over all five labs results in¯ y i = µ i + ¯ τ + ¯ z i , ≤ i ≤ , with ¯ y ∼ N ( µ, Σ) and Σ i,i = ( σ τ + σ z ) = σ and Σ i,j = σ τ = ρ for i = j .The problem has then been reduced to testing if the marginal means ofa multivariate Gaussian vector with equicorrelated entries do not vanish.One possible approach is to use marginal tests based on ¯ y i ’s and rely onthe Benjamini–Hochberg procedure to control FDR. That is, we can order | ¯ y | (1) ≥ | ¯ y | (2) ≥ · · · ≥ | ¯ y | ( p ) and apply the step-up procedure with criticalvalues equal to σ · Φ − (1 − iq/ p ).Another possible approach is to “whiten the noise” and express our mul-tiple testing problem in the form of a regression equation˜ y = Σ − / ¯ y = Σ − / µ + ε, (3.2)where ε ∼ N (0 , I p ). Treating Σ − / as the regression design matrix, our prob-lem is equivalent to classical model selection: identify the nonzero compo-nents of the vector µ of regression coefficients. Note that while the matrixΣ is far from being diagonal, Σ − / is diagonally dominant. For example,when σ = 1 and ρ = 0 . 5, then Σ − / i,i = 1 . − / i,j = − . i = j .Thus, every low-dimensional submodel obtained by selecting few columnsof the design matrix Σ − / will be very close to orthogonal. In summary,the transformation (3.2) reduces the multiple-testing problem with stronglypositively correlated test statistics to a problem of model selection underapproximately orthogonal design, which is well suited for the application ofSLOPE with the λ BH values.To compare the performances of these two approaches, we simulate dataaccording to the model (3.1) with variance components σ τ = σ z = 2 . 5, whichyield σ = 1 and ρ = 0 . 5. We consider a sequence of sparse settings, wherethe number k of nonzero µ i ’s varies between 0 and 80. To obtain moder-ate power, the nonzero means are set to √ p/c ≈ . 63, where c is theEuclidean norm of each of the columns of Σ − / . We compare the perfor-mance of SLOPE and BH on marginal tests under two scenarios: (1) as-suming σ τ = σ z = 2 . To be explicit, (3.2) is the basic regression model with X = Σ − / and β = µ .ORTED L-ONE PENALIZED ESTIMATION Fig. 3. Simulation results for testing multiple means from correlated statistics. (a)–(b) Mean FDP ± k . (c) Mean FDP ± (d) Power plot. unweighted means method based on equating the ANOVA mean squares totheir expectations: ˆ σ z = MSE , ˆ σ τ = MS τ − MSE1000 ;using the standard notation from ANOVA analysis, MSE is the mean squaredue to the error in the model (3.1) and MS τ is the mean square due to therandom factor τ . To use SLOPE, we center the vector ˜ y by subtracting itsmean, and center and standardize the columns of ˆΣ − / , so they have zeromeans and unit l norms. Figure 3 reports the results of these simulations,averaged over 500 independent replicates.In our setting, the estimation procedure has no influence on SLOPE.Under both scenarios (variance components known and unknown) SLOPEkeeps FDR at the nominal level as long as k ≤ 40. Then its FDR slowlyincreases, but for k ≤ 80 it is still very close to the nominal level as shown inFigure 3(c). In contrast, the performance of BH differs significantly: when σ is known, BH on the marginal tests is too conservative, with an averageFDP below the nominal level; see Figure 3(a) and (b). When σ is esti-mated, the average FDP of this procedure increases and for q = 0 . 05, itsignificantly exceeds the nominal level. Under both scenarios (known and M. BOGDAN ET AL. Fig. 4. Testing example with q = 0 . and k = 50 . The top row refers to marginal tests,and the bottom row to SLOPE. Both procedures use the estimated variance components.Histograms of false discovery proportions are in the first column and of true positive pro-portions in the second. unknown σ ) the power of BH is substantially smaller than the power pro-vided by SLOPE [Figure 3(d)]. Moreover, the False Discovery Proportion(FDP) in the marginal tests with BH correction appears more variable acrossreplicates than that of SLOPE [Figure 3(a), (b) and (c)]. Figure 4 presentsthe results in greater detail for q = 0 . k = 50: in approximately 65%of the cases the observed FDP for BH is equal to 0, while in the remaining35% it takes values which are distributed over the whole interval (0 , q even if, when there are discoveries, a large number of them arefalse. Indeed, in approximately 26% of all cases BH on the marginal tests didnot make any rejections (i.e., R = 0); and conditional on R > 0, the meanof FDP is equal to 0.16 with a standard deviation of 0.28, which clearlyshows that the observed FDP is typically far away from the nominal valueof q = 0 . 1. In other words, while BH is close to controlling the FDR, thescientists would either make no discoveries or have very little confidence onthose actually made. In contrast, SLOPE results in a more predictable FDPand a substantially larger and more predictable True Positive Proportion(TPP, fraction of correctly identified true signals); see Figure 4. ORTED L-ONE PENALIZED ESTIMATION Fig. 5. Observed (a) FWER for Lasso with λ Bonf and (b) FDR for SLOPE with λ BH under Gaussian design and n = Choosing λ in general settings. In the previous sections we observedthat, for orthogonal designs, Lasso with λ Bonf = σ · Φ − (1 − α/ p ) controlsFWER at the level α , while SLOPE with the sequence λ = λ BH controls FDRat the level q . We are interested, however, in applying these procedures inmore general settings, specifically when p > n and there is some correlationamong the explanatory variables, and when the value of σ is not known.We start tackling the first situation. Correlation among regressors notori-ously introduces a series of complications in the statistical analysis of linearmodels, ranging from the increased computational costs that motivated theearly popularity of orthogonal designs, to the conceptual difficulties of dis-tinguishing causal variables among correlated ones. Indeed, recent resultson the consistency of ℓ penalization methods typically require some formof partial orthogonality. SLOPE and Lasso aim at finite sample properties,but it would not be surprising if departures from orthogonality were to havea serious effect. To explore this, we study the performance of Lasso andSLOPE in the case where the entries of the design matrix are generatedindependently from the N (0 , /n ) distribution. Specifically, we consider twoGaussian designs with n = 5000: one with p = 2 n = 10,000 and one with p = n/ √ p andconsider situations where the number of important variables ranges between0 and 100. Figure 5 illustrates that under such Gaussian designs both Lasso– λ Bonf and SLOPE lose the control over their targeted error rates (FWERand FDR) as the number k of nonzero coefficients increases, with a departurethat is more severe when the ratio between p/n is larger. M. BOGDAN ET AL. The effect of shrinkage. What is behind this fairly strong effect,and is it possible to choose a λ sequence to compensate it? Some usefulinsights come from studying the solution of the Lasso. Assume that thecolumns of X have unit norm and that z ∼ N (0 , β = η λ ( ˆ β − X ′ ( X ˆ β − y )) = η λ ( ˆ β − X ′ ( X ˆ β − Xβ − z ))(3.3) = η λ ( ˆ β − X ′ X ( ˆ β − β ) + X ′ z ) , where η λ is the soft-thresholding operator, η λ ( t ) = sgn( t )( | t | − λ ) + , appliedcomponentwise. Defining v i = h X i , P j = i X j ( β j − ˆ β j ) i , we can writeˆ β i = η λ ( β i + X ′ i z + v i ) , (3.4)which expresses the relation between the estimated value of ˆ β i and its truevalue β i . If the variables are orthogonal, the v i ’s are identically equal to 0,leading to ˆ β i = η λ ( β i + X ′ i z ). Conditionally on X , X ′ i z ∼ N (0 , 1) and by usingBonferroni’s method, one can choose λ such that P (max i | X ′ i z | > λ ) ≤ α .When X is not orthogonal, however, v i = 0 and its size increases with theestimation error of β j (for i = j )—which depends on the magnitude of theshrinkage parameter λ . Therefore, even in the perfect situation where allthe k relevant variables, and those alone, have been selected, and whenall columns of the design matrix are realizations of independent randomvariables, v i will not be zero. Rather, the squared magnitude v i will be onthe order of λ · k/n . In other words, the variance that would determinethe correct Bonferroni threshold is on the order 1 + λ · k/n . In reality, thetrue k is not known a priori, and the selected k depends on the value of thesmoothing parameter λ , so that it is not trivial to implement this correctionin the Lasso. SLOPE, however, uses a decreasing sequence λ , analogous toa step-down procedure, and this extra noise due to the shrinkage of relevantvariables can be incorporated by progressively modifying the λ sequence. Inevocative, if not exact terms, λ is used to select the first variable to enterthe model: at this stage we are not aware of any variable whose shrunkcoefficient is “effectively increasing” the noise level, and we can keep λ = λ BH (1). The value of λ determines the second variable to enter the modeland, hence, we know that there is already one important variable whosecoefficient has been shrunk by roughly λ BH (1); we can use this information toredefine λ . Similarly, when using λ to identify the third variable, we knowof two relevant regressors whose coefficients have been shrunk by amountsdetermined by λ and λ , and so on. What follows is an attempt to make thisintuition more precise, accounting for the fact that the sequence λ needs tobe determined a priori, and we need to make a prediction on the values of thecross products X ′ i X j appearing in the definition of v i . Before we turn to this, ORTED L-ONE PENALIZED ESTIMATION we want to underscore how this explanation for the loss of FDR control isconsistent with patterns evident from Figure 5: the problem is more seriousas k increases (and, hence, the effect of shrinkage is felt on a larger numberof variables) and as the ratio p/n increases (which for Gaussian designsresults in larger empirical correlation | X ′ i X j | ). Our loose analysis suggeststhat when k is really small, SLOPE with λ BH yields an FDR that is closeto the nominal level, as empirically observed.3.2.2. Adjusting the regularizing sequence for SLOPE. In light of (3.4),we would like an expression for X ′ i X S ( β S − ˆ β S ), where with S , X S and β S we indicate the support of β , the subset of variables associated to β i = 0,and the value of their coefficients, respectively.Again, to obtain a very rough evaluation of the SLOPE solution, we canstart from the Lasso. Let us assume that the size of β S and the value of λ aresuch that the support and the signs of the regression coefficients are correctlyrecovered in the solution. That is, we assume that sign( β j ) = sign( ˆ β j ) for all j , with the convention that sign(0) = 0. Without loss of generality, we fur-ther assume that β j ≥ 0. Now, the Karush–Kuhn–Tucker (KKT) optimalityconditions for the Lasso yield X ′ S ( y − X ˆ β S ) = λ · S , (3.5)implying ˆ β S = ( X ′ S X S ) − ( X ′ S y − λ · S ) . In the case of SLOPE, rather than one λ , we have a sequence λ , . . . , λ p .Assuming again that this is chosen so that we recover exactly the support S , the estimates of the nonzero components are very roughly equal toˆ β S = ( X ′S X S ) − ( X ′S y − λ S ) = ˆ β OLS − ( X ′S X S ) − λ S , where λ S = ( λ , . . . , λ | S | ) ′ and ˆ β OLS is the least-squares estimator of β S . Thisleads to E ( β S − ˆ β S ) ≈ ( X ′S X S ) − λ S and E X ′ i X S ( β S − ˆ β S ) ≈ E X ′ i X S ( X ′S X S ) − λ S , an expression that tells us the typical size of v i in (3.4).For the case of Gaussian designs, where the entries of X are i.i.d. N (0 , /n ),for i / ∈ S , E ( X ′ i X S ( X ′S X S ) − λ S ) = 1 n λ ′S E ( X ′S X S ) − λ S = w ( |S| ) · k λ S k ℓ , (3.6) w ( k ) = 1 n − k − . M. BOGDAN ET AL. Fig. 6. Graphical representation of sequences { λ i } for p = 5000 and q = 0 . . The solidline is λ BH , the dashed (resp., dotted) line is λ G given by (3.7) for n = p/ (resp., n = 2 p ). This uses the fact that the expected value of an inverse k × k Wishart with n degrees of freedom is equal to I k / ( n − k − λ ’s described below denoted by λ G sinceit is motivated by Gaussian designs. We start with λ G (1) = λ BH (1). At thenext stage, however, we need to account for the slight increase in varianceso that we do not want to use λ BH (2) but rather λ G (2) = λ BH (2) p w (1) λ G (1) . Continuing, this gives λ G ( i ) = λ BH ( i ) s w ( i − X j k ⋆ , with λ G ( i ) as in (3.7) . (3.8)An immediate validation—if the intuition that we have stretched thisfar has any bearing in reality—is the performance of λ G ⋆ in the setup ofFigure 5. In Figure 7 we illustrate the performance of SLOPE for large sig-nals β i = 5 √ p as in Figure 5, as well as for rather weak signals with ORTED L-ONE PENALIZED ESTIMATION Fig. 7. Mean FDP ± λ G ⋆ . Strong signals have nonzero regres-sion coefficients set to √ p , while this value is set to √ p for weak signals. (a) p = 2 n = (b) p = n/ β i = √ p . The correction works very well, rectifying the loss of FDR con-trol documented in Figure 5. For p = 2 n = 10,000, the values of the criticalpoint k ⋆ are 51 for q = 0 . 05 and 68 for q = 0 . 1. For p = n/ λ G ⋆ is very close to the nominal level when k ≤ k ⋆ .In situations where one cannot assume that the design is Gaussian orthat columns are independent, we suggest replacing w ( i − P j
Algorithm 5 Iterative SLOPE fitting when σ is unknown input: y , X and initial sequence λ S (computed for σ = 1) initialize: S + = ∅ repeat S = S + compute the RSS obtained by regressing y onto variables in S set ˆ σ = RSS / ( n − | S | − compute the solution ˆ β to SLOPE with parameter sequence ˆ σ · λ S set S + = supp( ˆ β ) until S + = S that the computational cost of this procedure is relatively low. Two elementscontribute to this. First, the complexity of the procedure is reduced by thefact that the sequence of λ ’s does not need to be estimated entirely, but onlyup to the point k ⋆ where it starts increasing (or simply flattens) and onlyfor a number of entries on the order of the expected number of nonzero co-efficients. Second, the smoothness of λ assures that it is enough to estimate λ on a grid of points between 1 and k ⋆ , making the problem tractable alsofor very large p . In Bogdan et al. (2013) we applied a similar procedure forthe estimation of the regularizing sequence with p = 2048 = 4,194,304 and n = p/ Unknown σ . According to formulas (1.5) and (1.10), the penaltyin SLOPE depends on the standard deviation σ of the error term. In manyapplications σ is not known and needs to be estimated. When n is largerthan p , this can easily be done by means of classical unbiased estimators.When p ≥ n , some solutions for simultaneous estimation of σ and regressioncoefficients using ℓ optimization schemes were proposed; see, for exam-ple, St¨adler, B¨uhlmann and van de Geer (2010) and Sun and Zhang (2012).Specifically, Sun and Zhang (2012) introduced a simple iterative version ofthe Lasso called the scaled Lasso . The idea of this algorithm can be appliedto SLOPE, with some modifications. For one, our simulation results showthat, under very sparse scenarios, it is better to de-bias the estimates ofregression parameters by using classical least-squares estimates within theselected model to obtain an estimate of σ .We present our algorithm above (Algorithm 5). There, λ S is the sequenceof SLOPE parameters designed to work with σ = 1, obtained using the meth-ods from Section 3.2.2.The procedure starts by using a conservative estimate of the standarddeviation of the error term ˆ σ (0) = Std( y ) and a related conservative version ORTED L-ONE PENALIZED ESTIMATION of SLOPE with λ (0) = ˆ σ (0) · λ S . Then, in consecutive runs ˆ σ ( k ) is computedusing residuals from the regression model, which includes variables identifiedby SLOPE with sequence σ ( k − · λ S . The procedure is repeated until con-vergence, that is, until the next iteration results in exactly the same modelas the current one.3.2.4. Simulations with idealized GWAS data. We illustrate the perfor-mance of the “scaled” version of SLOPE and of our algorithm for the esti-mation of the parameters λ i with simulations designed to mimic an idealizedversion of Genome Wide Association Studies (GWAS). We set n = p = 5000,and simulate 5000 genotypes of p independent Single Nucleotide Polymor-phisms (SNPs). For each of these SNPs the minor allele frequency (MAF) issampled from the uniform distribution on the interval (0 . , . x ij = ( − , for aa ,0 , for aA ,1 , for AA ,(3.9)where a and A denote the minor and reference alleles at the j th SNP forthe i th individual. Then the matrix ˜ X is centered and standardized, so thecolumns of the final design matrix X have zero mean and unit norm. Thetrait values are simulated according to the model y = Xβ + z, (3.10)where z ∼ N (0 , I ), that is, we assume only additive effects and no inter-action between loci (epistasis). We vary the number of nonzero regressioncoefficients k between 0 and 50 and we set their size to 1 . √ p ≈ . k , 500 replicates are performed, ineach selecting randomly among the columns of X , the k with nonzero coeffi-cients. Since our design matrix is centered and does not contain an intercept,we also center the vector of responses and let SLOPE work with ˜ y = y − ¯ y ,where ¯ y is the mean of y .We set q = 0 . 05 and estimate the sequence λ via the Monte Carlo approachdescribed in Section 3.2.2; here, we use 5000 independent random draws of X S and X j to compute the next term in the sequence. The calculationsterminated in about 90 seconds (HP EliteDesk 800 G1 TWR, 3.40 GHz, M. BOGDAN ET AL. Fig. 8. (a) Graphical representation of sequences λ MC and λ G for the SNP design ma-trix. (b) Mean FDP ± λ G ⋆ and λ MC and for BH as appliedto marginal tests. (c) Power of both versions of SLOPE and BH on marginal tests for β = · · · = β k = 1 . √ p ≈ . , σ = 1 . In each replicate, the signals are randomly placedover the columns of the design matrix, and the plotted data points are averages over 500replicates. Intel i7-4770) at λ , where the estimated sequence λ obtained a first localminimum. Figure 8(a) illustrates that up to this first minimum the MonteCarlo sequence λ MC coincides with the heuristic sequence λ G ⋆ for Gaussianmatrices. In the result the FDR and power of “scaled” SLOPE are almostthe same for both sequences [Figure 8(b) and (c)].In our simulations, the proposed algorithm for scaled SLOPE convergesvery quickly. The conservative initial estimate of σ leads to a relativelysmall model with few false discoveries since σ (0) · λ S controls the FDR insparse settings. Typically, iterations to convergence see the estimated valueof σ decrease and the number of selected variables increase. Since somesignals remain undetected (the power is usually below 100%), σ is slightlyoverestimated at the point of convergence, which translates into controllingthe FDR at a level slightly below the nominal one; see Figure 8(b).Figure 8(b) and (c) compare scaled SLOPE with the “marginal” tests.The latter are based on t -test statistics t i = ˆ β i / ˆ σ , ˆ σ = RSS i / ( n − , where ˆ β i (resp., RSS i ) is the least-square estimate of the regression coef-ficient (resp., the residual sum of squares) in the simple linear regressionmodel including only the i th SNP. To adjust for multiplicity, we use BH atthe nominal FDR level q = 0 . k ≤ 5. However, for k ≥ 10 the FDR of the marginal testsapproach falls below the nominal level and the power decreases from 80%for k = 10 to 67% for k = 50. SLOPE’s power remains, instead, stable at thelevel of approximately 86% for k ∈ { , . . . , } . This conservative behavior ORTED L-ONE PENALIZED ESTIMATION of marginal tests results from the inflation of the noise level estimate causedby regressors that are unaccounted for in the simple regression model.We use this idealized GWAS setting to also explore the effect of somemodel misspecification. First, we consider a trait y on which genotypes haveeffects that are not simply additive. We formalize this via the matrix ˜ Z collecting the “dominant” effects˜ z ij = (cid:26) − , for aa, AA ,1 , for aA .(3.11)The final design matrix [ X, Z ] has the columns [ ˜ X, ˜ Z ] centered and stan-dardized. Now the trait values are simulated according to the model y = [ X, Z ][ β ′ X , β ′ Z ] ′ + ε, where ε ∼ N (0 , I ), the number of “causal” SNPs k varies between 0 and50, each causal SNP has an additive effect (nonzero components of β X )equal to 1 . √ p ≈ . 95 and a dominant effect (nonzero components of β Z ) randomly sampled from N (0 , σ = 2 √ p ). The data is analyzed usingmodel (3.10), that is, assuming linear effect of alleles even when this is nottrue.Second, to explore the sensitivity to violations of the assumption of thenormality of the error terms, we considered (1) error terms z i with a Laplacedistribution and a scale parameter adjusted to that the variance is equalto one, and (2) error terms contaminated with 50 outliers ∼ N (0 , σ = 5)representing 1% of all observations.Figure 9 summarizes the performance of SLOPE and of the marginal tests(adjusted for multiplicity via BH), which we include for reference purposes.Violation of model assumption appears to affect power rather than FDRin the case of SLOPE. Specifically, in all three examples FDR is kept veryclose to the nominal level while the power is somewhat diminished withrespect to Figure 8. The smallest difference is observed in the case of Laplaceerrors, where the results of SLOPE are almost the same as in the case ofnormal errors. This is also the case where the difference in performancedue to model misspecification is negligible for marginal tests. In all othercases, this approach seems to be much more sensitive than SLOPE to modelmisspecification.3.3. A real data example from genetics. In this section we illustrate theapplication of SLOPE to a current problem in genetics. In Service et al.(2014), the authors investigate the role of genetic variants in 17 regions inthe genome, selected on the basis of previously reported association withtraits related to cardiovascular health. Polymorphisms are identified via ex-ome resequencing in approximately 6000 individuals of Finnish descent: this M. BOGDAN ET AL. Fig. 9. FDR and power of “scaled” SLOPE based on “gaussian” sequence λ G ⋆ (leftpanel) and BH-corrected single marker tests (right panel) for different deviations from theassumed regression model. Error bars for FDR correspond to mean FDP ± provides a comprehensive survey of the genetic diversity in the coding por-tions of these regions and affords the opportunity to investigate which ofthese variants have an effect on the traits of interest. While the originalstudy has a broader scope, we here tackle the problem of identifying whichgenetic variants in these regions impact the fasting blood HDL levels. Pre-vious literature reported associations between 9 of the 17 regions and HDL,but the resolution of these earlier studies was unable to pinpoint to specificvariants in these regions or to distinguish if only one or multiple variantswithin the regions impact HDL. The resequencing study was designed toaddress this problem.The analysis in Service et al. (2014) relies substantially on “marginal”tests: the effect of each variant on HDL is examined via a linear regressionthat has cholesterol level as outcome and the genotype of the variant asexplanatory variable, together with covariates that capture possible popula-tion stratification. Such marginal tests are common in genetics and representthe standard approach in genome-wide association studies (GWAS). Amongtheir advantages, it is worth mentioning that they allow to use all availableobservations for each variant without requiring imputation of missing data;their computational cost is minimal; and they result in a p -value for each ORTED L-ONE PENALIZED ESTIMATION variant that can be used to clearly communicate to the scientific commu-nity the strength of the evidence in favor of its impact on a particular trait.Marginal tests, however, cannot distinguish if the association between a vari-ant and a phenotype is “direct” or due to correlation between the variant inquestion and another, truly linked to the phenotype. Since most of the cor-relation between genetic variants is due to their location along the genome(with nearby variants often correlated), this confounding is often considerednot too serious a limitation in GWAS: multiple polymorphisms associatedto a phenotype in one locus simply indicate that there is at least one ge-netic variant (most likely not measured in the study) with impact on thephenotype in the locus. The situation is quite different in the resequencingstudy we want to analyze, where establishing if one or more variants in thesame region influence HDL is one of the goals. To address this, the authorsof Service et al. (2014) resort to regressions that include two variables at thetime: one of these being the variant with previously documented strongestmarginal signal in the region, the other being variants that passed an FDRcontrolling threshold in the single variant analysis. Model selection strate-gies were only cursorily explored with a step-wise search routine that targetsBIC. Such limited foray into model selection is motivated by the fact thatone major concern in genetics is to control some global measure of type Ierror, and currently available model selection strategies do not offer finitesample guarantees with this regard. This goal is in line with that of SLOPEand so it is interesting for us to apply this new procedure to this problem.The data set in Service et al. (2014) comprises 1878 variants, on 6121 sub-jects. Before analyzing it with SLOPE, or other model selection tools, weperformed the following filtering. We eliminated from considerations variantsobserved only once (a total of 486), since it would not be possible to make in-ference on their effect without strong assumptions. We examined correlationbetween variants and selected for analysis a set of variants with pair-wisecorrelation smaller than 0.3. Larger values would make it quite challengingto interpret the outcomes; they render difficult the comparison of resultsacross procedures since these might select different variables from a groupof correlated ones; and large correlations are likely to adversely impact theefficacy of any model selection procedure. This reduction was carried outin an iterative fashion, selecting representatives from groups of correlatedvariables, starting from stronger levels of correlation and moving onto lowerones. Among correlated variables, we selected those that had stronger uni-variate association with HDL, larger minor allele frequency (diversity), and,among very rare variants, we privileged those whose annotation was moreindicative of possible functional effects. Once variables were identified, weeliminated subjects that were missing values for more than 10 variants andfor HDL. The remaining missing values were imputed using the average al-lele count per variant. This resulted in a design with 5375 subjects and 777 M. BOGDAN ET AL. variants. The minor allele frequency of the variants included ranges from2 × − to 0.5, with a median of 0.001 and a mean of 0.028: the dataset still includes a number of rare variants, with the minor allele frequencysmaller than 0.01.In Service et al. (2014), association between HDL and polymorphismswas analyzed only for variants in regions previously identified as having aninfluence on HDL: ABCA1 , APOA1 , CEPT , FADS1 , GALNT2 , LIPC , LPL , MADD , and MVK (regions are identified with the name of one of the genesthey contain). Moreover, only variants with minor allele frequencies largerthan 0.01 were individually investigated, while nonsynonimous rare variantswere analyzed with “burden tests.” These restrictions were motivated, atleast in part, by the desire to reduce tests to the most well-powered ones, sothat controlling for multiple comparisons would not translate in an excessivedecrease of power. Our analysis is based on all variants that survive the de-scribed filtering in all regions, including those not directly sequenced in theexperiment in Service et al. (2014), but included in the study as landmarksof previously documented associations ( array SNPs in the terminology ofthe paper). We compare the following approaches: the (1) marginal testsdescribed above in conjunction with BH and q = 0 . 05; (2) BH and q = 0 . p -values from the full model regression; (3) Lasso with λ Bonf and α = 0 . 05; (4) Lasso with λ CV (in these last two cases we use the routinesimplemented in glmnet in R); (5) the R routine Step.AIC in forward direc-tion and BIC as optimality criteria; (6) the R routine Step.AIC in backwarddirection and BIC as optimality criteria; (7) SLOPE with λ G ⋆ and q = 0 . λ obtained via Monte Carlo starting from our design ma-trix. Defining the λ for Lasso– λ Bonf and SLOPE requires a knowledge ofthe noise level σ ; we estimated this from the residuals of the full model.When estimating λ via the Monte Carlo approach, for each i we used 5000independent random draws of X S and X j . Figure 10(a) illustrates that theMonte Carlo sequence λ MC is only slightly larger than λ G ⋆ : the differenceincreases with the index i , and becomes substantial for ranges of i that areunlikely to be relevant in the scientific problem at hand.Tables 1 and 2 in Service et al. (2014) describe a total of 14 variants ashaving an effect on HDL: two of these are for regions FADS1 and MVK and the strength of the evidence in this specific data set is quite weak (amarginal p -value of the order of 10 − ). Multiple effects are identified in re-gions ABCA1 , CEPT , LPL and LIPL . The results of the various “modelselection” strategies we explored are in Figure 11, which reports the esti-mated values of the coefficients. The effect of the shrinkage induced by Lassoand SLOPE are evident. To properly compare effect sizes across methods,it would be useful to resort to the two-step procedure that we used for thesimulation described in Figure 2. Since our interest here is purely model ORTED L-ONE PENALIZED ESTIMATION Fig. 10. (a) Graphical representation of sequences λ MC and λ G for the variants designmatrix. Mean FDP ± (b) λ G ⋆ and (c) λ MC for the variants designmatrix and β = · · · = β k = √ p ≈ . , σ = 1 . selection, we report the coefficients directly as estimated by the ℓ penal-ized procedures; this has the welcome side effect of increasing the spread ofpoints in Figure 11, improving visibility.Of the 14 variants described in Service et al. (2014), 8 are selected by allmethods. The remaining 6 are all selected by at least some of the 8 methodswe compared. There are an additional 5 variants that are selected by allmethods but are not in the main list of findings in the original paper: fourof these are rare variants, and one is an array SNP for a trait other thanHDL. While none of these, therefore, was singularly analyzed for associationin Service et al. (2014), they are in highlighted regions: one is in MADD , andthe others in ABCA1 and CETP , where the paper documents a plurality ofsignals.Besides this core of common selections that correspond well to the originalfindings, there are notable differences among the 8 approaches we considered.The total number of selected variables ranges from 15, with BH on the p -values of the full model, to 119, with the cross-validated Lasso. It is notsurprising that these methods would result in the extreme solutions. On theone hand, the p -values from the full model reflect the contribution of onevariable given all the others, which are, however, not necessarily includedin the models selected by other approaches; on the other hand, we haveseen how the cross-validated Lasso tends to select a much larger numberof variables and offers no control of FDR. In our case, the cross-validatedLasso estimates nonzero coefficients for 90 variables that are not selected byany other methods. Note that the number of variables selected by the cross-validated Lasso changes in different runs of the procedure, as implementedin glmnet with default parameters. It is quite reasonable to assume that alarge number of these are false positives: regions G6PC2 , PANK1 , CRY2 and MTNR1B , where the Lasso– λ CV selects some variants, have no documented M. BOGDAN ET AL. Fig. 11. Estimated effects on HDL for variants in 17 regions. Each panel corresponds toa region and is identified by the name of a gene in the region, following the convention inService et al. (2014). Regions with (without) previously reported association to HDL areon the green (red) background. On the x -axis variants position in base-pairs along theirrespective chromosomes. On the y -axis estimated effect according to different methodolo-gies. With the exception of marginal tests—which we use to convey information on thenumber of variables and indicated with light gray squares—we report only the value ofnonzero coefficients. The rest of the plotting symbols and color convention is as follows:dark gray bullet—BH on p -values from full model; magenta cross—forward BIC; pur-ple cross—backward BIC; red triangle—Lasso– λ Bonf ; orange triangle—Lasso– λ CV ; cyanstar—SLOPE– λ G ⋆ ; black circle—SLOPE with λ defined with Monte Carlo strategy. association with lipid levels, and regions CELSR2 , GCKR , ABCG8 and NCAN have been associated previously to total cholesterol and LDL, butnot HDL. The other procedures that select some variants in any of theseregions are the forward and backward greedy searches trying to optimizeBIC, which have hits in CELSR2 and ABCG8 , and the BH on univariate p -value, which has one hit in ABCG8 . SLOPE does not select any variantin regions not known to be associated with HDL. This is true also of theLasso– λ Bonf and BH on the p -values from the full model, but these miss, ORTED L-ONE PENALIZED ESTIMATION Fig. 12. Each row corresponds to a variant in the set differently selected by the com-pared procedures, indicated by columns. Orange is used to represent rare variants and bluecommon ones. Squares indicate synonymous (or noncoding variants) and circles nonsyn-onimous ones. Variants are ordered according to the frequency with which they are selected.Variants with names in green are mentioned in Service et al. (2014) as to have an effect onLDL, while variants with names in red are not [if a variant was not in dbSNP build 137,we named it by indicating chromosome and position, following the convention in Serviceet al. (2014)]. respectively, 2 and 6 of the variants described in the original paper, whileSLOPE λ G ⋆ misses only one of them.Figure 12 focuses on the set of variants where there is some disagreementbetween the 8 procedures we considered, after eliminating the 90 variantsselected only by the Lasso– λ CV . In addition to recovering all except one ofthe variants identified in Service et al. (2014), and to the core of variantsselected by all methods, SLOPE– λ G ⋆ selects 3 rare variants and 3 commonvariants. While the rare variants were not singularly analyzed in the originalstudy, they are in the two regions where aggregate tests highlighted the roleof this type of variation. One is in ABCA1 and the other two are in CETP ,and they are both nonsynonimous. Two of the three additional commonvariants are in CETP and one is in MADD ; in addition to SLOPE, these areselected by Lasso– λ CV and the marginal tests. One of the common variantsand one rare variant in CETP are mentioned as a result of the limited forayin model selection in Service et al. (2014). SLOPE– λ MC selects two less ofthese variants.In order to get a handle on the effective FDR control of SLOPE in thissetting, we resorted to simulations. We consider a number k of relevant M. BOGDAN ET AL. variants ranging from 0 to 100, while concentrating on lower values. At eachlevel, k columns of the design matrix were selected at random and assignedan effect of √ p against a noise level σ set to 1. While analyzing thedata with λ MC and λ G ⋆ , we estimated σ from the full model in each run.Figure 10(b)–(c) reports the average FDP across 500 replicates and theirstandard error: the FDR of both λ MC and λ G ⋆ are close to the nominallevels for all k ≤ 4. Discussion. The ease with which data are presently acquired has ef-fectively created a new scientific paradigm. In addition to carefully design-ing experiments to test specific hypotheses, researchers often collect datafirst, leaving question formulation to a later stage. In this context, linear re-gression has increasingly been used to identify connections between one re-sponse and a large number p of possible explanatory variables. When p ≫ n ,approaches based on convex optimization have been particularly effective.An easily computable solution has the advantage of definitiveness and ofreproducibility—another researcher, working on the same data set, wouldobtain the same answer. Reproducibility of a scientific finding or of theassociation between the outcome and the set of explanatory variables se-lected among many, however, is harder to achieve. Traditional tools such as p -values are often unhelpful in this context because of the difficulties of ac-counting for the effect of selection. In response, a great number of proposals[see, e.g., Benjamini and Yekutieli (2005), Berk et al. (2013), B¨uhlmann(2013), Efron (2011), Javanmard and Montanari (2014a, 2014b), Lockhartet al. (2014), Meinshausen and B¨uhlmann (2010), Wasserman and Roeder(2009), Meinshausen, Meier and B¨uhlmann (2009), van de Geer et al. (2014),Zhang and Zhang (2014)] present different approaches for controlling somemeasures of type I error in the context of variable selection. We here chose asa useful paradigm that of controlling the expected proportion of irrelevantvariables among the selected ones. A similar goal of FDR control is pursuedin Foygel-Barber and Cand`es (2014), Grazier G’Sell, Hastie and Tibshirani(2013). While Foygel-Barber and Cand`es (2014) achieve exact FDR con-trol in finite sample irrespective of the structure of the design matrix, thismethod, at least in the current implementation, is really best tailored forcases where n > p . The work in Grazier G’Sell, Hastie and Tibshirani (2013)relies on p -values evaluated as in Lockhart et al. (2014), and is limited to the ORTED L-ONE PENALIZED ESTIMATION contexts where the assumptions in Lockhart et al. (2014) are met, includingthe assumption that all true regressors appear before the false regressorsalong the Lasso path. SLOPE controls FDR under orthogonal designs, andsimulation studies also show that SLOPE can keep the FDR close to thenominal level when p > n and the true model is sparse, while offering largepower and accurate prediction. This is, of course, only a starting point andmany open problems remain.First, while our heuristics for the choice of the λ sequence allows to keepFDR under control for Gaussian designs and other random design matrices[more examples are provided in Bogdan et al. (2013)], it is by no meansa definite solution. Further theoretical research is needed to identify thesequences λ , which would provably control FDR for these designs and othertypical design matrices.Second, just as in the BH procedure where the test statistics are comparedwith fixed critical values, we have only considered in this paper fixed valuesof the regularizing sequence { λ i } . It would be interesting to know whetherit is possible to select such parameters in a data-driven fashion as to achievedesirable statistical properties. For the simpler Lasso problem, for instance,an important question is whether it is possible to select λ on the Lasso pathas to control the FDR. In the case where n ≥ p , a method to obtain this goalwas recently proposed in Foygel-Barber and Cand`es (2014). It would be ofgreat interest to know if similar positive theoretical results can be obtainedfor SLOPE, in perhaps restricted sparse settings.Third, our research points out the limits of signal sparsity which can behandled by SLOPE. Such limitations are inherent to ℓ convex optimizationmethods and also pertain to Lasso. Some discussion on the minimal FDRwhich can be obtained with Lasso under Gaussian designs is provided inBogdan et al. (2013), while new evocative results on adaptive versions ofLasso are on the way.Fourth, we illustrated the potential of SLOPE for multiple testing withpositively correlated test statistics. In our simple ANOVA model, SLOPEcontrols FDR even when the unknown variance components are replacedwith their estimates. It remains an open problem to theoretically describe apossibly larger class of unknown covariance matrices for which SLOPE canbe used effectively.In conclusion, we hope that the work presented so far would convincethe reader that SLOPE is an interesting convex program with promisingapplications in statistics and motivates further research. Acknowledgments. We would like to thank the Editor, Professor KarenKafadar, the Associate Editor and two reviewers for many constructive sug-gestions, which led to a substantial improvement of this article. M. BOGDAN ET AL. We thank the authors of Service et al. (2014) for letting us use theirdata during the completion of dbGaP release. Emmanuel J. Cand`es wouldlike to thank Stephen Becker for all his help in integrating the sorted ℓ norm software into TFOCS. Ma lgorzata Bogdan would like to thank DavidDonoho and David Siegmund for encouragement and Hatef Monajemi forhelpful discussions. We are very grateful to Lucas Janson for suggestingthe acronym SLOPE, and to Rina Foygel Barber and Julie Josse for usefulcomments about an early version of the manuscript.SUPPLEMENTARY MATERIAL Supplement to “SLOPE—Adaptive variable selection via convex opti-mization.” (DOI: 10.1214/15-AOAS842SUPP; .pdf). The online Appendixcontains proofs of some technical results discussed in the text.REFERENCES Abramovich, F. and Benjamini, Y. (1995). Thresholding of wavelet coefficients as multi-ple hypotheses testing procedure. In Wavelets and Statistics . Lecture Notes in Statistics Abramovich, F. , Benjamini, Y. , Donoho, D. L. and Johnstone, I. M. (2006). Adapt-ing to unknown sparsity by controlling the false discovery rate. Ann. Statist. Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans.Automat. Control AC-19 Barlow, R. E. , Bartholomew, D. J. , Bremner, J. M. and Brunk, H. D. (1972). Statistical Inference Under Order Restrictions. The Theory and Application of IsotonicRegression . Wiley, New York. MR0326887 Bauer, P. , P¨otscher, B. M. and Hackl, P. (1988). Model selection by multiple testprocedures. Statistics Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithmfor linear inverse problems. SIAM J. Imaging Sci. Becker, S. R. , Cand`es, E. J. and Grant, M. C. (2011). Templates for convex coneproblems with applications to sparse signal recovery. Math. Program. Comput. Benjamini, Y. and Gavrilov, Y. (2009). A simple forward selection procedure based onfalse discovery rate control. Ann. Appl. Stat. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practicaland powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B Benjamini, Y. and Yekutieli, D. (2005). False discovery rate-adjusted multiple confi-dence intervals for selected parameters. J. Amer. Statist. Assoc. Berk, R. , Brown, L. , Buja, A. , Zhang, K. and Zhao, L. (2013). Valid post-selectioninference. Ann. Statist. Best, M. J. and Chakravarti, N. (1990). Active set algorithms for isotonic regression;a unifying framework. Math. Program. Birg´e, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. (JEMS) Bogdan, M. , Chakrabarti, A. , Frommlet, F. and Ghosh, J. K. (2011). AsymptoticBayes-optimality under sparsity of some multiple testing procedures. Ann. Statist. Bogdan, M. , Ghosh, J. K. and ˙Zak-Szatkowska, M. (2008). Selecting explanatoryvariables with the modified version of Bayesian information criterion. Qual. Reliab.Eng. Int. Bogdan, M. , van den Berg, E. , Sabatti, C. , Su, W. and Cand`es, E. J. (2015).Supplement to “SLOPE—Adaptive variable selection via convex optimization.”DOI:10.1214/15-AOAS842SUPP. Bogdan, M. , van den Berg, E. , Su, W. and Cand`es, E. J. (2013). Statistical estimationand testing via the ordered ℓ norm. Preprint. Available at arXiv:1310.1969v2. Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variableselection, and supervised clustering of predictors with OSCAR. Biometrics B¨uhlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli Candes, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p ismuch larger than n . Ann. Statist. Cand`es, E. J. , Wakin, M. B. and Boyd, S. P. (2008). Enhancing sparsity by reweighted l minimization. J. Fourier Anal. Appl. de Leeuw, J. , Hornik, K. and Mair, P. (2009). Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods. J. Stat. Softw. Efron, B. (2011). Tweedie’s formula and selection bias. J. Amer. Statist. Assoc. Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regres-sion. Ann. Statist. Foster, D. P. and Stine, R. A. (1999). Local asymptotic coding and the minimumdescription length. IEEE Trans. Inform. Theory Foygel-Barber, R. and Cand`es, E. J. (2014). Controlling the false discovery rate viaknockoffs. Ann. Statist. To appear. Available at arXiv:1404.5609. Frommlet, F. and Bogdan, M. (2013). Some optimality properties of FDR controllingrules under sparsity. Electron. J. Stat. Frommlet, F. , Ruhaltinger, F. , Twar´og, P. and Bogdan, M. (2012). Modified ver-sions of Bayesian information criterion for genome-wide association studies. Comput.Statist. Data Anal. Grazier G’Sell, M. , Hastie, T. and Tibshirani, R. (2013). False variable selectionrates in regression. Preprint. Available at arXiv:1302.2303. Grotzinger, S. J. and Witzgall, C. (1984). Projections onto order simplexes. Appl.Math. Optim. Ingster, Yu. I. (1998). Minimax detection of a signal for l n -balls. Math. Methods Statist. Javanmard, A. and Montanari, A. (2014a). Confidence intervals and hypothesis testingfor high-dimensional regression. J. Mach. Learn. Res. Javanmard, A. and Montanari, A. (2014b). Hypothesis testing in high-dimensionalregression under the Gaussian random design model: Asymptotic theory. IEEE Trans.Inform. Theory Kruskal, J. B. (1964). Nonmetric multidimensional scaling: A numerical method. Psy-chometrika M. BOGDAN ET AL. Lockhart, R. , Taylor, J. , Tibshirani, R. J. and Tibshirani, R. (2014). A significancetest for the Lasso. Ann. Statist. Mallows, C. L. (1973). Some comments on c p . Technometrics Meinshausen, N. (2007). Relaxed Lasso. Comput. Statist. Data Anal. Meinshausen, N. and B¨uhlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B.Stat. Methodol. Meinshausen, N. , Meier, L. and B¨uhlmann, P. (2009). p -values for high-dimensionalregression. J. Amer. Statist. Assoc. Nesterov, Y. (2004). Introductory Lectures on Convex Optimization. A Basic Course .Kluwer Academic, Boston, MA. MR2142598 Nesterov, Y. Parikh, N. and Boyd, S. (2013). Proximal algorithms. In Foundations and Trends inOptimization Sarkar, S. K. (2002). Some results on false discovery rate in stepwise multiple testingprocedures. Ann. Statist. Service, S. K. , Teslovich, T. M. , Fuchsberger, C. , Ramensky, V. , Yajnik, P. , Koboldt, D. C. , Larson, D. E. , Zhang, Q. , Lin, L. , Welch, R. , Ding, L. , McLellan, M. D. , O’Laughlin, M. , Fronick, C. , Fulton, L. L. , Magrini, V. , Swift, A. , Elliott, P. , Jarvelin, M. R. , Kaakinen, M. , McCarthy, M. I. , Pel-tonen, L. , Pouta, A. , Bonnycastle, L. L. , Collins, F. S. , Narisu, N. , String-ham, H. M. , Tuomilehto, J. , Ripatti, S. , Fulton, R. S. , Sabatti, C. , Wil-son, R. K. , Boehnke, M. and Freimer, N. B. (2014). Re-sequencing expands ourunderstanding of the phenotypic impact of variants at GWAS loci. PLoS Genet. e1004147. St¨adler, N. , B¨uhlmann, P. and van de Geer, S. (2010). ℓ -penalization for mixtureregression models. TEST Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist.Soc. Ser. B Tibshirani, R. and Knight, K. (1999). The covariance inflation criterion for adaptivemodel selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. van de Geer, S. , B¨uhlmann, P. , Ritov, Y. and Dezeure, R. (2014). On asymptoticallyoptimal confidence regions and tests for high-dimensional models. Ann. Statist. Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection. Ann.Statist. Wu, Z. and Zhou, H. H. (2013). Model selection and sharp asymptotic minimaxity. Probab. Theory Related Fields Zeng, X. and Figueiredo, M. (2014). Decreasing weighted sorted l1 regularization. IEEESignal Process. Lett. Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional param-eters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol. Zhong, L. and Kwok, J. (2012). Efficient sparse modeling with automatic feature group-ing. IEEE Trans. Neural Netw. Learn. Syst. Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc.