[PDF] Tuning parameter calibration for ℓ 1 -regularized logistic regression

Abstract

Feature selection is a standard approach to understanding and modeling high-dimensional classification data, but the corresponding statistical methods hinge on tuning parameters that are difficult to calibrate. In particular, existing calibration schemes in the logistic regression framework lack any finite sample guarantees. In this paper, we introduce a novel calibration scheme for ℓ 1 -penalized logistic regression. It is based on simple tests along the tuning parameter path and is equipped with optimal guarantees for feature selection. It is also amenable to easy and efficient implementations, and it rivals or outmatches existing methods in simulations and real data applications.

Full PDF

TTuning parameter calibration for (cid:96) -regularized logistic regression Wei Li

School of Mathematical Sciences, Peking University, Beijing, China

Johannes Lederer ∗ Departments of Statistics and Biostatistics, University of Washington, Seattle, WA, USA

Abstract

Feature selection is a standard approach to understanding and modeling high-dimensional classiﬁcation data, but the corresponding statistical methods hingeon tuning parameters that are diﬃcult to calibrate. In particular, existingcalibration schemes in the logistic regression framework lack any ﬁnite sampleguarantees. In this paper, we introduce a novel calibration scheme for (cid:96) -penalizedlogistic regression. It is based on simple tests along the tuning parameterpath and is equipped with optimal guarantees for feature selection. It is alsoamenable to easy and eﬃcient implementations, and it rivals or outmatchesexisting methods in simulations and real data applications. Keywords:

Feature selection; Penalized logistic regression; Tuning parametercalibration

1. Introduction

The advent of high-throughput technology has created a large demand forfeature selection with high-dimensional classiﬁcation data. In gene expressionanalysis or genome-wide association studies, for example, investigators attemptto select from a large set of potential risk factors the predictors that are mostuseful in discriminating two or more conditions of interest. The standardapproaches for such tasks are penalized likelihood methods [1, 2, 3, 4, 5, 6].However, the performance of these methods hinges on the calibration of tuningparameters that balance model ﬁt and model complexity.The focus of this paper is the calibration of the (cid:96) -penalized likelihoodfor feature selection in logistic regression. The most widely used schemes forthis calibration are based on Cross-Validation ( cv ) [7] or information criteria, ∗ Corresponding author

Email address: [email protected] (Johannes Lederer)

Published version https: // doi. org/ 10. 1016/ j. jspi. 2019. 01. 006 in J. Statist. Plann. Inference; March 1, 2019 a r X i v : . [ s t a t . M E ] F e b ncluding the Akaike’s information criterion ( aic ) [8], the Bayesian informationcriterion ( bic ) [9], the extended Bayesian information criterion ( ebic ) [10], andthe Generalized Information Criterion ( gic ) [11]. cv - and aic -based proceduresare designed for prediction and thus typically not suited for feature selection [12].In contrast, bic -, ebic - and gic -type procedures, see also the recent versionsin [13, 10], are designed primarily for feature selection, and some consistencyresults for model selection have been derived [14, 15]. Yet another approachbased on a permutation idea has been introduced in [16]. However, all thesemethods share the same limitation in that they lack ﬁnite sample guarantees.This means that theoretical backup for applications, where samples sizes arealways ﬁnite, is not available.In this paper, we introduce a novel calibration scheme based on a testingprocedure and sharp (cid:96) ∞ -bounds. It is easy to implement and computationallyeﬃcient, and in contrast to all previous approaches, it is indeed equipped withﬁnite sample guarantees. Our proposal is thus of immediate practical andtheoretical relevance.The remainder of this paper is organized as follows. Section 2 contains ourmain proposal and the theoretical results. Section 3 and Section 4 demonstratethat our method is also a contender in simulations and real data applications.Section 5 contains a brief discussion. The proofs and further simulations aredeferred to the Appendix. Notation.

The index sets are denoted by [ k ] = { , . . . , k } for k ∈ { , , . . . } ,and the cardinality of sets is denoted by | · | . For a given vector β ∈ R p , thesupport set of β is written as supp( β ) = { j ∈ [ p ] : β j (cid:54) = 0 } , and for q ∈ [1 , ∞ ],the (cid:96) q -norm of β is denoted by (cid:107) β (cid:107) q . The (cid:96) q -induced matrix-operator normsare denoted by |||·||| q . Two examples are the spectral norm |||·||| , which denotesthe maximal singular value of a matrix, and the (cid:96) ∞ -matrix norm ||| X ||| ∞ =max i =1 ,...,n (cid:80) pj =1 | X ij | . The minimal and the maximal eigenvalue of a squarematrix are denoted by Ω min ( · ) and Ω max ( · ), respectively. For a given subset A of [ p ], the vectors β A ∈ R | A | and β A c ∈ R | A c | denote the components of β in A and in its complement A c , respectively, and given a matrix X ∈ R n × p , thematrix X A denotes the sub-matrix of X with column indexes restricted to A .The diagonal matrix with diagonal elements a , . . . , a n is denoted by diag { a , . . . , a n } .The function w is ﬁnally deﬁned as w ( u, v ) = exp( u (cid:62) v ) / (1 + exp( u (cid:62) v )) forvectors u, v of the same length.

2. Methodology

In this section, we formulate the general setting and introduce the assumptionsrequired for the theoretical analysis. We consider data in the form of a real-valued n × p design matrix X and a binary response vector Y = ( y , . . . , y n ) (cid:62) . Ourframework allows for high-dimensional data, where p rivals or outmatches n .We denote the rows of X (i.e. the samples) by x , . . . , x n ∈ R p and the2olumns of X (i.e. the predictors) by x , . . . , x p ∈ R n . The matrix X can bedeterministic or a realization of a random matrix, but in any case, we assumethat X has been normalized, i.e., (cid:107) x j (cid:107) = 1 for j = 1 , . . . , p .The design matrix X and the response vector Y are linked by the standardlogistic regression modelPr( y i = 1 | x i ) = exp( x (cid:62) i β ∗ )1 + exp( x (cid:62) i β ∗ ) ( i = 1 , . . . , n ) , (1)where β ∗ ∈ R p is the unknown regression vector. Our goal is feature selection(or also called support recovery), that is, estimation of the support set S =supp( β ∗ ). The starting point for approaching this task is the well-known familyof estimators ˆ β λ ∈ arg min β ∈ R p { L ( β ) + λ (cid:107) β (cid:107) } ( λ >

0) (2)indexed by the tuning parameter λ. The ﬁrst term L ( β ) = (cid:80) ni =1 (log(1 +exp( x (cid:62) i β )) − y i x (cid:62) i β ) /n is the negative log-likelihood function, and the secondterm is a regularization that exploits that s = | S | (cid:28) n, p in many applications.We estimate S by supp( ˆ β λ ) for a data-driven tuning parameter λ ≡ λ ( Y, X ).We take (2) as our starting point, because it has become the most popularfamily of estimators in our context. The main reason for this popularity isthat (cid:96) -regularization is equipped with fast algorithms and some theoreticalunderstanding. One might argue that diﬀerent starting points might suit sometask better; for example, one drawback of (cid:96) -regularization is that it leadsto complex dependencies on the design matrix and can require potentiallyunrealistic assumptions [17, 18, 19, 20, 21]. But, importantly, our calibrationapproach is agnostic to what estimator is used: it only requires a suitable oracleinequality (while deriving such oracle inequalities might be diﬃcult, of course).In this sense, the following results are merely an indication of the full potentialof our calibration scheme.Support recovery with (2) is feasible only if the correlations in the designmatrix X are suﬃciently small. In the following, we state corresponding assumptionsthat virtually coincide with those ones used by [4] in the context of Ising models.The assumptions are formulated in terms of X (cid:62) W X/n , the Hessian of thelog-likelihood function evaluated at the true regression parameter β ∗ , where W = diag { w ( x , β ∗ ) , . . . , w ( x n , β ∗ ) } . We ﬁrst require that the submatrix of theHessian matrix corresponding to the relevant covariates has eigenvalues boundedaway from zero. Assumption 1 (Minimal eigenvalue condition) . It holds that c min = Ω min ( X (cid:62) S W X S /n ) > . Note that if this assumption were violated, the relevant covariates would belinearly dependent, and the true support set S would not be well-deﬁned.Thus, this condition ensures that the relevant covariates do not become highlydependent. We then impose an irrepresentability condition [22].3 ssumption 2 (Irrepresentability condition) . It holds that γ = 1 − ||| ( X (cid:62) S W X S ) − X (cid:62) S W X S c ||| ∞ > . This assumption is a modiﬁed version of the irrepresentability condition commonlyused in the theory for linear regression with the Lasso [22]. More generally,irrepresentability conditions prevent the relevant covariates from being stronglycorrelated with the irrelevant covariates. This ensures that the true support setcan be identiﬁed with ﬁnitely many samples.Assumption 1 and 2 also make assumptions on s = | S | implicit. For illustration,suppose that the two assumptions hold for the population matrices E β ∗ ( X (cid:62) S W X (cid:62) S )and (cid:8) E β ∗ ( X (cid:62) S W X (cid:62) S ) (cid:9) − E β ∗ ( X (cid:62) S W X (cid:62) S c ). Then, according to Lemma 5 and 6in [4], these two assumptions hold with large probability if log( p ) s = o ( n ). Inthis spirit, larger s makes the conditions more restrictive on other aspects of themodel.Importantly, however, the above assumptions on the design are not needed inthe analysis of the proposed scheme itself. Instead, the assumptions are neededto ensure that there is a viable estimator in the family (2) at all. We discussthis in the following section. (cid:96) ∞ -estimation and support recovery (cid:96) ∞ -estimation and support recovery are two closely related aspects of high-dimensionallogistic regression. In this section, we thus establish oracle inequalities for boththese tasks.To state the result, we deﬁne the vector of residuals as ε = ( ε , . . . , ε n ) (cid:62) with entries ε i = y i − Pr( y i = 1 | x i ) for i ∈ [ n ]. The vector ε is random noisewith mean zero. As it is standard in the theory of high-dimensional statistics,our results are based on an event T λ = (cid:110) − γ ) nγ (cid:107) X (cid:62) ε (cid:107) ∞ ≤ λ (cid:111) ( λ > . Terms such as − γ ) nγ (cid:107) X (cid:62) ε (cid:107) ∞ are sometimes called the “eﬀective noise” [23]and can often be bounded by standard empirical process theory [1]. Essentially,the event states that heavy noise ( ε large) requires strong penalization ( λ large).For the technical proofs, we also assume λ ≤ γc / (100(2 − γ ) sc max ) in theremainder, and for ease of presentation, we set a = ||| ( X (cid:62) S W X S ) − ||| ∞ / ||| ( X (cid:62) S W X S ) − ||| . Then, we ﬁnd the following result. On a high level, γc / (100(2 − γ ) sc max ) ∼ − γ ) (cid:107) X (cid:62) ε (cid:107) ∞ / ( nγ ) ∼ / √ n . Thus, γc / (100(2 − γ ) sc max ) (cid:29) − γ ) (cid:107) X (cid:62) ε (cid:107) ∞ / ( nγ ) . Since the right-hand side of this relationis basically the optimal tuning parameter targeted in our study, see the next section, themuch larger upper bound on λ has no impact on our analysis. For details, in particular onthe constants, we refer to the proofs section. heorem 1 ( (cid:96) ∞ -bound and support recovery) . Under Assumption 1 and 2,the following properties hold on the event T λ .(a) (cid:96) ∞ -bound: (cid:107) ˆ β λ − β ∗ (cid:107) ∞ ≤ . aλ/c min ;(b) support recovery: supp( ˆ β λ ) ⊂ S , and supp( ˆ β λ ) = S if min j ∈ S | β ∗ j | > . aλ/c min . Oracle inequalities are the standard way to state ﬁnite sample bounds in high-dimensionalstatistics [1]. Similar results for (cid:96) -penalized logistic/linear regression have alsobeen derived elsewhere [3, 24, 25, 4], but the above formulation is particularlyuseful for our purposes. Part (a) implies that for a suitable tuning parameter λ ,the estimator ˆ β λ is uniformly close to the regression vector β ∗ . Part (b) impliesthat for suitable tuning parameter, the estimator supp( ˆ β λ ) provides exact supportrecovery if the non-zero parameters are suﬃciently large. As long as the designassumptions are met, Theorem 1 thus ensures that the family (2) contains aviable estimator. Theorem 1 ensures that the family (2) contains an accurate estimator. Thisleaves us with two tasks: (i) We have to formulate a notion of optimality withinthe family (2). In other words, we have to deﬁne what an optimal tuningparameter is. (ii) We have to formulate a scheme to ﬁnd an optimal tuningparameter from data.To address these two tasks, we develop an approach that relates to theAV-testing idea introduced by [24]. The AV-tests have been developed for linearregression, which diﬀers from logistic regression both in theory and implementations.For example, (cid:96) ∞ -bounds in linear regression can be established by “standard”proof techniques, while the bounds needed here are based on the more recentPrimal-Dual Witness technique. However, a more interesting, and quite strikinginsight here is that the high-level arguments transfer from the linear to thenon-linear setting - and even beyond. The following discussion can thus be readas a general blueprint for feature selection calibration, while the parts speciﬁcto logistic regression, namely the proof techniques and details, are deferred tothe Appendix.Let us ﬁrst deﬁne the concept of oracle tuning parameters. Since one canhandle only ﬁnitely many values in practice, we consider a ﬁxed but arbitrarysequence 0 < λ < · · · < λ N of tuning parameters and denote the correspondingset by Λ = { λ , · · · , λ N } . In view of Theorem 1, an optimal tuning parametersatisﬁes two requirements. On the one hand, the bounds hold only on theevent T λ . Thus, an optimal tuning parameter needs to ensure that the event T λ holds with high probability. On the other hand, the bounds are linear in λ .Thus, an optimal tuning parameter should be as small as possible. We formalizethis notion as follows: Deﬁnition 1 (Oracle tuning parameter) . Given δ ∈ (0 , , the oracle tuningparameter is λ ∗ δ = argmin λ ∈ Λ { Pr( T λ ) ≥ − δ } . (3)5ince the set Λ is ﬁnite, the oracle tuning parameter is always well-deﬁned.It is also small in large samples: in particular, since the residuals ε in T λ arebounded, standard concentration results ensure that λ ∗ δ → n → ∞ and δ ﬁxed—as long as log p/n → n → ∞ .We call the optimal tuning parameter “oracle tuning parameter” to signifythat it is a purely theoretical quantity and cannot be used in applications. First, λ ∗ δ depends on γ, which is unknown in practice. Second, even if γ were known,a precise evaluation of λ ∗ δ would be computationally intensive. Finally, it isunclear how to choose δ. We thus aim at ﬁnding a data-driven selection rulethat mimics the performance of the optimal tuning parameter. The followingtests provide this.

Deﬁnition 2 (Testing-based calibration) . Given a constant C ≥ . a/c min , weselect the tuning parameter ˆ λ = min (cid:110) λ ∈ Λ : (cid:107) ˆ β λ (cid:48) − ˆ β λ (cid:48)(cid:48) (cid:107) ∞ ≤ Cλ (cid:48) + Cλ (cid:48)(cid:48) ∀ λ (cid:48) , λ (cid:48)(cid:48) ≥ λ (cid:111) (4) and set ˆ S = { j ∈ [ p ] : | ( ˆ β ˆ λ ) j | ≥ C ˆ λ } . (5)An intuition goes as follows: suppose that λ is large enough to control therandom noise, that is, 4(2 − γ ) (cid:107) X (cid:62) ε (cid:107) ∞ / ( nγ ) ≤ λ . Then, also λ (cid:48) , λ (cid:48)(cid:48) ≥ λ arelarge enough, and Theorem 1 ensures that (cid:107) ˆ β λ (cid:48) − β ∗ (cid:107) ∞ ≤ Cλ (cid:48) and (cid:107) ˆ β λ (cid:48)(cid:48) − β ∗ (cid:107) ∞ ≤ Cλ (cid:48)(cid:48) . Combining these two inequalities with the help of the triangleinequality shows that (cid:107) ˆ β λ (cid:48) − ˆ β λ (cid:48)(cid:48) (cid:107) ∞ ≤ C ( λ (cid:48) + λ (cid:48)(cid:48) ) is a necessary condition for λ being large enough. What we now want is the smallest one among such “largeenough” tuning parameters. This motivates us to select ˆ λ as the smallest λ that satisﬁes (cid:107) ˆ β λ (cid:48) − ˆ β λ (cid:48)(cid:48) (cid:107) ∞ ≤ C ( λ (cid:48) + λ (cid:48)(cid:48) ) for all λ (cid:48) , λ (cid:48)(cid:48) ≥ λ . This selectionis then indeed “conservative” (ˆ λ ≤ λ ∗ δ ), but it is also optimal in the sense ofStatement (5).Two features of our testing-based scheme are apparent immediately: First,the method is computationally eﬃcient, because it requires at most one pass ofthe tuning parameter path. This path can be computed by standard algorithmssuch as glmnet [26]. Since the structure of the tests allows for early stopping,the computation of a part of the tuning parameter path is actually suﬃcient.Second, the method is easy to implement because it consists of simple (cid:96) ∞ -testsalong the tuning parameter path. The tests also highlight the close connectionsbetween (cid:96) ∞ -estimation and our ﬁnal goal, support recovery.The third feature of our scheme is that it is equipped with optimal ﬁnitesample theoretical guarantees. We establish this in the following result. Theorem 2 (Optimality of the testing-based calibration) . Under Assumption 1and 2, for any δ ∈ (0 , and C ≥ . a/c min , the tuning parameter ˆ λ from (4) provides with probability at least − δ ˆ λ ≤ λ ∗ δ and (cid:107) ˆ β ˆ λ − β ∗ (cid:107) ∞ ≤ Cλ ∗ δ , nd, if min j ∈ S | β ∗ j | > Cλ ∗ δ , ˆ S ⊃ S .

Let us highlight some aspects of this result: First, all results are stated for ﬁxed n, p , and all constants are speciﬁed. The bounds are thus ﬁnite sample boundsthat can provide, as opposed to asymptotic bounds, concrete insights into thepractical performance of the method. Next, the guarantees hold for any γ and δ, but these quantities do not need to be speciﬁed in the method. Similarly, theresults hold irrespective of the set Λ , in particular, irrespective of the number oftuning parameters N . The set Λ enters the results only through λ ∗ δ : the ﬁner thegrid Λ, the more precise the optimal tuning parameter λ ∗ δ , and thus, the sharperthe guarantees. Furthermore, the (cid:96) ∞ -bounds demonstrate the optimality ofthe method. Indeed, the estimator with optimal, in practice unknown tuningparameter satisﬁes (cid:107) ˆ β λ ∗ δ − β ∗ (cid:107) ∞ ≤ . aλ ∗ δ /c min , see Theorem 1. The bound forthe estimator with the data-driven tuning parameter ˆ λ equals this bound - up toa constant factor 3. Finally, since Deﬁnition (5) contains a threshold, which isbased on the guarantee ˆ λ ≤ λ ∗ δ , the number of false positives is typically small.Yet, the second part of the theorem ensures that ˆ S contains all suﬃciently largepredictors, which means that also the number of false negatives is typicallysmall. Theorem 2 thus provides accurate feature selection guarantees for thetesting-scheme. We are not aware of any comparable feature selection (or (cid:96) ∞ -)guarantees for standard calibration schemes. Remark (The constant C in practice) . The optimal value of C in view ofthe theoretical bounds is C = 1 . a/c min . As described above, support recoverywith (2) is not possible in highly correlated settings, and it has been pointed outthat large β ∗ can be problematic for (cid:96) -penalized methods more generally [17].Assuming near-orthogonal design and small parameter values in the sense of (cid:107) β ∗ (cid:107) ≈ , so that a ≈ , c min ≈ / , we ﬁnd . a/c min ≈ . This suggests thatan appropriate choice is C = 6 , and we adopt this choice throughout this paper.The assumption of near-orthogonal design and small parameter values mightbe unrealistic in practice, but the empirical studies in Section 3 and the Appendixdemonstrate good performance of C = 6 even when the model deviates substantiallyfrom this assumption.In any case, the choice of C remains a subject for further study. For example,the limitations of the heuristics for C might simply be an artifact of usingthe (cid:96) -regularizer: there might be estimators diﬀerent from (2) for which theapplication of our calibration scheme leads to constants C that can be theorizedmore globally. To summarize, the proposed testing-based method accurately mimics theperformance of the optimal tuning parameter, and yet, it is computationallyeﬃcient and does not depend on the quantities γ and δ. Moreover, the parameter C can be set to a universal constant; in particular, C does not require calibration.The simulation results below indicate that indeed no further calibration isrequired. The testing-based scheme is thus a practical scheme with a soundtheoretical foundation. 7 . Simulation studies In this section, we show the practical performance of the proposed schemein a simulation study. We simulate data from the logistic regression model (1)with n = 200 samples and p ∈ { , } predictors. The row vectors x i of the design matrix X are i.i.d. Gaussian with mean zero and covarianceΣ = (1 − κ )I + κ , where I is the identity matrix, is the matrix of all ones,and κ ∈ { , . , . , . } the level of correlations among the predictors. Thesettings with larger correlations are particularly interesting, because they violatethe strict design assumptions. The coordinates of the regression vector β ∗ areset to zero except for s ∈ { , , , , } uniformly at random chosen entriesthat are set to 1 or − N = 500 tuning parameters that are equallyspaced on [ λ , λ N ], where λ = 0 . λ N and λ N = 10 log( p ) /n ensure a largespread of outcomes. For each of the total 40 settings, we report the meansover 200 replications. The methods under consideration are the testing-basedmethod deﬁned in (4) and (5), bic , ebic , 10-fold cv , and aic . The ebic is theclassical bic with an additional penalty term 2 θ log( p ) with a positive θ . Here,we choose θ ∈ { . , . , } . Note that the bic is a special form of ebic with θ = 0. No thresholding is applied for the standard methods, since there is noguidance on the choice of such a threshold. All computations are conductedwith the software R [27] and the glmnet package.Since our goal is support recovery, we compare the methods in terms ofHamming distance, which is the sum of the number of false positives and falsenegatives. Figure 1 contains the results for κ ∈ { . , . } and s ∈ { , , } .In each plot of Figure 1, we use ebic ebic

05 and ebic

025 to denote theEBIC method with θ = 1 , . bic and ebic consistently outperform cv and aic .This is no surprise, given that bic and ebic are speciﬁcally designed for featureselection. Second, our testing-based scheme rivals bic and ebic across allsettings. bic , ebic and aic require one complete pass of the tuning parameter path.10-fold cv requires one complete pass of 10 tuning parameter paths and thus,requires about 10 times more computational power (or parallelization). Thetesting-based scheme is the most eﬃcient approach: it requires at most onecomplete pass of the tuning parameter path, and typically even less, becauseit stops as soon as the tuning parameter is selected. For illustration, Figure 2summarizes the run times for six settings with κ = 0 .

5; additional results withlarger correlations and more dense signals are provided in the Appendix.

4. Real data analysis

In this section, we apply the proposed scheme to biological data. We considerthree data sets: 8 a) p = 200, κ = 0 .

25 (b) p = 200, κ = 0 . p = 500, κ = 0 .

25 (d) p = 500, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerentcalibration schemes for the tuning parameter. The 12 simulation settings diﬀer in the numberof variables p , correlation κ , and sparsity level s . a) Gene expression data from a leukemia microarray study [28]. The datacomprises n = 72 patients; 25 patients with acute myeloid leukemia and47 patients with acute lymphoblastic leukemia. The predictors are theexpression levels of p = 7129 genes. The data is summarized in the R package golubEsets . The goal is to select the genes whose expressionlevels discriminate between the two types of leukemia.b) The above data with the additional preprocessing and ﬁltering describedin [29]. This reduces the number of genes to p = 3571. The data issummarized in the R package cancerclass .9 a) p = 200, κ = 0 . p = 500, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerentcalibration schemes for the tuning parameter. Depicted are the results for p ∈ { , } and κ = 0 . c) Proteomics data from a melanoma study [30]. The data comprises n = 205patients; 101 patients with stage I (moderately severe) melanoma and 104patients with stage IV (very severe) melanoma. The raw data contains theintensities of 18 (cid:48)

856 mass-charge ( m/z ) values measured in the patients’serum samples. We apply the preprocessing described in [31], which resultsin p = 500 m/z values, and we subsequently normalize the data. The goalis to select the m/z values whose intensities discriminate between the twomelanoma stages.The objective of our method is feature selection. However, since thereare no ground truths available for the above applications, we cannot measurefeature selection accuracy directly. Instead, we need to infer the method’sperformance from the number of selected predictors and the prediction accuracy.We generally seek methods that yield a model with a small number of predictors(easy to interpret) and small prediction errors (good ﬁt of the data). Moreover,an increase in prediction accuracy through reﬁtting indicates well-estimatedsupports, while a deterioration in prediction accuracy through reﬁtting indicatesfalse negatives or false positives. We thus report the model sizes and theprediction errors of Leave-One-Out Cross-Validation without (LOOCV) andwith reﬁtting (LOOCV-reﬁt). Typically, no method is simultaneously dominatingin all measures, so that one needs to weight the two aspects according to theobjective. For example, the model size is sometimes considered secondary whenthe goal is prediction, but it is a crucial factor for support recovery.We apply the four diﬀerent methods as described in the previous section.The results are summarized in Table 1. We observe that the methods form twoclusters: On the one hand, cv and aic provide the most accurate predictions.On the other hand, bic and the testing-based approach select considerably10 able 1: Means and standard deviations of the model sizes and of the misclassiﬁcation ratesfor Leave-One-Out Cross-Validation without and with reﬁtting. Method Model size LOOCV LOOCV-reﬁta) Gene expression data with p = 7129 genesTesting 4.35 (1.36) 0.153 (0.362) 0.111 (0.316) bic cv aic p = 3571 genesTesting 4.42 (1.39) 0.167 (0.375) 0.125 (0.333) bic cv aic p = 500 m/z valuesTesting 1.00 (0.00) 0.205 (0.405) 0.205 (0.405) bic cv aic cv and aic being designed for prediction, and bic andtesting being designed for feature selection. For an “in-cluster” comparison,we focus at the testing-based method and bic . In the ﬁrst two data examples,the testing-based method is dominating bic , because it provides more accurateprediction with smaller models. In the third example, bic is more accuratein prediction, but the testing-based approach provides reasonable prediction(compare especially with cv and aic after reﬁtting) with only one variable.

5. Discussion

We have introduced a scheme for the calibration of (cid:96) -penalized likelihoodfor feature selection in logistic regression. A distinctive feature of the approachare its theoretical guarantees. Indeed, the new method satisﬁes optimal ﬁnitesample bounds, while for existing methods, the available theory is limited toasymptotic results - or there is no theory at all. Given that in applications,sample sizes are always ﬁnite, only ﬁnite sample theory can provide concreteguidance for practitioners. In addition to the theory, the scheme is easy toimplement, computationally eﬃcient, and competitive in simulations and realdata applications.Besides being of direct theoretical and practical value for logistic regression,our contribution also provides new insights into the testing ideas that have beendeveloped in [24]. In particular, we think that its successful use in non-linear11egression demonstrates the general potential of testing for tuning parametercalibration and is expected to spark further studies of testing ideas in othermodeling frameworks.A topic for further research are the design assumptions. The focus of thispaper is feature selection, where strict assumptions on the correlations in X cannot be avoided. However, it would be interesting to extend our approach totasks that are less sensitive to correlations, such as (cid:96) -estimation and prediction.Another question is whether (cid:96) -penalization is the right starting point. Inthis paper, we introduce an approach to calibrate—in a sense optimally— (cid:96) -penalizedlikelihood, accepting all beneﬁts just as well as all limitations of (cid:96) -regularization.It could well be that some theoretical limitations, such as the strict conditions onthe design, could be alleviated by starting with a diﬀerent family of estimators,or diﬀerent families could improve computational speed or practical performance.For example, one could consider estimators with non-convex regularizers, suchas SCAD [32] and MCP [33], or adaptive lasso-type regularizers [34], all ofwhich have been shown to reduce bias if (arguably very stringent) conditions onthe design matrix are met. Comparing among diﬀerent types of estimators isbeyond the scope of our paper, but we stress that our scheme principally appliesto any family of estimators: all it needs is a suitable oracle inequality. So whilethis paper only discusses the calibration of (cid:96) -regularized likelihood, extensionsto other estimators (that might suit a given application much better) appear tobe in close reach.This question is closely related to the choice of C . We found that C = 6leads to excellent empirical results across a wide range of settings, while ourtheoretical justiﬁcation for this choice requires strong assumptions on the designand the parameter vector. It would be of interest to broaden the scope of thetheoretical justiﬁcation as well as to theorize the choice of C for the applicationof our calibration method to estimators beyond the (cid:96) -regularized likelihood. Acknowledgements

We thank the editor, associate editor, and anonymous referees for theircomments and suggestions, which helped us considerably in revising the paper.Wei Li acknowledges support from the China Scholarship Council. We thankHaiyun Jin for his valuable contributions to the implementation of our method,and we thank Micha¨el Chichignoud and Xiao-Hua Zhou for the inspiring discussionsand the insightful remarks.

ReferencesReferences [1] P. B¨uhlmann, S. van de Geer, Statistics for High-dimensional Data:Methods, Theory and Applications, Springer, 2011.122] T. Hastie, R. Tibshirani, M. Wainwright, Statistical Learning WithSparsity: The Lasso and Generalizations, CRC Press, 2015.[3] F. Bunea, Honest variable selection in linear and logistic regression modelsvia (cid:96) and (cid:96) + (cid:96) penalization, Electronic Journal of Statistics 2 (2008)1153–1194.[4] P. Ravikumar, M. Wainwright, J. Laﬀerty, High-dimensional Ising modelselection using (cid:96) Appendix A.

In this appendix, we provide proofs for our theoretical claims, and we presentadditional simulation results.

Appendix A.1. Proofs for the theoretical claims

We provide here proofs for Theorems 1 and 2. Throughout this section, wewrite ρ ( u, v ) = exp( u (cid:62) v ) / (1 + exp( u (cid:62) v )) for vectors u, v of the same length.For ease of notation, we will suppress the subscript λ at most instances.The key quantities in the proofs are two vectors ˆ α = (ˆ α (cid:62) S , ˆ α (cid:62) S c ) (cid:62) ∈ R p andˆ ν = (ˆ ν (cid:62) S , ˆ ν (cid:62) S c ) (cid:62) ∈ R p constructed as follows:1. deﬁne the primal subvector ˆ α S ∈ R s such thatˆ α S ∈ argmin θ ∈ R s (cid:26) n (cid:88) i =1 (log(1 + exp( x (cid:62) i,S θ )) − y i x (cid:62) i,S θ ) /n + λ (cid:107) θ (cid:107) (cid:27) ;2. set ˆ α S c = 0 ∈ R p − s ;3. deﬁne the dual vector ˆ ν ∈ R p via its elementsˆ ν j = n (cid:88) i =1 X ij ( y i − ρ ( x i , ˆ α )) / ( nλ ) ( j ∈ [ p ]) . The proofs of the theorems are based on three auxiliary lemmas. Figure A.3depicts the dependencies.

Lemma 1 ( (cid:96) -bound for the primal subvector) . If λ ≤ c / (10 sc max ) and (cid:107) X (cid:62) ε/n (cid:107) ∞ ≤ λ/ , then (cid:107) ˆ α S − β ∗ S (cid:107) ≤ λ √ sc min . emma 1 Lemma 3Lemma 2 Theorem 1 Theorem 2 Figure A.3: Dependencies among the lemmas and theorems. For example, the arrow betweenLemmas 1 and 2 indicates that Lemma 2 relies on Lemma 1.

The proof of this lemma follows along the same lines as the proof of Lemma 3in [4].Lemma 1 implies that the primal subvector ˆ α S is (cid:96) -consistent, which enablesus to develop a Taylor series expansion of ρ ( x i,S , ˆ α S ) at β ∗ S according to ρ ( x i,S , ˆ α S ) − ρ ( x i,S , β ∗ S ) = w ( x i,S , β ∗ S ) x (cid:62) i,S (ˆ α S − β ∗ S ) + r i (A.1)with remainder term r i = (ˆ α S − β ∗ S ) (cid:62) (cid:90) ( ∇ ρ ( x i,S , ˆ α S + t (ˆ α S − β ∗ S )))(1 − t )d t (ˆ α S − β ∗ S ) . The derivative reads explicitly ∇ ρ ( x i,S , ˆ α S + t (ˆ α S − β ∗ S )) = ξ i ( t ) x i,S x (cid:62) i,S , (A.2)where ξ i ( t ) = exp( η i ( t ))(1 − exp( η i ( t ))) / (1 + exp( η i ( t ))) and η i ( t ) = x (cid:62) i,S (ˆ α S + t (ˆ α S − β ∗ S )) for t ∈ [0 ,

1] and i ∈ [ n ]. Summarizing r , . . . , r n from Equation (A.1)in the vector r = ( r , . . . , r n ) (cid:62) , we can now state the following result. Lemma 2 ( (cid:96) ∞ -bound for the remainder term) . If λ ≤ γc / (100(2 − γ ) sc max ) and (cid:107) X (cid:62) ε/n (cid:107) ∞ ≤ λ/ , then (cid:107) X (cid:62) r/n (cid:107) ∞ ≤ λγ − γ ) . Proof.

The proof follows readily from Lemma 1. To see this, note that because | X ij | ≤ i ∈ [ n ] and j ∈ [ p ], it holds that | r (cid:62) x j /n | = | n (cid:88) i =1 X ij r i /n | ≤ n (cid:88) i =1 | X ij || r i | /n ≤ n (cid:88) i =1 | r i | /n for all j ∈ [ p ]. By the closed form of r i in Equation (A.2), it holds that | ξ i ( t ) | ≤ t ∈ [0 , α S c = β ∗ S c = 0, we then get | r (cid:62) x j /n | ≤ (ˆ α S − β ∗ S ) (cid:62) (cid:16) n (cid:88) i =1 x i,S x (cid:62) i,S /n (cid:17) (ˆ α S − β ∗ S )= (ˆ α S − β ∗ S ) (cid:62) ( X (cid:62) S X S /n )(ˆ α S − β ∗ S )= (ˆ α − β ∗ ) (cid:62) ( X (cid:62) X/n )(ˆ α − β ∗ ) ≤ c max (cid:107) ˆ α S − β ∗ S (cid:107) . (A.3)16oreover, because λ ≤ γc / (100(2 − γ ) sc max ) ≤ c / (10 sc max ) and (cid:107) X (cid:62) ε/n (cid:107) ∞ ≤ λ/

4, the assumptions of Lemma 1 are satisﬁed. Combining this lemma withEquation (A.3) yields | r (cid:62) x j /n | ≤ λ sc max c ≤ λγ − γ )for each j ∈ [ p ]. Thus, (cid:107) X (cid:62) r/n (cid:107) ∞ ≤ λγ/ (4(2 − γ )) as desired. Lemma 3 (Primal dual witness construction) . The pair (ˆ α, ˆ ν ) deﬁned abovesatisﬁes the following three properties:(i) It holds that ˆ ν S ∈ ∂ (cid:107) ˆ α S (cid:107) ; (ii) If (cid:107) ˆ ν S c (cid:107) ∞ < , then any solution ˆ β to the problem (2) satisﬁes supp( ˆ β ) ⊂ S ;(iii) Under Assumption 1 and (cid:107) ˆ ν S c (cid:107) ∞ < , the solution ˆ β is unique, and ˆ β =ˆ α = (ˆ α (cid:62) S , ˆ α (cid:62) S c ) (cid:62) ∈ R p .Proof. We conduct the proof in three steps in correspondence with the threeclaims.

Step 1:

We show that if (cid:107) ˆ ν S c (cid:107) ∞ ≤

1, the pair (ˆ α, ˆ ν ) satisﬁes the KKTconditions, that is, ˆ ν ∈ ∂ (cid:107) ˆ α (cid:107) and − n (cid:88) i =1 X ij ( y i − ρ ( x i , ˆ α )) /n + λ ˆ ν j = 0for j ∈ [ p ]. By 1. in the construction at the beginning of this section, there is a κ ∈ ∂ (cid:107) ˆ α S (cid:107) ⊂ R s such that − n (cid:88) i =1 X ij ( y i − ρ ( x i,S , ˆ α S )) /n + λκ j = 0for j ∈ S . Hence, with ˆ α S c = 0 in 2. and the deﬁnition of ˆ ν j in 3.,ˆ ν j = n (cid:88) i =1 X ij ( y i − ρ ( x i,S , ˆ α S )) / ( nλ ) = κ j for j ∈ S , that is, ˆ ν S ∈ ∂ (cid:107) ˆ α S (cid:107) as desired. Step 2:

We now show that supp( ˆ β ) ⊂ S for all ˆ β ∈ R p that satisfyˆ β ∈ arg min β ∈ R p { L ( β ) + λ (cid:107) β (cid:107) } . In view of the condition (cid:107) ˆ ν S c (cid:107) ∞ < α, ˆ ν ) satisﬁesthe KKT conditions for the above problem and thus, ˆ α is a minimizer of theobjective function. Consequently, L ( ˆ β ) + λ (cid:107) ˆ β (cid:107) = L (ˆ α ) + λ (cid:107) ˆ α (cid:107) . ν ∈ ∂ (cid:107) ˆ α (cid:107) by Step 1, it holds that (cid:107) ˆ α (cid:107) = (cid:104) ˆ ν, ˆ α (cid:105) . Plugging this into theprevious display yields L ( ˆ β ) + λ (cid:107) ˆ β (cid:107) = L (ˆ α ) + λ (cid:104) ˆ ν, ˆ α (cid:105) . We can now subtract λ (cid:104) ˆ ν, ˆ β (cid:105) on both sides to obtain L ( ˆ β ) + λ (cid:107) ˆ β (cid:107) − λ (cid:104) ˆ ν, ˆ β (cid:105) = L (ˆ α ) + λ (cid:104) ˆ ν, ˆ α − ˆ β (cid:105) . By 2. and 3. in the above construction, it holds that λ ˆ ν = − L (cid:48) (ˆ α ), where L (cid:48) ( · )denotes the derivative of L ( · ). Thus, we can further deduce λ (cid:107) ˆ β (cid:107) − λ (cid:104) ˆ ν, ˆ β (cid:105) = L (ˆ α ) − (cid:104) L (cid:48) (ˆ α ) , ˆ α − ˆ β (cid:105) − L ( ˆ β ) . Because the Hessian of L ( β ) is a non-negative matrix, L ( · ) is a convex function.It holds that L ( ˆ β ) ≥ L (ˆ α ) + (cid:104) L (cid:48) (ˆ α ) , ˆ β − ˆ α (cid:105) . Combining the two displays yields λ (cid:107) ˆ β (cid:107) − λ (cid:104) ˆ ν, ˆ β (cid:105) ≤ , and dividing by the tuning parameter yields further (cid:107) ˆ β (cid:107) ≤ (cid:104) ˆ ν, ˆ β (cid:105) . However, by H¨older’s inequality and (cid:107) ˆ ν (cid:107) ∞ ≤

1, it holds that (cid:107) ˆ β (cid:107) ≥ (cid:104) ˆ ν, ˆ β (cid:105) . Consequently, (cid:107) ˆ β (cid:107) = (cid:104) ˆ ν, ˆ β (cid:105) . In view of the condition (cid:107) ˆ ν S c (cid:107) ∞ <

1, this can only be true if ˆ β j = 0 for all j ∈ S c . This completes the proof of Step 2. Step 3:

We now show that ˆ β = ˆ α . From Step 2, we deduce that ˆ β =( ˆ β (cid:62) S , (cid:62) withˆ β S ∈ argmin θ ∈ R s (cid:26) n (cid:88) i =1 (log(1 + exp( x (cid:62) i,S θ )) − y i x (cid:62) i,S θ ) /n + λ (cid:107) θ (cid:107) (cid:27) . Moreover, since the minimal eigenvalue of X (cid:62) S W X S /n is larger than zero byAssumption 1, this problem has a unique solution. Combining this with 1.in the construction at the beginning of this section yields ˆ β S = ˆ α S , that is,ˆ β = ˆ α . Proof of Theorem 1.

We conduct the proof in two steps. The ﬁrst step is to show that supp( ˆ β ) ⊂ S and that ˆ β is the unique solution of the problem (2). The second step is to showthe (cid:96) ∞ -bound and the result on support recovery.18 tep 1: We ﬁrst show that supp( ˆ β ) ⊂ S and that ˆ β is the unique solution.This result holds true if the primal-dual pair (ˆ α, ˆ ν ) ∈ R p × R p constructed asin Lemma 3 satisﬁes (cid:107) ˆ ν S c (cid:107) ∞ <

1. To show the latter inequality, we use thedeﬁnition of ε and Equation (A.1) to rewrite 3 . in the construction above as n (cid:88) i =1 X ij w ( x i,S , β ∗ S ) x (cid:62) i,S (ˆ α S − β ∗ S ) /n − n (cid:88) i =1 X ij ( ε i − r i ) /n + λ ˆ ν j = 0 ( j ∈ [ p ]) . Because w ( x i,S , β ∗ S ) = w ( x i , β ∗ ) for each i ∈ [ n ], we can put the above displayin the matrix form( X (cid:62) W X/n ) (cid:18) β ∗ S − ˆ α S (cid:19) + X (cid:62) ( ε − r ) /n − λ (cid:18) ˆ ν S ˆ ν S c (cid:19) = 0 , and then in the block matrix form n − (cid:18) X (cid:62) S W X S X (cid:62) S W X S c X (cid:62) S c W X S X (cid:62) S c W X S c (cid:19) (cid:18) β ∗ S − ˆ α S (cid:19) + n − (cid:18) X (cid:62) S ( ε − r ) X (cid:62) S c ( ε − r ) (cid:19) − λ (cid:18) ˆ ν S ˆ ν S c (cid:19) = 0 . We now solve this equation for λ ˆ ν S c and ﬁnd λ ˆ ν S c = X (cid:62) S c W X S ( β ∗ S − ˆ α S ) /n + X (cid:62) S c ( ε − r ) /n . Since the matrix X (cid:62) S W X S is invertible by Assumption 2, we can solve the blockmatrix equation also for ( β ∗ S − ˆ α S ) /n and ﬁnd( β ∗ S − ˆ α S ) /n = − ( X (cid:62) S W X S ) − X (cid:62) S ( ε − r ) /n + λ ( X (cid:62) S W X S ) − ˆ ν S . (A.4)Combining the two displays yields λ ˆ ν S c = − X (cid:62) S c W X S ( X (cid:62) S W X S ) − X (cid:62) S ( ε − r ) /n + X (cid:62) S c ( ε − r ) /n + λX (cid:62) S c W X S ( X (cid:62) S W X S ) − ˆ ν S . (A.5)Taking (cid:96) ∞ -norms on both sides of Equation (A.5) and using the triangle inequality,we ﬁnd (cid:107) λ ˆ ν S c (cid:107) ∞ ≤ (cid:107) X (cid:62) S c W X S ( X (cid:62) S W X S ) − X (cid:62) S ( ε − r ) /n (cid:107) ∞ + (cid:107) X (cid:62) S c ( ε − r ) /n (cid:107) ∞ + λ (cid:107) X (cid:62) S c W X S ( X (cid:62) S W X S ) − ˆ ν S (cid:107) ∞ . Invoking properties of the induced matrix norms and the (cid:96) ∞ -norm and thecondition (cid:107) ˆ ν S (cid:107) ∞ ≤ (cid:107) λ ˆ ν S c (cid:107) ∞ ≤ ||| X (cid:62) S c W X S ( X (cid:62) S W X S ) − ||| ∞ (cid:0) (cid:107) X (cid:62) ( ε − r ) /n (cid:107) ∞ + λ (cid:1) + (cid:107) X (cid:62) ( ε − r ) /n (cid:107) ∞ . Next, we divide by λ on both sides, apply Assumption 2, use the triangleinequality, and rearrange the terms again to ﬁnd (cid:107) ˆ ν S c (cid:107) ∞ ≤ (1 − γ ) + 2 − γλ ( (cid:107) X (cid:62) ε/n (cid:107) ∞ + (cid:107) X (cid:62) r/n (cid:107) ∞ ) .

19y the deﬁnition of T λ , it holds that (cid:107) X (cid:62) ε/n (cid:107) ∞ ≤ λγ/ (4(2 − γ )), which isequivalent to 2 − γλ (cid:107) X (cid:62) ε/n (cid:107) ∞ ≤ γ . Since γ ∈ (0 , (cid:107) X (cid:62) ε/n (cid:107) ∞ ≤ λ/ T λ . Combining this with the assumption λ ≤ γc / (100(2 − γ ) sc max )implies that (cid:107) X (cid:62) r/n (cid:107) ∞ ≤ λγ/ (4(2 − γ )), see Lemma 2. Thus, (cid:107) ˆ ν S c (cid:107) ∞ ≤ (1 − γ ) + γ − γλ (cid:107) X (cid:62) r/n (cid:107) ∞ ≤ (1 − γ ) + γ γ < . We ﬁnally invoke Lemma 3 to conclude that supp( ˆ β ) ⊂ S and that ˆ β = ˆ α =(ˆ α (cid:62) S , (cid:62) ) (cid:62) ∈ R p is the unique solution of the problem (2), as desired. Step 2:

To show the (cid:96) ∞ -bound, we use Equation (A.4) and ˆ β S = ˆ α S fromStep 1 and ﬁnd β ∗ S − ˆ β S = − ( X (cid:62) S W X S ) − X (cid:62) S ( ε − r ) + λn ( X (cid:62) S W X S ) − ˆ ν S = − (cid:0) X (cid:62) S W X S /n (cid:1) − (cid:0) X (cid:62) S ( ε − r ) /n (cid:1) + λ (cid:0) X (cid:62) S W X S /n (cid:1) − ˆ ν S . We then ﬁnd similarly as before (cid:107) β ∗ S − ˆ β S (cid:107) ∞ ≤ ||| ( X (cid:62) S W X S /n ) − ||| ∞ ( λ + (cid:107) X (cid:62) ( ε − r ) /n (cid:107) ∞ ) . By the deﬁnition of a , we have ||| ( X (cid:62) S W X S /n ) − ||| ∞ ≤ a ||| ( X (cid:62) S W X S /n ) − ||| ≤ a/c min . Combining this with the bounds on (cid:107) X (cid:62) ε/n (cid:107) ∞ and (cid:107) X (cid:62) r/n (cid:107) ∞ deduced inStep 1 yields (cid:107) β ∗ S − ˆ β S (cid:107) ∞ ≤ ac min (cid:16) λ + λγ − γ ) + λγ − γ ) (cid:17) ≤ . aλ/c min . Since supp( ˆ β ) ⊂ S by Step 1, the above display implies that (cid:107) β ∗ − ˆ β (cid:107) ∞ ≤ . aλ/c min . Consequently, supp( ˆ β ) = S as long as min j ∈ S | β ∗ j | > . aλ/c min . This concludesthe proof. (cid:3) Proof of Theorem 2.

The proof is conducted in three steps. The ﬁrst step is to show the boundon ˆ λ , the second step is to show the bound on the sup-norm error, and the laststep is to show that ˆ S ⊃ S . To begin with, we deﬁne the event T ∗ δ = (cid:26) − γ ) nγ (cid:107) X (cid:62) ε (cid:107) ∞ ≤ λ ∗ δ (cid:27) .

20y our deﬁnition of the oracle tuning parameter in (4), we have that Pr( T ∗ δ ) ≥ − δ . Thus, it suﬃces to show that the results hold conditioned on the event T ∗ δ . Step 1:

To show that ˆ λ ≤ λ ∗ δ , we proceed by proof by contradiction. Ifˆ λ > λ ∗ δ , then the deﬁnition of our testing-based calibration implies that theremust exist two tuning parameters λ (cid:48) , λ (cid:48)(cid:48) ≥ λ ∗ δ such that (cid:107) ˆ β λ (cid:48) − ˆ β λ (cid:48)(cid:48) (cid:107) ∞ > C ( λ (cid:48) + λ (cid:48)(cid:48) ) . (A.6)However, because both T λ (cid:48) and T λ (cid:48)(cid:48) include T ∗ δ , and because C ≥ . a/c min ,Theorem 1 implies that (cid:107) ˆ β λ (cid:48) − β ∗ (cid:107) ∞ ≤ Cλ (cid:48) and (cid:107) ˆ β λ (cid:48)(cid:48) − β ∗ (cid:107) ∞ ≤ Cλ (cid:48)(cid:48) . Byapplying the triangle inequality, we have (cid:107) ˆ β λ (cid:48) − ˆ β λ (cid:48)(cid:48) (cid:107) ∞ ≤ (cid:107) ˆ β λ (cid:48) − β ∗ (cid:107) ∞ + (cid:107) ˆ β λ (cid:48)(cid:48) − β ∗ (cid:107) ∞ ≤ C ( λ (cid:48) + λ (cid:48)(cid:48) ) . This upper bound contradicts our earlier conclusion (A.6) and, therefore, yieldsthe desired bound on the tuning parameter.

Step 2:

On the event T ∗ δ , we have ˆ λ ≤ λ ∗ δ , and so the testing-based methodimplies that (cid:107) ˆ β ˆ λ − ˆ β λ ∗ δ (cid:107) ∞ ≤ C (ˆ λ + λ ∗ δ ) ≤ Cλ ∗ δ . By applying the triangle inequality, we ﬁnd that (cid:107) ˆ β ˆ λ − β ∗ (cid:107) ∞ ≤ (cid:107) ˆ β ˆ λ − ˆ β λ ∗ δ (cid:107) ∞ + (cid:107) ˆ β λ ∗ δ − β ∗ (cid:107) ∞ ≤ Cλ ∗ δ + (cid:107) ˆ β λ ∗ δ − β ∗ (cid:107) ∞ . Theorem 1 implies that (cid:107) ˆ β λ ∗ δ − β ∗ (cid:107) ∞ ≤ . aλ ∗ δ /c min ≤ Cλ ∗ δ , and combining thepieces yields the desired sup-norm bound. Step 3:

Let us ﬁnally show that ˆ S ⊃ S . Suppose j ∈ S , then by the boundon the sup-norm error that we deduce in Step 2, we have | ( ˆ β ˆ λ ) j | ≥ | β ∗ j | − Cλ ∗ δ . In view of the condition min j ∈ S | β ∗ j | > Cλ ∗ δ and the deﬁnition ˆ S = { j ∈ [ p ] : | ( ˆ β ˆ λ ) j | ≥ C ˆ λ } , we conclude that j ∈ ˆ S , that is, ˆ S ⊃ S . This completes theproof. (cid:3) Appendix A.2. Additional simulations

In this section, we ﬁrst present additional simulation results for the settingsdescribed in Section 3. Figure A.4 shows the results of Hamming distance for κ ∈ { , . } and s ∈ { , , } , and Figure A.5 shows corresponding runtimes for κ = 0 .

75. The results for more dense signals ( s ∈ { , } ) and κ ∈ { , . , . , . } and p ∈ { , } are summarized in Figure A.6 and A.7.Next, we expand our simulation studies by varying covariance structuresfor the covariates and manipulating the values of the β ∗ ’s. As pointed out inSection 3, we generate the row vectors x i of X independently from Gaussianwith mean zero and covariance Σ. Here, we consider Σ as Σ = ( σ ij ), where σ ij = 0 . | i − j | . We then generate each component of β ∗ in the support set S from21 ( µ β ∗ , µ β ∗ is used to control the signal level, and we consider µ β ∗ = 2 . n = 200, p ∈ { , } , s ∈ { , , , , } . All theresults are averaged over 200 replications. Figure A.8 and A.9 show the resultsof Hamming distance and run times for s ∈ { , , } , respectively. Figure A.10and A.11 show the corresponding results for s ∈ { , } .We ﬁnally test the proposed scheme in ultra-high dimensional settings. Weconsider p = 3000 and leave all other parameters as in Section 3, that is, n =200, κ ∈ { , . , . , . } , and s ∈ { , , , , } . Figure A.12 shows theHamming distances for s ∈ { , , } , and Figure A.13 shows the correspondingrun times. Figure A.14 and Figure A.15 show the same for the more dense cases s ∈ { , } .Figures A.4 – A.15 indicate that our testing-based method rivals or outmatchesall alternatives across a very wide range of settings, both in computational timeand in statistical accuracy. Hence, beyond its theoretical guarantees (which noneof the alternatives has), the proposed scheme is also a competitor in practice.22 a) p = 200, κ = 0 (b) p = 200, κ = 0 . p = 500, κ = 0 (d) p = 500, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Section 3. The 12 simulationsettings diﬀer in the number of variables p , correlation κ , and sparsity level s ∈ { , , } . a) p = 200, κ = 0 .

75 (b) p = 500, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Section 3. Depicted are theresults for p ∈ { , } , κ = 0 .

75 and s ∈ { , , } . a) p = 200, κ = 0 .

25 (b) p = 200, κ = 0 . p = 500, κ = 0 .

25 (d) p = 500, κ = 0 . p = 200, κ = 0 (f) p = 200, κ = 0 . p = 500, κ = 0 (h) p = 500, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Section 3. The 24 simulationsettings diﬀer in the number of variables p , correlation κ , and sparsity level s ∈ { , } . a) p = 200, κ = 0 .

75, and s ∈ { , } . a) p = 200, µ β ∗ = 2 . p = 200, µ β ∗ = 5(c) p = 500, µ β ∗ = 2 . p = 500, µ β ∗ = 5Figure A.8: Variable selection errors of (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Appendix A.2. The 12simulation settings diﬀer in the number of variables p , signal level µ β ∗ , and sparsitylevel s ∈ { , , } . a) p = 500, µ β ∗ = 2 . p = 500, µ β ∗ = 5Figure A.9: Run times (in seconds) of (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Appendix A.2. Depicted arethe results for p = 500, µ β ∗ ∈ { . , } and s ∈ { , , } . a) p = 200, µ β ∗ = 2 . p = 200, µ β ∗ = 5(c) p = 500, µ β ∗ = 2 . p = 500, µ β ∗ = 5Figure A.10: Variable selection errors of (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Appendix A.2. The 12simulation settings diﬀer in the number of variables p , signal levels µ β ∗ , and sparsitylevel s ∈ { , } . a) p = 500, µ β ∗ = 2 . p = 500, µ β ∗ = 5Figure A.11: Run times (in seconds) of (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Appendix A.2. Depicted arethe results for p = 500, µ β ∗ ∈ { . , } and s ∈ { , } . a) p = 3000, κ = 0 (b) p = 3000, κ = 0 . p = 3000, κ = 0 . p = 3000, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Section 3. The 12 simulationsettings share p = 3000 but diﬀer in the correlation level κ ∈ { , . , . , . } and sparsitylevel s ∈ { , , } . a) p = 3000, κ = 0 (b) p = 3000, κ = 0 . p = 3000, κ = 0 . p = 3000, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Section 3. The 12 simulationsettings share p = 3000 but diﬀer in the correlation level κ ∈ { , . , . , . } and sparsitylevel s ∈ { , , } . a) p = 3000, κ = 0 (b) p = 3000, κ = 0 . p = 3000, κ = 0 . p = 3000, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Section 3. The 8 simulationsettings share p = 3000 but diﬀer in the correlation level κ ∈ { , . , . , . } and sparsitylevel s ∈ { , } . a) p = 3000, κ = 0 (b) p = 3000, κ = 0 . p = 3000, κ = 0 . p = 3000, κ = 0 . (cid:96) -regularized logistic regression with seven diﬀerenttuning parameter calibration schemes for settings described in Section 3. The 8 simulationsettings share p = 3000, but diﬀer in the correlation level κ ∈ { , . , . , . } , and sparsitylevel s ∈ { , } ..