[PDF] Splitting strategies for post-selection inference

Abstract

We consider the problem of providing valid inference for a selected parameter in a sparse regression setting. It is well known that classical regression tools can be unreliable in this context due to the bias generated in the selection step. Many approaches have been proposed in recent years to ensure inferential validity. Here, we consider a simple alternative to data splitting based on randomising the response vector, which allows for higher selection and inferential power than the former and is applicable with an arbitrary selection rule. We provide a theoretical and empirical comparison of both methods and extend the randomisation approach to non-normal settings. Our investigations show that the gain in power can be substantial.

Full PDF

SSplitting strategies for post-selection inference

Daniel G. Rasines ∗ G. Alastair YoungFebruary 4, 2021

Abstract

We consider the problem of providing valid inference for a selected parameterin a sparse regression setting. It is well known that classical regression tools canbe unreliable in this context due to the bias generated in the selection step. Manyapproaches have been proposed in recent years to ensure inferential validity. Here,we consider a simple alternative to data splitting based on randomising the responsevector, which allows for higher selection and inferential power than the former andis applicable with an arbitrary selection rule. We provide a theoretical and empiricalcomparison of both methods and extend the randomisation approach to non-normalsettings. Our investigations show that the gain in power can be substantial.

Keywords:

Data splitting; Randomisation; Post-selection inference; Regression; Variableselection.

Suppose we have data Y ∼ N ( µ, σ I n ), where µ ∈ R n , σ >

0, and I n is the n × n identitymatrix. We assume that the components of µ are modelled as a function of p covariates, µ i = g ( x i , . . . , x ip ) for some unknown g , and denote by X = ( x ij ) ∈ R n × p the known,ﬁxed design matrix. In many situations, it is suspected that only a few covariates are trulyactive, and a preliminary variable-selection step is performed to identify these. Havingscreened a set of potentially relevant variables, we may want to provide inference for theregression coeﬃcients of the best linear approximation of µ in the selected model, or someother parameter depending on the output of the selection step.If the same data that was used for selection is also used to provide inference for theselected parameter, standard inferential procedures are unreliable, typically leading tooveroptimistic results; see e.g. Hong et al. (2018). Data splitting techniques, whereby aportion of the data is reserved for uncertainty quantiﬁcation, oﬀer a simple yet eﬀectiveway to circumvent this problem. Unfortunately, data splitting often leads to procedures ∗ Corresponding author: [email protected] a r X i v : . [ s t a t . M E ] F e b ith little power both for identifying the active covariates and for providing inferencefor the selected parameters. In this paper, we study an alternative to data splitting,motivated by the work by Tian and Taylor (2018), which provides a more eﬃcient way ofsplitting the sample information, resulting in more powerful procedures, and which is easyto apply with a general variable-selection method. Our main objective here is to providea theoretical and empirical comparison between the two splitting strategies.Data splitting is a popular tool in prediction problems, where the hold-out observationsare used to assess the accuracy of a predictive model. When the goal is inference ratherthan prediction, a frequent criticism of data splitting is that diﬀerent splits can producediﬀerent selected models, and therefore two people analysing the same data may endup answering diﬀerent questions. While this is a valid concern, we stress that selectionbased on the full data is not free from some level of arbitrariness, as the selection processalways involves subjective decisions, including the choice of the selection rule itself, andin many cases a random input independent of the data; see e.g. Wasserman and Roeder(2009); Meinshausen and B¨uhlmann (2010); and Cand`es et al. (2018). Nevertheless, it isimportant to keep the eﬀect of the random components as low as possible, as failure to doso results in high uncertainty about the relevance of the selected variables.The splitting technique considered in this paper can be viewed as a variant of data splittingwhich produces datasets which are more similar to the full sample than those resulting fromthe latter method, and is therefore potentially less aﬀected by randomness. Furthermore, itentails no extra computational cost with respect to data splitting. The method operatesby applying the variable selection algorithm to a randomised version of the data, andthen basing inference on the conditional distribution of the data given its randomisedform, thereby avoiding any selection bias. The general idea of basing selection on anartiﬁcial perturbation of the data in this context was proposed by Tian and Taylor (2018)as a way of deriving uniformly consistent and powerful inferential procedures, and hasbecome a popular device in the literature. In the original paper, the authors compare theresulting inferential power of randomisation and data splitting in circumstances where it ispossible to provide inference conditionally on the selection event, showing the superiorityof the former. Here we compare the methods in circumstances where inference discardsall the information of the selection split, and is therefore unaﬀected by the complexityof the selection rule. This is explained in more detail in §

3, but the key point is thatmethods discussed here may be applied in circumstances where the conditional samplingdistribution of the data given selection is intractable or unknown.A large number of methods have been proposed in recent years to deal with selection bias.They can be broadly divided into two categories: those which assume that the selection al-gorithm is of a speciﬁc form; and those which provide guarantees for an arbitrary selectionrule. An important class of methods in the ﬁrst group is formed by conditional procedures,as considered above, which are constructed by analysing the conditional distribution ofthe data given the speciﬁed selection event, when this can be found explicitly. This lineof work was started by Lockhart et al. (2014) and has subsequently been extended inmultiple works such as Lee and Taylor (2014); Loftus and Taylor (2014); Lee et al. (2016);Fithian et al. (2017); Tibshirani et al. (2018); and Panigrahi et al. (2020). The secondgroup of methods includes the POSI approach of Berk et al. (2013) and extensions of it(Bachoc et al., 2017, 2020), which achieve uniformly valid inference by maximising overall possible model selection procedures, and are very conservative as a result, as well as2he data splitting approach of Rinaldo et al. (2019), which provides model-free procedureswith asymptotically valid guarantees in a random-design setting. Cox (1975) analyseddata splitting in a simple inferential problem involving many normal means and found itto be competitive against a natural alternative. Methods based on data splitting have alsobeen considered in more recent works such as Rubin et al. (2006), Wasserman and Roeder(2009), Ignatiadis et al. (2016), and DiCiccio et al. (2020). Fithian et al. (2017) observedthat inference after data splitting based only on the hold-out observations is inadmissible,being always dominated by data-carving rules, which consider the sampling distributionof the full data, conditional on the selection event. Such data-carving may, however, bevery complicated or intractable to implement in many situations due to the complexity ofthe conditional distribution.

Suppose that, for a given data vector Y , a variable-selection algorithm selects a subset s ⊆{ , . . . , p } of the covariates. In general, determining an appropriate inferential objectivepost-selection is not straightforward. Under a linearity assumption, µ = Xβ for some β = ( β , . . . , β p ) T ∈ R p , a natural possibility is to provide inference for the components of β associated with the selected variables, { β i : i ∈ s } . However, this is a diﬃcult problemwhen p > n as the model is not identiﬁable. A popular alternative target of inferenceis the projection parameter, proposed originally by Berk et al. (2013). The projectionparameter is the regression parameter of µ projected onto the subspace spanned by theselected columns of X : β s ( X ) = arg min z ∈ R | s | E (cid:0) (cid:107) Y − X ( s ) z (cid:107) (cid:1) = { X ( s ) T X ( s ) } − X ( s ) T µ, where X ( s ) is the submatrix of X that contains the selected columns and | s | denotesthe number of selected covariates. In other words, it is the best linear predictor of µ in the selected model. When the model happens to be linear in the selected covariates, µ = X ( s ) β s for some β s ∈ R | s | , β s ( X ) is simply β s . Otherwise, the interpretation of theprojection parameter is less transparent: the j -th component of β s ( X ) may be viewedas the average change of the response when the j -th selected covariate increases by oneunit, approximated in the selected model. An alternative interpretation is given in § § β F ( X ) = ( X T X ) − X T µ (assuming X has full column rank). Some authors refer to this parameter as the fulltarget, and to the previous one as the partial target. In this case, post-selection inferencemay be provided for the components of β F ( X ) associated with the selected variables, { β F ( X ) i : i ∈ s } . Another option, presumably more common in practice, is to proceedunder the assumption that µ = X ( s ) β s and carry out inference on β s . In the random-design case, Rinaldo et al. (2019) develop inferential methods for other choices of theselected parameter which depend on the distribution of the covariates.Let us denote a generic selected parameter, possibly depending on the design matrix aswell as on the selected set, by h s ( X ). Adopting the conditional approach (Fithian et al.,2017), we deem an inferential statement about h s ( X ) valid if its error guarantees holdunder the conditional distribution of the data given the event that h s ( X ) was selected.3or example, a 1 − α conﬁdence set T for h s ( X ) is valid if it satisﬁespr ( h s ( X ) ∈ T | S = s ) ≥ − α, where S is the random set of selected covariates. For simplicity, we will assume that thechoice of the interest parameters depends only on s , so that inference on a given h s ( X ) isrequired if and only if S = s .For some popular variable-selection algorithms, such as the lasso or stepwise procedureswith ﬁxed tuning parameters, the conditioning event { S = s } can be studied analytically(often, it can be written as a union of aﬃne sets; see e.g. Lee and Taylor (2014); Loftus andTaylor (2014); and Lee et al. (2016)). In most cases, however, this event is too complicatedto be explored analytically. Furthermore, even when they can be implemented, conditionalmethods tend to be very conservative (Kivaranovic and Leeb, 2020). In such cases, datasplitting oﬀers an analytically simple and computationally light solution to the inferenceproblem. The most common form of data splitting is simple data splitting. Here, a fraction f = n /n , 1 ≤ n < n , is speciﬁed, and a set of indices r is chosen uniformly at ran-dom from the subsets of { , . . . , n } of size n . Then, the sets of observations ( Y i : i ∈ r )and ( Y i : i ∈ r c ) are respectively used for selection and for inference, where r c denotes thecomplement of r . Since all the observations are independent, the conditional distributionof the inference set given the output of the selection step is the same as the unconditionalone, so classical procedures can be used to provide valid inference for a selected parameter.More elaborated splitting rules can be found in the prediction literature; see e.g. Reit-ermanov´a (2010). These rules allocate the samples to the selection and inferential setsaccording the observed values of the covariates, usually trying to divide them as evenlyas possible, to ensure that the analyst has access to similar regions of the design space inboth stages. For example, when the observations come from diﬀerent groups, stratiﬁeddata splitting rules allocate the samples so that both the inference and the selection setscontain observations from all the observed groups. Quite generally, then, a data splittingstrategy can be formalised as a random variable R , possibly depending on X , taking valuesin the power set of { , . . . , n } .In the current context, data splitting techniques can be viewed as particular cases of amore ﬂexible method. Suppose that W is a random quantity, possibly depending on X ,and that in the selection step we only allow ourselves to observe the value of a function U ≡ u ( Y, W ). Since selection depends on the data only through U , inference based onthe conditional distribution of the data given its observed value, Y | U , is free of selectionbias, and does not require knowledge of the selection mechanism. Note that, by contrast,a conditional approach would base inference on Y | { S = s } . Abstractly, we may thinkof the data under this conditional distribution as the portion reserved for inference. Datasplitting techniques can be written this way by letting W = R and U be the set ofobservations allocated to selection. A natural question to examine, then, is whether thereexist data decompositions of this form that are better than data splitting.4ian and Taylor (2018) proposed randomisation schemes of the form U = Y + γW , where W is n -dimensional artiﬁcial noise and γ > γ assign most of the sampleinformation for selection, while large values allocate most of it for inference. One clearadvantage of this approach over data splitting is that it gives access to all the observedvalues of the covariates both at the selection and at the inferential stages, while in datasplitting we only have access to a subset of them at each stage. To ensure high inferentialpower, Tian and Taylor (2018) recommend that the distribution of W has tails at leastas heavy as the normal distribution. If the observation variance σ is known, the choice W ∼ N (0 n , σ I n ), where 0 n is an n -dimensional vector of zeroes, allows for a remarkablysimple analysis. With this choice, U ∼ N ( µ, (1 + γ ) σ I n ) is of the same parametric formas the data. Furthermore, basing inference for the selected parameter on Y | U amountsto basing it on the marginal distribution of V = Y − γ − W ∼ N ( µ, (1 + γ − ) σ I n ). Thisfollows because U and V are independent, as they are uncorrelated and normal, and jointlysuﬃcient. Henceforth, we refer to this split as the ( U, V ) decomposition. Tian (2020) usedsuch a decomposition to derive an unbiased estimator of the prediction risk in a relatedcontext.If the error variance is unknown but can be estimated with reasonable precision, an ap-proximate decomposition can be achieved by plugging in the variance estimate in thevariance of W . In section 5 we consider the asymptotic validity of this approach. In thelinear case, if p is small relative to n , σ can be estimated in the classical way. Otherwisewe have to resort to high-dimensional alternatives. In our simulation studies we used theestimator implemented in the selectiveInference R package of Tibshirani et al. (2019),which estimates σ using the residual sum of squares from a lasso ﬁt, with the penaltyparameter tuned by cross-validation. The good performance of this estimator in sparsemodels was demonstrated in Reid et al. (2016). Many other methods are available; e.g.Fan et al. (2012) and Bayati et al. (2013). In the low-dimensional linear case, µ = Xβ , the optimality of a design X is commonly mea-sured through summary statistics of the inverse Fisher information, σ ( X T X ) − . Popularcriteria in the literature include the D-optimality criterion, which minimises the deter-minant of the inverse information, the A-optimality criterion, which minimises the trace,and a few others. When a data split r is used to partition the sample, the inverse infor-mation matrices of the selection and inference sets are σ ( X Tr X r ) − and σ ( X Tr c X r c ) − ,where X r and X r c are the design matrices that contain only the observations in splits r and r c , respectively. Correspondingly, an additive randomisation scheme U = Y + γW , W ∼ N (0 n , σ I n ), produces splits of the inverse Fisher information (1 + γ ) σ ( X T X ) − and (1 + γ − ) σ ( X T X ) − .Proposition 1, which we state for a generic parametric model, shows that the latter splitis more eﬃcient than a class of data splitting strategies when the quality of the inverseFisher information is measured in a particular way. For this analysis, we let Y , . . . , Y n beindependent (not necessarily identically distributed) observations from a generic regularparametric model with p -dimensional parameter β , and write I i ( β ) for the Fisher infor-5ation provided by the i -th observation, and I Y ( β ) = (cid:80) ni =1 I i ( β ) for the full-data Fisherinformation. A generic data split r distributes the information between the selection andinferential tasks as I Y ( β ) = (cid:88) i ∈ r I i ( β ) + (cid:88) i ∈ r c I i ( β ) ≡ I r ( β ) + I r c ( β ) . The result concerns the behaviour of I R ( β ) and I R c ( β ) under repeated sampling from arandom data splitting strategy R , such as simple data splitting.In what follows, all expectations are taken with respect to the distribution of R , and, fortwo positive deﬁnite matrices A and B , we use the notation A < B ( ≤ ) to indicate that A − B is negative (semi)deﬁnite. Furthermore, we say that a real-valued matrix function ϕ is convex if it satisﬁes ϕ { tA + (1 − t ) B } ≤ tϕ ( A ) + (1 − t ) ϕ ( B ) for all matrices A and B in its domain and t ∈ (0 , A < B implies ϕ ( A ) < ϕ ( B ). Proposition 1.

Let R be any splitting strategy satisfying pr( i ∈ R ) = f for some 0

Let p be ﬁxed, n → ∞ , and consider a sequence of selection sets s , s , . . . ⊆{ , . . . , p } such that the sequence selection probabilities, { pr( S = s i ) : i ≥ } , is boundedaway from zero. Assume that all the selection events { S = s i } can be written in the form { X T U ∈ E } , where U = Y + γW for a ﬁxed γ ∈ (0 , E is a ﬁnite union of convexsets. Furthermore, assume that X T X is invertible for all n , λ max { ( X T X ) − } = O ( n − ),max( X ) = o ( n / ), and that the contrasts η satisfy max( η ) (cid:107) η (cid:107) − = O ( n − / ). If theobservation errors ε , . . . , ε n are independent and identically distributed with mean zero,variance σ , and E ( | ε | ) < ∞ , and W = ˆ σZ , where ˆ σ is a weakly consistent estimator of σ and Z ∼ N (0 n , I n ) is independent of the data, then(1 + γ − ) − / σ − (cid:107) η (cid:107) − ( η T V − η T µ ) |{ S = s n } d −→ N (0 , , n → ∞ , where V = Y − γ − W .The requirement that selection depends on the data only through X T U allows us to reducethe dimension of the conditioning event from n to p , which stays bounded. It is satisﬁedby all penalised-likelihood methods, which estimate the active set of covariates by thesupport of ˆ β = arg min β ∈ R p (cid:107) U − Xβ (cid:107) + l ( β )for an appropriate loss function l . This follows because the ﬁrst term can be written, upto an additive constant independent of β , as (cid:107) X ( X T X ) − X T U − Xβ (cid:107) .7egarding the convexity assumption, suppose that, for a given s , the selection event interms of the full vector U is E (cid:48) ⊆ R n . Under the previous condition, the selection eventin terms of X T U is E = X T E (cid:48) = { X T u : u ∈ E (cid:48) } . Therefore, if E (cid:48) = E (cid:48) ∪ . . . ∪ E (cid:48) m ,with E (cid:48) , . . . , E (cid:48) m convex, E = X T E (cid:48) ∪ . . . ∪ X T E (cid:48) m is also a union of convex sets, so therequirement can be imposed on E (cid:48) directly. Verifying that E (cid:48) is a union of convex sets canstill be diﬃcult problem in general, but it has been shown to be true for some widely usedselection rules such as forward stepwise regression, the lasso, and least angle regression,all with ﬁxed tuning parameters (Tibshirani et al., 2018). We note, however, that thisrequirement is mainly convenient from a mathematical perspective and can probably beweakened.The asymptotic condition on η is very natural. It ensures that the asymptotic support of η/ (cid:107) η (cid:107) is unbounded. When inference is sought for a selected coeﬃcient or for a projectionparameter, the contrasts are of the form X ( s ) { X ( s ) T X ( s ) } − e j for some s ⊆ { , . . . , p } and 1 ≤ j ≤ p , where e j ∈ R p is the j -th vector of the canonical basis. These vectors have (cid:107) η (cid:107) = e Tj { X ( s ) T X ( s ) } − e j ≥ λ min (cid:2) { X ( s ) T X ( s ) } − (cid:3) = λ max { X ( s ) T X ( s ) } − ≥ λ max { X T X } − by the min-max theorem and the fact that λ max { X ( s ) T X ( s ) } ≤ λ max { X T X } . Further-more, by similar arguments and the Cauchy–Schwarz inequality, it follows thatmax( η ) = max ≤ i ≤ n e Ti X ( s ) { X ( s ) T X ( s ) } − e j ≤ max ≤ i ≤ n (cid:107) e Ti X ( s ) (cid:107)(cid:107){ X ( s ) T X ( s ) } − e j (cid:107)≤ p / max( X ) (cid:2) e Tj { X ( s ) T X ( s ) } − e j (cid:3) / ≤ p / max( X ) λ max (cid:2) { X ( s ) T X ( s ) } − (cid:3) = max( X ) O ( n − )under the assumptions of the theorem. If we further assume that the entries of X arebounded as n grows, max( X ) = O (1), putting all together it follows that the asymptoticcondition on the contrast is satisﬁed if λ max ( X T X ) = O ( n ). We compared data splitting with the (

U, V ) decomposition in the context of the normallinear model. The data generating process was Y = Xβ + N (0 n , I n ), where β ∈ R p is an unknown sparse vector of coeﬃcients. At each replication of the simulations, therows of the design matrix were generated as independent samples from the distribution N (0 p , Γ), where Γ ∈ R p × p is a Toeplitz matrix with ( i, j ) entry ρ | i − j | for some ρ ≥ p < n/

4, and using the high-dimensional alternative otherwise, as per §

3. We compared the splitting strategies in terms of selection power, selection stability,and inferential power in low and high-dimensional settings. All the simulation results canbe found in the Appendix.The data splitting strategy considered was the DUPLEX (Snee, 1977). In the even case( f = 1 / f (cid:54) = 1 / U, V ) decomposition, as we shall see.Following the discussion of section 4, for a given fraction f , we compared the performanceof the data splitting algorithm with splitting fraction f with the performance of the ran-domised procedure with tuning parameter γ = ( f − − / . Speciﬁcally, for a given f and a given dataset ( Y, X ), we considered two procedures. The ﬁrst bases selection on( Y (1) , X (1) ) and inference on ( Y (2) , X (2) ), where Y (1) = ( Y i : i ∈ r ) T , X (1) = ( x ij : i ∈ r, j = 1 , . . . , p ), Y (2) = ( Y i : i ∈ r c ) T , and X (2) = ( x ij : i ∈ r c , j = 1 , . . . , p ). The secondone bases selection on ( U, X ) = ( Y + γW, X ) and inference on ( V, X ) = ( Y − γ − W, X ),with W = ˆ σZ , where Z ∼ N (0 n , I n ) is artiﬁcially generated noise independent of the dataand ˆ σ is the estimate of model variance.Regarding selection, we considered two recently proposed algorithms: the ﬁxed- X knockoﬀalgorithm of Barber and Cand`es (2015) and the stability selector paired with the lasso ofMeinshausen and B¨uhlmann (2010), as implemented in the R packages knockoff (Barberet al., 2020) and stabs (Hofner and Hothorn, 2017). These are selection rules for whichconditional inference is analytically intractable and for which computational approachesare very demanding (Markovic et al., 2019), so they constitute an example where splittingapproaches would likely be preferred in practice. These algorithms aim at identifying theset of active coeﬃcients while keeping the number of false discoveries under control. Theknockoﬀ provides a guarantee on the false discovery rate, while stability selection controlsthe expected number of false discoveries. In all the simulations we set the false discoveryrate of the knockoﬀ algorithm at 0 .

3, and the expected number of false discoveries andcutoﬀ threshold of the stability algorithm at 3 at 0 . X knockoﬀ is only applicable in settings with less covariates thanobservations, it was only considered in the lower dimensional cases.9 .2 Selection power In this simulation, we generated 10 triplets ( β, Y, X ) independently for each combi-nation of the following parameters: n = 200; ρ = 0 , . f = 1 / , / γ = 1 , − / ); p = 30 ,

50 for the knockoﬀ algorithm, and p = 200 , β was generated by sampling 10 non-zero positions uniformly at ran-dom and ﬁlling them with independent random variables distributed uniformly in the set {− , − . , − . , . . . , − . , . , . . . , . , } . Then, for each β , a pair ( Y, X ) was generated,and the corresponding selection algorithm was applied to ( Y (1) , X (1) ) in the case of datasplitting, and to ( U, X ) in the case of the randomised procedure.The selection ability of the splitting methods was compared according to two criteria: truepositive rate and power. The true positive rate is the average number of correct discoveriesdivided by the total number of active covariates, and the power is deﬁned, for each possiblevalue of | β i | = 0 . , . . . ,

1, as the average number of times a coeﬃcient with absolute value | β i | is selected, averaged over all the generated coeﬃcients of all β ’s. We also computed,for comparison, the results corresponding to applying the selection algorithms to the fulldataset, ( Y, X ).Table 1 shows the observed true positive rates of data-splitting and randomisation dividedby the observed true positive rates of full-data selection, and Figs. 1 and 2 show theempirical power functions of the three methods. The results clearly favour the randomisedsplit procedure over data splitting, particularly in the cases with larger values of p . Quiteremarkably, despite the fact that the choices of f and γ were balanced (in the sensedescribed in § In addition to enjoying high power, it is important that the selection method does notdepend strongly on ancillary components of the analysis: in the case of data splitting,on the choice of the selection and inference sets, and, in the case of randomisation, onthe observed value of the artiﬁcial noise W . To measure the stability of the selectionstrategies with respect to these elements, a simulation was conducted in which the selectionalgorithms were applied to multiple splits of the same dataset. Since the DUPLEX is adeterministic allocation rule, in this section we considered a simple data splitting schemeinstead.We set β = (1 , . , . . . , . , , . . . , T , ρ = 0 . f = 1 / , / p = 50 for the knockoﬀ,and p = 400 for stability selection. For each ( f, p ), we generated 100 pairs ( Y, X ) and,for each pair, we sampled 50 selection sets uniformly at random and 50 realisations ofthe randomisation noise W , and applied the selection algorithms to each data split andperturbed instance of the data. For each ( Y, X ), we recorded the average number of timeseach active covariate was selected across the diﬀerent splits, an estimate of pr( i ∈ S | Y, X ), i = 1 , . . . ,

10. The empirical mean and standard deviations of the 100 estimated averagescan be found in Table 2. 10andomisation was more stable than data splitting, because the corresponding selectionprobabilities of the active covariates were more concentrated around higher values. Forexample, for the knockoﬀ with f = 1 /

2, for most values of (

Y, X ) we had more than a 90%probability of selecting β under repeated sampling from W , as opposed to the ∼

77% ofdata splitting.

In this simulation, we consider the problem of constructing conﬁdence intervals for theselected coeﬃcients { β i : i ∈ s } . Firstly, we show that a face-value approach, which re-ports the standard conﬁdence intervals ignoring selection, undercovers coeﬃcients withsmall eﬀect size | β i | , while the splitting techniques provide valid intervals. This gives anillustration of the need to take selection into account in the inferential stage. Secondly, wecompare the lengths of the intervals derived from the randomised procedure with respectto those obtained by data splitting. To avoid complications related to conﬁdence intervalconstruction in non-identiﬁable settings, we shall only consider here low-dimensional cases.Assume that X has full rank and let ˆ β i = e Ti ( X T X ) − X T Y be the ordinary least squaresestimator of β i . In a classical, non-selective setting, where the inferential goals are de-termined prior to the data collection, marginal inference for β i is based on the pivot T i = ( ˆ β i − β i ) / ˆ σ , which follows a scaled t distribution with n − p degrees of freedom,where ˆ σ is the classical variance estimator. If, however, inference on a given β i is onlyprovided for some data samples, T i is no longer pivotal, and the resulting conﬁdence in-tervals can be miscalibrated. To exemplify this, we run the following simulation. We ﬁxed n = 200, p = 30, true parameter β = (1 , − , . , − . , . , − . , , , . . . , T , ρ = 0 , . f = 1 / , /

4, and for each combination of ( ρ, f ), we generated 5 × pairs ( Y, X ). Wethen applied the selection algorithm to each dataset as in the previous sections and con-structed classical equal-tailed conﬁdence intervals based on T i for each selected coeﬃcient,with a nominal coverage of 90%.Tables 3 and 4 contain the observed coverages of these intervals in the rows indicatedby FV (face-value), averaged across coeﬃcients with equal absolute eﬀect. Clearly, theface-value intervals are unreliable when | β i | = 0 or 0 .

2, with actual coverages signiﬁcantlylower than the nominal one in most cases. Note that here the face-value approach is beingimplemented after selection based on a fraction of the sample information, so the selectioneﬀect is not as large as if all the data had been used for selection.The rows labelled with HD contain the coverages of the intervals constructed using only thehold-out split of the data. In the case of data splitting, the intervals were derived from the t -pivots of the hold-out observations, ˆ β DS i = e Ti ( X (2) T X (2) ) − X (2) T Y (2) . With the currentsimulation parameters, the number of remaining observations is always larger than thenumber of covariates, so X (2) has full rank almost surely. For the randomised procedure,we used the approximation ˆ β R i = e Ti ( X T X ) − X T V ˙ ∼ N ( β i , (1 + γ − ) σ e Ti ( X T X ) − e i ).The intervals were thus constructed by studentisation of ˆ β R i . Despite the distributionalapproximation, we see that the coverages of the resulting conﬁdence intervals were veryclose to the nominal one. 11or the hold-out methods we also recorded the average length of the intervals, which canbe found in Tables 5 and 6. The intervals provided by the randomised procedure werealways shorter, on average, than those provided by the data splitting procedure. In thecases where the amount of information reserved for inference was small ( f = 3 / X (2) were of dimension 50 ×

30, and the resulting intervals were very wide asa consequence. The randomised procedure performed signiﬁcantly better in these cases,giving intervals roughly 75% shorter than data splitting. The maximum observed standarddeviations of the ﬁgures shown in the tables were, respectively, 0 . , . , . . We now consider settings with a number of covariates exceeding the sample size. Inthese situations the full model is not identiﬁable, and the methods used in the previoussection are not applicable. Instead, we consider the problem of constructing conﬁdenceintervals for the coeﬃcients of a projection parameter. As noted by Berk et al. (2013), thecoeﬃcients of a projection parameter for a selected set s can be interpreted as a measureof the correlation between the selected coeﬃcients and the mean µ , accounting for all theother selected covariates. This follows because β s ( X ) i can be written as the projection of µ onto the residual vector of the i -th selected covariate regressed against the other selectedcovariates. Indeed, if X ( s ) i is the i -th column of X ( s ) and X ( s ) − i is X ( s ) with the i -thcolumn removed, a simple calculation shows that β s ( X ) i = 1 (cid:107) r i (cid:107) r Ti µ, r i = (cid:110) I n − X ( s ) − i (cid:0) X ( s ) T − i X ( s ) − i (cid:1) − X ( s ) T − i (cid:111) X ( s ) i . Thus, β s ( X ) i provides a measure of the relevance of the i -th covariate relative to theselected set, with a null value indicating that the variable is redundant and can be safelyremoved from s .If s had been ﬁxed in advance and we knew the value of σ , inference on the coeﬃ-cients of β s ( X ) would be based on the components of ˆ β s ( X ) = { X ( s ) T X ( s ) } − X ( s ) T Y ∼ N ( β s ( X ) , σ { X ( s ) T X ( s ) } − ). So, for a nominal coverage of 1 − α , the conﬁdence inter-vals would be given by [ ˆ β s ( X ) i ∓ q − α/ σ [ e Ti { X ( s ) T X ( s ) } − e i ] / ], where q − α/ is the1 − α/ σ is unknown, we can plug-inthe high-dimensional estimate used for the ( U, V ) decomposition, ˆ σ , say, and report[ ˆ β s ( X ) i ∓ q − α/ ˆ σ HD [ e Ti { X ( s ) T X ( s ) } − e i ] / ].When s is data-dependent, hold-out inference can be provided similarly, with the diﬀerencethat, for data splitting, the inferential target is not the full-projection parameter β s ( X ),but the projection parameter based on the hold-out observations. Inference is thus pro-vided for β s ( X (2) ) = { X (2) ( s ) T X (2) ( s ) } − X (2) ( s ) T µ , where µ = E ( Y (2) ), and is based onˆ β DS s ( X (2) ) = { X (2) ( s ) T X (2) ( s ) } − X (2) ( s ) T Y (2) ∼ N ( β s ( X (2) ) , σ { X (2) ( s ) T X (2) ( s ) } − ).In the case of randomisation, the target parameter is still β s ( X ), and, as before, inferenceis based on the normal approximation to the distribution ofˆ β R s ( X ) = { X ( s ) T X ( s ) } − X ( s ) T V ˙ ∼ N ( β s ( X ) , (1 + γ − ) σ { X ( s ) T X ( s ) } − ) . In both cases, σ was approximated by ˆ σ .12he simulation parameters were set as in § p = 400. Here the knockoﬀ was not considered, as it is not applicable with p > n .The results can be found in Tables 7 and 8. Note that in this case the columns indicatethe absolute value of the full-model coeﬃcients β i , not the coeﬃcients of the projectionparameters, which in general depend on s . The coverage results were similar to the lowerdimensional case, with the diﬀerence that the eﬀect of selection was more pronounced here.In one of the cases considered, the coeﬃcients of the projection parameters associated withthe null covariates in the full model were almost guaranteed to be missed by the face-valueintervals. The hold-out intervals, on the other hand, remained well-calibrated across thediﬀerent coeﬃcients. Regarding interval lengths, in this case we computed the averageof all the intervals produced for a given selection set s , and averaged the results overselection sets of equal size in order to establish a more equitable comparison. We see thatthe average lengths were very similar in this case. Recall, however, that the selection powerof the latter method is substantially higher in high-dimensional settings, so in conjunctionrandomisation dominates data splitting. The maximum standard deviation of the coverageﬁgures were 1 . , . , . .

4, for | β i | = 0 , . , .

5, and 1, respectively. The maximumstandard deviation of the length ﬁgures was 0 .

04. The entries with a dash indicate thatno selection set of the corresponding size was selected in the simulation.

Randomisation oﬀers an alternative to data splitting which divides the sample informationmore eﬃciently. We have compared randomisation and data splitting with respect to theirselection stability, and their selection and inferential power, showing that the dominanceof the former can be substantial in settings with a limited amount of information. Asan example, consider a linear model with n = 2 p + 2 and splitting fraction f = 1 / ppendix Simulation results

Table 1: True positive rate of the selection algorithms applied after data splitting (DS)and randomisation (R), normalised by the true positive rate of selection applied to thefull dataset

Knockoﬀ Split f ρ p

DS R1/2 0 30 0.903 0.9421/2 0 50 0.738 0.9231/2 0.5 30 0.820 0.9141/2 0.5 50 0.648 0.8903/4 0 30 0.972 0.9783/4 0 50 0.935 0.9713/4 0.5 30 0.940 0.9643/4 0.5 50 0.890 0.970 Stability Split f ρ p

DS R1/2 0 200 0.694 0.8671/2 0 1000 0.486 0.8211/2 0.5 200 0.691 0.8581/2 0.5 1000 0.478 0.8203/4 0 200 0.895 0.9483/4 0 1000 0.826 0.9333/4 0.5 200 0.886 0.9483/4 0.5 1000 0.826 0.934

Table 2: Means and standard deviations of the estimated selection probabilities for ﬁxedvalues of (

Y, X ) β i f Split 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1Knockoﬀ1/2 DS 0.81 0.76 0.77 0.74 0.71 0.67 0.62 0.53 0.38 0.15(0.08) (0.09) (0.09) (0.09) (0.09) (0.11) (0.11) (0.16) (0.16) (0.13)R 0.98 0.92 0.98 0.95 0.95 0.92 0.89 0.77 0.57 0.25(0.06) (0.18) (0.06) (0.10) (0.11) (0.13) (0.13) (0.23) (0.26) (0.22)3/4 DS 0.98 0.95 0.97 0.97 0.96 0.88 0.89 0.80 0.62 0.25(0.14) (0.22) (0.17) (0.17) (0.20) (0.33) (0.31) (0.40) (0.49) (0.44)R 0.99 0.93 0.99 0.97 0.98 0.96 0.96 0.86 0.69 0.34(0.04) (0.19) (0.04) (0.10) (0.08) (0.10) (0.11) (0.22) (0.29) (0.32)Stability1/2 DS 1.00 1.00 1.00 1.00 0.99 0.95 0.87 0.58 0.15 0.01(0.00) (0.00) (0.00) (0.01) (0.02) (0.10) (0.17) (0.28) (0.18) (0.02)R 1.00 1.00 1.00 1.00 1.00 0.98 0.95 0.76 0.30 0.04(0.00) (0.00) (0.00) (0.00) (0.01) (0.05) (0.10) (0.25) (0.27) (0.07)3/4 DS 1.00 1.00 1.00 1.00 1.00 1.00 0.98 0.88 0.31 0.08(0.00) (0.00) (0.00) (0.00) (0.00) (0.00) (0.14) (0.33) (0.47) (0.27)R 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.90 0.44 0.06(0.00) (0.00) (0.00) (0.00) (0.00) (0.01) (0.02) (0.20) (0.37) (0.14) .2 0.4 0.6 0.8 1.0 . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r Figure 1: Power of the knockoﬀ applied to the complete dataset (black), a data split (red),and a randomised version of the data (blue). The titles indicate ( f, ρ, p ).15 .2 0.4 0.6 0.8 1.0 . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r . . . Effect P o w e r Figure 2: Power of stability selection applied to the complete dataset (black), a data split(red), and a randomised version of the data (blue). The titles indicate ( f, ρ, p ).16able 3: Coverages of conﬁdence intervals for the selected coeﬃcients; knockoﬀ selection | β i | f ρ Split Method 0 0.2 0.5 11/2 0 DS FV 68.7 90.1 90.8 89.8R FV 67.3 89.9 90.6 89.8DS HD 89.8 91.3 90.0 90.0R HD 89.6 89.9 89.9 90.31/2 0.5 DS FV 73.9 85.2 91.0 89.9R FV 72.0 84.0 91.2 89.7DS HD 90.0 90.8 89.9 90.2R HD 89.8 89.2 90.0 90.43/4 0 DS FV 59.2 91.3 90.5 89.8R FV 57.4 91.8 90.4 89.8DS HD 90.3 89.7 89.7 89.6R HD 89.8 90.2 90.0 89.63/4 0.5 DS FV 66.4 85.8 91.5 90.0R FV 65.0 85.2 91.4 89.9DS HD 90.6 90.1 90.4 89.8R HD 89.2 90.4 89.7 89.6

Table 4: Coverages of conﬁdence intervals for the selected coeﬃcients; stability selection | β i | f ρ Split Method 0 0.2 0.5 11/2 0 DS FV 39.1 80.2 91.1 89.8R FV 34.9 82.6 91.1 89.8DS HD 90.3 90.6 90.0 90.0R HD 90.1 89.4 89.8 89.71/2 0.5 DS FV 61.7 81.6 89.5 90.0R FV 55.8 80.1 89.5 89.9DS HD 90.6 92.5 89.9 90.1R HD 90.8 89.6 90.5 89.93/4 0 DS FV 21.9 83.7 90.7 89.8R FV 14.2 85.0 90.6 89.8DS HD 90.8 90.2 89.6 89.6R HD 88.5 90.4 89.9 90.33/4 0.5 DS FV 46.9 79.9 90.3 89.9R FV 40.0 80.8 90.6 89.9DS HD 88.9 89.7 90.3 89.8R HD 90.8 91.0 89.6 90.3 | β i | f ρ Split Method 0 0.2 0.5 11/2 0 DS HD 0.391 0.390 0.391 0.391R HD 0.356 0.355 0.358 0.3581/2 0.5 DS HD 0.508 0.507 0.510 0.482R HD 0.457 0.455 0.459 0.4363/4 0 DS HD 0.653 0.651 0.650 0.651R HD 0.503 0.502 0.506 0.5063/4 0.5 DS HD 0.899 0.904 0.903 0.846R HD 0.644 0.645 0.650 0.617

Table 6: Average length of conﬁdence intervals for the selected coeﬃcients; stability se-lection | β i | f ρ Split Method 0 0.2 0.5 11/2 0 DS HD 0.389 0.391 0.391 0.392R HD 0.360 0.354 0.358 0.3581/2 0.5 DS HD 0.506 0.506 0.509 0.483R HD 0.459 0.458 0.457 0.4383/4 0 DS HD 0.657 0.648 0.651 0.651R HD 0.507 0.500 0.507 0.5073/4 0.5 DS HD 0.890 0.904 0.902 0.847R HD 0.648 0.648 0.646 0.620 | β i | f ρ Split Method 0 0.2 0.5 11/2 0 DS FV 38.6 70.9 91.2 89.4R FV 25.0 71.6 91.1 89.2DS HD 89.7 86.1 89.4 90.0R HD 87.9 89.5 89.8 89.91/2 0.5 DS FV 41.7 75.0 83.9 90.6R FV 32.5 75.3 87.5 90.8DS HD 91.2 88.0 88.0 90.4R HD 88.4 88.9 89.2 90.23/4 0 DS FV 11.7 71.2 90.9 89.1R FV 05.5 72.8 90.3 89.1DS HD 88.2 89.8 90.0 90.0R HD 89.4 90.3 90.0 89.93/4 0.5 DS FV 18.3 71.3 88.5 90.5R FV 12.9 70.1 89.9 90.1DS HD 90.0 92.3 91.4 90.5R HD 89.0 91.3 90.4 90.3

Table 8: Average length of conﬁdence intervals for the coeﬃcients of the projection pa-rameters; stability selection | s | f ρ Split Method 1 2 3 4 5 6 7 81/2 0 DS HD - 0.34 0.34 0.34 0.33 0.33 0.34 -R HD - 0.35 0.34 0.33 0.33 0.33 0.33 0.301/2 0.5 DS HD 0.34 0.38 0.37 0.37 0.36 - - -R HD 0.35 0.39 0.37 0.36 0.36 0.35 0.34 -3/4 0 DS HD - 0.48 0.48 0.48 0.48 0.47 0.47 0.48R HD - 0.52 0.49 0.47 0.47 0.46 0.46 0.453/4 0.5 DS HD 0.48 0.55 0.54 0.52 0.52 0.53 - -R HD 0.49 0.56 0.54 0.52 0.51 0.48 0.49 - roof of Proposition 1 Proof.

Let R be the support of R , write p r = pr( R = r ) for r ∈ R , and let 1( · ) denotethe indicator function. By the harmonic-arithmetic inequality for Hermitian matrices(Bhagwat and Subramanian, 1978), it follows that1 f I Y ( β ) − = (cid:40) n (cid:88) i =1 pr( i ∈ R ) I i ( β ) (cid:41) − = (cid:40) n (cid:88) i =1 (cid:88) r ∈R p r i ∈ r ) I i ( β ) (cid:41) − = (cid:40)(cid:88) r ∈R p r n (cid:88) i =1 i ∈ r ) I i ( β ) (cid:41) − = (cid:40)(cid:88) r ∈R p r I r ( β ) (cid:41) − < (cid:88) r ∈R p r I r ( β ) − = E (cid:8) I R ( β ) − (cid:9) . The inequality is strict because the I r ( β )’s are not all equal. By the monotonicity of ϕ ,this implies that ϕ (cid:26) f I Y ( β ) − (cid:27) < ϕ (cid:2) E (cid:8) I R ( β ) − (cid:9)(cid:3) . Furthermore, since ϕ is convex, Jensen’s inequality gives ϕ (cid:2) E (cid:8) I R ( β ) − (cid:9)(cid:3) ≤ E (cid:2) ϕ (cid:8) I R ( β ) − (cid:9)(cid:3) , which shows the ﬁrst inequality. The second one can be shown with a analogous argument. Proof of Theorem 1

Proof.

Assume without loss of generality that the maximum absolute entry of η is 1 forall n . Otherwise we can always divide it by its maximum absolute entry. Furthermore,assume for the moment that ˆ σ = σ and that E is convex for all n .In the ﬁrst part of the proof we show that, under these conditions, the standardisedjoint distribution of X T U and η T V is asymptotically normal. Deﬁne the 2 n vector and2 n × ( p + 1) matrix T = (cid:18) εW (cid:19) , A = (cid:18) X ηγX − γ − η (cid:19) . Note that A has full column rank and that A T A = (cid:18) (1 + γ ) X T X p Tp (1 + γ − ) (cid:107) η (cid:107) (cid:19) , A T T = (cid:18) X T Uη T V (cid:19) − (cid:18) X T µη T µ (cid:19) . p + 1)-dimensional vectors P i = σ − ( A T A ) − / A T e i ε i for i = 1 , . . . , n ,and P i = σ − ( A T A ) − / A T e i W i for i = n + 1 , . . . , n , where e i is the i -th canonical vectorof R n . Clearly, E ( P i ) = 0 p +1 for all i = 1 , . . . , n . Furthermore, n (cid:88) i =1 P i = σ − ( A T A ) − / A T T ≡ Q, n (cid:88) i =1 var( P i ) = I p +1 . Consider events of the form { X T U ∈ E, η T V ≤ c } for c ∈ R . These events can beequivalently written as { Q ∈ B c } , where B c = (cid:26) x ∈ R p +1 : σ ( A T A ) / x + (cid:18) X T µη T µ (cid:19) ∈ E × ( −∞ , c ] (cid:27) . Clearly, since E is convex, B c is also convex. Thus, by a version of the multivariateBerry–Esseen theorem for convex sets (Raiˇc, 2019, Theorem 1.1), it follows that | pr { Q ∈ B c } − pr { N (0 p +1 , I p +1 ) ∈ B c }|≤ ασ (cid:40) E ( | ε | ) n (cid:88) i =1 ( e Ti A ( A T A ) − A T e i ) / + E ( | W | ) n (cid:88) i = n +1 e Ti A ( A T A ) − A T e i ) / (cid:41) = ˜ α n (cid:88) i =1 ( e Ti A ( A T A ) − A T e i ) / , for some α and ˜ α are constants independent of n . By the min-max theorem, we can boundthe terms of the last sum as follows: e Ti A ( A T A ) − A T e i ≤ (cid:107) e Ti A (cid:107) λ max (cid:110)(cid:0) A T A (cid:1) − (cid:111) ≤ ( p + 1) max( A ) λ max (cid:110)(cid:0) A T A (cid:1) − (cid:111) . Hence, n (cid:88) i =1 ( e Ti A ( A T A ) − A T e i ) / ≤ n ( p + 1) / max( A ) λ max (cid:110)(cid:0) A T A (cid:1) − (cid:111) / = o (1)by the various asymptotic conditions on X and η . Recall that we are assuming thatthe entries of η are bounded, and note that the eigenvalues of A T A are those of X T X multiplied by 1 + γ , together with (1 + γ − ) (cid:107) η (cid:107) . The bound given by the theorem isuniform over convex sets, so we have shown that | pr { Q ∈ B c } − pr { N (0 p +1 , I p +1 ) ∈ B c }| = o (1)uniformly in c . Now, using that A T A is block-diagonal, the asymptotic probability can bewritten as pr { N (0 p +1 , I p +1 ) ∈ B c } = pr (cid:26) N (cid:18)(cid:18) X T µη T µ (cid:19) , σ A T A (cid:19) ∈ E × ( −∞ , c ] (cid:27) = pr (cid:8) N ( X T µ, (1 + γ ) σ X T X ) ∈ E (cid:9) pr (cid:8) N ( η T µ, (1 + γ − ) σ (cid:107) η (cid:107) ) ≤ c (cid:9) . To conclude this part, let c = kx + η T µ for an arbitrary x independent of n ,where k = (1 + γ − ) / σ (cid:107) η (cid:107) . We have thatpr { k − ( η T V − η T µ ) ≤ x | X T U ∈ E } = pr( X T U ∈ E, η T V ≤ c )pr( X T U ∈ E, η T V < ∞ ) . { k − ( η T V − η T µ ) ≤ x | X T U ∈ E } → pr { N (0 , ≤ x } . This establishes convergence in distribution under a known variance assumption and a con-vex selection set. Extending the result to the conditions of the theorem is straightforward.First, if ˆ σ (cid:54) = σ , write V = ˜ V + γ − ( σ − ˆ σ ) Z , with ˜ V = Y − γ − σZ . Then, k − ( η T V − η T µ ) = k − ( η T ˜ V − η T µ ) + k − γ − ( σ − ˆ σ ) η T Z. We have shown that, conditionally on selection, the ﬁrst term converges in distribution to astandard normal. The second one is bounded by a constant times ( σ − ˆ σ ) (cid:107) η (cid:107) − η T Z . Notethat, unconditionally on selection, σ − ˆ σ = o p (1), by assumption, and (cid:107) η (cid:107) − η T Z = O p (1),as it has zero mean and unit variance. Hence, the last term is o p (1) unconditionally onselection. However, since the selection probability is bounded away from zero, it followstrivially the last term converges to zero also conditionally on selection, aspr (cid:0) | ( σ − ˆ σ ) (cid:107) η (cid:107) − η T Z | > δ | S = s n (cid:1) ≤ pr (cid:0) | ( σ − ˆ σ ) (cid:107) η (cid:107) − η T Z | > δ (cid:1) pr ( S = s n ) → n → ∞ for any positive δ . The result follows from Slutsky’s theorem.Finally, suppose that E = E ∪ E , with E and E convex. We can writepr { k − ( η T V − η T µ ) ≤ x | X T U ∈ E } = pr( X T U ∈ E , η T V ≤ c ) + pr( X T U ∈ E , η T V ≤ c ) − X T U ∈ E ∩ E , η T V ≤ c )pr( X T U ∈ E ) + pr( X T U ∈ E ) − X T U ∈ E ∩ E )and use the normal approximation as before, since all the events involved are convex.When the union involves more than two sets we can proceed recursively. References

Bachoc, F., Leeb, H. and P¨otscher, B. M. (2017), Valid conﬁdence intervals for post-model-selection predictors. arXiv:1412.4605v3.Bachoc, F., Preinerstorfer, D. and Steinberger, L. (2020), ‘Uniformly valid conﬁdenceintervals post-model-selection’,

Ann. Stat. (1), 440–463.Barber, R. F. and Cand`es, E. (2015), ‘Controlling the false discovery rate via knockoﬀs’, Ann. Stat. (5), 2055–2085.Barber, R. F., Cand`es, E., Janson, L., Patterson, E. and Sesia, M. (2020), The knockoﬀﬁlter for controlled variable selection . R package version 0.3.3.

URL: https://CRAN.R-project.org/package=knockoﬀ

Bayati, M., Erdogdu, M. A. and Montanari, A. (2013), Estimating LASSO risk and noiselevel’, in C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani and K. Q. Weinberger,eds, ‘Advances in Neural Information Processing Systems 26 (NIPS 2013)’.22erk, R., Brown, L., Buja, A., Zhang, K. and Zhao, L. (2013), ‘Valid post-selectioninference’,

Ann. Stat. (2), 802–837.Bhagwat, K. V. and Subramanian, R. (1978), ‘Inequalities between means of positiveoperators’, Math. Proc. Camb. Philos. Soc. (3), 393–401.Cand`es, E., Fan, Y., Janson, L. and Lv, J. (2018), ‘Panning for gold: ‘model-X’ knockoﬀsfor high dimensional controlled variable selection’, J. R. Statist. Soc. B (3), 551–577.Cox, D. R. (1975), ‘A note on data-splitting for the evaluation of signiﬁcance levels’, Biometrika (2), 441–444.DiCiccio, C. J., DiCiccio, T. J. and Romano, J. P. (2020), ‘Exact tests via multiple datasplitting’, Stat. Probab. Lett. .Fan, J., , Guo, S. and Hao, N. (2012), ‘Variance estimation using reﬁtted cross-validationin ultrahigh dimensional regression’,

J. R. Statist. Soc. B (1), 37–65.Fithian, W., Sun, D. L. and Taylor, J. E. (2017), ‘Optimal inference after model selection’.arXiv:1410.2597v4.Hofner, B. and Hothorn, T. (2017), Stability selection with error control . R package version0.6-3.

URL: https://CRAN.R-project.org/package=stabs

Hong, L., Kuﬀner, T. A. and Martin, R. (2018), ‘On overﬁtting and post-selection uncer-tainty assessments’,

Biometrika (1), 221–224.Ignatiadis, N., Klaus, B., Zaugg, J. and Huber, W. (2016), ‘Data-driven hypothesisweighting increases detection power in genome-scale multiple testing’,

Nat. Methods (7), 577–580.Kivaranovic, D. and Leeb, H. (2020), ‘On the length of post-model-selection conﬁdenceintervals conditional on polyhedral constraints’, JASA (0), 1–13.Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016), ‘Exact post-selection inference,with application to the lasso’, Ann. Stat. (3), 907–927.Lee, J. D. and Taylor, J. E. (2014), ‘Exact post model selection inference for marginalscreening’, Adv. Neural Inf. Process Syst. , 136–144.Lockhart, R., Taylor, J. E., Tibshirani, R. and Tibshirani, R. (2014), ‘A signiﬁcance testfor the lasso’, Ann. Stat. (2), 413–468.Loftus, J. R. and Taylor, J. E. (2014), ‘A signiﬁcance test for forward stepwise modelselection’. arXiv:1405.3920v1.Markovic, J., Taylor, J. and Taylor, J. (2019), Inference after black box selection.arXiv:1901.09973v1.Meinshausen, N. and B¨uhlmann, P. (2010), ‘Stability selection’, J. R. Statist. Soc. B (4), 417–473.Panigrahi, S., Taylor, J. and Weinstein, A. (2020), ‘Integrative methods for post-selectioninference under convex constraints’. arXiv:1605.08824v7.23aiˇc, M. (2019), ‘A multivariate berry–esseen theorem with explicit constants’, Bernoulli (4A), 2824–2853.Reid, S., Tibshirani, R. and Friedman, J. (2016), ‘A study of error variance estimation inlasso regression’, Stat. Sin. , 35–67.Reitermanov´a, Z. (2010), ‘Data splitting’, WDS’10 Proceedings of Contributed Papers , 31–36.Rinaldo, A., Wasserman, L. and G’Sell, M. (2019), ‘Bootstrapping and sample splittingfor high-dimensional, assumption-lean inference’, Ann. Stat. (6), 3438–3469.Rubin, D., Dudoit, S. and van der Laan, M. (2006), ‘A method to increase the powerof multiple testing procedures though sample splitting’, Stat. Appl. Genet. Mol. Biol. (19).Snee, R. D. (1977), ‘Validation of regression models: methods and examples’, Technomet-rics , 415–428.Tian, X. (2020), ‘Prediction error after model search’, Ann. Stat. (2), 763–784.Tian, X. and Taylor, J. E. (2018), ‘Selective inference with a randomized response’, Ann.Stat. (2), 679–710.Tibshirani, R., Rinaldo, A., Tibshirani, R. and Wasserman, L. (2018), ‘Uniform asymptoticinference and the bootstrap after model selection’, Ann. Stat. (3), 1255–1287.Tibshirani, R., Tibshirani, R., Taylor, J., Loftus, J., Reid, S. and Markovic, J. (2019), selectiveInference: Tools for post-selection inference . R package version 1.2.5. URL: https://CRAN.R-project.org/package=selectiveInference

Wasserman, L. and Roeder, K. (2009), ‘High-dimensional variable selection’,

Ann. Stat.37