Communication-Efficient False Discovery Rate Control via Knockoff Aggregation
aa r X i v : . [ s t a t . M L ] N ov Communication-Efficient False Discovery Rate Control viaKnockoff Aggregation
Weijie Su Junyang Qian Linxi Liu
Department of Statistics, Stanford University, Stanford, CA 94305, USANovember 2015
Abstract
The false discovery rate (FDR)—the expected fraction of spurious discoveries among all thediscoveries—provides a popular statistical assessment of the reproducibility of scientific studiesin various disciplines. In this work, we introduce a new method for controlling the FDR inmeta-analysis of many decentralized linear models. Our method targets the scenario wheremany research groups—possibly the number of which is random—are independently testing acommon set of hypotheses and then sending summary statistics to a coordinating center in anonline manner. Built on the knockoffs framework introduced by Barber and Cand`es (2015),our procedure starts by applying the knockoff filter to each linear model and then aggregatesthe summary statistics via one-shot communication in a novel way. This method gives exactFDR control non-asymptotically without any knowledge of the noise variances or making anyassumption about sparsity of the signal. In certain settings, it has a communication complexitythat is optimal up to a logarithmic factor.
Modern scientific discoveries are commonly supported by statistical significance summarized fromexploring datasets. In our present world of Big Data, there are a number of difficulties with thisscenario: an increasing number of hypotheses tested simultaneously, extensive use of sophisticatedtechniques, and enormous tuning parameters. In this pipeline, spurious discoveries arise naturallyby mere random chance alone across nearly all disciplines including health care [19, 2, 15], machinelearning [9, 7], and neuroscience [22].To address this challenge, the statistical community in the past two decades has developed avariety of approaches. A landmark work [3] proposed the false discovery rate (FDR) as a new mea-sure of type-I error for claiming discoveries, along with the elegant Benjamini-Hochberg procedure(BHq) for controlling the FDR in the case of independent test statistics. Roughly speaking, FDR isthe expected fraction of erroneously made discoveries among all the claimed discoveries. Today, thisconcept has been widely accepted as a criterion for providing evidence about the reproducibility ofdiscoveries claimed in one experiment.Our motivation for this work is further enhanced by the observation that scientific experimentsare inherently decentralized in nature, where a given set of hypotheses are probed by several groupsworking in parallel. For an individual group, its access to datasets collected by the others is verylimited. Then, challenges arise on how to statistically and efficiently perform meta-analysis of1esults from all groups for controlling the FDR while maintaining a higher power (the fractionof correctly identified true discoveries) compared to an individual group. As a running example,imagine that an initiative is intended to study the genetic causes of autism across many researchinstitutes. Due to the privacy and confidentiality of the datasets held by different institutes, itwould be difficult to share full datasets. In contrast, aggregating small-volume summary statisticsfrom each institute is a practical solution. Another issue is observed in different high-tech companiesthat hold background and behavioral information on thousands of millions of individuals, but arereluctant to share data for common research topics in part due to huge communication costs.
To formalize the problem considered throughout the paper, suppose we observe a sequence of linearmodels y i = X i β i + z i , where the design X i ∈ R n i × p and the response y i ∈ R n i are collected by the i th group, the errorterm z i has i.i.d. N (0 , σ i ) entries, and the signal β i ∈ R p may vary across different groups. Keepingin mind that a (sufficiently) strong signal in one of the groups is adequate to declare significancein the meta-analysis finding, we are interested in any feature j that obeys β ij = 0 for at least one i ;for any model selection procedure returning a set of discoveries b S ⊂ { , . . . , p } , the false discoveryproportion (FDP) is defined asFDP = n ≤ j ≤ p : j ∈ b S and β ij = 0 for all i o max {| b S | , } , (1)and FDR is the expectation of FDP. We assume that each group has only access to its own data, thatis, ( y i , X i ), and reports summary statistics encoded in about O ( p ) bits to a coordinating center.In this protocol, we aim to achieve the exact FDR control by only making use of the informationreceived at the center. Our approach, referred to as knockoff aggregation , is built on top of theknockoffs framework introduced by Barber and Cand`es [1]. The knockoff filter remarkably achievesexact FDR control in the finite sample setting for a single linear model whenever the number ofvariables is no more than the number of observations. The validity of the method does not dependon the amplitude or sparsity of the unknown signal, or any knowledge of the noise variance. Insharp contrast, the BHq is only known to control the FDR for sequence models under very restrictedcorrelation structures [4] apart from the independent case [3].Some appealing features of the knockoff aggregation are listed as follows. Inherited from theknockoffs, our method also controls the FDR exactly for general design matrices in a non-asymptoticmanner and does not require any knowledge of the noise variances of linear models. Apart fromthese inheritances, knockoff aggregation provides more refined information on the significance ofeach hypothesis by aggregating many independent copies of the summary statistics, resemblingthe multiple knockoffs as briefly mentioned in [1]. This property not only improves power byamplifying the signal, but also allows control of a generalized FDR which incorporates randomizeddecision rules. Due to the one-shot nature, this method only costs O ( p · Preliminaries
In this section, we give a concise exposition of the knockoff filter [1]. Consider y = Xβ + z , where the design X is n by p and noise term z consists of n i.i.d. N (0 , σ ) entries. The knockoffsframework assumes the number of variables p is no more than the number of measurements n andthe design matrix X has full rank. This is used to ensure model identifiability since otherwisethere exists a non-trivial linear combination of the p features X j that sums to zero. Moreover, wenormalize each column: k X j k = 1 for all 1 ≤ j ≤ p .This method starts with constructing the knockoff features f X ∈ R n × p that satisfies f X ⊤ f X = X ⊤ X , X ⊤ f X = X ⊤ X − diag( s ) , (2)where s ∈ R p has nonnegative entries. To understand the constraints, observe that the first equalityforces f X to mimic the correlation structure of X . The second equality further requires any original-knockoff pair X j , f X j have the same correlation with all the other 2 p − X .The next step is to generate statistics for every original-knockoff pair. Denote by X KO =[ X , f X ] ∈ R n × p , the augmented design matrix. The reference paper suggests choosing the Lassoon the augmented design as a pilot estimator: b β ( λ ) = argmin b ∈ R p k y − X KO b k + λ k b k . Then, let Z j = sup { λ : b β j ( λ ) = 0 } . Similarly, define e Z j for the knockoff variable f X j . Then,a recommended choice of the knockoff statistics are (different notation is used for the ease ofexposition) W j = max { Z j , e Z j } , χ j = sgn( Z j − e Z j ) , where sgn( x ) = − , , x < , x = 0 , x > W j can take the form of any symmetric function of Z j and e Z j . Furthermore, the use ofthe pilot estimator is not necessarily confined to the Lasso; alternatives include least-squares, leastangle regression [12], and any likelihood estimation procedures with a symmetric penalty (see e.g.[13, 25, 5]).The following lemma, due to [1], is essential for the proof of FDR control of our knockoffaggregation. As clear from (1), we call j a true null when β j = 0 and a false null otherwise. Lemma 2.1.
Conditional on all false null χ j and all W j , all true null χ j are jointly independentand uniformly distributed on {− , } . This simple lemma follows from the delicate symmetry between X j and its knockoff f X j , whichis guaranteed by the construction (2). The result implies that each χ j can be interpreted as aone-bit p -value, in the sense that it takes 1 or − β j = 0. In the case oflarge | β j | , we shall expect that χ j is more likely to take 1 since the original feature X j has more3dds to enter the Lasso path earlier than f X j . This lemma also suggests ordering the hypothesesbased on the magnitude of W j so that hypotheses that are more likely to be rejected would betested earlier. Good ordering of hypotheses is a key element for improving power in sequentialhypothesis testing (see e.g. [14, 21]). We recap the problem: Observe a sequence of decentralized linear regression models with the sameset of hypotheses of interest, y i = X i β i + z i (3)for 1 ≤ i ≤ m . The design matrix X i is n i by p and z i consists of i.i.d. centered normals. The errorterms z i are jointly independent and are allowed to have different variance levels. The number ofobserved models m is not necessarily deterministic, but must be independent of the randomness ofall z i . We are interested in simultaneously testing H ,j : β j = β j = · · · = β mj = 0(that is, there is no effect of feature j in all studies) versus H a,j : at least one β ij = 0for 1 ≤ j ≤ p . To fully utilize the knockoffs framework, we assume n i ≥ p for all i ≥ W i , . . . W ip and the one-bit p -values χ i , . . . , χ ip . Then, for each 1 ≤ j ≤ p , we aggregate W j , . . . , W mj to produce W j that measuresthe rank of the j th hypothesis: the larger W j is, the earlier the hypothesis H ,j is tested. Let thissummary statistic take the form W j = Γ( W j , . . . , W mj ) for some nonnegative measurable functionΓ defined on R m + . Recognizing that a large W ij provides evidence of significant rank of the corre-sponding hypothesis, we are particularly interested in summary functions Γ that are non-decreasingin each coordinate. No further conditions of Γ are required. Examples include Γ( x , . . . , x m ) =max { x , . . . , x m } , Γ( x , . . . , x m ) = the sum (or product) of the r largest of x , . . . x m for some 1 Require: X , . . . , X m , y , . . . , y m and a summary function Γ. Run the knockoff filter for each model and get χ j , . . . , χ mj and W j , . . . , W mj . (One-shot communication) Let χ j ← m + P mi =1 χ ij and W j ← Γ( W j , . . . , W mj ). Having defined the aggregated knockoff statistics, we now turn to control a generalized FDR. Wecall it the weighted false discovery rate ( w FDR) which includes the original FDR as a specialexample. In lieu of the accept-or-reject decision rule, we introduce a randomized decision rulethat assigns each hypothesis H ,j a number ω j between 0 and 1. The closer ω j is to 1, the moreconfidence we have in rejecting the hypothesis H ,j . The definition the weighted FDR is given asfollows. Definition 3.1. Given a randomized decision rule ω ∈ [0 , p , define the weighted false discoveryproportion as w FDP = P pj =1 ω j null j P pj =1 ω j if P pj =1 ω j > and otherwise w FDP = 0 , where null j = 1 if the hypothesis H ,j is a true null andotherwise 0. The w FDR is the expectation of w FDP . If the weights ω j take only 0 , 1, then the w FDR reduces to the vanilla FDR. In general, ω j can be interpreted as the probability, or confidence , of randomly rejecting H ,j . As a specialcase, rejecting a hypothesis with confidence 0 is equivalent to accepting it. The motivation forthis generalized FDR is simple: The accept-or-reject rule behaves like a hard rule that may makecompletely different decisions for very close p -values, whereas randomization smoothes out thisundesirable artifact. From a practical point of view, this generalized FDR has the potential to findapplications in large-scale Internet experiments where randomized decisions occurred frequently.Randomization of testing also comes naturally from an empirical Bayesian framework (see e.g.[11]).As mentioned earlier, the particular form of the refined χ -statistics is highly motivated bythe fact that χ j under the null hypotheses (i.e. β j = · · · = β mj = 0) are simply i.i.d. binomialrandom variables, no matter how complicated the joint distribution of W j is. The following lemmaformalizes this point, whose proof is just a stone away from Lemma 2.1. Lemma 3.2. Conditional on all false null χ j and all W j , all true null χ j are jointly independentand have binomial distribution B ( m, / . Therefore, the refined p -value for testing H ,j is naturally given by P j = 12 m m X i = χ j (cid:18) mi (cid:19) , which, by definition, is stochastically smaller than the uniform distribution on [0 , , → [0 , 1] a confidence function if Ω isnon-increasing and obeys lim x → Ω( x ) = Ω(0) = 1 , lim x → − Ω( x ) = Ω(1) = 0 . This function is used to provide weights ω in rejecting the hypotheses. In the special case ofΩ( x ) = x ≤ c for some 0 < c < 1, this algorithm reduces to the Selective SeqStep+ in [1]. Interestedreaders are referred to [23] for generalizations in a different direction. Below, let U be uniformlydistributed on [0 , w FDRin the finite sample setting holds for any summary function Γ and confidence function Ω. Theorem 3.3. Combining Algorithm 1 and Algorithm 2 gives w FDR ≤ q. Algorithm 2 Knockoff filter with aggregated statistics from Algorithm 1 Require: χ , . . . , χ p , W , . . . , W p from Algorithm 1, nominal level q ∈ (0 , Compute P j = m P mi = χ j (cid:0) mi (cid:1) . Order hypotheses according to the magnitude of the W -statistics: W ρ (1) ≥ W ρ (2) ≥ · · · ≥ W ρ ( p ) ,where ρ ( · ) is a permutation of 1 , . . . , p . Let b k be max ( k : 1 + P kj =1 (1 − Ω( P ρ ( j ) )) P kj =1 Ω( P ρ ( j ) ) ≤ q E Ω( U ) − q ) , with the convention that max ∅ = −∞ . Reject all hypotheses H ,ρ ( j ) for j ≤ b k with weight (confidence) ω j = Ω( P ρ ( j ) ).The proof of the theorem relies on two lemmas stated below, which are parallel to Lemma 4 of[1]. We defer the proofs of these two lemmas to the Appendix. Lemma 3.4. Let V + ( k ) = k X j =1 Ω( P j ) null j , V − ( k ) = k X j =1 (1 − Ω( P j )) null j . Then, M ( k ) = V + ( k )1 + V − ( k ) is a super-martingale running backward in k with respect to the filtration F k , which only knows allthe false null P j , and V + ( k ) , V + ( k + 1) , . . . , V + ( p ) . Lemma 3.5. For any integer N ≥ , let U, U , . . . , U N be i.i.d. uniform random variables on [0 , .Then, E " P Nj =1 Ω( U j )1 + P Nj =1 (1 − Ω( U j )) ≤ E Ω( U )1 − E Ω( U ) . (4)6ow, we turn to give the proof of Theorem 3.3. The idea of the proof is similar to that ofTheorem 2 in [1]. Proof of Theorem 3.3. Recall that the weights ω j are given as ω j = Ω( P j ) if H ,j is ranked amongthe first b k hypotheses processed by Algorithm 2 and otherwise ω j = 0. Due to the independencebetween m and all z i , by a conditional argument the theorem is reduced to proving for a deter-ministic m . By Lemma 3.2, without loss of generality, we may assume W ≥ · · · ≥ W p . Then weget w FDP · b k ≥ = P b kj =1 Ω( P j ) null j P b kj =1 Ω( P j )= 1 + P b kj =1 (1 − Ω( P j )) null j P b kj =1 Ω( P j ) · P b kj =1 Ω( P j ) null j P b kj =1 (1 − Ω( P j )) null j ≤ P b kj =1 (1 − Ω( P j )) P b kj =1 Ω( P j ) · P b kj =1 Ω( P j ) null j P b kj =1 (1 − Ω( P j )) null j ≤ q (1 − E Ω( U )) E Ω( U ) · M ( b k ) . Recognizing that b k is a stopping time with respect to F , we apply the Doob’s optional stoppingtheorem to the super-martingale M ( k ) in Lemma 3.4, w FDR ≤ q (1 − E Ω( U )) E Ω( U ) · E M ( b k ) ≤ q (1 − E Ω( U )) E Ω( U ) · E M ( p ) ≤ q (1 − E Ω( U )) E Ω( U ) · E Ω( U )1 − E Ω( U ) = q, where the last inequality follows from Lemma 3.5 and the observation that Ω( P j ) is stochasticallydominated by Ω( U j ). While it is out of the scope of the present paper, we would like to briefly point out that the knock-off aggregation can be applied to control other type-I error rates, including the k -familywise errorrate ( k -FWER) [18], γ -FDP [16], and per-family error rate (PFER) [17]. These error rates havedifferent interpretations from the FDR and are more favorable in certain applications. Interestedreaders are referred to a recent work [20] where some attractive features of the knockoffs frame-work are translated into a novel procedure for provably controlling the k -FWER and PFER. Withmore refined information, the knockoff aggregation has the potential to improve power while stillcontrolling these error rates. 7 Communication Complexity The knockoff aggregation is communication efficient due to its one-shot nature. For each decentral-ized linear model, the message sent to the coordinating center is merely the sign information χ ij andthe ordering information W ij . This piece of information can be encoded in O ( p poly (log p )) bits,where the polylogarithmic factor is used in quantizing each W ij depending on the accuracy required.Hence, the total bits of communication is e O ( mp ). We can further get rid of this logarithmic factorby forcing W ij to take only 0 or 1, respectively, depending on whether the original W ij is below themedian of the original W i , . . . , W ip or not.It would be interesting to get a lower bound on the total communication cost required tocontrol the FDR while maintaining a decent power. A trivial bound as such is Ω( p ) since it needs p bits to fully characterize the support set of β i (in this section Ω( · ) is the Big Omega notation incomplexity theory, instead of the confidence function). In general, this bound is unachievable sincethe summary statistics from each decentralized model are obtained by using only local information.To shed light on this, we provide a simple but illuminating example of (3) where Θ( mp ) is theoptimal communication cost in achieving asymptotically vanishing FDP and full power up to apolylogarithmic factor. To start with, fix the noise level σ i = 1. All β i are equal to a common β . Let the design matrix X i of each decentralized model be a (2 p ) × p matrix with orthonormalcolumns, and each β j independently take µ := q log pm with probability half and 0 otherwise. Wefurther assume that β is independent of z i , and both p, m → ∞ but do not differ extremely fromeach other in the sense that p = O (e m . ) and m = O ( poly ( p )). This condition allows m ≍ log p or m ≍ p α for arbitrary α > 0. Here, the summary statistics are defined as follows: We regress y i onthe augmented design [ X i , f X i ] ( f X i is an orthogonal complement of X i ), obtain the least-squaresestimates b β ij , e β ij for each 1 ≤ j ≤ p , and then take χ ij = sgn( b β ij − e β ij ) and W ij = 1 or 0, respectively,depending on whether | b β ij − e β ij | is above the median of | b β i − e β i | , . . . , | b β ip − e β ip | or not.Under the preceding assumptions, the knockoff aggregation almost perfectly recovers the sup-port set of the signal β , with a total communication cost of O ( mp ). Let V ∈ { , } p be constructedas V j = 1 if β j = 0 and otherwise V j = 0 (so V is uniformly distributed in the cube { , } p ). Sim-ilarly, the output of the knockoff aggregation, denoted as b V KO , takes the form of b V KO ,j = 1 if H ,j is rejected and b V KO ,j = 0 if H ,j is accepted. Last, denote by Hamm( · , · ) the Hamming distance. Proposition 4.1. Let ǫ be any positive constant. With probability tending to one, this knockoffaggregation with slowly vanishing nominal levels q obeys Hamm( b V KO , V ) ≤ ǫp. Hence, the knockoff aggregation is capable of distinguishing almost all the signal features fromthe noise features, resulting FDP → → 1. The nominal level q is spelled out in theproof.Next, we move to give the information-theoretic lower bound. Denote by M i the message sentby the i th model, which only depends on the local information y i , X i . Then the coordinatingcenter makes decisions b V ∈ { , } p to reject or accept each of the p hypotheses solely based on the m pieces of messages M , . . . , M m . In other words, the protocol is non-interactive. Let L i be theminimal length of M i in bit, with a preassigned budget constraint E ( L + · · · + L m ) ≤ B . Theproof of the result below uses tools from [8]. 8 roposition 4.2. Let ǫ and C be arbitrary positive constants. If the total communication budget B = Cmp log . p , then for any non-interactive protocol, Hamm( b V , V ) ≥ − ǫ p holds with probability tending to one. Incidentally, the exponent 2 . p/ V . Hence, it is hopeless to draw any statistically validconclusion based on O ( mp/ log o (1) p ) bits of information in the distributed setting. In a nutshell,for our example a communication budget of O ( mp ) up to a logarithmic factor, is both sufficientand necessary for recovering the true signal. In this section, we test the performance of the knockoff aggregation under a range of designs withdifferent sparsity levels and signal strengths. Recall that we have m decentralized linear models y i = X i β i + z i for i = 1 , . . . , m , where X i ∈ R n i × p . Our setup is similar to [1]. First, the rows of X i are drawnindependently from N ( , Σ ), and are also independent of other sub-models. The columns of each X i are then normalized to have unit length. Second, given a sparsity level k , we randomly sample k signal locations and set β ij = A for each selected index j and all i , where A is a fixed magnitude.Last, we fix the design and repeat the experiment by drawing y iid ∼ N ( Xβ , I ). Nominal FDR levelis set to be q = 0 . Our first experiment tests the performance across different m as the sparsity level or signal strengthvaries. To save space, here we only show the results for the models with independent features,i.e. Σ is diagonal. Similar patterns still hold under correlated designs. We take p = 1000, and n i = 3000 for each i . Fix Σ = I and m = 5. In this scenario, we take A = 1 . p (2log p ) / ≈ . p (2log p ) / m = 5 decentralized models, and 1.2 is a compensation factor for information loss in ourcommunication-efficient aggregation. Each experiment is repeated 30 times.In the knockoff aggregation, we are allowed to choose the summary function Γ (in Algorithm 1)and confidence function Ω (in Algorithm 2). The choice can be made adaptively to different m, n, p and the design structure. In the following simulation, we take Γ( W j , . . . , W mj ) = P mi =1 n i W ij , andΩ( P j ) = P j ≤ . .Figure 1 shows the FDR and power achieved by knockoff aggregation with fixed signal strength A = 1 . 99 and varying sparsity levels k = 10 , , , k = 309nd varying strengths A = c p (2 log p ) / 5, where c = 1 . , . , . , . 0. Recall that the power is thefraction of identified true discoveries among all the k potential true discoveries. We see that ourprocedure can effectively control the FDR for different m in both cases. Meanwhile, as m increasespower gains are significant even we only need e O ( mp ) bits of communication. 10 20 30 40 50 60 70 80 90 10000.050.10.150.20.250.3 Sparsity level k F DR m = 1m = 3m = 5 10 20 30 40 50 60 70 80 90 10000.10.20.30.40.50.60.70.80.91 Sparsity level k P o w e r m = 1m = 3m = 51.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.400.050.10.150.20.250.3 Amplitude level A F DR m = 1m = 3m = 5 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.400.10.20.30.40.50.60.70.80.91 Amplitude level A P o w e r m = 1m = 3m = 5 Figure 1: Top row: mean FDP and power versus sparsity level k with fixed strength A = 1 . A with fixed sparsity k = 30. Weuse i.i.d. design with p = 1000 , n i = 3000 and nominal level q = 0 . We compare the knockoff aggregation with other methods, such as the least-squares (OLS) and theLasso. For the OLS, we consider the following procedure. For each i , we have the OLS estimator b β i based on ( X i , y i ). This estimator obeys b β i ∼ N ( β , Θ i ), where Θ i = (( X i ) ⊤ X i ) − . Then, b β i andthe corresponding marginal variances ( Θ ijj ) ≤ j ≤ p are aggregated from the m nodes as follows. Let b β j = P mi =1 b β ij /m be an averaged estimator of β j and Θ j = (1 /m ) P mi =1 Θ ijj be its variance. Set the z -score Z j = b β j / p Θ j for testing H ,j . Note that Z j ∼ N (0 , 1) marginally when β j = · · · β mj = 0.Ignoring the correlations, we apply the BHq procedure directly to the p -values derived from the z -scores. We simply call this OLS for convenience hereafter.For the Lasso, we take the following approach. Given data ( X i , y i ), we compute the Lassoestimates in parallel b β i Lasso = argmin b ∈ R p k y i − X i b k + λ i k b k , where λ i ≥ λ i such that the cross-validation error is within one standard errorof the minimum. For each i , the support set of b β i Lasso is sent to the center and a majority vote isapplied to determine whether to accept H ,j or not.To compare these methods, we consider a correlated design as an illustration. Let p = 500 , n i =1500, and m = 5. To generate the rows X ij , we set Σ ij = 1 if i = j and Σ ij = − . / (0 . · ( p − 2) + 1)otherwise. Fix the sparsity level k = 100 and signal strength A = 5 √ p . In this setting, whilethe powers of these procedures are essentially 1, their behavior in FDP shown in Table 1 is verydistinct. Mean of FDP SD of FDPKnockoff Aggregation 0.1774 0.0493OLS 0.1433 0.1260Lasso 0.3458 0.0405 Table 1: FDP of knockoff aggregation, OLS and Lasso. The Lasso with a cross-validated penalty lacks a guarantee of FDR control (see e.g. [24]).In the case of correlated designs, its empirical FDR 0 . q = 0 . 20. Despite the fact that we choose a sparser model in cross-validation, Lasso still tends toselect more variables than necessary. In terms of controlling false discoveries, the Lasso does notgive a satisfactory solution. In contrast, the mean FDP of the knockoff aggregation and that ofthe OLS are both under the nominal level, though the former is slightly higher than the latter.However, more importantly, as shown in Figure 2, the FDPs of the knockoff aggregation are tightlyconcentrated around the nominal level, while those of the OLS are widely spread—sometimes theproportions of false discoveries can be as high as 70% for the OLS. For information, the estimatedstandard deviation of the knockoff FDP is 0.0493, while, in stark contrast, that of the OLS FDP is0.1260—almost 3 times higher. Such high variability is undesirable in practice. Researchers wouldnot like to take the risk of having 70% false discoveries in any study. F r equen cy Knockoff Aggregation 0 0.2 0.4 0.6 0.8 10102030405060708090100 FDP F r equen cy OLS 0 0.2 0.4 0.6 0.8 10102030405060708090100 FDP F r equen cy Lasso Figure 2: Histogram of the FDPs by knockoff aggregation, OLS, and Lasso with 200 replicates. m = 5 , p = 500, and n i = 1500. Sparsity level k = 100 and signal strength A = 5 √ p . Discussion We introduce a communication-efficient method for aggregating the knockoff filter running on manydecentralized linear models. This knockoff aggregation enjoys exact FDR control and some desiredproperties inherited from the knockoffs framework. Simulation results provide evidence that thisproposed method exhibits nice properties in a range of examples.Many challenging problems remain and we address a few of them. An outstanding open problemis to generalize the knockoffs framework to the high-dimensional setting p > n . This would as wellhelp the knockoff aggregation cover a broader range of applications. In addition, the flexibility ofthe use of the link functions Ω and Γ leaves room for further investigation: For example, does thereexist an optimal Ω or Γ? Can these functions be chosen in a data-driven fashion? Last, it wouldbe interesting to incorporate differential privacy (see e.g. [10]) in the stage of aggregation, whichcould lead to much stronger protection of confidentiality. Acknowledgments W. S. was supported in part by a General Wang Yaowu Stanford Graduate Fellowship. W. S. wouldlike to thank Emmanuel Cand`es for encouragement. We would like to thank John Duchi for helpfulcomments, and Becky Richardson for discussions about an early version of the manuscript. References [1] R. F. Barber and E. J. Cand`es. Controlling the false discovery rate via knockoffs. The Annalsof Statistics , 43(5):2055–2085, 2015.[2] C. G. Begley and L. M. Ellis. Drug development: Raise standards for preclinical cancerresearch. Nature , 483(7391):531–533, 2012.[3] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerfulapproach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodologi-cal) , 57(1):pp. 289–300, 1995.[4] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing underdependency. Ann. Statist. , 29(4):1165–1188, 08 2001.[5] M. Bogdan, E. v. d. Berg, C. Sabatti, W. Su, and E. J. Cand`es. SLOPE – Adaptive variableselection via convex optimization. The Annals of Applied Statistics , 9(3):1103–1140, 2015.[6] T. M. Cover and J. A. Thomas. Elements of Information Theory . John Wiley & Sons, 2012.[7] J. Demˇsar. Statistical comparisons of classifiers over multiple data sets. Journal of MachineLearning Research , 7:1–30, 2006.[8] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and Y. Zhang. Optimality guarantees fordistributed statistical estimation. arXiv preprint arXiv:1405.0782 , 2014.[9] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. The reusable holdout:Preserving validity in adaptive data analysis. Science , 349(6248):636–638, 2015.1210] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Theoretical Com-puter Science , 9(3-4):211–407, 2013.[11] B. Efron. Large-scale simultaneous hypothesis testing. Journal of the American StatisticalAssociation , 99(465), 2004.[12] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals ofStatistics , 32(2):407–499, 2004.[13] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle prop-erties. Journal of the American Statistical Association , 96(456):1348–1360, 2001.[14] D. P. Foster and R. A. Stine. Alpha-investing: A procedure for sequential control of expectedfalse discoveries. Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,70(2):429–444, 2008.[15] L. P. Freedman, I. M. Cockburn, and T. S. Simcoe. The economics of reproducibility inpreclinical research. PLoS Biol , 13(6):e1002165, 2015.[16] C. Genovese and L. Wasserman. A stochastic process approach to false discovery control. TheAnnals of Statistics , pages 1035–1061, 2004.[17] A. Gordon, G. Glazko, X. Qiu, and A. Yakovlev. Control of the mean number of false discov-eries, bonferroni and stability of multiple testing. Ann. Appl. Stat. , 1(1):179–190, 06 2007.[18] G. Hommel and T. Hoffmann. Controlled uncertainty. In P. Bauer, G. Hommel, and E. Sonne-mann, editors, Multiple Hypothesenprfung / Multiple Hypotheses Testing , volume 70 of Medi-zinische Informatik und Statistik , pages 154–161. Springer Berlin Heidelberg, 1988.[19] J. P. Ioannidis. Why most published research findings are false. PLoS Medicine , 2(8):e124,2005.[20] L. Janson and W. Su. Familywise error rate control via knockoffs. arXiv preprintarXiv:1505.06549 , 2015.[21] A. Javanmard and A. Montanari. On online control of false discovery rate. arXiv preprintarXiv:1502.06197 , 2015.[22] N. Kriegeskorte, W. K. Simmons, P. S. F. Bellgowan, and C. I. Baker. Circular analysis insystems neuroscience: the dangers of double dipping. Nature Neuroscience , 12(5):535–540,2009.[23] A. Li and R. F. Barber. Accumulation tests for FDR control in ordered hypothesis testing. arXiv preprint arXiv:1505.07352 , 2015.[24] W. Su, M. Bogdan, and E. J. Cand`es. False discoveries occur early on the Lasso path. arXivpreprint arXiv:1511.01957 , 2015.[25] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of theRoyal Statistical Society: Series B (Statistical Methodology) , 67(2):301–320, 2005.13 Technical Proofs Proof of Lemma 3.2. Denote by S the set of indices of all false null hypotheses, that is, S = { ≤ j ≤ p : at least one β ij = 0 } . Similarly, S c corresponds to the true null hypotheses. Foreach i , conditional on W i := ( W , . . . , W ip ) and χ iS , from Lemma 2.1 it follows that χ iS c hasi.i.d. components uniformly distributed on {− , } . Since the m linear models are independentlygenerated, we thus see that the concatenation of χ iS c , ≤ i ≤ m are uniformly distributed on {− , } m | S c | conditional on all W i and all χ iS . Then the proof immediately follows by recognizingthat W j and χ j only depend on W ij , ≤ i ≤ m and χ ij , ≤ i ≤ m , respectively. The binomialdistribution follows from the fact that χ j is simply the number of 1 in χ ij , ≤ i ≤ m . Proof of Lemma 3.4. We use some ideas from the proof of Lemma 4 in [1]. Given the filtration F k ,we know all V − ( k ) , . . . , V − ( p ) since it always holds that V − ( k ′ ) + V + ( k ′ ) = { j null : j ≤ k ′ } forall k ′ ≥ k . Without loss of generality, we assume all the hypotheses are true due to the observationthat M ( j ) agrees with M ( j − 1) if the j th hypothesis is true. Write V + ( k ) = P kj =1 Ω( P j ) = A . Bythe exchangeability of Ω( P ) , . . . , Ω( P k ), we get E ( V + ( k − |F k ) = E ( V + ( k − | V + ( k ) = A ) = k − k A. (5)Hence, we have E ( M ( k − |F k ) = E ( M ( k − | V + ( k ) = A )= E P k − j =1 Ω( P j ) /k − P k − j =1 Ω( P j ) /k (cid:12)(cid:12)(cid:12) V + ( k ) = A ) ! . To proceed, note that x/ (1 − x ) is convex for x < 1. Since P k − j =1 Ω( P j ) ∈ [ A − , A ] almost surely,by the inverse Jensen inequality we get E P k − j =1 Ω( P j ) /k − P k − j =1 Ω( P j ) /k (cid:12)(cid:12)(cid:12) V + ( k ) = A ) ! ≤ ( A − /k − ( A − /k η + A/k − A/k (1 − η ) , where η = A/k is provided by (5). Simple calculation reveals that( A − /k − ( A − /k η + A/k − A/k (1 − η ) = A k − A = M ( k ) , as desired. Proof of Lemma 3.5. Write α = E Ω( U ) ∈ (0 , U j ) obeys 0 ≤ Ω( U j ) ≤ E Ω( U j ) = α . We assert that the right-hand side (RHS) of (4) attains the maximumif each Ω( U j ) is replaced by i.i.d. Bernoulli random variable B ( α ). (Note that B ( α ) assumes only14 , E B ( α ) = α .) To see this, we examine which Ω( U ) gives the maximum conditionalexpectation while P pj =2 Ω( U j ) is fixed. Write P Nj =1 Ω( U j )1 + P Nj =1 (1 − Ω( U j )) = Ω( U ) + P Nj =2 Ω( U j ) N + 1 − P Nj =2 Ω( U j ) − Ω( U ) (6)For any constants c > , c > 1, the function ( x + c ) / ( c − x ) is convex on [0 , E (cid:18) Ω( U ) + c c − Ω( U ) (cid:19) ≤ E (cid:18) B ( α ) + c c − B ( α ) (cid:19) . Applying the last display to (6), we see that the RHS of (4) shall never decrease if each Ω( U j ) isreplaced by i.i.d. B ( α ). As a consequence, it only remains to show E (cid:18) B ( N, α )1 + N − B ( N, α ) (cid:19) ≤ α − α , which has been established in the proof of Lemma 4 in [1]. This finishes the proof. Proof of Lemma 4.1. We start with the observation that P ( χ ij = 1 | β j = µ ) = 12 + (1 + o (1)) p log p/m. Hence, χ j = m/ o P (1)) √ m log p if β j = µ for a fixed j . By the central limit theorem, χ j = m/ O P ( √ m ) if β j = 0. Set the confidence function Ω( x ) = x ≤ c p for some slowly vanishingsequence c p . Then, as log p → ∞ when taking the limit p → ∞ , we have { j : β j = µ, Ω( P j ) = 1 } p → { j : β j = 0 , Ω( P j ) = 0 } p → P pj =1 (1 − Ω( P ρ ( j ) )) P pj =1 Ω( P ρ ( j ) ) → ρ ( · ). Then the rejection rule b k given by Algorithm 2 takes p with probabilityapproaching one since the targeted upper bound q/ E Ω( U ) − q = q/c p − q > c p ≪ q = q p ). Recognizing that for almost all j the weights Ω( P j ) = 1 if and only if β j = 0. Thisis equivalent to saying that b V has only a vanishing fraction of indices that do not agree with V . Proof of Proposition 4.2. For the sake of generality, replace 2 . ǫ and ǫ by ǫ . The proofmakes extensive use of Lemmas 2, 4, and 6 of [8]. Take δ = 1 / , σ = 1 /µ in Lemma 6 of [8], and a = √ m log ǫ p. I ( V ; M ) = o ( p ) . (7)Recognizing n i ≡ a, δ ) satisfies Eqn. (18) of [8] for sufficiently large m . Then, combined with Lemma 4, Eqn (19b)gives I ( V ; M ) ≤ m X i =1 I ( V ; M i ) ≤ δ a σ m X i =1 H ( M i ) + pmh ( p ⋆ ) + pmp ⋆ ≤ 32 log ǫ pm m X i =1 H ( M i ) + pmh ( p ⋆ ) + pmp ⋆ ≤ B log ǫ pm | {z } I + pmh ( q ⋆ ) | {z } I + pmq ⋆ | {z } I , where the last inequality follows from Shannons source coding theorem [6]. This inequality assertsthat (7) is a simple consequence of I l = o ( p ) (8)for all l = 1 , , 3. Now we move to prove (8). For l = 1, we have I = 32 B log ǫ pm = O mp log ǫ p · log ǫ pm ! = O p log ǫ p ! = o ( p ) . For l = 2 , 3, note that q ⋆ = min (cid:26) − ( a − . σ , (cid:27) . Since a ≫ 1, it follows that the exponent( a − . σ ≍ log ǫ p. As a consequence, q ⋆ = exp (cid:16) − Θ(log ǫ p ) (cid:17) = o (1) . Thus, I = o ( I ), and I further obeys I ≍ − pmq ⋆ log q ⋆ ≍ pm log ǫ p exp (cid:16) Θ(log ǫ p ) (cid:17) = o ( p ) , which makes use of m = O ( poly ( p )). This proves (7).Next, we proceed to finish the proof by resorting to Lemma 2 of [8]. To this end, set t =(0 . − ǫ ) p (if ǫ ≥ . P (cid:16) Hamm( b V , V ) > (0 . − ǫ ) p (cid:17) ≥ − I ( V ; M ) + log 2log p N t , (9)16here, in our setting, N t = { v ∈ { , M } p : k v k ≤ (0 . − ǫ ) p } . By the large deviation theory, weget log 2 p N t ∼ (log 2 + (0 . − ǫ ) log(0 . − ǫ ) + (0 . ǫ ) log(0 . ǫ )) p ≍ p. (10)Substituting (7) and (10) into (9), we obtain P (cid:16) Hamm( b V , V ) > (0 . − ǫ ) p (cid:17) ≥ − o (1) ,,