A stable and adaptive polygenic signal detection method based on repeated sample splitting
aa r X i v : . [ s t a t . M E ] A ug A stable and adaptive polygenic signal detection methodbased on repeated sample splitting
Yanyan Zhao and Lei Sun Department of Statistical Sciences, University of Toronto, 100 St. GeorgeStreet, Toronto, Ontario M5S 3G3, Canada,[email protected] Division of Biostatistics, Dalla Lana School of Public Health, University ofToronto, 155 College Street, Toronto, Ontario M5T 3M7, Canada,[email protected] 7, 2020 bstract Using polygenic risk score for trait association analyses and disease prediction are paramount forgenetic studies of complex traits. Valid inference relies on sample splitting, or more recently ex-ternal data, to obtain a set of potentially associated genetic variants, along with their weights, forpolygenic risk score construction. The use of external data has been popular, but recent work in-creasingly calls its use into question due to adverse effects of potential data heterogeneity betweendifferent samples. Our study here adheres to the original sampling-splitting principle but does so,repeatedly, to increase stability of our inference. To accommodate different polygenic structures,we develop an adaptive test for generalized linear models. We provide the asymptotic null dis-tributions of the proposed test for both fixed and diverging number of variants. We also showthe asymptotic properties of the proposed test under local alternatives, providing insights on whypower gain attributed to variable selection and weighting can compensate for efficiency loss due tosample splitting. We support our analytical findings through extensive simulation studies and anapplication.
Keywords : Adaptive; Polygenic risk score; Sample splitting; Signal detection; Stability.2
Introduction
The polygenic risk score concept has recently been successfully applied to study many complexand heritable traits, by aggregating weak genetic effects across a large number of variants thatdo not, individually, achieve statistical significance (Purcell et al., 2009; Fritsche et al., 2018). Apolygenic study has two stages, where stage one applies a variable selection procedure to a trainingsample to obtain a set of potentially associated variants and their corresponding weights. Usingan independent testing sample, stage two first constructs a polygenic risk score for each individualby calculating the weighted sum of the risk alleles across the selected variants, and stage two thenstudies the aggregated score for its association with the trait of interest. A significant associationimplies that genetic signals are present among the selected variants, and the corresponding poly-genic score can be used for disease prediction. The use of external data for variable selection andweight estimation has been popular, but recent work has shown that this approach can be extremelysensitive to sample mis-match beyond population stratification (Mostafavi et al., 2020). Thus, ourwork here focuses on methods that do not rely on the use of external data.Focusing on association studies, the original polygenic method of Purcell et al. (2009) hasbeen improved (Dudbridge, 2013; Zhou et al., 2013; Vilhjalmsson et al., 2015; Wu et al., 2019).For example, Wu et al. (2019) proposed an adaptive method, where the test statistic is based ondifferent functions of variant-specific score statistics, derived from the whole sample across all variants, where different functions target different alternatives. However, accurate inference of themethod of Wu et al. (2019) requires simulation or bootstrap, and the lack of variable selection canadversely affect power despite its use of the whole sample for association testing.Sample splitting allows variable selection while preserving validity of an inference, and thisprinciple has been used in many other study settings. However, variability inherent in one-timeonly sample splitting has been noted, including in variable selection (Wasserman and Roedern,2009; Meinshausen et al., 2009), estimation (Kravitz et al., 2018), and more recently selectiveinference (Rinaldo et al., 2019; Romano and DiCiccio, 2019; Barber and Candes, 2019). Repeatedsample splitting is a natural remedy, but it is not obvious how to aggregate information acrossmultiple correlated sample splits to derive a valid and efficient test.In this paper, we develop an adaptive and robust polygenic association test for testing highdimensional regression coefficients for generalized linear models. In section 2, we first review theclassical polygenic association test, based on weighted sum of the risk alleles, and note its equiva-lency with weighted sum of variant-specific score statistics. We then consider different weightingfactors for the score vector, where different weights are tailored to different alternatives while stillleveraging information from the training sample. To aggregate information across different weight-3ng factors, we use the recent Cauchy method of Liu and Xie (2020), and we discuss the connectionof our adaptive method with that of Wu et al. (2019). To improve stability of our inference, we thenconsider repeated sample splitting and utilize the Cauchy method again to aggregate informationacross multiple sample splits without explicitly modelling the inherent correlation. Finally, wederive the asymptotic null distributions of the proposed test for both fixed and diverging numberof variants, and we study its asymptotic properties under local alternatives. In section 3 we presentextensive simulation results for method evaluation and comparison, and in section 4 we provide anapplication. We conclude with discussion in section 5.
Let Y ∈ R n × be the outcome variable of interest, G ∈ R n × J the genotype matrix, and X ∈ R n × q the covariate matrix for a sample of size n with J genetic variants and q covariates. For clarity,let y i be the response for individual i , g i j the genotype for individual i and variant j , and x i j ′ the covariate value for individual i and covariate j ′ , i = , . . . , n , j = , . . . , J , and j ′ = , . . . , q .Further, let G i ∈ R J × be the genotype vector for individual i , G j ∈ R n × the genotype vector forvariant j , X i ∈ R q × the covariate vector for individual i , and X j ′ ∈ R n × the vector for covariate j ′ . We assume that conditional on ( G i , X i ) , y i follows a distribution with density function, f ( y i ) = exp { ( y i θ i − b ( θ i )) / a i ( φ ) + c ( y i , φ ) } for some specific functions a ( · ) , b ( · ) and c ( · ) , where θ isthe canonical parameter, φ the dispersion parameter, var ( y i | G i , X i ) = a i ( φ ) ν ( µ i ) , and ν ( µ i ) thevariance function (MacCullagh and Nelder, 1989). We consider the generalized linear model thatmodels µ i = b ′ ( θ i ) = E ( y i | G i , X i ) for different types of response variables in the exponential familyby a monotone and differentiable link function G ( · ) , G ( µ i ) = G Ti β + X Ti β x , where β and β x are, respectively, J - and q -dimensional vectors of regression coefficients; q isfixed but J may vary depending on the study setting. Among the J genetic variables, we use M ∗ and | M ∗ | to denote, respectively, the set and number of truly associated ones. For simplicity butwithout loss of generality, we also assume that G i and X i have been mean centred at zero andstandardized to have variance one. 4 .2 The classical polygenic risk score for association tests Suppose we have 2 n independent observations, the classical polygenic risk score-based associationanalysis (Purcell et al., 2009) first randomly splits the sample into two equal subsets, D n , and D n , ;the corresponding data and parameter estimates such as y , X , G and ˆ β will carry superscripts ( ) and ( ) , respectively for the two subsets, unless specified otherwise. A variable selection procedureis then applied to the training sample D n , to select a subset of candidate genetic variables, M ,where we define J = | M | . In the testing sample D n , , a polygenic risk score G ∗ i is constructed byaggregating the J selected variables using G ( ) i , but weighted by the effect estimates ˆ β ( ) j obtainedfrom D n , , j ∈ M . That is, G ∗ i = ∑ J j = ˆ β ( ) j g ( ) i j ( i = , . . . , n ) . The inference is then based on thegeneralized linear regression model applied to D n , , G { E ( y ( ) i | G ∗ i , X ( ) i ) } = G ∗ i β ∗ + X ( ) Ti β x , (1)and testing H : β ∗ = H : β ∗ = . (2)The corresponding score statistic is, T = ∑ ni = ( y ( ) i − ˆ µ ( ) i ) G ∗ i , where ˆ µ ( ) i = G − ( X ( ) Ti ˆ β x ) andˆ β x is the maximum likelihood estimate of β x under H . The distribution of standardized T can beapproximated by χ , and the p-value of a test based on T will be denoted as p .This classical polygenic association testing has since been improved on several fronts, includ-ing modelling dependency structure (i.e. linkage disequilibrium) between genetic variables (Vilh-jalmsson et al., 2015) and better estimation of β ( ) j (Shi et al., 2016), among others (Lloyd-Joneset al., 2019). However, additional work are needed. To facilitate our discussion, it is instructive tore-formulate T as the following, T = n ∑ i = ( y ( ) i − ˆ µ ( ) i ) G ∗ i = n ∑ i = ( y ( ) i − ˆ µ ( ) i ) J ∑ j = ˆ β ( ) j g ( ) i j = J ∑ j = ˆ β ( ) j n ∑ i = ( y ( ) i − ˆ µ ( ) i ) g ( ) i j = n J ∑ j = ˆ β ( ) j S j , where S j = n − ∑ ni = ( y ( ) i − ˆ µ ( ) i ) g ( ) i j . Thus, T constructed based on the aggregated risk score G ∗ i is analytically equivalent to a linear weighted average of the score statistics, S j ’s, across the J selected genetic variants.Tests based on T are sub-optimal when signs of ˆ β ( ) j and S j differ. For example, recent workin association tests for rare variants have shown that T type of tests are only powerful when a5arge proportion of the variants being tested are causal and their genetic effects are in the samedirection (Derkach et al., 2014). Further, the direct use of ˆ β ( ) j ’s as weights may not be robust todifferent alternatives. Finally, when the signal-to-noise ratio is low as often the case in practice, theone-time only sample splitting approach may not be reliable (Meinshausen et al., 2009). Figure 1is an illustration of the p-value lottery phenomenon associated with T , when it is applied to a realdataset with 2 n = J = p−values of T and T dc F r equen cy p−values of T dc F r equen cy Figure 1: Histograms of p-values of T (white) and the proposed T dc (blue) based on randomlysplitting a real dataset to training and testing samples, independently 100 times. The right figure isa zoom-in plot for the proposed T dc . Here we develop a robust method that is adaptive to different alternatives. We first propose newtests by considering different weighting schemes, given a particular sample split. We then improvethe stability of our inference through repeated sample splitting.Recall that testing (2) in (1) can be reformulated as testing H : β = H : β = , (3)in G { E ( y ( ) i | G ( ) i , X ( ) i ) } = G ( ) Ti β + X ( ) Ti β x , (4)6here β = ( β , . . . , β J ) T . The proposed new test statistics have the following form, T γ = n J ∑ j = w γ − j S j , γ ∈ Γ = { , , , . . . } , (5)where w j depends on ˆ β ( ) j obtained from D n , , and γ is an even integer to avoid signal cancellationbetween variants.Let S = ( S , . . . , S J ) T and R = diag { r j } = diag { w γ − j } , j = , . . . , J , we have T γ = nS T RS . (6)We can easily modify R to include off-diagonal elements to reflect potential linkage disequilibriumbetween genetic variables, and we will study the asymptotic null distribution of T γ in Theorems1–3 for both fixed and diverging J .The different γ values in (5) adapt to different signal sparsities. To obtain an accurate yetcomputationally efficient adaptive test, we propose to aggregate p γ ’s, the p-values of T γ , γ ∈ Γ = { , , , . . . } , and p , the p-value of T , using the Cauchy combination method recently proposed byLiu and Xie (2020). The Cauchy method can accommodate complex dependency structure amongp-values without explicitly modelling it. In our setting, the proposed test statistics is T c = ( | Γ | + ) − ∑ γ ∈ Γ ∪ tan { ( . − p γ ) π } . (7)The tail of the null distribution of T c can be well approximated by the standard Cauchy distribution,as long as the individual p γ ’s are accurate, which we study in sections 2.4 and 3. The final p-valueof T c is p c = / − ( arctan t c ) / π , where t c is the observed value for T c .Here we acknowledge that T γ is related to SPU type of test statistics proposed by Xu et al.(2016, 2017); Wu et al. (2019). For an integer γ ≥ T spu ( γ ) = ∑ Jj = S γ j , where S j is obtained fromthe whole sample. If we omit the sample splitting step in our approach, J = J , and let w j = S j , wehave T γ ∝ T spu ( γ ) for all γ >
1. The authors of T spu ( γ ) have noted that for an even integer γ → ∞ , T spu ( γ ) ∝ ( ∑ j | S j | γ ) / γ → max j | S j | , defined as T spu ( ∞ ) . This suggests that larger γ is more powerfulfor sparse alternatives and smaller γ is the opposite. To make the SPU test robust to differentalternatives, the authors then proposed an adaptive SPU test based on T aSPU = min γ ∈ Γ aSPU { p spu ( γ ) } , where the recommended Γ aSPU = { , , , , , , ∞ } , and p spu ( γ ) is the p-value associated with each T spu ( γ ) . The asymptotic p spu ( γ ) for γ = S j ’s and their correlation structure (Xu et al., 2017; Wu et al., 2019). However, the7symptotic p spu ( γ ) for γ > p spu ( γ ) , and subsequently p aSPU , based on bootstrap or simulation, which iscomputational expensive.The distinction between T aSPU and the proposed T c is four-fold. Firstly, the building block of T c is T γ , where γ is an even integer, which facilitates studies of asymptotic properties of the proposedtests; see Theorems 1–4 for details. Secondly, tests using different γ values are correlated witheach other. Thus, even if the individual p-value estimation is accurate, the minimum-p approach of T aSPU makes the inference more difficult than that of T c , which is based on the easy-to-implementCauchy method. Thirdly, although T aSPU uses the whole sample for association testing, it aggre-gates information across all J genetic variants, many of which may be from the null leading toreduced power as compared to T c , which benefits from variable selection. Lastly, the flexible struc-ture of w j in T γ allows incorporation of additional information available for each genetic variant,such as its functional importance (Ionita-Laza et al., 2016).To further robustify T c against sampling variation inherent in the one-time only sample split-ting approach, we then consider repeated sample splitting of m times. For the s th sample split, s = , . . . , m , we obtain T c , s and its corresponding p-value, p c , s . To combine the p c , s ’s while notexplicitly modelling the correlation, we again utilize the Cauchy method of Liu and Xie (2020).The proposed double Cauchy combination test statistic is T dc = m − m ∑ s = tan { ( . − p c , s ) π } . (8)Similar to inference based on T c , the tail of the null distribution of T dc can be well approximatedby the standard Cauchy distribution, as long as the individual p-values to be combined are accuratewhich we study next. T γ To make the dependency of T γ on n and J explicit, we use T n , J , γ to denote T γ in this section.We study the asymptotic properties of T n , J , γ for both fixed and diverging J , under the null orlocal alternatives. For notation simplicity, we now omit superscript ( ) from Y ∈ R n × , G ∈ R n × J and X ∈ R n × q , representing, respectively, the outcome, genotype and covariate data in the testingsample D n , , where J is the number of SNPs to be tested. Recall that T n , J , γ = nS T RS , where S = ( S , . . . , S J ) T is the score vector, R = diag { r j } , and γ is an even integer. The covariancematrix of n / S is Σ s = E { a i ( φ ) ν ( µ i ) G i G Ti } , where G i ∈ R J × , the genotype vector for individual i , ε = ( ε , . . . , ε n ) T = Y − G − ( G β + X β x ) , and ε = ( ε , . . . , ε n ) T = Y − G − ( X β x ) .8he following theorem gives the asymptotic null distribution of T n , J , γ , provided that the sameregularity conditions, required for the convergence of S to a multivariate normal random variable,hold (Goeman et al., 2011; Vandekar et al., 2019). In addition, we ignore the nuisance parameters a i ( φ ) and β x for now and discuss how to include them in section 5. We provide all proofs in theSupplementary Material. Theorem 1.
Under the null hypothesis H in (3), for any fixed finite J and γ ,T n , J , γ → T J , γ in distribution as n → ∞ , where T J , γ and ∑ J j = λ J , j χ j are equivalent in distribution, χ j ’s areindependent variables with central chi-square distribution with 1 degrees of freedom, χ , λ J , ≥ . . . , ≥ λ J , J are the eigenvalues of C Ts RC s , and Σ s = C s C Ts . When Y is normally distributed, T n , J , γ and T J , γ equivalent in distribution always holds for any n (and finite J ); when both n and J are diverging, additional assumptions are required. Assumption 1.
Assume G i = C g Z i , ∀ i, where C g is a J × J matrix and C g C Tg = Σ g , and Z i =( z i , . . . , z iJ ) T with E ( Z i ) = and cov ( Z i ) = I J . Assume z i j has finite eighth moment and E ( z i j ) = + ∆ < ∞ , ∀ j, where ∆ is a constant and ∆ > − , and E ( Π j z ν j i j ) = Π j E ( z ν j i j ) , where ∑ j ν j ≤ and all ν j ’s are non-negative integers. Assumption 2.
Let f g be the probability density of G and D ( f g ) be its support. Assume E ( ε | G ) = and E ( ε | G ) = , and there are positive constants K and K such that E ( ε | G ) > K andE ( ε | G ) < K almost everywhere for g ∈ D ( f g ) . Assumption 3.
There exist real numbers ρ ∞ , j ’s such that lim J → ∞ ρ J , j = ρ ∞ , j uniformly ∀ j, and lim J → ∞ ∑ J j = ρ J , j = ∑ ∞ j = ρ ∞ , j < ∞ , where ρ J , j = λ J , j / p tr ( R Σ s ) , j = , . . . , J , which are theeigenvalues of C Ts RC s / p tr ( R Σ s ) in descending order. Assumption 4. n { tr ( R Σ g ) / tr ( R Σ g ) } → ∞ as n and J → ∞ . Assumptions 1–3 are standard in studying high-dimensional testing (Guo and Chen, 2016;Zhang et al., 2019). Assumption 4 specifies a relationship between n and J . Because tr ( R Σ g ) = ∑ J j = λ J , j and tr ( R Σ g ) = ∑ J j = λ J , j , we have n { tr ( R Σ g ) / tr ( R Σ g ) } = n { ∑ J j = λ J , j / ( ∑ J j = λ J , j ) } .Thus, Assumption 4 holds for any diverging n and J if λ J , j ’s are dominated by first few largerones. When all λ J , j ’s are similar in magnitude, Assumption 4 is equivalent to requiring samplesize n grows to infinity at a rate faster than J . The following theorem generalizes Theorem 1 fromfinite to infinite J . 9 heorem 2. Under the null hypothesis H in (3) and assume Assumptions 1–4 hold, σ − n , { T n , J , γ − tr ( R Σ s ) } → ζ and { tr ( R Σ s ) } − / { T J , γ − tr ( R Σ s ) } → ζ in distribution as n and J → ∞ , where ζ and ∑ ∞ j = ρ ∞ , j ( χ j − ) / √ are equivalent in distribu-tion, σ n , = tr ( R Σ s ) + δ , and δ = n − n ∑ J j = ∑ J k = r j r k E ( g i j g ik ε i ) − tr ( R Σ s ) − tr ( R Σ s ) o = o { tr ( R Σ s ) } . Therefore, as n and J → ∞ ,sup x | pr ( T n , J , γ ≤ x ) − pr ( T J , γ ≤ x ) | → . Theorems 1 and 2 show that we can use ∑ J j = λ J , j χ j to approximate the asymptotic nulldistribution of T n , J , γ for both fixed and diverging J . The corresponding p-value can be calculatedusing the method of Imhof (1961) or Davies (1980); we use the method of Davies (1980) in ourimplementation below for both simulation and application studies.To show the asymptotic normality of T n , J , γ under the null, we need to impose the followingassumption, which substitutes for specifying an explicit relationship between J and n . Assumption 5. tr ( R Σ g ) / tr ( R Σ g ) → ∞ and tr ( R Σ g ) → ∞ as n and J → ∞ . Theorem 3.
Under the null hypothesis H in (3) and assume Assumptions 1–5 hold, σ − n , { T n , J , γ − tr ( R Σ s ) } → N ( , ) , in distribution as n and J → ∞ . We now study the adverse effect of sample splitting on power, while considering the beneficialeffect of variable selection afforded by sample splitting, under the local alternative L β , L β = n ∆ T β R Σ g R ∆ β = o { n − tr ( R Σ g ) } and { G − ( G Ti β ) } = O ( ) o , where ∆ β = E { G − ( G Ti β ) G i } . Theorem 4.
Under the local alternative L β and assume Assumptions 1–5 hold, σ − n , { T n , J , γ − tr ( R Σ s ) − µ n , β } → N ( , ) , in distribution as n and J → ∞ , where µ n , β = tr ( R Ξ β ) + ( n − ) ∆ T β R ∆ β , σ n , = { tr ( R Σ s + R Ξ β ) }{ + o ( ) } , and Ξ β = E (cid:2) { G − ( G Ti β ) } G i G Ti (cid:3) . T n , J , γ under L β is determined by SNR n ( β ) = µ n , β / σ n , ,where SNR n ( β ) can be interpreted as signal-to-noise ratio following Guo and Chen (2016). Asdetailed in the Supplementary Material, µ n , β = J ∑ j = r j E { G − ( G Ti β ) g i j } + ( n − ) J ∑ j = r j E { G − ( G Ti β ) g i j } , σ n , = { σ n , + tr ( R Ξ β ) + tr ( R Σ s R Ξ β ) } { + o ( ) } , where σ n , = ∑ J j = ∑ J k = r j r k E ( g i j g ik ε i ) { + o ( ) } , tr ( R Ξ β ) = ∑ J j = ∑ J k = r j r k E (cid:2) g i j g ik { G − ( G Ti β ) } (cid:3) ,and tr ( R Σ R Ξ β ) = ∑ J j = ∑ J k = r j r k E ( g i j g ik ε i ) E (cid:2) g i j g ik { G − ( G Ti β ) } (cid:3) .Now define T n , J = n ∑ Jj = S j as the test statistic calculated based on the whole sample of size2 n but without variable selection and assuming R = I . In this case, G j ∈ R n × and β ∈ R J × forcalculating S j . The signal-to-noise ratio corresponding to T n , J is SNR n ( β ) = µ n , β / σ n , . µ n , β = J ∑ j = E { G − ( G Ti β ) g i j } + ( n − ) J ∑ j = E { G − ( G Ti β ) g i j } , σ n , = { σ n , + tr ( Ξ β , J ) + tr ( Σ s , J Ξ β , J ) } { + o ( ) } , where σ n , = ∑ Jj = ∑ Jk = E ( g i j g ik ε i ) { + o ( ) } , tr ( Ξ β , J ) = ∑ Jj = ∑ Jk = E (cid:2) g i j g ik { G − ( G Ti β ) } (cid:3) ,and tr ( Σ s , J Ξ β , J ) = ∑ Jj = ∑ Jk = E ( g i j g ik ε i ) E (cid:2) g i j g ik { G − ( G Ti β ) } (cid:3) .To provide additional insights, assume pr ( M ⊃ M ∗ ) → n → ∞ ; this assumption can befulfilled by various variable selection algorithms (Fan and Lv, 2008; Li et al., 2012; Jin et al.,2014; Zhang, 2017). Comparing µ n , β with µ n , β , it is not surprising that sample size reductionis the primary cause of power loss for a sample splitting-based method. On the other hand, thecharacterizations of σ n , and σ n , indicate noise-filtering in the training sample D n , can reducevariance of the test statistic calculated in the testing sample D n , , leading to increased power forthe sample splitting-based method. The use of weights derived from D n , can further compensateefficient loss due to reduced sample size in D n , . Thus, SNR n ( β ) can be larger than SNR n ( β ) , andtests based on T n , J , γ can be more powerful than T n , J . Simulation studies in the next section showthat, even if sure-screening fails in D n , , the sample splitting approach can have comparable powerwith the methods of Wu et al. (2019); Guo and Chen (2016) applied to the full-sample withoutvariable selection. 11 Simulation studies
To evaluate the performance of T dc and compare it to tests proposed by Guo and Chen (2016) andWu et al. (2019), we consider two simulation designs. Design one simulates G , where we canexplicitly specify and study effects of different model parameters, while design two builds upona set of real genotypes. Design one considers sample size of 2 n =
200 or 1500 and dimension J ∈ { , , , , , } . It generates G based on a multivariate normal distribution withmean vector 0 and correlation matrix Σ g = { ρ | i − j | } J × J , where ρ = . , . , .
8, and i and j = , . . . , J . That is, we use the autoregressive model for correlation between the J variants. Fordesign two, G comes from an application sample of 2 n = J = T dc , we let r j = ˆ β γ − j ( j = , . . . , J ) and Γ = { , , , } to first obtain T c of (7),we then use m =
10 or 50 to derive the more stable T dc of (8). For completeness, we also studythe performance of the individual T γ ’s but present the results in the Supplementary Material. Thenumbers of simulation replicates are 10 for evaluating type I error control and 500 for power, andadditional simulation design details are provided below when appropriate. Methods applied to binary outcomes often have worse performance than normally distributed traits.Thus, we generate Y based on a logistic regression with β =
0, and without loss of generality,intercept equals to one and with no other covariates. For evaluation of type I error control, thevariable selection procedure is not critical as long as it is applied to a sample that is independentof the testing sample as the case here, and the size of J is also not critical. Thus, we choose J = J and regress the simulated Y on each of the J simulated variants in D n , , and we obtain theircorresponding ˆ β j values for the polygenic association testing in D n , .Table 1 shows the empirical test sizes of T dc for 2 n = m =
10, and nominal α values of0.05, 0.01, and 10 − , and Supplementary Material provides results for more stringent α values and m =
50; results for 2 n = T γ ’s areapproximated by the weighted linear combination of independent χ distributions as specified inTheorem 2, and the distributions of T c and T dc by the standard Cauchy distribution. It is clear thattype I error of T dc is controlled at or below the nominal levels, across the settings.12able 1: Empirical test sizes of T dc . Autoregressive model AR(1, ρ ) for correlation between the J variants, sample size 2 n = m =
10 for repeated sample splitting J
10 50 200 400 1000 ρ α
5% 1% 0.1% 5% 1% 0.1% 5% 1% 0.1% 5% 1% 0.1% 5% 1% 0.1%0.2 5.285 0.981 0.105 5.330 0.938 0.071 5.341 0.928 0.072 5.403 0.943 0.074 5.434 0.941 0.0810.5 5.265 0.961 0.095 5.340 0.936 0.069 5.372 0.941 0.071 5.369 0.938 0.072 5.409 0.941 0.0760.8 5.347 0.983 0.084 5.330 0.942 0.070 5.343 0.937 0.068 5.338 0.940 0.073 5.326 0.923 0.076
Results in the Supplementary Material show that the normal approximation given in Theorem 3,however, is not adequate for T γ when J ≤ ∑ J j = λ J , j χ j approximation in practice. Consistent with previous reports, the empirical type I error rates of themethods of Wu et al. (2019) and Guo and Chen (2016), based on asymptotic approximations, areinflated (Supplementary Material). Thus, we use permutation with 10 replicates to evaluate powerof these methods, as recommended by the authors. Similar to the type I error evaluation above, here we also focus on the more difficult case of analyz-ing binary outcomes than normally distributed traits. We generate Y based on logistic models withdifferent proportions of nonzero regression coefficients, varying from 0 . J variants. We assume the indices of the nonzero β j ’s to be uniformly distributed in { , . . . , J } ,and we also randomly specify the nonzero β j ’s to be half positive and half negative. Results be-low focus on power comparison between the proposed T dc test and the methods of Guo and Chen(2016) and Wu et al. (2019), which are applied to the whole sample and without variable selection.Results of the original polygenic risk score test, T , are shown in the Supplementary Material.To better delineate the factors influencing power, we consider three study scenarios. In allthree scenarios, the weights inferred from the training sample D n , are used to construct the T dc test statistic using the testing sample D n , .(I), Oracle: M = M ∗ . This is the ‘best’ case scenario for T dc , where the selection step appliedto D n , identifies all and only truly associated variants; the estimated weights however may not beoptimal. This study is to show power gain of T dc , despite the reduction of sample size, as comparedto the methods without variable selection.(II), J = J : M = { G , . . . , G J } . This is the ‘worst’ case scenario for T dc , where the selectionstep fails completely at filtering out non-signals; the estimated weights however may be infor-13ative. This scenario is tailored for studying power loss of T dc due to sample size reduction ascompared with methods without sample splitting, while also demonstrating the benefits of lever-aging the weights inferred from D n , to perform associate analysis using D n , .(III), Variable Selection: M is estimated based on the variable selection method proposed inLi et al. (2012). This study investigates the impact of accuracy of variable selection on power of T dc as compared to the methods without variable selection.Figure 2 shows the results for 2 n = J = ρ = . J variants, and m =
10 forrepeated 50%-50% sample splitting to construct T dc . Supplementary Material provides additionalsimulation results.For scenario (I), Fig. 2 (the first column) shows that the proposed T dc test has substantial powergain, attributed to noise filtering despite the reduction in sample size for associate testing, as com-pared to the methods of Guo and Chen (2016) and Wu et al. (2019).For scenario (II), Fig. 2 (the second column) shows that the anticipated power loss of T dc dueto sample splitting can be compensated by leveraging the weights inferred from D n , , as comparedwith the methods of Guo and Chen (2016) and Wu et al. (2019), which use the full sample. Forthe sparse alternative case, scenario (II) 4 signals, T dc displays comparable power with the methodof Wu et al. (2019), while both are substantially more powerful than the method of Guo and Chen(2016). For the other alternatives considered in this scenario, all three methods have comparablepower with the method of Guo and Chen (2016) having slightly higher power. Overall, the pro-posed T dc test is most robust to the different alternatives considered here, and it is computationallymost efficient.For the more realistic scenario (III), interestingly the results are similar to those of scenario(II). This suggests that while the variable selection step filters out noise, it also filters out some(weak) signals. The implementation of the method of Li et al. (2012) requires the specification of J . The results in Fig. 2 are for J = T dc is not extremely sensitive to theproportion. However, the 33%-67% sample splitting has slightly increased power for the scenar-ios considered here. This is consistent with the literature (Barber and Candes, 2019; Dudbridge,2013), where it has been noted that uneven sample splitting, with more subjects assigned to the14 .000.250.500.751.00 0.05 0.10 0.15 0.20 0.25 0.30 | b | P o w e r (I) Oracle, | b | P o w e r (II) J = J, | b | P o w e r (III) Variable
Selection, | b | P o w e r (I) Oracle,
40 Signals | b | P o w e r (II) J = J,
40 Signals | b | P o w e r (III) Variable
Selection,
40 Signals | b | P o w e r (I) Oracle,
200 Signals | b | P o w e r (II) J = J,
200 Signals | b | P o w e r (III) Variable
Selection,
200 Signals | b | P o w e r (I) Oracle,
400 Signals | b | P o w e r (II) J = J,
400 Signals | b | P o w e r (III) Variable
Selection,
400 Signals
Figure 2: Power comparison of the proposed test T dc (red triangle), the method of Guo and Chen(2016) (blue square), and the method of Wu et al. (2019) (black circle), for the three study scenarios(I), (II) and (III). Sample size 2 n = J = We apply the proposed T dc test and the methods of Guo and Chen (2016) and Wu et al. (2019),as well as T , the original polygenic risk score test, to the cystic fibrosis data introduced in Soaveet al. (2015). This dataset consists of 2 n = J = T dc test, we first randomly divide the 1409 individuals into twosubsets, D n and D n , where n =
409 and n = n =
705 and n = n = n = r j = ˆ β γ − j ( j = , . . . , J ) and Γ = { , , , } ,and we apply the variable selection method of Li et al. (2012) and let J = n , the sample sizeof the testing sample. Because the approximation of the asymptotic distribution of T dc requires apositive definite matrix estimate of Σ g in D n , we use the algorithm proposed by Rothman (2012),with the tuning parameter selected by 5-fold cross-validations.Using this application dataset, we first re-evaluate type I error control of T dc by simulating Y independently of the observed G for the n individuals in the D n testing sample, where y i = + ε i ( i = , . . . , n ) and ε i follows the standard normal distribution. The variable selection and estimationof weights, however, are based on the real data, both Y and G , of the D n training sample. Resultsin Table S5 in the Supplementary Material show that the empirical type I error rates of T dc are atthe nominal level when m =
10 or 50 but are slightly inflated when m = m =
10 or 50 for constructing T dc .In the absence of knowledge of true association, application results focus on stability of testsbased on sample splitting, and p-values of all methods. The empirical p-values are 0.0985 and0.0727, respectively, for the methods of Guo and Chen (2016) and Wu et al. (2019), based on 10 bootstrap samples applied to the whole sample. For T , we randomly split the whole sample to the D n and D n subsets, independently 100 times, to obtain 100 different p ’s, the T -based p-values.The histogram of p ’s for n =
409 and n = T dc , we also randomlysplit the whole sample to two subsets, but independently 100 ×
50 times, and use a sequence of m =
50 repeated sample splits to obtain 100 p dc ’s, the T dc -based p-values. The histogram of p dc ’sfor n =
409 and n = T dc applied to the cystic fibrosis application databased on different n - n sample splits, with m =
50 for constructing T dc and repeatedly 100 times.The total sample size 2 n = J = J = n , the sample ofsize of the testing sample n n Minimum 1st Quantile Median Mean 3rd Quantile Maximum409 1000 0.003 0.030 0.045 0.046 0.061 0.101705 704 0.020 0.067 0.079 0.086 0.105 0.1951000 409 0.003 0.045 0.059 0.057 0.067 0.104sample splitting approach: p ranges from 0.0019 to 0.9446, while in contrast p dc ranges from0.003 to 0.101, with a mode of around 0.05.For completeness, Table 2 provides the summary statistics of the 100 p dc ’s associated withdifferent sample splitting proportions. Results show that although the proposed T dc demonstratesmuch improved stability of inference as compared to T , variation remains in this application,suggesting the signals are too weak or sample size is not sufficient. In the theoretical study, we did not consider the impact of estimating nuisance parameters β x and φ , as we expect that the results would be similar when we impose stringent conditions on thedesign matrix X and the relationship between n and q to ensure estimation accuracy of the nuisanceparameters (Guo and Chen, 2016). In practice, we can estimate the nuisance parameters in thetraining sample and treat the estimates as known quantities to construct T dc in the testing sample.This approach has been recommended by Chernozhukov et al. (2018) for another study settingwhere the sample splitting strategy is used. Acknowledgement
We thank Dr. Lisa J. Strug and her lab for providing the cystic fibrosis data for application andhelpful discussion. YZ is a trainee of the STAGE training program at the University of Toronto.This research is funded by the Natural Sciences and Engineering Research Council of Canada, theCanadian Institutes of Health Research, and the University of Toronto McLaughlin Centre.17 eferences
Barber, R., and E. Candes, 2019: A knockoff filter for high-dimensional selective inference.
Ann.Statist. , , 2504–2537.Chernozhukov, V., D. Chetverikov, M. Demirer, C. H. Esther Duflo, W. Newey, and J. Robins,2018: Double/debiased machine learning for treatment and structural parameters. Econom. J. , , C1–C68.Davies, R., 1980: Algorithm as 155: The distribution of a linear combination of χ random vari-ables. J. Royal Stat. Soc. C, , 323–333.Derkach, A., J. F. Lawless, and L. Sun, 2014: Pooled association tests for rare genetic variants: areview and some new results. Stat. Sci. , , 302–321.Dudbridge, F., 2013: Power and predictive accuracy of polygenic risk scores. PLoS. Genet. , ,e1003 348.Fan, J., and J. Lv, 2008: Sure independence screening for ultrahigh dimensional feature space. J.R. Statist. Soc. B, , 849–911.Fritsche, L., S. Gruber, Z. Wu, ..., and B. Mukherjee, 2018: Association of polygenic risk scoresfor multiple cancers in a phenome-wide study: Results from the michigan genomics initiative. Am. J. Hum. Genet. , , 1048–1061.Goeman, J. J., H. C. Van Houwelingen, and L. Finos, 2011: Testing against a high-dimensionalalternative in the generalized linear model: asymptotic type i error control. Biometrika , 381–390.Guo, B., and S. Chen, 2016: Tests for high dimensional generalized linear models.
J. R. Statist.Soc. B, , 1079–1102.Imhof, J., 1961: Computing the distribution of quadratic forms in normal variables. Biometrika , , 419–426.Ionita-Laza, I., K. McCallum, B. Xu, and J. Buxhaum, 2016: A spectral approach integratingfunctional genomic annotations for coding and noncoding variants. Nat. Genet. , , 214–220.Jin, J., C. Zhang, and Q. Zhang, 2014: Optimality of graphlet screening in high dimensionalvariable selection. J. Mach. Learn. Res. , , 2723–2772.18ravitz, E., R. Carroll, and D. Ruppert, 2018: Sample splitting as an m-estimator with applicationto physical activity scoring. Am. J. Hum. Genet. , , 1048–1061.Li, R., W. Zhong, and L. Zhu, 2012: Feature screening via distance correlation learning. J. Am.Stat. Assoc. , , 1129–1139.Liu, Y. W., and J. Xie, 2020: Cauchy combination test: A powerful test with analytic p-valuecalculation under arbitrary dependency structures. J. Am. Stat. Assoc. , , 393–402.Lloyd-Jones, L. R., J. Zeng, J. Sidorenko, ..., and A. Metspalu, 2019: Improved polygenic predic-tion by bayesian multiple regression on summary statistics. Nat. Commun. , , 1–11.MacCullagh, P., and J. A. Nelder, 1989: Generalized Linear Models . London: CRC Press.Meinshausen, N., L. Meier, and P. Buhlmann, 2009: p-values for high-dimensional regression.
J.Am. Stat. Assoc. , , 1671–1681.Mostafavi, H., A. Harpak, D. Conley, J. Pritchard, and M. Przeworski, 2020: Variable predictionaccuracy of polygenic scores within an ancestry group. Elife , , e48 376.Purcell, S. M., N. R. Wray, J. L. Stone, ..., and I. S. Consortium., 2009: Common polygenicvariation contributes to risk of schizophrenia and bipolar disorder. Nature ,
460 (7256) , 748–752.Rinaldo, A., L. Wasserman, and M. G’Sell, 2019: Bootstrapping and sample splitting for high-dimensional, assumption-lean inference.
Ann. Statist. , , 3438–3469.Romano, J., and C. DiCiccio, 2019: Multiple data splitting for testing. Technical Report .Rothman, A. J., 2012: Positive definite estimators of large covariance matrices.
Biometrika ,
99 (3) ,733–740.Shi, J., J. Park, J. Duan, ..., and M. Garcia-Closas, 2016: Winner’s curse correction and variablethresholding improve performance of polygenic risk modeling based on genome-wide associa-tion study summary-level data.
PLoS. Genet. , , e1006 493.Soave, D., and Coauthors, 2015: A joint location-scale test improves power to detect associatedsnps, gene sets, and pathways. Am. J. Hum. Genet. , , 125–138.Vandekar, S., P. Reiss, and R. Shinohara, 2019: Interpretable high-dimensional inference via scoreprojection with an application in neuroimaging. J. Am. Stat. Assoc. , , 820–830.19ilhjalmsson, B., J. Yang, H. Finucane, ..., and T. Hayeck, 2015: Modeling linkage disequilibriumincreases accuracy of polygenic risk scores. Am. J. Hum. Genet. , , 576–592.Wasserman, L., and K. Roedern, 2009: High-dimensional variable selection. Ann. Statist. , ,2178–2201.Wu, C., G. Xu, and W. Pan, 2019: An adaptive test on high-dimensional parameters in generalizedlinear models. Statistica Sinica , , 2163–2186.Xu, G. J., L. F. Lin, P. Wei, and W. Pan, 2016: An adaptive two-sample test for high-dimensional. Biometrika , , 609–624.Xu, Z., C. Wu, P. Wei, and W. Pan, 2017: A powerful framework for integrating eqtl and gwassummary data. Genetics , , 893–902.Zhang, J., J. Guo, B. Zhou, and M. Y. Cheng, 2019: A simple two-sample test in high dimensionsbased on l2-norm. J. Am. Statist. Assoc. , , 1011–1027.Zhang, Y., 2017: Recovery of weak signal in high dimensional linear regression by data perturba-tion. Electron. J. Stat. , , 3226–3250.Zhou, X., P. Carbonetto, and M. Stephens, 2013: Polygenic modeling with bayesian sparse linearmixed models. PLoS. Genet. , , e1003 264.Zou, H., and T. Hastie, 2005: Regularization and variable selection via the elastic net. J. R. Statist.Soc. B,67