[PDF] Accurate Genomic Prediction Of Human Height

Abstract

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, \sim40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate \sim0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the "missing heritability" problem -- i.e., the gap between prediction R-squared and SNP heritability. The \sim20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.

Full PDF

aa r X i v : . [ q - b i o . GN ] S e p Accurate Genomic Prediction Of Human Height

Louis Lello , Steven G. Avery , Laurent Tellier , Ana I. Vazquez , Gustavo de losCampos , and Stephen D.H. Hsu Michigan State UniversityDepartment of Physics and AstronomyEast Lansing, MI 48824 Michigan State UniversityDepartment of Epidemiology and BiostatisticsEast Lansing, MI 48824 Cognitive Genomics Laboratory, BGI Michigan State UniversityDepartment of Statistics and ProbabilityEast Lansing, MI 48824 University of CopenhagenDepartment of Biology, Functional GeneticsCopenhagen, DK

Abstract

We construct genomic predictors for heritable and extremely complex human quan-titative traits (height, heel bone density, and educational attainment) using modernmethods in high dimensional statistics (i.e., machine learning). Replication tests showthat these predictors capture, respectively, ∼

40, 20, and 9 percent of total variance forthe three traits. For example, predicted heights correlate ∼ ∼

20k activated SNPs in our height predictor reveal thegenetic architecture of human height, at least for common SNPs. Our primary datasetis the UK Biobank cohort, comprised of almost 500k individual genotypes with mul-tiple phenotypes. We also use other datasets and SNPs found in earlier GWAS forout-of-sample validation of our results.

Introduction

Recent estimates [1] suggest that common SNPs account for signiﬁcant heritability of complex traitssuch as height, heel bone density, and educational attainment (EA). Large GWAS studies of these traitshave identiﬁed many associated SNPs at genome-wide signiﬁcance ( p < × − ) [2–6]. However,the total variance accounted for by these SNPs is still a small fraction of the trait heritability and of theproportion of variance that could be captured by regression on common SNPs as suggested by SNPheritability estimates [7].The simplest hypothesis explaining this (so far) “missing heritability” is that previous studies havenot had enough statistical power to identify most of the relevant SNPs, due to their small eﬀect size, lowminor-allele frequency (MAF), or both. In this letter, we provide evidence in support of this hypothesisby constructing genomic predictors capturing much of the estimated SNP heritability. We make useof a newly available large data set (the UK Biobank 500k genomes release) and new computationalmethods.Association studies (GWAS) focus on reliable (high conﬁdence) identiﬁcation of associated SNPs.In contrast, genomic prediction based on whole genome regression methods [8] seeks to constructthe most accurate predictor of phenotype, and tolerates possible inclusion of a small fraction of false-positive SNPs in the predictor set. The SNP heritability of the molecular markers used to build thepredictor can be interpreted as an upper bound to the variance that could be captured by the predictor.While identiﬁcation of GWAS SNPs is accomplished by single SNP regression, construction of abest predictor is a global optimization problem in the high dimensional space of possible eﬀect sizes ofall SNPs. In this letter we use L -penalized regression (LASSO or Compressed Sensing) to obtain ourpredictors. This method is particularly eﬀective in cases where only a small subset of variables havenon-zero eﬀect on the predicted quantity (i.e., the eﬀects vector is sparse, or approximately sparse).In earlier work [9] it was shown that matrices of human genomes are good compressed sensors, andthat they are in the universality class of Gaussian random matrices. The L algorithm exhibits phasetransition behavior as the sample size and penalization parameter are varied; this behavior can be usedto optimize the penalization as a function of sample size. Technical details are provided in the Methodssection below.Beyond the theoretical considerations given above, the practical outcome of our work is to signif-icantly improve accuracy in genomic prediction of complex phenotypes. Using these predictors, onecan, for example, reliably identify outliers in the population based on DNA alone. The activated SNPsin the predictors (i.e., those that have been assigned non-zero eﬀect size by the LASSO algorithm)are likely to be associated with the phenotype, although they may not reach genome-wide signiﬁcancein ordinary regression analysis. While there may be some contamination of false-positives amongthese SNPs, one can nevertheless infer properties of the overall genetic architecture of the trait (e.g.,distribution of eﬀect sizes with MAF). Our main dataset is the July 2017 release of nearly 500k UK Biobank genotypes and associatedphenotypes [10, 11]. (See Supplement for more detailed description of data, quality control, algorithms,and computations.)We compute an estimator ® β ∗ for the vector of linear eﬀects, ® β ∈ R p , using L -penalized regression(LASSO) [12]. This corresponds to minimizing the objective function below (phenotypes ® y are age2nd gender adjusted; both ® y and genotype values X are standardized). ® β ∗ = argmin ® β ∈ R p O λ (® y , X ; ® β ) , O λ (® y , X ; ® β ) = (cid:13)(cid:13) ® y − X ® β (cid:13)(cid:13) + n λ k ® β k , (1)where λ is a penalty (hyper-)parameter and the L norm is deﬁned to be the sum of the absolute valuesof the coeﬃcients k ® β k = p Õ j = | β j | . The resulting eﬀects vector ® β ∗ deﬁnes a linear predictive model which captures a large portion of theheritable genetic variance.In our procedure, a ﬁrst screening based on standard single marker regression is performed on thetraining set to reduce the set of candidate SNPs from 645,589 SNPs that passed QC (Supplement) tothe top p =

50k and 100k by statistical signiﬁcance.

Figure (1) displays results from a typical LASSO run for height. 5 non-overlapping sets of 5kindividuals each were held back from LASSO training using the top 100k candidate SNPs. For eachvalue of the L penalization λ the resulting predictor ® β ∗ is applied to the genomes of the holdbacksets and the correlation between predicted and actual height is computed. A phase transition (regionof rapid variation in results) is expected and occurs at roughly 10 < − ln ( λ ) <

12. The penalizationis reduced until the correlation is maximized. In Figure (1), the correlation is shown as a functionof number of SNPs assigned non-zero eﬀect sizes (i.e., activated) by LASSO. In the phase transitionregime, where correlation rapidly increases, the number of activated SNPs grows rapidly from aboutzero to 7k. Each of the 5 colored curves in the ﬁgure corresponds to a training run on 453k individuals,with a diﬀerent 5k held back (and slightly diﬀerent training set) for each run. The phase transition isshown in terms of the penalization − ln ( λ ) in Figure (2).Figure (3) shows the correlation between predicted and actual phenotypes in a validation set of 5000individuals not used in the training optimization described in above - this is shown both for height andheel bone mineral density. The horizontal axis shows the number of individuals used in the training setand the error bars reﬂect 1 SD uncertainty estimated from ﬁve replications. The correlation obtainedindicates convergence to an asymptotic value of somewhat less than 0.7 (corresponding to roughly 50percent of total variance) for height, and perhaps 0.45 for heel bone mineral density. Figure (4) showsa scatterplot (each point is an individual) of predicted and actual height for 2000 individuals (roughlyequal numbers of males and females) not used in the training. The actual heights of most individualsare within about 3 cm of the predicted value.The corresponding result for Educational Attainment does not indicate any approach to a limitingvalue. Using all the data in the sample, we obtain maximum correlation of ∼ .

3, activating about 10kSNPs. Presumably, signiﬁcantly more or higher quality data will be required to capture most of theSNP heritability of this trait.The number of activated SNPs in the optimal predictors for height and bone density is roughly 20k.Increasing the number of candidate SNPs used from p =

50k to p = h = . ± . ( ) and heel bone density h = . ± . ( ) . However, there has been debate in the literature over the statistical properties of3 . . . . . . . . Number of Hits P r ed i c t i v e C o rr e l a t i on HeightTop 100k SNPs Max Correlation0.6399 +/− 0.0172 - ln ( l ) Figure 1: Correlation between actual and predicted heights as a function of the number of SNP hitsactivated in the predictor. While diﬃcult to visually separate, each line represents the training of apredictor using 453k individuals. Correlation is computed on 5k individuals not used in training. Thephase transition region (roughly, 10 < − ln ( λ ) <

12) corresponds to rapid growth in correlation on thisgraph, with number of hits growing from near 0 to over 5000.GREML estimates of SNP heritability and it is not clear that standard estimation methods yieldreasonably unbiased estimates even with large sample size [7, 13–16]. Therefore, we suggest thatGCTA estimates of SNP heritability should only be used as a rough guide. Perhaps the only way todetermine the heritability of a trait over a speciﬁc set of genomic variants is to build the best possiblepredictor [17] (i.e., with, in principle, unlimited sample size n ) to determine how much variance canbe accounted for.For height we tested out-of-sample validity by building a predictor model using SNPs whose stateis available for both UKBB individuals (via imputation) and on Atherosclerosis Risk in CommunitiesStudy (ARIC) [18] individuals (the latter is a US sample). This SNP set diﬀers from the one usedabove, and is somewhat more restricted due to the diﬀerent genotyping arrays used by UKBB andARIC. Training was done on UKBB data and out-of-sample validity tested on ARIC data. A ∼ ∼ ∼ ∼ ∼

7% diﬀerencein predictor performance. An ARIC scatterplot analogous to Figure (4) is shown in the Supplement.Most ARIC individuals have actual height within 4 cm or less of predicted height.We also checked (see Supplement) that familial relationships in UKBB do not have an importantimpact on our results. LASSO training was done both on the full set of data and on a smaller dataset where all ﬁrst degree cousin or stronger relations were removed (kinship > 0.10). After ﬁltering4 . . . . . . . . - ln ( l ) P r ed i c t i v e C o rr e l a t i on HeightTop 100k SNPs Max Correlation0.6399 +/− 0.0172 - ln ( l ) Figure 2: Correlation between actual and predicted heights as a function of L penalization λ . Eachline represents the training of a predictor using 453k individuals. Correlation is computed on 5kindividuals not used in training.

100 150 200 250 300 350 400 450 . . . . . . Sample Size 1000 C o rr ( y , y ^ ) HeightHeel Bone Mineral Density

Figure 3: Correlation between predicted and actual height as number of individuals n in training set isvaried. p =

50k candidate SNPs used in optimization. Fit lines of the form Corr ∼ nn + b are includedto aid visualization. 5

50 160 170 180 190 200

Predicted Height (cm) A c t ua l H e i gh t ( c m ) Figure 4: Actual height (cm) versus predicted height (cm) using 2000 randomly selected individualsheld back from predictor optimization. (Roughly equal numbers of males and females; no correctionsof actual height for age or gender). Error bars indicate ± ∼ . − . − . . . . . Position (base pairs) b i Figure 5: Eﬀect size (minor allele) for each activated SNP in a predictor model. The horizontal axisrepresents the SNP position in the genome, if chromosomes (1-22) were connected end to end to forma continuous linear region. Activated SNPs are distributed roughly uniformly throughout the genome.

Until recently most work with large genomic datasets has focused on ﬁnding associations betweenmarkers (e.g., SNPs) and phenotype [17]. In contrast, we focused on optimal prediction of phenotypefrom available data. We show that much of the expected heritability from common SNPs can becaptured, even for complex traits aﬀected by thousands of variants. Recent studies using data fromthe interim release of the UKBB reported prediction correlations of about 0.5 for human height usingroughly 100K individuals in the training [19]. These studies forecast further improvement of predictionaccuracy with increased sample size, which have been conﬁrmed here.We are optimistic that, given enough data and high quality phenotypes, results similar to thosefor height might be obtained for other quantitative traits, such as cognitive ability or speciﬁc diseaserisk. There are numerous disease conditions with heritability in the 0.5 range, such as Alzheimer’s,Type I Diabetes, Obesity, Ovarian Cancer, Schizophrenia, etc [20]. Even if the heritable risk forthese conditions is controlled by thousands of genetic variants, our work suggests that eﬀectivepredictors might be obtainable (i.e., comparable to the height predictor in Figure (4)). This wouldallow identiﬁcation of individuals at high risk from genotypes alone. The public health beneﬁts arepotentially enormous.We can roughly estimate the amount of case-control data required to capture most of the variancein disease risk. For a quantitative trait (e.g., height) with h ∼ .

5, our simulations [9] predict thatthe phase transition in LASSO performance occurs at n ∼ s where n is the number of individuals inthe sample and s is the sparsity of the trait (i.e., number of variants with non-zero eﬀect sizes). Forcase-control data, we ﬁnd n ∼ s (where n means number of cases with equal number controls)is suﬃcient. Thus, using our methods, analysis of ∼ Acknowledgments

LL, SA, and SH acknowledge support from the Oﬃce of the Vice-President forResearch at MSU. The authors are grateful for useful correspondence and discussion with AlexanderGrueneberg and Hwasoon Kim. We also acknowledge support from the NIH Grants R01GM099992and R01GM101219, and NSF Grant IOS-1444543, subaward UFDSP00010707.8

SUPPLEMENT: Methods

A.1 UKBB Dataset QC

In July 2017, the UK Biobank [10, 11] released a set of 488,377 genotyped individuals which weregenotyped using two Aﬀymetrix platforms—approximately 50,000 samples on the UK BiLEVE Axiomarray and the remainder on the UK Biobank Axiom array. The initial genotype information wascollected for 488,377 individuals for 805,426 SNPs and then subsequently imputed. Quality Controlwas done on the un-imputed data by removing SNPs which had missing call rates over 3%, removingindividuals which had missing call rates over 10% and, so as not to deal with very rare variants,removing SNPs which had minor frequencies below 0.1%. The resulting genetic data contained645,589 SNPs and 488,371 individuals. This set was then further ﬁltered for self-reported caucasiansfor whom the necessary phenotype measurements were available: for height, the number of remainingindividuals was 457,484; for heel bone mineral density there were 413,444 individuals; and foreducational attainment, there were 455,637 individuals.The imputed data set was generated using the set of 805,426 raw markers using the HaplotypeReference Consortium and UK10K haplotype resources. After imputation and initial QC, there werea total of 92,060,613 SNPs and 487,411 individuals. From this imputed data, further quality controlwas performed using Plink version 1.9 by excluding SNPs and samples which had missing call ratesexceeding 3% and also removing snps with minor allele frequency below 0.1%. For out-of-samplevalidation of height, we extracted SNPs which survived the prior quality control measures, and arealso present in a second dataset from the Atherosclerosis Risk in Communities Study (ARIC) [18].This resulted in a total of 632,155 SNPs and 464,192 samples. All quality control steps, except thoseperformed by the UK Biobank involving the imputation, were performed using version 1.9 of the Plinksoftware [21].

A.2 Confounding variables: age, sex and family structure

All traits for self-identiﬁed Caucasians were adjusted on the basis of age and sex. The phenotypesfor self-reported Caucasians were adjusted by z-scoring the phenotypes amongst all individuals of thesame sex. To correct for the eﬀects of societal changes, a univariate linear regression was performedon z-scored phenotypes using year-of-birth as the dependent variable. The adjusted phenotype was setequal to the residual of the z-scored phenotype and the regression line. Before making these corrections,it was shown that the mean phenotypic value was indeed increasing with year-of-birth—this was seenin all three phenotypes: height, heel bone mineral density and educational attainment.Relatedness calculations were provided with the UKBB dataset in order to account for familystructure and cryptic relatedness. There were 107,163 familial relationships identiﬁed amongst UKBBparticipants which were at the level of third cousins or higher and, due to the large number ofrelationships, ﬁltering out these individuals results in a nontrivial decrease in the size of data availablefor model selection. To investigate the relevance of this issue, LASSO training was done both on thefull set of data and on a smaller data set where all ﬁrst degree cousin or stronger relations were removed(kinship > 0.10). After ﬁltering for kinship on the calls, this left 423,510 individuals for height and382,727 individuals for heel bone density. This unrelated dataset was used for model training usingrandom sets of 100, 150, ... , 400 thousand individuals and there was no discernible diﬀerence inthe results between using a training set drawn from the set of 423,510 kinship-ﬁltered individuals andindividuals from the unﬁltered set. Therefore we do not believe that the familial relationships have animportant impact on our results. 9 .3 L -penalized regression Consider the regression problem in generality. We have n observations of the phenotype, y I , with I = , . . . , n as the vector ® y . The genotype data is encoded in the n × p design matrix X I j with j = , . . . , p . The X I j is the number of copies of the most frequent minor allele of the j th SNP for the I th person, and thus takes values 0, 1, or 2. Missing values are mean-imputed.We use a standard linear model for the dependence of y on the SNP data x j . That is, we assume arelationship of the form y I = y + ˆ ® β · ® x I + e I , (2)where the errors, e I , are assumed to be (identically and independent) normally distributed withunknown variance σ e . The errors, e I , receive contributions from potential environmental eﬀects,gene–gene nonlinear eﬀects, and gene–environment nonlinear eﬀects. For discussion of methods torecover nonlinear eﬀects, see [22].We compute an estimator ® β ∗ for the vector of linear eﬀects, ˆ ® β ∈ R p , using L -penalized regression(LASSO) [12]. This corresponds to minimizing the objective function (after standardizing ® y and X ) ® β ∗ = argmin ® β ∈ R p O λ (® y , X ; ® β ) , O λ (® y , X ; ® β ) = (cid:13)(cid:13) ® y − X ® β (cid:13)(cid:13) + n λ k ® β k , (3)where λ is a penalty (hyper-)parameter and the L norm is deﬁned to be the sum of the absolute valuesof the coeﬃcients k ® β k = p Õ j = | β j | . (We use k · k with no subscript to denote the standard L norm.) The extra factor of n in the secondterm is a convention that factors out the explicit sample size scaling of n . The squaring in the ﬁrst termis (implicitly) of the L norm of the residual.The ﬁrst term is the standard ordinary least-squares (OLS) loss function. The eﬀect of the secondterm is to regularize the regression problem by favoring sparse solutions with the nonzero coeﬃcientsshrunk toward 0. This seems appropriate for genomic problems, since we expect that for any givenphenotype most SNPs have no eﬀect. Biasing the nonzero coeﬃcients toward 0 reduces variance andimproves the expected ﬁt for small sample size.Even for n ≪ p , LASSO can obtain an accurate ® β under the right conditions: the eﬀects vectormust be sparse and the heritability of the trait must be suﬃciently high (equivalently, the amount ofnoise variance is bounded). For ﬁxed σ e and sparse eﬀects vector, there is a critical sample size n ∗ (depending on σ e and the sparsity of the trait) above which one expects to get good recovery of ® β interms of the L error. A phase transition at n ∼ n ∗ has been demonstrated numerically for real andsimulated genomic data in [9].For our speciﬁc calculations we follow the following cross-validation procedure:1. Break the data into training sets, and smaller test and validation sets.2. Perform a standard GWAS on the training sets, and rank the SNPs by p -value.3. To ease the computational burden, restrict the calculation to a ﬁxed number of lowest p -valueSNPs on each training set. Replace any missing SNP values by the SNP-mean for the trainingdata.4. Perform LASSO on the standardized training data, scanning a range of values for the penalty λ that passes through the phase transition region of rapid variation in results.10. Choose the λ that has the maximum correlation on the test set, which was held back fromtraining.6. Finally, evaluate performance of optimal predictor β ∗ on validation sets. A.4 Coordinate Descent

Most algorithms for minimizing the objective function (3) use (some variation of) coordinate de-scent [23, 24]. The basic form of the algorithm is as follows. Proceeding from an initial guess ® β , wecycle through the p “coordinates” sequentially, minimizing O with respect to each β j (holding othersﬁxed). To that end, note that ∂ O ∂ β j = n  h x j i β j + Õ k , j h x j x k i β k − h x j y i + λ sgn ( β j )  = . (4)Thus, the updated coeﬃcient should satisfy β ∗ j = h x j i  h x j y i − Õ k , j h x j x k i β k − λ sgn ( β ∗ j )  . (5)To solve for β ∗ j , one should determine the λ = ( β ∗ j ) should be positive (negative) and subtract (add) the λ term. If the sign ﬂips, then the solutionis spurious, and the optimal solution is at β ∗ j =

0. (To see this note that for β ∗ j = + the derivative ispositive, and for β ∗ j = − the derivative is negative.)Introduce the “soft thresholding function” S ( z , γ ) = sgn ( z ) max (| z | − γ, ) . (6)Then, the update for the j th component of ® β is β ∗ j = h x j i S © « h x j y i − Õ k , j h x j x k i β k , λ ª®¬ . (7)The basic Coordinate descent algorithm is as shown in Alg. 1. A.5 Out-of-sample Validation

Model (i.e., predictor) construction was performed by implementing LASSO on the UK Biobank data.In order to validate models and check against overtraining, a second dataset is needed in order to testthe results. We 1) withheld a small subset of UKBB individuals from the initial training for in-samplevalidation, and 2) applied the model to individuals from a completely diﬀerent dataset (ARIC) forout-of-sample validation. In-sample validation was done by withholding a predetermined number ofrandomly selected individuals from the UK Biobank data before p-value cuts were applied to SNPs.The remaining individuals were used for LASSO training and the resulting model was applied to theindividuals initially held back to check in-sample validity.Out-of-sample validation is similar, except that we used a set of common SNPs for which statevalues can be imputed on the UKBB individuals and are also known for ARIC individuals. Initial We use a custom implementation in Julia [25] using safe screening ideas [26–29]. ata: X jI and y I with j = , . . . , p and I = , . . . , n Input:

Penalty parameter λ , tolerance ǫ , and (optionally) initial guess ® β Output: ® β solving LASSO optimization problem within convergence tolerance ǫ ® β ← ® β repeat ® β ← ® β for j in { , . . . , p } do β j ← h x j i S (cid:16) h x j y i − Í k , j h x j x k i β k , λ (cid:17) enduntil ( ® β − ® β ) < ǫ return ® β Algorithm 1:

Basic coordinate descent algorithm for LASSO.training of the model was performed using UKBB individuals, but its validity was then tested on theARIC data. Results using the un-imputed dataset reached correlation of ∼ ∼ ∼ β by the LASSO algorithm) by varianceexplained ( V i = ν ( − ν ) β i where ν is the minor allele frequency), then scanned down this list andlooked for a proxy match by distance in the corresponding dataset. For GIANT, we took the resultspublished online and extracted the top 3000 hits ordered by p-value. For SSGAC, we used the publishedresults and kept SNPs with p < − - a total of 316. For GEFOS, we kept all SNPs with p < − and then coarse grained SNPs within blocks of 10k base pairs, resulting in 3901 regions. These resultsare displayed in Figures (7), (8), (9). They show signiﬁcant overlap between regions of the genomenear previously known SNPs and regions identiﬁed by our algorithm. However, our activated SNPsare roughly uniformly distributed over the entire genome, and number in the many thousands for eachtrait. This means that many of our SNPs, including some of those that account for the most variance,are in regions not previously identiﬁed by earlier GWAS.Figure (10) shows number of activated SNPs by sign of eﬀect of the minor allele and minor allelefrequency (MAF). The height of each bar represents the number of + or − SNPs in a MAF bin ofwidth 0 . ∼ n ( ν ) = a ν − b to the range ν ∈ ( . , . ) where ν is MAF and n ( ν ) is the number of nonzero eﬀects. We exclude the smallestvalues of MAF because of incomplete discovery of SNPs in that region. The ± distributions are nearlysymmetrical ( a + = . , b + = . a − = . b − = .

40 150 160 170 180 190

Predicted Height (cm) A c t ua l H e i gh t ( c m ) Figure 6: Actual height (cm) versus predicted height (cm) using 2000 randomly selected individuals(roughly equal numbers of M and F; no corrections for age or gender) from the ARIC dataset. Errorbars indicate ± References [1] Jian Yang et al. “GCTA: A Tool for Genome-wide Complex Trait Analysis”. In:

The AmericanJournal of Human Genetics doi : . url : https://doi.org/10.1016/j.ajhg.2010.11.011 (cit. on p. 2).[2] Peter M. Visscher et al. “10 Years of GWAS Discovery: Biology, Function, and Translation”.In: The American Journal of Human Genetics doi : . url : https://doi.org/10.1016/j.ajhg.2017.06.005 (cit. on p. 2).[3] Eirini Marouli et al. “Rare and low-frequency coding variants alter human adult height”. In: Nature doi : . url : https://doi.org/10.1038/nature21039 (cit. on pp. 2, 12).13 . . . . Top N SNPs from LASSO % M a t c hed w = 10kw = 50kTop 3k Giant SNPs by P−value Figure 7: Matching between top SNPs activated in predictor model ordered by variance accountedfor (x axis) and SNPs identiﬁed previously by GIANT GWAS (height). Percent of previously knownSNPs matched is shown on y axis. Matching window size w given in bp. . . . . . Top N SNPs from LASSO % M a t c hed w = 10kw = 50kTop 312 SSGAC SNPs by P−valueP < 1e−6 Figure 8: Matching between top SNPs activated in predictor model ordered by variance accountedfor (x axis) and SNPs identiﬁed previously by SSGAC GWAS (Educational Attainment). Percent ofpreviously known SNPs matched is shown on y axis. Matching window size w given in bp.14 . . . . . Top N SNPs from LASSO % M a t c hed w = 15kw = 50kTop GEFOS SNPs by P−valueP < 1e−8 Figure 9: Matching between top SNPs activated in predictor model ordered by variance accounted for(x axis) and SNPs identiﬁed previously by GEFOS GWAS (Heel Bone Density). Percent of previouslyknown SNPs matched is shown on y axis. Matching window size w given in bp. N u m be r o f non z e r o E ff e c t s − − − − Positive EffectsNegative EffectsHeightr = 0.61Total Hits = 22k

Figure 10: Number of SNPs with positive (red) and negative (blue) minor allele eﬀect sizes. Curvesare constructed by ﬁtting a power law in MAF. 154] Unnur Styrkarsdottir et al. “Multiple Genetic Loci for Bone Mineral Density and Fractures”. In:

New England Journal of Medicine doi : . url : https://doi.org/10.1056/nejmoa0801197 (cit. on pp. 2, 12).[5] John A Morris et al. “Genome-Wide Association Study of Heel Bone Mineral Density Identiﬁes153 Novel Loci and Implicates Functional Involvement of GPC6 in Osteoporosis”. In: (). Toappear in Nature Genetics (cit. on pp. 2, 12).[6] Aysu Okbay et al. “Genome-wide association study identiﬁes 74 loci associated with educationalattainment”. In: Nature doi : . url : https://doi.org/10.1038/nature17671 (cit. on pp. 2, 12).[7] Gustavo de los Campos et al. “Genomic Heritability: What Is It?” In: PLOS Genetics doi : . url : https://doi.org/10.1371/journal.pgen.1005048 (cit. on pp. 2, 4).[8] Gustavo de los Campos et al. “Predicting genetic predisposition in humans: the promise ofwhole-genome markers”. In: Nature Reviews Genetics doi : . url : https://doi.org/10.1038/nrg2898 (cit. on p. 2).[9] Shashaank Vattikuti et al. “Applying compressed sensing to genome-wide association studies”.In: GigaScience issn : 2047-217X. doi : . url : http://dx.doi.org/10.1186/2047-217X-3-10 (cit. on pp. 2, 7, 10).[10] UK Biobank . Accessed: 2017-07-21. url : (cit. on pp. 2, 9).[11] Clare Bycroft et al. “Genome-wide genetic data on ~500, 000 UK Biobank participants”. In:(July 2017). doi : . url : https://doi.org/10.1101/166298 (cit. on pp. 2,9).[12] Robert Tibshirani. “Regression Shrinkage and Selection Via the Lasso”. In: Journal of the RoyalStatistical Society, Series B

58 (1994), pp. 267–288 (cit. on pp. 2, 10).[13] Siddharth Krishna Kumar et al. “Limitations of GCTA as a solution to the missing heritabilityproblem”. In:

Proceedings of the National Academy of Sciences doi : . url : https://doi.org/10.1073/pnas.1520109113 (cit. on p. 4).[14] Jian Yang et al. “GCTA-GREML accounts for linkage disequilibrium when estimating geneticvariance from genome-wide SNPs”. In: Proceedings of the National Academy of Sciences doi : . url : https://doi.org/10.1073/pnas.1602743113 (cit. on p. 4).[15] Siddharth Krishna Kumar et al. “Response to Commentary on "Limitations of GCTA as asolution to the missing heritability problem"”. In: (Feb. 2016). doi : . url : https://doi.org/10.1101/039594 (cit. on p. 4).[16] Eric R Gamazon and Danny S Park. “SNP-based heritability estimation: measurement noise,population stratiﬁcation, and stability”. In: (Feb. 2016). doi : . url : https://doi.org/10.1101/040055 (cit. on p. 4).[17] Robert Makowsky et al. “Beyond Missing Heritability: Prediction of Complex Traits”. In: PLoSGenetics doi : . url : https://doi.org/10.1371/journal.pgen.1002051 (cit. on pp. 4, 7).[18] “The decline of ischaemic heart disease mortality in the ARIC study communities. The ARICStudy Investigators”. In: Int J Epidemiol url : (cit. on p. 7).[21] Shaun Purcell et al. “PLINK: A Tool Set for Whole-Genome Association and Population-BasedLinkage Analyses”. In: The American Journal of Human Genetics doi : . url : https://doi.org/10.1086/519795 (cit. on p. 9).[22] Chiu Man Ho and Stephen DH Hsu. “Determination of nonlinear genetic architecture usingcompressed sensing”. In: GigaScience doi : . url : https://doi.org/10.1186/s13742-015-0081-6 (cit. on p. 10).[23] Jerome Friedman et al. “Pathwise coordinate optimization”. In: The Annals of Applied Statistics doi : . url : https://doi.org/10.1214/07-aoas131 (cit. on p. 11).[24] Jerome Friedman et al. “Regularization Paths for Generalized Linear Models via CoordinateDescent”. In: Journal of Statistical Software doi : . url : https://doi.org/10.18637/jss.v033.i01 (cit. on p. 11).[25] J. Bezanson et al. “Julia: A Fast Dynamic Language for Technical Computing”. In: ArXiv e-prints (Sept. 2012). arXiv: (cit. on p. 11).[26] L. El Ghaoui et al. “Safe Feature Elimination in Sparse Supervised Learning”. In:

Paciﬁc Journalof Optimization

Proceedings of The 31st International Conference on Machine Learning . Ed. by Eric P. Xing andTony Jebara. Vol. 32. 1. JMLR Workshop and Conference Proceedings, Jan. 2014, pp. 289–297(cit. on p. 11).[28] O. Fercoq et al. “Mind the duality gap: safer rules for the Lasso”. In:

ArXiv e-prints (May 2015).arXiv: (cit. on p. 11).[29] A. Malti and C. Herzet. “Safe screening tests for LASSO based on ﬁrmly non-expansiveness”.In: .IEEE, Mar. 2016. doi : . url : https://doi.org/10.1109/icassp.2016.7472575https://doi.org/10.1109/icassp.2016.7472575