From Genotype to Phenotype: polygenic prediction of complex human traits
FFrom Genotype to Phenotype: polygenic prediction ofcomplex human traits
Timothy G. Raben , Louis Lello , Erik Widen , and Stephen D.H. Hsu Michigan State University, East Lansing, Michigan, 48824 Genomic Prediction, North Brunswick, New Jersey, 08902
Abstract
Decoding the genome confers the capability to predict characteristics of the organism(phenotype) from DNA (genotype). We describe the present status and future prospects ofgenomic prediction of complex traits in humans. Some highly heritable complex phenotypessuch as height and other quantitative traits can already be predicted with reasonable accuracyfrom DNA alone. For many diseases, including important common conditions such as coro-nary artery disease, breast cancer, type I and II diabetes, individuals with outlier polygenicscores (e.g., top few percent) have been shown to have 5 or even 10 times higher risk thanaverage. Several psychiatric conditions such as schizophrenia and autism also fall into thiscategory. We discuss related topics such as the genetic architecture of complex traits, siblingvalidation of polygenic scores, and applications to adult health, in vitro fertilization (embryoselection), and genetic engineering.
A version of this article was prepared for
Genomic Prediction of Complex Traits , SpringerNature book series
Methods in Molecular Biology . Keywords: genomics, complex trait prediction, PRS, in vitro fertilization, genetic engineering1 a r X i v : . [ q - b i o . GN ] J a n Introduction
I, on the other hand, knew nothing, except ... physics and mathematics and an ability to turn myhand to new things. — Francis CrickThe challenge of decoding the genome has loomed large over biology since the time of Wat-son and Crick. Initially, decoding referred to the relationship between DNA and specic proteinsor molecular mechanisms, but the ultimate goal is to deduce the relationship between DNA andphenotype — the character of the organism itself.
How does Nature encode the traits of the organ-ism in DNA?
In this review we describe recent advances toward this goal, which have resultedfrom the application of machine learning (ML) to large genomic data sets. Genomic prediction isthe real decoding of the genome: the creation of mathematical models which map genotypes tocomplex traits.It is a peculiarity of ML and articial intelligence (AI) applied to complex systems that thesemethods can often “solve” a problem without explicating, in a manner that humans can absorb,the intricate mechanisms that lie intermediate between input and output. For example, AlphaGo[1] achieved superhuman mastery of an ancient game that had been under serious study forthousands of years. Yet nowhere in the resulting neural network with millions of connectionstrengths is there a human-comprehensible guide to Go strategy or game dynamics. Similarly, ge-nomic prediction has produced mathematical functions which predict quantitative human traitswith surprising accuracy — e.g., height, bone density, and cholesterol or lipoprotein A levels inblood (see Table 1); using typically thousands of genetic variants as input (see next section fordetails) — but without explicitly revealing the role of these variants in actual biochemical mech-anisms. Characterizing these mechanisms — which are involved in phenomena such as bonegrowth, lipid metabolism, hormonal regulation, protein interactions — will be a project whichtakes much longer to complete.If recent trends persist, in particular the continued growth of large genotype | phenotypedata sets, we will likely have good genomic predictors for a host of human traits within the nextdecade. Here good can mean capturing most of the a priori estimated heritability of the trait.2 henotype Correlation . ( . ) , ( ) Heel bone density 0 . ( . ) , ( ) BMI 0 . ( . , ( ) Educational attainment 0 . ( . ) , ( ) Apolipoprotein A 0 . ( . ) , ( , ) Apolipoprotein B 0 . ( . ) , ( , ) Cholesterol 0 . ( . ) , ( , ) Direct bilirubin 0 . ( . ) , ( , ) HDL cholesterol 0 . ( . ) , ( , ) Lipoprotein A 0 . ( . ) , ( , ) Platelet count 0 . ( . ) , ( ) Total bilirubin 0 . ( . ) , ( , ) Total protein 0 . ( . ) , ( , ) Triglycerides 0 . ( . ) , ( , ) Table 1:
Examples of quantitative trait prediction. The last 10 traits listed are all obtained fromstandard blood test measurements (terminology from UK Biobank data elds). Uncertainties givenin parenthesis are the standard deviation obtained from validation of 5 dierent training runs, eachof which produce a slightly dierent predictor. Predictors were trained with data and methodsanalogous to [2]. data limited :algorithms and computational resources are not the bottleneck.In the following sections we will discuss (1) Current status of human genomic prediction:examples of quantitative trait prediction and prediction of disease risk, out of sample validation ofpredictors, sibling validation, (2) Methods, Genetic Architecture, and Theory, and (3) Applicationsto In Vitro Fertilization and Gene Editing. The nal section will discuss some (near term) futureprojections. ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● [ N / ] AUC hypertensionhypothyroidismtype 2 diabetes
Figure 1:
Prediction quality measured by AUC for three conditions (hypertension, hypothy-roidism, type 2 diabetes) as a function of training sample size 𝑁 . Evidence is strong that furtherincreases in sample size will lead to improvement in accuracy. Reproduced from [2]. Present Status: 2020
Technological advances have reduced the (2020) cost of whole genome sequencing to under $1kand the cost of SNP array genotyping to roughly $20 [3]. In this section we will give an overviewof results obtained from training on data sets obtained using SNP arrays. For quantitative traitssample sizes were in excess of 400k individuals. For disease risk training (polygenic risk scores, orPRS) typical sample sizes were tens of thousands of cases and at least as many controls, typicallyindividuals late in life for whom medical records are available. See [2] for specic details.
In 2012 one of the authors [4] estimated that training on a few hundred thousand SNP genotypesusing L1 penalization (see next section for details) would capture most of the common SNP her-itability for height. In 2017, with the release of the UK Biobank data set of 500k genomes, thisprediction came true. Accurate genomic prediction of adult height, with standard deviation of ∼ ∼ Polygenic risk predictors for dozens of important disease conditions, including, e.g., diabetes,breast cancer, coronary artery disease, hypertension, schizophrenia, autism, and many more,5
50 160 170 180 190 200150160170180190 Measured height ( cm ) P r e d i c t e dh e i gh t ( c m ) menwomen Figure 2:
Height prediction in males and females not used in predictor training. Predicted heightcomputed from ∼
20k SNPs is shown on vertical axis, and actual height on horizontal axis. In atypical bin the standard deviation of the distribution of actual heights relative to prediction is ∼ ∼ .
65. Reproduced from [5]. have been published and validated by many research groups [2, 10–12].We can roughly characterize the performance of these polygenic risk predictors as follows:individuals with very high PRS will typically have an incidence rate which is many times higherthan the population average. For example, in [2] we found that for atrial brillation, 99th per-centile PRS implies ∼
10 times higher likelihood of case status. Similarly, low PRS indicates belowaverage risk for the condition: we can identify individuals whose risk is an order of magnitudelower than in the general population. 6dentication of outliers is possible even though the standard performance metric AUC (AreaUnder ROC Curve) value is modest: e.g., AUC ∼ .
6. This is because the absolute risk as functionof PRS is highly nonlinear: outlier (e.g., 99th percentile) risk can be very high even if risk forindividuals near the middle of the PRS distribution varies only modestly — see gure 3, whichillustrates risk dierentiation for the example phenotypes breast cancer and hypothyroidism. Wereport below on further explicit results. P r ob a b ili t y Breast cancerinferred populationpredictedlifetime prevalence P r ob a b ili t y Hypothyroidisminferred populationpredictedlifetime prevalence
Figure 3:
Incidence of breast cancer and hypothyroidism as a function of percentile polygenicrisk score (PRS). At high PRS the likelihood of incidence increases nonlinearly, and at low PRS thelikelihood decreases nonlinearly. Red curve is theoretical, modeling case and control populationswith normal distributions shifted in mean PRS. Blue data points are calculated using individuals(not used in training) binned by PRS. Reproduced from [2].
There is already signicant interest in the application of PRS in a clinical setting, for exampleto identify high risk individuals who might receive early screening or preventative care [2, 13–24]. As a concrete example, women with high PRS scores for breast cancer can be oered earlyscreening: already standard of care for those with BRCA risk variants [25, 26]. However, BRCAmutations aect no more than a few women per thousand in the general population [27–29].Importantly, the number of (BRCA negative) women who are at high risk for breast cancer dueto polygenic eects is an order of magnitude larger than the population of BRCA carriers [2,70, 30–34]. From this one example it is clear that signicant medical, public health, and costbenets could result from PRS (e.g. [35]). It is well known that patients with atheroscleroticdiseases, coronary artery disease (CAD), and lung diseases can benet from early intervention[36–38]. In many instances where early treatment can be benecial, stratication by age, gender,and ethnicity show an exacerbation of poor outcomes [39]. Precision genetics is already used inidentication of candidates for early intervention, and will become widespread in the near future(cf. Myriad’s riskScore test and other examples [33, 34]). In gure 4, we illustrate the predictedrisk of breast cancer and coronary artery disease as function of age for high, medium and lowrisk groups, respectively. -
45 45 -
49 49 -
53 53 -
57 57 -
61 61 -
65 65 -
69 69 - % % % % % % % % % % % % Age P e r ce n t w i t h c ond i t i on Breast cancertop 4 % PRSmid 40 %- % PRSbottom 4 % PRS -
45 45 -
49 49 -
53 53 -
57 57 -
61 61 -
65 65 -
69 69 - % % % % % % % % % % % % Age P e r ce n t w i t h c ond i t i on CADtop 4 % PRSmid 40 %- % PRSbottom 4 % PRS
Figure 4:
Breast cancer and coronary artery disease (CAD) risk progression with age and poly-genic score (PRS). Outlier individuals with unusually high (top 4 percent) PRS are much morelikely to be diagnosed with the condition than those with typical (40th to 60th percentile) PRS.Low PRS is associated with reduced incidence of the condition. Similar results are available for 20or more common disease conditions, and have obvious utility for early screening, diagnosis, andprevention. PRS obtained from [10] and scored on a population within the UK Biobank [40].
It should be stressed that here we focus on purely genomic risk scores and correlations. Thatis, we are focused on relative genetic risk from SNP information alone. These results can beeasily combined with information from other biomarkers (e.g., blood test results) or health-relatedindicators such as age, sex, BMI, blood pressure, etc. to obtain even stronger risk stratication [2,80, 11, 15].PRS with similar utility for risk outlier identication have been developed for psychiatricconditions such as autism, schizophrenia [41, 42].
There are now many validations of polygenic prediction in the scientic literature, conductedusing groups of people born on dierent continents and in dierent decades than the originalpopulations used in training [43, 44]. Here we discuss results showing that predictors can dier-entiate between siblings (which one has heart disease? is taller?), despite similarity in childhoodenvironments and genotype. The predictors work almost as well in pairwise sibling comparisonsas in comparisons between randomly selected strangers.We tested a variety of polygenic predictors using tens of thousands of genetic siblings forwhom we have SNP genotypes, health status, and phenotype information in late adulthood. Sib-lings have typically experienced similar environments during childhood, and exhibit negligiblepopulation stratication relative to each other. Therefore, the ability to predict dierences indisease risk or complex trait values between siblings is a strong test of genomic prediction inhumans. We compare validation results obtained using non-sibling subjects to those obtainedamong siblings and nd that typically most of the predictive power persists in within-familydesigns. Given 1 sibling with normal-range PRS score (less than 84th percentile) and 1 siblingwith high PRS score (top few percentiles), the predictors identify the aected sibling about 70-90percent of the time across a variety of disease conditions, including breast cancer, heart attack,diabetes, etc. For height, the predictor correctly identies the taller sibling roughly 80 percent ofthe time when the (male) height dierence is 2 inches or more. Some disease prediction resultsare illustrated in gure 5 while a sibling validation of height prediction can be seen in gure 6.9 .5 2.0 2.5 3.00.00.20.40.60.81.0 Δ z score F r ac t i on c o rr ec t CADsibling pairsrandom pairs Δ z score F r ac t i on c o rr ec t Type 1 diabetessibling pairsrandom pairs Δ z score F r ac t i on c o rr ec t Hypothyroidismsibling pairsrandom pairs Δ z score F r ac t i on c o rr ec t Hypertensionsibling pairsrandom pairs
Figure 5:
Predictors tested on random (non-sibling) pairs and aected sibling pairs with a singlecase, for conditions coronary artery disease (CAD), type 1 diabetes (T1D), hypothyroidism, andhypertension. One individual in each pair is high risk (i.e., has a high polygenic risk score) andthe other is normal risk (PRS < + .5 2.0 2.5 3.00.00.20.40.60.81.0 Δ z score F r ac t i on c o rr ec t Heightsibling pairsrandom pairs
Figure 6:
Probability that polygenic predictor correctly identies the taller individual (verticalaxis) for pairs of random individuals and pairs of siblings. Horizontal axis shows absolute dier-ence in z-scored height between the individuals in each pair. Quality of prediction is very similarbetween pairs of random individuals and sibling pairs, despite the siblings having experiencedmore similar childhood environments and sharing more alleles in common. Error bands includeuncertainty due to limited numbers of individuals in each z-score bin. Reproduced from [45].
Sparse learning algorithms have been successfully applied to construct genomic predictors [5,46–48]. These algorithms incorporate a prior that SNPs which materially aect the trait are onlya small fraction of the (typically of order million or more) candidate SNPs. In other words, the11lgorithms favor parsimony in the construction of models for genetic risk. This prior has beenconrmed by many studies: most of the population variance for even the most polygenic traits(e.g., human height) is captured by at most tens of thousands of SNPs [49]. Although 10k is a largenumber, it is small compared to the millions of candidate common SNPs (i.e., polymorphismsfound in at least ∼ within an specic ancestry group, as self-reported or according to PCA clustering. We elaboratemore on cross-ancestral studies later in this section while for this method, both training andvalidation is performed within a specied ancestry. Predictor training
In order to avoid diculties arising from population structure, training isperformed in a homogeneous population with similar ancestry. Standard tools using, e.g.,principal components analysis, allow ecient categorization by ancestry. The set of candi-date SNPs is typically either the full set of SNPs directly measured by the genotyping array,or a larger set obtained by imputation. Basic quality control is performed to avoid usingextremely rare variants and poorly genotyped participants. The weights 𝛽 𝑗 are chosen byminimizing the objective function 𝑂 = | (cid:174) 𝑦 − 𝑋 (cid:174) 𝛽 | + 𝜆 ∑︁ 𝑗 | 𝛽 𝑗 | 𝑦 and matrix 𝑋 of SNP genotypes for each sample. 𝜆 is ahyperparameter of the model which tunes the level of sparsity imposed. The predictorsare trained using 𝑘 -fold cross-validation: a small subset of data is withheld from trainingand used for model selection; the entire process is repeated 𝑘 times with a dierent subsetwithheld every time. Score/Validation
Each trained predictor is scored on its corresponding validation subset with-held from its 𝑘 -fold training. We typically use standard prediction metrics such as AUCand explained variance to determine optimal parameter settings and to select top predic-tors from each cross-validation fold. Evaluation
Once the optimal predictors have been selected they can be evaluated in a numberof ways. Typically we use evaluation data sets composed of (1) individuals of similar an-cestry to the training set, but not used in training, (2) individuals of adjacent ancestry notused in training (e.g., Eastern or Southern Europeans, adjacent to British / North-West Eu-ropeans), (3) individuals of generally similar ancestry but collected in entirely dierent co-horts, sometimes from another continent, (4) distant ancestry groups (e.g., non-Europeansof a specic ancestry such as East Asian or African), and most prominently (5) siblings ofgenerally European ancestry (see [45] for a full paper on sibling evaluation). We againcompute standard prediction metrics such as AUC, but also more relevant measures suchas the absolute probability (rate of incidence) of the condition in outlier subgroups such astop few percent PRS score, as illustrated in gure 3.Predictors built with L1 penalization typically have between 100 and 20k active SNPs, dependingon phenotype, distributed over many chromosomes — see gure 7 for the example of height.The predictor performance varies with the trait but is always strongly dependent on the samplesize available for training. Empirically, we nd an approximate sample size dependence ∼ 𝑁𝑁 + 𝑏 with the asymptotic behavior determined by the (linear/narrow-sense) heritability of the trait inquestion and by the number of candidate SNPs. In practice, there is typically a very steep gain in13rediction power as the sample sizes grow from small to moderate, say from 1k to 50k samples.It is then followed by a region of less dramatic but steady performance gains until it eventuallyattens out for extremely large data sets. Figure 1 shows how predictor performance improveswith increasing sample sizes for hypertension, hyperthyroidism and type 2 diabetes. - - - Base pair position along each chromosome β Figure 7:
Locations of ∼
20k SNPs activated in height predictor on the genome, with individualchromosomes indicated. Vertical axis is eect size 𝛽 of minor allele. Positive 𝛽 value indicates aSNP for which the minor allele is associated with increased height. Reproduced from [5]. We have shown, e.g., for height and bone-heel mineral density, that the UKB with its ∼ .2 Prediction across ancestry groups, causal variants PRS training has so far overwhelmingly been conducted in populations of homogeneous ancestry,typically of European descent, for which large databanks rst became available. This is becausepopulation stratication (patterns of correlation within the genome, which dier by ancestry)introduces special diculties in statistical learning (e.g. [64]). Consequently, the majority of pre-dictors have been trained on and work best in European populations. There are a few exceptionsin which GWAS works well in diverse populations, e.g. [65], but performance of complex PRS fallo quickly as a function of genetic distance [66, 67]. The implications of this skewed focus areserious as the majority of the world population, including minorities within countries of predom-inantly European descent, are left out from these new advancements in health care [68]. It is thusan urgent priority to correct this situation: (1) by building predictors using cohorts from otherancestries (e.g., East Asian or West African) as well as (2) developing techniques that can mod-ify or adapt predictors trained in one ancestry group so that they work well in another, perhapsdistant, ancestry group.Cross-ancestral study and training of predictors provide a unique opportunity to explore thegenetic architecture for common diseases, a research area in which important basic questionsstill remain. While PRS and GWAS identify candidate SNPs which are statistically associatedwith increased (or decreased) risk, they cannot determine which SNPs have a causal eect onindividual biology. Because SNPs often occur in correlated clusters in the genomes of a givenpopulation, there is always some ambiguity concerning whether a specic SNP is causal. Thesecorrelation patterns vary across populations.Current predictors utilize SNPs which are merely tags (i.e., correlated in state) for the ac-tual causal
SNPs (or other structure). The quality of the tag may be much weaker in a distantpopulation, causing the predictor to perform much worse. This is a problem to be solved.By utilizing dierent patterns of correlation, we may be able to zero in on actual causal SNPs —they are likely to be consistently detected in predictor training across multiple distant populations.In other words, we can turn the problem described above into a tool for detecting candidate causal
Using the SNPs activated in existing PRS, we can begin to investigate the genetic architecture ofcommon disease risk [71]. There are many detailed results concerning specic conditions, buttwo general points should be emphasized:1. Much of the genetic risk identied in polygenic predictors is controlled by variants out-side genic (protein coding) regions, and not accessible through exome sequencing. Thissupports the notion that DNA information storage extends beyond specic genes.2. The DNA regions used in disease risk predictors so far constructed seem to be largely dis-joint, suggesting that most genetic disease risks are largely uncorrelated. It seems possiblein theory for an individual to be a low-risk outlier in all conditions simultaneously.The space of genetic variation is high dimensional, and extends far beyond individual (proteincoding) genes. Intuitions about strong pleiotropy are likely wrong — they were developed beforewe knew anything about real genetic architectures. There seem to be many causal variants thatcould, in principle, be independently modied and evidence to date suggests that large portionsof genetic variance aecting dierent human traits and disease risks are independent.In the nal section below, we make some rough estimates concerning the total space of her-itable individual dierences (including both quantitative traits and disease risks) for humans,assuming approximate independence.
There is signicant empirical evidence now that linear predictive models can capture much (nearlyall?) of the estimated common SNP heritability of many traits. It may come as a surprise that ge-netic eects can be approximately additive, given the apparent complexity of biological systems.16onlinear genetic eects certainly exist and are likely realized in every organism. However,quantitative dierences between individuals within a species may be largely due to independentlinear eects of specic genetic variants. As reected in Fisher’s Fundamental Theorem of Nat-ural Selection [4], linear eects are the most readily evolvable in response to selection, whereasnonlinear “gadgets” (i.e., mechanisms which depend sensitively on multiple genetic switches) aremore likely to be fragile to small changes. Evolutionary adaptations requiring signicant changesto nonlinear gadgets are improbable and therefore require exponentially more time than simpleadjustment of frequencies of alleles of linear eect. One might say that to rst approximation,Biology = linear combinations of nonlinear gadgets, and most of the variation between individu-als in a species is due to the (linear) way gadgets are combined, rather than in the realization ofdierent gadgets in dierent individuals.
Today millions of babies are produced through In Vitro Fertilization (IVF). In most developedcountries roughly 3-5 percent of all births are through IVF, and in Denmark the fraction is about10 percent [72]. But when the technology was rst introduced with the birth of Louise Brown in1978, the pioneering scientists had to overcome signicant resistance.Wikipedia: ...During these controversial early years of IVF, Fishel and his colleaguesreceived extensive opposition from critics both outside of and within the medical andscientic communities, including a civil writ for murder. Fishel has since stated that"the whole establishment was outraged" by their early work and that people thoughtthat he was "potentially a mad scientist". [73]In the past, parents with more viable embryos than they intended to use made a selection basedon very little information — typically nothing more than the appearance or morphology of eachblastocyst. With modern technology it has become common to genotype embryos before selec-tion, in order to detect potential genetic issues such as trisomy 21 (Down Syndrome). Parents17ho are carriers of a single gene variant linked to a Mendelian condition can use genetic screen-ing to avoid passing the risk variant on to their child. Millions of embryos are now geneticallytested each year. With polygenic risk prediction, it is possible now to screen against outlier riskfor many common disease conditions, not just rare single gene conditions. For example, the over-whelming majority of families with breast cancer history are not carriers of a BRCA risk variant,but rather have elevated polygenic risk. It is now possible for these families to select an embryowith average or even below average breast cancer risk if they so wish.Beyond IVF embryo selection, the advent of CRISPR and other recent advances in genetic edit-ing suggest that future technologies will permit germ line editing of humans — perhaps leadingto consequences in human evolution.Note that for genomic prediction it is enough to identify SNPs which are correlated to thephenotype. But to achieve the desired eect in editing one must identify the actual causal geneticvariants. This step in the research program is highly nontrivial and may take longer to accomplishthan the development of the molecular editing tools.Highly polygenic traits imply a very large reservoir of extant variance already present in thepopulation [4]. It is this extant variance that plant and animal breeders have used to advanceagriculture for thousands of years. Roughly speaking, if a trait is controlled by ∼ 𝑁 geneticvariants, an increase in phenotype by one population standard deviation corresponds to changing ∼ 𝑁 / variants from the (−) to (+) state. Thus the maximum number of standard deviations thatcan be captured through editing could be as large as 𝑁 / !The population geneticist James Crow of Wisconsin wrote [74]:The most extensive selection experiment, at least the one that has continued for thelongest time, is the selection for oil and protein content in maize (Dudley 2007). Theseexperiments began near the end of the nineteenth century and still continue; thereare now more than 100 generations of selection. Remarkably, selection for high oilcontent and similarly, but less strikingly, selection for high protein, continue to makeprogress. There seems to be no diminishing of selectable variance in the population.18he eect of selection is enormous: the dierence in oil content between the highand low selected strains is some 32 times the original standard deviation.To take another example, wild chickens lay eggs at the rate of roughly one per month. Domes-ticated chickens have been bred to lay almost one egg per day. (Those are the eggs we have forbreakfast!) Of all the wild chickens in evolutionary history, probably not a single one producedeggs at the rate of a modern farm chicken.The corresponding ethical issues are complex and wide ranging. They deserve serious at-tention in what may be a relatively short interval before these capabilities become a reality inhuman genetic engineering and widespread clinical practice. We cannot do them justice here,but they include topics such as: the power dynamic between population geneticists and studiedpopulations [75]; personal and communal identity issues [76]; comparisons of dierent types ofpre-implantation testing [77, 78]; how to develop ethical guidelines in a fast changing eld [79,80]; disease specic concerns [81]; misinformation and media bias [82]; congenital vs adult-onsettesting [83]; non-medical testing and sex selection [84, 85]; provider duties and obligations [86];disparities in health care [67]; intersection of legal and religious concerns with genetics [87, 88];evolution of the limits of gene editing [89]; ethical statements and disclosures of those usingCRISPR [90]; patent competition and human application of CRISPR [91]; animal welfare [92];and the extensive and intertwining history of genetics and eugenics [93–96].Each society will decide for itself where to draw the line on human genetic engineering, butwe can expect a diversity of perspectives. Almost certainly, some countries will allow geneticengineering, thereby opening the door for global elites who can aord to travel for access to re-productive technology. As with most technologies, such as IVF, the rich and powerful will be therst beneciaries. Eventually, though, it is possible that many countries will not only legalize hu-man genetic engineering, but even make it a (voluntary) part of their national healthcare systems.The alternative would be inequality of a kind never before experienced in human history.19 The Future
We conclude with some predictions concerning future progress, and some unifying theoreticalremarks related to high dimensionality and its role in genomic prediction.Perhaps the easiest prediction to make is that there are still signicant gains to be made fromsimply increasing the training data size. Already for some common disease conditions we canidentify outliers (e.g., few percent of the population) who have well over 50 percent probabilityto have the disease by late adulthood. This level of prediction will become available for manymore conditions, and the size of the identiable high risk population will increase considerably.There will be important consequences for early screening, diagnosis, and prevention in medicalcare. Health insurance may also be transformed: individuals who know their risk proles have anasymmetric information advantage over insurers. Perhaps we will see a day when no insurer willprice a policy without DNA analysis. Or, perhaps strong genomic prediction will force societiesinto a single payer healthcare structure, in which all risks are pooled.Next we list some startling developments that can already be anticipated and only await suf-cient training data to be realized.1. Face Recognition: AI algorithms use of order 100 features to identify human faces (e.g.,distance between the eyes, or from nose to mouth, etc.). From identical twins, we knowthat these features — each a complex quantitative trait — is itself highly heritable. Withenough training data, already conveniently extracted by face recognition algorithms fromordinary photos, we expect that facial features will be predictable from DNA and hencefaces themselves can be reconstructed from DNA alone. This will likely have applicationsin forensic science (e.g., to solve crimes) as well as in IVF (parents will have an idea ofwhat their child will look like, at dierence ages, from an embryo genotype). It will alsointersect with the ethical use of facial recognition technology and the protection of civilliberties [97–99].2. Cognitive and Personality traits: Substantial progress has been made in the prediction of20ognitive ability [44, 100]. Actual cognitive score correlates roughly 0.3 to 0.4 with pre-dicted score. Quality of prediction appears to be entirely data limited, and we expect thesecorrelations to increase considerably before the regime of diminishing returns is reached.Personality traits such as conscientiousness, extraversion, or agreeableness are known tobe highly heritable, and we expect them to be predictable from genotype once sucientphenotype data become available [101]. In general, behavioral traits are more greatly inu-enced by environmental factors than other phenotypes, exacerbating the above mentionedethical concerns.3. Longevity: As we mentioned in our discussion of pleiotropy, genetic disease risks seemto be largely (although not completely) independent. This suggests that (at least theoreti-cally), individuals could exist that have low risk across all common disease conditions thatimpact life expectancy. In the future we may be able to identify outliers in longevity, andunderstand better the limits to human life span.We can begin to formulate a “grand unied theory” of human individual dierences, using aninformation theoretic approach and what we already know about the genetic architecture anddimensionality of the space of human variation [102]. Some orders of magnitude:• ∼
10M common SNP dierences between two individuals.• ∼
10k SNPs may control most of the variance for a typical complex trait.We can take the number of common (e.g., MAF > .
01) SNP dierences that is typical betweentwo individuals as an estimate of the amount of genomic information that determines the (her-itable) individual dierences among humans. In principle, there could be ∼
1k (a few thousand)largely independent complex traits with little pleiotropy between them. These might include ahundred common disease or health risks, hundreds of cosmetic traits, including facial and bodymorphology parameters, dozens of psychometric variables, including personality traits, etc.Clearly, individual dierences are well accommodated by a ∼
1k dimensional phenotype spaceembedded in a ∼
10M dimensional space of genetic variants.Of course, it is an unrealistic idealization for the traits to be entirely independent in genetic21rchitecture. We expect to nd pleiotropy at some level: a subset of genetic variants will aectmore than one trait. But these estimates suggest that a signicant part of the genetic variance ofeach trait can be predicted using existing linear models, and potentially modied (e.g., via editing)independently of the other traits. This is simply a consequence of high dimensionality.It seems certain that genomic prediction will have signicant impacts on health care throughbetter screening, diagnosis, and treatment of almost all important disease conditions. As we haveemphasized the main limiting factor is the availability of suciently large data sets with goodphenotype information across populations of diverse ancestry. Computational and algorithmicmethods are not (at least at the moment) the main constraint on progress. Perhaps some day soonwe will have good predictors for almost all heritable individual dierences in the human species.The dream of decoding the human genome is within our reach and important societal choicesconcerning how to apply the results are upon us.
Competing Interests
Stephen Hsu a shareholder of Genomic Prediction, Inc. (GP), and serveson its Board of Directors. Louis Lello is an employee and shareholder of GP. Timothy Raben andErik Widen have no commercial interests relevant to this research.22 eferences
1. Gibney, E. Google AI algorithm masters ancient game of Go.
Nature News
445 (2016)(cit. on p. 2).2. Lello, L., Raben, T. G., Yong, S. Y., Tellier, L. C. & Hsu, S. D. H. Genomic prediction of 16complex disease risks including heart attack, diabetes, breast and prostate cancer.
Sci Rep DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program(GSP) . Accessed: 2020-12-14 (cit. onp. 5).4. Ho, C. M. & Hsu, S. D. Determination of nonlinear genetic architecture using compressedsensing.
GigaScience https://doi.org/10.1186/s13742- 015- 0081- 6 (Sept. 2015)(cit. on pp. 5, 17, 18).5. Lello, L. et al. Accurate genomic prediction of human height.
Genetics et al.
Ecient cross-trait penalized regression increases prediction accuracyin large cohorts using secondary phenotypes.
Nature Communications issn: 20411723. http://dx.doi.org/10.1038/s41467-019-08535-0 (2019) (cit. on p. 5).7. Qian, J. et al.
A fast and scalable framework for large-scale and ultrahigh-dimensionalsparse regression with application to the UK biobank.
PLoS Genetics. issn: 15537404 (2020)(cit. on p. 5).8. Yengo, L et al. A meta-analysis of height in 4.1 million European-ancestry individuals identi-es 10,000 SNPs accounting for nearly all heritability attributable to common variants
Oct. 26,2020 (cit. on p. 5). 23. Kaiser, J. ‘Landmark’ study resolves a major mystery of how genes govern human height (2020) (cit. on p. 5).10. Khera, A. V. et al.
Genome-wide polygenic scores for common diseases identify individualswith risk equivalent to monogenic mutations.
Nature genetics et al.
Polygenic prediction of weight and obesity trajectories from birth toadulthood.
Cell
PMC6661115, 587–596 (2019) (cit. on pp. 6, 9).12. Lewis, C. M. & Vassos, E. Polygenic risk scores: from research tools to clinical instruments.
Genome Medicine https : / / doi . org / 10 . 1186 / s13073 - 020 - 00742 - 5 (2020)(cit. on p. 6).13. Torkamani, A., Wineinger, N. E. & Topol, E. J. The personal and clinical utility of polygenicrisk scores.
Nature Reviews Genetics
581 (2018) (cit. on p. 7).14. Liu, L. & Kiryluk, K. Genome-wide polygenic risk predictors for kidney disease.
NatureReviews Nephrology
Nature Reviews Genetics
392 (2016) (cit. onpp. 7, 9).16. Euesden, J., Lewis, C. M. & O’reilly, P. F. PRSice: polygenic risk score software.
Bioinfor-matics
Nature Reviews Genetics et al.
Breast cancer risk prediction using a clinical risk model and polygenic riskscore.
Breast Cancer Research and Treatment
Genome Med (9(96), 2017) (cit. on p. 7).240. Abraham, G. & Inouye, M. Genomic risk prediction of complex human disease and its clin-ical application.
Current Opinion in Genetics & Development
BMJ Heart et al.
Genomics in clinical practice: lessons from the front lines.
Science trans-lational medicine American Association for the Advancement of Science (2013) (cit. onp. 7).23. Veenstra, D. L., Roth, J. A., Garrison, L. P., Ramsey, S. D. & Burke, W. A formal risk-benetframework for genomic tests: facilitating the appropriate translation of genomics into clin-ical practice.
Genetics in Medicine
Nature Publishing Group, 686–693 (2010) (cit. on p. 7).24. Bowdin, S. et al.
Recommendations for the integration of genomics into clinical practice.
Genetics in Medicine
Jama
JNCI: Journal of the National Cancer Institute
Oxford University Press, 680–691 (2010) (cit. on p. 7).27. Ot, K. BRCA Mutation Frequency and Penetrance: New Data, Old Debate.
JNCI: Journalof the National Cancer Institute
23 (2006) (cit. on p. 7).28. Ford, D., Easton, D. F. & Peto, J. Estimates of the gene frequency of BRCA1 and its con-tribution to breast and ovarian cancer incidence.
American Journal of Human Genetics et al.
Prevalence of BRCA1 mutation carriers among U.S. non-HispanicWhites.
Cancer Epidemoiol. Biomarkers Prev.
Nature Reviews Genetics et al.
Evaluation of Polygenic Risk Scores for Breast and Ovarian CancerRisk Prediction in BRCA1 and BRCA2 Mutation Carriers.
JNCI: Journal of the NationalCancer Institute et al.
Polygenic risk scores for prediction of breast cancer and breast cancersubtypes.
The American Journal of Human Genetics
Elsevier, 21–34 (2019) (cit. on p. 8).33. Hughes, E. et al.
Development and Validation of a Clinical Polygenic Risk Score to PredictBreast Cancer Risk.
JCO Precision Oncology, https://doi.org/10.1200/PO.19.00360 (Aug. 6, 2020) (cit. on p. 8).34.
Myriad — Home (2020 (accessed November 10, 2020))(cit. on p. 8).35. Kakushadze, Z., Raghubanshi, R. & Yu, W. Estimating cost savings from early cancer diag-nosis.
Data
30 (2017) (cit. on p. 8).36. Farpour-Lambert, N. J. et al.
Physical activity reduces systemic blood pressure and improvesearly markers of atherosclerosis in pre-pubertal obese children.
Journal of the AmericanCollege of Cardiology et al.
Early versus delayed invasive intervention in acute coronary syndromes.
New England Journal of Medicine et al.
The Inhaled Steroid Treatment As Regular Therapy in Early Asthma(START) study 5-year follow-up: eectiveness of early intervention with budesonide inmild persistent asthma.
Journal of allergy and clinical immunology
Journal of National Black Nurses’ Association: JNBNA
Nature et al.
Identication of common genetic risk variants for autism spectrum disorder.
Nature Genetics. issn: 15461718 (2019) (cit. on p. 9).42. Ripke, S. et al.
Biological insights from 108 schizophrenia-associated genetic loci.
Nature https://doi.org/10.1038/nature13595 (July 2014) (cit. on p. 9).43. Wünnemann, F. et al.
Validation of Genome-Wide Polygenic Risk Scores for CoronaryArtery Disease in French Canadians.
Circulation. Genomic and precision medicine. issn:25748300 (2019) (cit. on p. 9).44. Belsky, D. W. et al.
Genetic analysis of social-class mobility in ve longitudinal studies.
Proceedings of the National Academy of Sciences
E7275–E7284 (2018) (cit. on pp. 9, 21).45. Lello, L., Raben, T. G. & Hsu, S. D. H. Sibling validation of polygenic risk scores and complextrait prediction.
Scientic Reports https://doi.org/10.1038/s41598- 020-69927-7 (2020) (cit. on pp. 10, 11, 13).46. Vattikuti, S., Lee, J. J., Chang, C. C., Hsu, S. D. & Chow, C. C. Applying compressed sensingto genome-wide association studies.
GigaScience
10 (2014) (cit. on pp. 11, 14).47. Vattikuti, S., Lee, J. J., Chang, C. C., Hsu, S. D. H. & Chow, C. C. Applying compressedsensing to genome-wide association studies.
GigaScience
10. issn: 2047-217X. http ://dx.doi.org/10.1186/2047-217X-3-10 (2014) (cit. on pp. 11, 14).48. Lee, J. J., Vattikuti, S. & Chow, C. C. Uncovering the genetic architectures of quantitativetraits.
Computational and Structural Biotechnology Journal
PMC4816193, 28–34 (2016)(cit. on p. 11). 279. Lello, L. et al.
Accurate genomic prediction of human height.
Genetics et al.
The Polygenic Score Catalog: an open database for reproducibility andsystematic evaluation. medRxiv (2020) (cit. on p. 12).51. Privé, F., Aschard, H. & Blum, M. G. Ecient implementation of penalized regression forgenetic risk prediction.
Genetics et al.
From Disease Association to Risk Assessment: An Optimistic View fromGenome-Wide Association Studies on Type 1 Diabetes.
PLoS Genetics (ed Visscher, P. M.)e1000678. https://doi.org/10.1371%2Fjournal.pgen.1000678 (2009) (cit. on p. 12).53. Abraham, G., Kowalczyk, A., Zobel, J. & Inouye, M. Performance and Robustness of Penal-ized and Unpenalized Methods for Genetic Prediction of Complex Human Disease. GeneticEpidemiology https://doi.org/10.1002/gepi.21698 (Nov. 2012) (cit. onp. 12).54. Botta, V., Louppe, G., Geurts, P. & Wehenkel, L. Exploiting SNP Correlations within RandomForest for Genome-Wide Association Studies.
PLoS ONE (ed Chen, L.) e93379. https ://doi.org/10.1371%2Fjournal.pone.0093379 (2014) (cit. on p. 12).55. Okser, S. et al. Regularized machine learning in the genetic prediction of complex traits.
PLoS genetics e1004754 (2014) (cit. on p. 12).56. De los Campos, G., Sorensen, D. & Gianola, D. Genomic Heritability: What Is It?
PLOSGenetics (ed Barsh, G. S.) e1005048. https : / / doi . org / 10 . 1371 / journal . pgen .1005048 (May 2015) (cit. on p. 12).57. Carvalho, C. M., Polson, N. G. & Scott, J. G. The horseshoe estimator for sparse signals. Biometrika arXiv preprint arXiv:1912.02351 (2019) (cit. onp. 12). 289. Berger, S., Pérez-Rodríguez, P., Veturi, Y., Simianer, H. & de los Campos, G. Eectivenessof shrinkage and variable selection methods for the prediction of complex human traitsusing data from distantly related individuals.
Annals of human genetics
Genetics et al.
Benchmarking Parametric and Machine Learning Models for GenomicPrediction of Complex Traits.
G3: Genes, Genomes, Genetics et al. Benchmarking Parametric and Machine Learning Models for GenomicPrediction of Complex Traits.
G3: Genes, Genomes, Genetics Trends in Genetics et al.
Polygenic adaptation on height is overestimated due to uncorrected strat-ication in genome-wide association studies.
Elife e39702 (2019) (cit. on p. 15).65. Loos, R. J. & Yeo, G. S. The bigger picture of FTO—the rst GWAS-identied obesity gene. Nature Reviews Endocrinology et al.
Generalization and Dilution of Association Results from EuropeanGWAS in Populations of Non-European Ancestry: The PAGE Study.
PLOS Biology https://doi.org/10.1371/journal.pbio.1001661 (Sept. 2013) (cit. on p. 15).67. Martin, A. R. et al.
Human demographic history impacts genetic risk prediction acrossdiverse populations.
The American Journal of Human Genetics et al.
Current clinical use of polygenic scores will risk exacerbating healthdisparities. bioRxiv. eprint: . (2019) (cit. on p. 15).69. Koyama, S. et al. Population-specic and trans-ancestry genome-wide analyses identifydistinct and shared genetic risk loci for coronary artery disease.
Nature Genetics. https://doi.org/10.1038%2Fs41588-020-0705-3 (2020) (cit. on p. 16).70. Dehghani, N., Bras, J. & Guerreiro, R. How understudied populations have contributedto our understanding of Alzheimer’s disease genetics. bioRxiv. eprint: . (2020) (cit. onp. 16).71. Yong, S. Y., Raben, T. G., Lello, L. & Hsu, S. D. Genetic Architecture of Complex Traits andDisease Risk Predictors. Scientic RepoRtS (2020) (cit. on p. 16).72. Sundhedsdatastyrelsen. Assisteret reproduktion 2018 tech. rep. Version 1.0 (Ørestads Boule-vard 5, 2300 København S, 2020). (cit. on p. 17).73. Wikipedia contributors.
Simon Fishel — Wikipedia, The Free Encyclopedia [Online; accessed3-December-2020]. 2020. https : / / en . wikipedia . org / w / index . php ? title = Simon _Fishel&oldid=983918723 (cit. on p. 17).74. Crow, J. F.
On epistasis: Why it is unimportant in polygenic directional selection
Culture, Medicineand Psychiatry
The Journalof Law, Medicine & Ethics
Na-ture Reviews Genetics Clinical genetics
Assisted reproduction: Ethical and legal issues in Seminarsin Fetal and Neonatal Medicine (2014), 264–271 (cit. on p. 19).80. Tre, N. R. et al. Utility and rst clinical application of screening embryos for polygenicdisease risk reduction.
Frontiers in Endocrinology
845 (2019) (cit. on p. 19).81. Sabatello, M. & Rasouly, H. M. The ethics of genetic testing for kidney diseases.
NatureReviews Nephrology, et al.
A modern approach tothe management of candidates for assisted reproductive technology procedures.
Minervaginecologica et al.
Use of preimplantationgenetic testing for monogenic defects (PGT-M) for adult-onset conditions: an Ethics Com-mittee opinion.
Fertility and sterility et al.
Use of reproductive technol-ogy for sex selection for nonmedical reasons.
Fertility and Sterility et al.
Disclosure of sex whenincidentally revealed as part of preimplantation genetic testing (PGT): an Ethics Committeeopinion.
Fertility and sterility et al.
Transferring embryos withgenetic anomalies detected in preimplantation testing: an Ethics Committee Opinion.
Fer-tility and Sterility
Journal ofMedical Ethics
The New England Journalof Medicine
207 (2004) (cit. on p. 19).89. Brokowski, C. & Adli, M. CRISPR ethics: moral considerations for applications of a powerfultool.
Journal of molecular biology
The CRISPR journal Nature Biotech-nology
Journal of Agricultural and Envi-ronmental Ethics
Human reproduction
The Oxford handbook of the history of eugenics (2010) (cit. onp. 19).95. Ekberg, M. The old eugenics and the new genetics compared.
Social History of Medicine
Journal of Medical Ethics
Biometric Tech-nology Today
Technology in Society
Journal of information,communication and ethics in society (2004) (cit. on p. 20).100. Lee, J. J. et al.
Gene discovery and polygenic prediction from a genome-wide associationstudy of educational attainment in 1.1 million individuals.
Nature Genetics. issn: 15461718(2018) (cit. on p. 21).101. Jang, K. L., Livesley, W. J. & Vernon, P. A. Heritability of the Big Five Personality Dimen-sions and Their Facets: A Twin Study.
Journal of Personality. issn: 00223506 (1996) (cit. onp. 21).102. Meisner, A. et al.
Combined Utility of 25 Disease and Risk Factor Polygenic Risk Scoresfor Stratifying Risk of All-Cause Mortality.