Polygenic Risk Score in Africa Population: Progress and challenges
Yagoub Adam, Suraju Sadeeq, Judit Kumuthini, Olabode Ajayi, Gordon Wells, Rotimi Solomon, Olubanke Ogunlana, Emmmanuel Adetiba, Emeka Iweala, Benedikt Brors, Ezekiel Adebiyi
PPolygenic Risk Score in Africa Population: Progress andchallenges
Yagoub Adam (cid:89) , Suraju Sadeeq (cid:89) , Judit Kumuthini , Olabode Ajayi , GordonWells , Rotimi Solomon , Olubanke Ogunlana , Emmmanuel Adetiba ,Emeka Iweala , Benedikt Brors , Ezekiel Adebiyi Covenant University Bioinformatics Research (CUBRe), Covenant University, Ota,Ogun State, Nigeria Dept Computer & Information Sciences, Covenant University, Ota, Ogun State,Nigeria Covenant Applied Informatics and Communication Africa Centre of Excellence(CApIC-ACE), Covenant University, Ota, Ogun State,Nigeria Dept of Biochemistry, Covenant University, Ota, Ogun State, Nigeria Dept of Electrical & Information Engineering (EIE), Covenant University, Ota,Ogun State, Nigeria Centre for Proteomic and Genomic Research, Cape Town, Western Cape, SouthAfrica. HRA, Institute for Systems Science, Durban University of Technology, Durban,South Africa. German Cancer Consortium (DKTK), Heidelberg, Germany Applied Bioinformatics Division, German Cancer Research Center (DKFZ),Heidelberg, 69120, Germany (cid:89)
Joint first author.* [email protected]
Abstract
Polygenic risk score (PRS) analysis is a powerful method been used to estimate anindividual’s genetic risk towards targeted traits. PRS analysis could be used to obtainevidence of a genetic effect beyond Genome-Wide Association Studies (GWAS) resultsi.e. when there are no significant markers. PRS analysis has been widely applied toinvestigate the genetic basis of several traits including rare diseases. However, theaccuracy of PRS analysis depends on the genomic data of the underlying population. Forinstance, several studies showed that obtaining higher prediction power of PRS analysisis challenging for non-Europeans. In this manuscript, we reviewed the conventional PRSmethods and their application to sub-saharan Africa communities. We concluded thatthe limiting factor of applying PRS analysis to sub-saharan populations is the lack ofsufficient GWAS data. Also, we recommended developing African-specific PRS tools. keywords
Prediction medicine, GWAS, post-GWAS, PRS analysis, Africa population
Author summary
PRS analysis is currently not applicable to African communities due to the current datasparsity with respect to Genome-Wide Association Studies (GWAS) in non-Europeanancestries. Therefore, extending the current PRS tools to handle diverse multi-ethnicdata is crucial to estimate PRS values for personalized medicine across ethnic groups.February 18, 2021 1/24 a r X i v : . [ q - b i o . GN ] F e b Introduction
Genome-wide association studies (GWAS) have been applied successfully to identifythe associations between hundreds of genomic variations with complex human traits [1].In general, GWAS report single nucleotides polymorphisms (SNPs) as statisticallysignificant genomic variations associated with the trait of interest, when their p -valuesare less than 5e-09 in Africa population (which also statistically depends on the numberof SNPs analyzed) [2]. The statistically significant SNPs reported by GWAS are used tounderstand the biomolecular mechanisms of many phenotypic traits, including varioushuman diseases. However, due to the statistical threshold, GWAS might fail to detectSNPs that are associated with low or moderate risks [3, 4]. The limitation of filteringvariants associated with low disease risk increases the GWAS false negative rate. Also,traditional GWAS cannot be used to integrate the polygenic nature of many complextraits [5]. Therefore, several post-GWAS approaches have been introduced to overcomethem [6, 7]. Due to the privacy issues such as access to individual level GWAS data sets,most post-GWAS approaches require only GWAS summary statistics. There are manypublic resources for GWAS summary statistics and they include the GWAS Catalog [8],GWAS Central [9] and the dbGaP database [10, 11]. A distinct approach of performingpost-GWAS analysis is known as PRS analysis. The PRS methods map genotype datafrom a GWAS summary into a single variable used to estimate an individual-level riskscore for the phenotypic trait.PRS analysis is used to predict an individual heritability by incorporating all SNPs [12].Therefore, obtaining a precise PRS value from case-control studies can be used inpersonalized medicine. However, challenges exist when translating PRS values to clinicalcare [13]. To successfully perform PRS analysis, two distinct GWAS summaries arerequired. The first sample (training sample) is used to select the SNPs for PRS analysis.The second sample (discovery sample) is used to evaluate the prediction value of PRSmethods. The following PRS approaches are discussed in the literature; (i) weightedmethods that consider the effect sizes derived from GWAS result, (ii) unweighted methodsthat consider the single marker analysis and (iii) shrinkage methods that consider themultivariate analysis. In this review, we focused on the tools and methods that performPRS analysis and their applications in understanding the predictive power of PRSanalysis. The reviewed PRS tools are chosen based on the following criteria:1. The approach must perform PRS analysis based on ”base” (GWAS) data (summarystatistics) and ”target” data (genotypes and phenotypes in each of the targetsample).2. The approach may involve linkage disequilibrium pruning.3. The method or approach should be readily available in form of a tool or packageto be able to execute the method on any data set.Besides reviewing the PRS methods, we aim to investigate the application of PRSanalysis on African population. It is important to note that the term ”African population”includes all those whose ancestors are African (i.e. Africans in diaspora). However, inthis manuscript, we will review the PRS studies only on sub-saharan Africa. The different approaches under the umbrella of PRS analysis are presented in Figure1 and Table 1. We can categorize PRS methods into two groups; (i) Bayesian-basedmethods (ii) non-Bayesian methods. Also, we can classify PRS methods into two withrespect to their treatments of linkage disequilibrium: (i) PRS methods that incorporateFebruary 18, 2021 2/24inkage disequilibrium (LD), and (ii) PRS methods that apply LD pruning. However,to ease the understanding of their underlying algorithms, we grouped the PRS analysisapproaches into four (See Table 2):1. Clumping with thresholding (C+T) is the standard approach of polygenic scoresanalysis.2. p -value thresholding approach.3. Penalized regression approach.4. Bayesian shrinkage approach. In practice, the prediction accuracy of PRS analysis tends to improve if the markersare LD pruned. However, as noted in Chatterjee et al [14], the absence of linkagedisequilibrium (LD) limits the predictive accuracy of PRS analysis. In addition, asimulation test performed by Vilhj´almsson et al [15] shows that in the presence of LD,the prediction accuracy of the PRS analysis widely used approach of LD pruning followedby p -value thresholding (P + T) under-predicts the heritability explained by the SNPs.One special method that incorporated LD in its study is LDPred (Subsection 4.0.1), aBayesian approach in the presence of LD. If loci are to be linked, then the posterior meaneffect can be derived analytically under a Gaussian infinitesimal prior. An arguably morereasonable prior for the effect sizes is a non-infinitesimal model where only a fraction ofthe markers are causal. For this reason, consider the following Gaussian mixture prior: β ∼ iid (cid:40) N (cid:16) , h ng M p (cid:17) with probability p0 with probability (1-p) , (1)where p is the probability that a marker is drawn from a Gaussian distribution i.e. thefraction of causal marker. Similarly, from this model the posterior mean can be derivedas E (cid:18) β i (cid:101) β l , D (cid:19) ≈ (cid:18) MN h g I + D i (cid:19) − (cid:101) β l , (2)where D i denotes the regional LD matrix within the region of LD and (cid:101) β l denotes theleast-squares estimated effects within that region. The approximation assumes that theheritability explained by the region is small and that LD with SNPs outside of the regionis negligible These PRS methods are non-Bayesian approaches that apply informed LD pruning (LDclumping) (figure 1). Moreover, these methods are referred to as pruning and thresholding(P+T) PRS methods. For instance, using a univariate regression coefficient ( r ) with athreshold of 0.2, we could apply p -value thresholding. The p -value thresholding process isoptimized over a grid concerning prediction accuracy in the validation data. LD pruningthat preferentially prunes the less significant marker could yield more accurate predictionsthan pruning random markers. For the p -value selection threshold, researchers shouldinclude only SNPs that are statistically significant in GWAS. This technique essentiallyshrinks all omitted SNPs to zero estimate and does not perform shrinkage on the effectFebruary 18, 2021 3/24ize estimates of the included SNPs. PRS is often computed over a variety of thresholdsgiven that the optimal p -value threshold is a priori unknown and the target phenotypebeing evaluated for the given threshold while the forecast adjusted accordingly. Thistechnique can be interpreted as a variable selection process which essentially executesthe GWAS p -value forward selection based on the size of the increment in the p -valuethreshold. Bayesian approaches have been used to explicitly model pre-existing genetic architecturethereby accounting for the distribution of effect sizes with a prior that should improvethe accuracy of a polygenic score. The main advantage of Bayesian-based PRS analysisis its ability to improve genomic risk prediction from summary statistics by taking intoaccount linkage disequilibrium (LD) among markers [16].
In general, EB-PRS method is a novel approach based on the Empirical Bayes theoremincorporating information across markers to improve prediction accuracy [17]. EB-PRS method aims at minimizing the prediction error by leveraging on the estimateddistribution of effect sizes. Assuming the SNPs are independent, the optimal PRS value(in terms of achieving the best classification accuracy) is S = β T X = m (cid:88) i =1 β i X i , (3)where m is the total number of genotyped SNPs. X i is the genotypic value and β i is thelog-odds ratio (OR) of the i th SNP. The log-OR is a measure of the effect size definedin the following formula: β i = log (cid:18) f i (1 − f i ) f i (1 − f i ) (cid:19) , (4)where f i and f i are the reference allele frequencies among controls and cases respectively.If the SNP is not associated with disease, then β i = 0.In practice, the true values of effect sizes are usually unknown and need to beestimated from the data. Song et al. [17], in their method use the Empirical Bayesapproach to estimate β which is the minimizer of the Bayes risk under the distributionestimated from the data. The estimators can be derived directly from GWAS summarystatistics. Compared to other improved genetic risk prediction methods [15, 18], theEB-PRS method does not require external panels or datasets. While there are othermethods which utilize effect size distributions for PRS value calculations, methods inthis category such as [16, 19] have no tuning parameters or external input. However, theEB-PRS method has a theoretical superiority compared with the existing methods inthis category in terms of minimizing the prediction error. This method was applied tothe following six complex disease traits; asthma (AS), breast cancer (BC), celiac disease(CEL), Crohn’s disease (CD), Parkinson’s disease (PD) and type-2 diabetes (T2D) toillustrate the improved risk prediction performance in real data (Table 1). Furthermore,the authors recorded significant improvement when comparing the EB-PRS methodwith all other methods that include the unadjusted PRS method, P + T , LDpred-inf,LDpred, Mak et al. ’s [19]. Although the EB-PRS method can achieve better performancewithout tuning any parameters and utilizing external information, its performance maybe improved with external information e.g. the LD information as used in LDpred. Also,in order to increase the prediction accuracy, Song et al. [17] suggested that the EB-PRSFebruary 18, 2021 4/24ethod could be further improved by combining other available datasets in the futuresuch as annotations or other GWAS summary statistics studying genetically correlatedtraits. The PRS-CS method is based on a Bayesian high-dimensional regression framework forpolygenic modeling and prediction: Y N × = X N × M β M × + (cid:15) N × , (5)where N and M denote the sample size and number of genetic markers respectively. Y is a vector of traits while X is the genotype matrix. β is a vector of effect sizes for thegenetic markers and (cid:15) is a vector of residuals. By assigning appropriate priors on theregression coefficients β to impose regularization, additive PRS value can be calculatedusing posterior mean effect sizes.Unlike LDpred [15] and the normal-mixture model recently developed [20, 21] whichcan incorporate genome-wide markers having varying genetic architectures with en-hanced performance and flexibility, the PRS-CS method utilizes a Bayesian regressionframework and places a conceptually different class of priors—the continuous shrinkage(CS) priors—on SNP effect sizes [22]. Continuous shrinkage priors allow for marker-specific adaptive shrinkage i.e. the amount of shrinkage applied to each genetic markeris adaptive to the strength of its association signal in GWAS which can accommodatediverse underlying genetic architectures.Feng & Smoller [22] presented the PRS-CS-auto method, a fully Bayesian approachthat enables automatic learning of a tuning parameter φ , from GWAS summary statistics.Although analyses conducted from the Biobank indicate that for many disease phenotypes,the current GWAS sample sizes may not be large enough to accurately learn φ andthe prediction accuracy of PRS-CS-auto method may be lower than PRS-CS methodand LDpred. However, simulation studies and quantitative trait analyses suggest thatPRS-CS-auto method can be useful when the training sample size is large or when anindependent validation set is difficult to acquire. Although the PRS-CS method providesa substantial improvement over existing methods for polygenic prediction [15], currentprediction accuracy of PRS value is still lower than what can be considered clinicallyuseful. Much work is still needed to further improve the predictive performance andtranslational value of PRS methods.Recent studies by [23–25] argued that jointly modeling multiple genetically correlatedtraits and functional annotations in polygenic modeling are expected to increase thepredictive performance of PRS methods. Since SNP effects are calculated with uncertainty and not all SNPs have an impacton the traits, unadjusted effect size estimates of all SNPs can lead to low-estimatedPRS, with high standards error [26]. Two shrinkage methods have been implemented tosolve these problems: shrinkage of the effect estimates of all SNPs by adapted statisticaltechniques, and use of p -value filtering thresholds as the criterion for inclusion of SNPs. PRS methods performing shrinkage of all SNPs [15,27] typically apply shrinkage/regularisationtechniques such as LASSO/ridge regression [27], or Bayesian approaches performingFebruary 18, 2021 5/24hrinkage by prior distribution specification [15]. Varying degrees of shrinkage may beaccomplished under varying methods or parameter settings. The most suitable shrinkageto be implemented depends on the underlying mixture of distributions of null andtrue effect size which is likely to be a complex mixture of distributions that differ bytraits. PRS estimation is usually tailored over a number of (tuning) parameters sincethe optimum shrinkage parameters are a priori unknown. For example, in the case ofLDpred, it includes a setting for a fraction of causal variant [15]. p -value filtering thresholds as the criterion for inclusion of SNPs In this process, the PRS estimate includes SNPs with a GWAS P-value below a certainlevel (e.g. p -value < − ) while all other SNPs are removed. This method shrinks allomitted SNPs to an estimated effect size of zero and does not perform shrinkage on theeffect size estimates of the included SNPs. Since the optimum p -value threshold is apriori unknown, PRS is computed over a range of thresholds, associated with each ofthe tested target trait and optimized appropriately for the prediction 1). In systematicshrinkage techniques, this method is similar to tuning parameter optimization. Thistechnique is regarded as a parsimonious method of selection of variables. It is efficientin performing the forward selection of variables (SNPs) using GWAS p -value with thesizes depending on the increment of P-value threshold. Therefore, only in the sense ofthis forward selection method is the chosen ’optimal threshold’ defined; a PRS derivedfrom another subset of the SNPs may be more predictive of the target trait. However,considering the fact that GWAS is focused on millions of SNPs, the number of subsetsof SNPs that could be chosen for study is too high. Usually, association studies in GWAS are done one-SNP-at-a-time [26]. It combineswith high genome-wide correlation structure, making it incredibly difficult to classify theindependent genetic effects. Though GWAS’ power can be enhanced by leveraging onthe results of several SNPs concurrently [28], provided that raw data on all samples areavailable. Generally, researchers need to take advantage of standard GWAS (one SNP ata time) summary statistics for polygenic scoring. To estimate the PRS, there are twokey options: (i) SNPs are clumped such that the retained SNPs are mostly independentof each other, (ii) all SNPs are included and the linkage disequilibrium (LD) betweenthem is adjusted. In the ’normal approach’ to polygenic scoring, option (i) is normallypreferred, requiring p -value thresholding, while option (ii) is commonly preferred inmethods that incorporate conventional methods of shrinkage [15,19] (see Table 2). As foroption (i) without clumping, some researchers tend to perform the p -value thresholdingmethod. Although breaking this presumption could lead to marginal losses in certainsituations [19], Choi et al. [26] suggested that clumping be done where GWAS estimatesof non-shrunk effect sizes are used. The standard method tends to work comparably tomore advanced approaches [15, 19]. This could well be due to the clumping mechanismcapturing conditionally independent effects. However, a critique of clumping is that forthe elimination of SNPs in LD, researchers usually use an arbitrarily selected correlationthreshold [29]. Thus, no technique is without arbitrary features, this could be an areafor the potential development of the classical method.February 18, 2021 6/24 ig 1. A general PRS analysis workflow. This is a typical polygenic risk scoreanalysis workflow showing base data, target data and encapsulating differentapproaches. Using summary statistics and individual-level genotype and phenotypedata, approaches such as lasso/ridge regression, clumping and p -value thresholdingcan be employed to increase the predictive accuracy of PRS analysis. In addition,results may predict health or disease risk and provide insights for effective therapeuticinterventions.February 18, 2021 7/24 able 1. Summary of polygenic risk score tools.Tool Approach ComputationalPlatform UserFriendly Functionality LinkLDpred [15] Bayesian ShrinkagePrior Python Difficult Method that uses a prior on effect sizes and LDinformation from an external reference panel https://bitbucket.org/bjarni_vilhjalmsson/ldpred
PRS-CS [22] Bayesian regressionframework Python Difficult utilizes a high- dimensional Bayesian regressionframework, by placing a continuous shrinkage(CS) prior on SNP effect sizes https://github.com/getian107/PRScs
EB-PRS [17] Empirical Bayesapproach R Difficult A novel method that leverages information foreffect sizes across all the markers R CRANAnnoPred [18] Bayesian ShrinkagePrior Python Difficult A framework that leverages diverse types ofgenomic and epigenomic functional annotations https://github.com/yiminghu/AnnoPred
PRSice [30] Clumping +thresholding(C+T) R Difficult for calculating, applying, evaluating and plot-ting the results of PRS analysis http://PRSice.info
PRSice2 [31] Clumping +thresh-olding (C+T) C++, R Easy an efficient and scalable software program forautomating and simplifying PRS analyses onlarge-scale data http://PRSice.info
LDpred2 [32] Bayesian Shrinkage R Difficult A faster and more robust implementation ofLDpred in R package bigsnpr https://privefl.github.io/bigsnpr/articles/LDpred2.html
BSLMM [33] Bayesian sparselinear mixed model R Difficult Prior specification for the hyper-parameters anda novel Markov chain Monte Carlo algorithm forposterior inference http://stephenslab.chicago.edu/software.html F e b r u a r y , ummary of polygenic risk score tools. . . continued.Tool Approach ComputationalPlatform UserFriendly Functionality LinkBayesR [21] HierarchicalBayesian MixtureModel Fortran Difficult Bayesian mixture model that simultaneouslyallows variant discovery, estimation of geneticvariance explained by all variants. https://github.com/syntheke/bayesR DPR software[34] Latent Dirichletprocess regressionmodel C++ Easy Dirichlet process regression to flexibly and adap-tively model the effect size distribution
SMTpred [35] Python Difficult Combines SNP effects or individual scores frommultiple traits according to their sample size,SNP-heritability ( h ) and genetic correlation( r G ). https://github.com/uqrmaie1/smtpred Lassosum [19] Penalised Regres-sion R Difficult A method for constructing PGS using summarystatistics and a reference panel in a penalizedregression framework https://github.com/tshmak/lassosum
Plink [36] p -value threshold-ing approach C/C++ Easy Open-source C/C++ toolset for genome-wideassociation studies (GWAS) and research inpopulation genetics http://zzz.bwh.harvard.edu/plink/ F e b r u a r y , able 2. Comparison of different approaches for performing PRS analyses.Key Factors Approaches p -value threshold-ing with clumping Penalised Regres-sion Clumping +Thresholding(C+T) Bayesian ShrinkagePriorControllingfor LinkageDisequilib-rium N/A LD matrix is inte-gral to algorithm Clumping Shrink effect sizeswith respect to LDShrinkage ofGWAS effectsize estimates P -value threshold LASSO, ElasticNet, penalty pa-rameters Bayesian P-value thresholdStandard Prior distribution,e.g. fraction ofcausal SNPs The key cause of confounding in GWAS (post-QC) is population structure, hence thepossibility of false-positive results [26]. In general, structure in mating patterns inducesstructure in genetic variation, closely associated with geographic location. Furthermore,environmental risk factors may be organized in a similar manner, creating the possibilityfor correlations between certain genetic variations and the characteristics examinedthat are confounded by, for example, location [37, 38]. Usually, this issue is solved inGWAS by modifying the principal components (PCs) [37] or by using mixed models [39].However, population composition in the PRS study presents a possible greater issuesince a significant number of null variants usually are included in PRS estimation. Forexample, allele frequencies are systematically different between the base and target datathat can be obtained from genetic drift or genotyped variant [40]. In addition, there isa danger that variations in null SNPs may result in the correlation between the PRSand target traits if the distributions of the environmental risk factors for the phenotypevary in both (base and target data) – both highly probable in most PRS studies. Evenif the GWAS had completely regulated its population structure, confounding is possiblyreintroduced. Correlated variations between the base and target data in allele frequenciesand risk factors are not taken into consideration. The regulation of structure in thePRS study should be adequate to prevent false-positive if the base and target samplesare drawn from the same or genetically similar populations. Choi et al. advised thatcare should be taken, provided that there are drastic variations between populations inthe distribution of PRS [ ? , 40, 41]. Such observations do not indicate large differencesbetween populations in aetiology while genuine differences are likely to contribute dueto geographical, cultural and selection pressure variations. It challenges the accurate useof base and target data from different populations in PRS studies that do not tacklethe problem of possible uncertainty generated by geographical stratification [41]. It istherefore important to be mindful that, by exploiting large sampling sizes, extremelysignificant effects can be obtained due to subtle confounding. Population structure issuesare as significant as the variations between individuals in the base and target populationsin genetics and the environment. In the coming years, the topic of generalizability ofPRS methods across populations is expected to be an active field [ ? , 42].February 18, 2021 10/24 PRS tools
The next sections would provide an example of some PRS tools that are commonly usedto perform PRS analysis.
This method infers the posterior mean effect size of each marker by using a prior oneffect sizes and LD information from an external reference panel [15]. LDpred calculatesthe posterior mean effects from GWAS summary statistics by conditioning on a geneticarchitecture prior and on LD information from a reference panel. The inner product ofthese is re-weighted and the test-sample genotypes is the posterior mean phenotype, andunder the model assumptions and available data, posterior mean phenotype is an optimal(minimum variance and unbiased) predictor. The prior of the effect sizes is a point-normal mixture distribution which allows for non-infinitesimal genetic architectures. Theprior has two parameters; the heritability, explained by the genotypes and the fractionof causal markers i.e. the fraction of markers with non-zero effects. The heritabilityparameter is estimated from GWAS summary statistics and accounts for sampling noiseand LD [43].By applying LDpred to five diseases: Sczherhernia (SCZ), Muscular dystrophy (MS),BC, Type II diabetes (T2D) and Coronary artery disease (CAD) for which the GWASsummary statistics for large sample sizes ranging from 27,000 to 86,000 individuals andraw genotypes for an independent validation dataset, LDpred outperforms the approachof pruning followed by thresholding [16], particularly at large sample sizes. For instance,in a large dataset of schizophrenia and multiple sclerosis, the predicted R increasedfrom 20.1% to 25.3% and from 9.8% to 12.0%, respectively. In another test, LDpred wasapplied to predict SCZ risk in non-European validation samples of both African andAsian descents. Although prediction accuracies were lower in absolute terms, similarrelative improvements were observed for LDpred over other methods.LDpred is a popular and powerful method for deriving polygenic scores based onsummary statistics and a Linkage Disequilibrium (LD) matrix only [44]. It assumes thereis a proportion p of variants that are causal. However, LDpred has several limitationsthat may result in limited predictive performance. The non-infinitesimal version ofLDpred, a Gibbs sampler, is particularly sensitive to model misspecification when appliedto summary statistics with large sample sizes. It is also unstable in long range LDregions such as the human leukocyte antigen (HLA) region of chromosome 6. This issuehas led to the removal of such regions from analyses [21, 23] which is unfortunate sincethis region of the genome contains many known disease-associated variants, particularlywith autoimmune diseases and psychiatric disorders [45, 46]. In a recent development,a new version of LDpred that addresses these issues while markedly improving itscomputational efficiency was presented by Priv´e et al. [32]. This is a faster and morerobust implementation of LDpred in the R package bigsnpr. A new version of LDpred, LDpred2, has a sparse option that can learn effects that areexactly 0 and an auto option that directly learns parameters from data. LDpred iswidely used and has the potential to provide polygenic models with good predictiveperformance [47]. Yet, it has some instability issues that have been pointed out byMarquez-Luna et al. [23] and by Lloyd-Jones et al. [21] and likely contributed to thediscrepancies in reported prediction accuracies [31, 48]. For instance, LDpred1 performspoorly in the simulations where causal variants are in the HLA region. In contrast,LDpred2 performs very well. It uses a window size of 3 centiMorgan (cM), which isFebruary 18, 2021 11/24arger than the default value used in LDpred1 and enables LDpred2 to work well evenwhen causal variants are in long-range LD regions. In another scenario, LDpred2-autowhich automatically computes values for hyper-parameters p and h , equally performswell compared to other LDpred2 models in simulations but does not perform well forsome of the real traits. Typically, Type 1 diabetes (T1D) is mainly composed of largeeffects in the HLA region because summary statistics have a small sample size. It isunknown why LDpred2-auto performs poorly specifically for pure red cell aplasi (PRCA).More studies need to be performed to understand the poor results of LDpred2-auto inthese two cases. In 2015, Euesden et al. [30] developed the first dedicated PRS analysis software, namingthe resulting methodology PRSice. PRSice is written in R, with wrappers for bashdata management scripts and PLINK-1.9 to minimize computational time (Table 1).Considering n individuals from the ’target phenotype’ data set using a list of m SNPs,the genotypes have some effect (or not) on the ’base phenotype’. The base and targetphenotype may be the same if assessing the shared genetic overlap of a phenotypebetween samples/populations. These genotype effects can be estimated from a univariateregression on the base phenotype for each SNP, such as from a genome-wide associationstudy (GWAS). In such a GWAS for a SNP i , where i = 1, 2, ..., m , a p -value, P i , iscalculated for the association between the SNP genotypes, G i , j = { , , } for individual j where j = 1, 2, . . . , n and the phenotype. Under the usual additive assumptionmade in GWAS, a corresponding effect size is estimated by β i for the effect of a unitincrease in genotype G ij , on the phenotype. SNPs are generally selected for inclusion ina PRS value based on the degree of evidence according to p -value for their associationwith the base phenotype in a GWAS – SNP i will be included in a PRS calculation if P i is smaller than a threshold, P T . PRS values are typically calculated at a number ofdifferent p -value thresholds, P T .At threshold P T , the PRS value for individual j can be calculated as: P RS
P T,j = m (cid:88) i =1 β i G i,j . (6)The PRS value is calculated across all individuals giving n scores per threshold, P T . Theassociation between these PRS values and the target phenotype can then be evaluatedin an appropriate regression model (depending on the data type of the target phenotype,e.g. linear regression if the phenotype is continuous). PRSice tool has been developed tofully automate PRS analyses, substantially expanding the capability of PLINK-1.9 [49].In real data, there is usually some missing genotype data unless genotypes have alreadybeen imputed. PLINK-1.9 imputes any missing data according to mean allele frequencies.However, it is not equipped to handle very large data sets, and a more memory-efficientapproach is used in its advanced version, PRSice-2. PRSice-2 which is an enhancement of PRSice, handles both genotyped and imputed data,provides empirical association p -values free from inflation due to overfitting, supportsdifferent inheritance models and evaluates multiple continuous and binary target traitssimultaneously [31]. This method streamlines the entire PRS analysis pipeline withoutgenerating intermediate files and performs all the main computations in C++, leadingto a drastic speed-up in run time and reduction in memory burden. Furthermore, usingbest-guess genotypes (BGEN) imputation format, PRSice-2 can directly process theFebruary 18, 2021 12/24GEN imputed format and convert to either best-guess genotypes or dosages whencalculating the PRS value without generating a large intermediate file. While PRS valuesbased on best-guess genotypes are calculated as for genotyped input, dosage-based PRSvalues are calculated as P RS = m (cid:88) i β i (cid:88) j w ij X j . (7)Where ω ij is the probability of observing genotype j ,where j ∈ { , , } , for the i th SNP; m is the number of SNPs; and β i is the effect size of the i th SNP estimatedfrom the relevant base genome-wide association study (GWAS) data. A simulationstudy has been used to compare the performance of PRSice-2 to alternative polygenicscore software lassosum [19] and LDpred [15] in terms of run time, memory usage andpredictive power on servers equipped with 286 Intel 8168 24 core processors at 2.7 GHzand 192 GB of RAM.Based on the simulation results, PRSice-2 showed best performance in all settings,significantly faster than lassosum and LDpred. Specifically, PRSice-2 can complete thefull PRS analysis on 100,000 samples within 4 minutes, which is 179 times faster than the10 hours required by lassosum and 241 times faster than the 13 hours 27 minutes requiredby LDpred. Similarly, PRSice-2 requires significantly less memory than lassosum andLDpred, requiring <
500 MB of memory for 100,000 samples as opposed to 11.2 GBrequired by lassosum and 45.2 GB required by LDpred.Another case study compared the predictive power of PRSice-2 to lassosum andLDpred for quantitative traits with heritability of 0.2, base sample size of 50,000 andtarget sample size of 10,000. PRSice-2 has comparable predictive power to lassosum andLDpred, typically generating PRS values with predictive power higher than those ofLDpred but not as high as lassosum. The details of the simulation code can be foundhere (simulation code), for others to inspect and repeat the analyses. While PRS valuesgenerated by PRSice-2 do not seem to fully optimize predictive accuracy, the simpleapproach and typically fewer SNPs exploited allow for easier interpretation of the resultscompared with methods that use all SNPs [50].
It is an alternative method that uses summary statistical data to estimate PRS, andtakes LD into account by using reference panels [19] on the basis of the commonly usedLASSO and elastic net regression [51, 52]. Consider the linear regression given below: y = Xβ + (cid:15). (8)For which X represents a data matrix of n -by- p , and y denotes a vector of the observedoutcome. LASSO is a commonly used method for deriving β estimates and y predictors,especially in cases where p is high and where it is rational to conclude that many β are 0.By minimizing the objective function, LASSO also obtains estimates of β given y and X.To test the efficiency of lassosum relative to LDpred, simulation studies [44] were carriedout using summary statistics for which LD was accounted for and Welcome Trust CaseControl Consortium (WTCCC) Phase 1 data for seven diseases. The outcome of LDpred,lassosum, and simple soft-thresholding (setting s = 1 in lassosum) was comparable formost of the diseases in the WTCCC dataset, except for T1D, where lassosum seemto outperform LDPred. The performance of LDpred and lassosum was comparablewhen the number of causal SNPs was 1,000 and the sample size was 11,200 for thesimulated phenotypes, and both were superior to soft thresholding. However, LDpred’sFebruary 18, 2021 13/24erformance was considerably reduced when the sample size was halved. The lassosumwas not influenced in the same way when reducing the sample size by halve. All methodsperformed equally when the number of causal SNPs was 25,000 and the sample size was11,200. The fact that summary statistics can be confounded by population stratificationand population heterogeneity, makes real-life application of PRS difficult. However,these problems in the lassosum design were not considered. One possible issue withthe use of meta-analytical summary statistics is that the original data produced by thesummary statistics is an amalgamation of datasets around the world with correctionfor population stratification. Possibly, there is no one homogenous dataset suitable asa reference panel. Further research is therefore required to explain what is the bestapproach here.Schork et al. [53] have demonstrated that different genome regions have different falsediscovery rates, thus different chances of being causally correlated with a phenotype.Therefore, genome annotation information can theoretically be used to enhance theperformance. Similarly, it is possible to utilize the fact that certain phenotypes havecommon genetic determinants (pleiotropy) to improve PRS. PLINK 1 is an open-source C/C++ tool set for performing genome-wide associationstudies (GWAS) and research in population genetics. However, the steady growth ofdata from imputation and whole-genome sequencing studies called for an urgent needfor faster and scalable implementations of its key functions. In addition, GWAS andpopulation-genetic data now frequently contain genotype likelihoods, phase information,and/or multiallelic variants, none of which can be represented by PLINK 1’s primarydata format.To address these problems, Chang et al. [36] developed a second-generation codebasefor PLINK. The first major release from this codebase, PLINK 1.9, introduces extensiveuse of bit-level parallelism, O ( √ n ) -time/constant-space Hardy-Weinberg equilibriumcalculation and Fisher’s exact tests and many other algorithmic improvements. Incombination, these changes accelerate most operations by 1-4 orders of magnitude andallow the program to handle data sets too large to fit into RAM.PLINK 1.9’s core functional domains are unchanged from that of its predecessor, andit is usable as a drop-in replacement in most cases with no changes to existing scripts. Tosupport easier interoperation with newer software, features such as the import/export ofVCF and Oxford-format files and an efficient cross-platform genomic relationship matrix(GRM) calculator have been introduced. Most pipelines currently employing PLINK 1.07can expect to benefit from upgrading to PLINK 1.9. Despite its computational advances,PLINK 1.9 can still be an unsatisfactory tool for working with imputed genomic datadue to the limitations of the PLINK 1 binary file format. To address this issue, theauthors designed a new core file format in PLINK 2.0 capable of representing most ofthe information emitted by modern imputation tools. Applying PRS analysis for multi-ethnic groups is still limited. However, novel PRSmethods have been developed to address the applicability of PRS analysis across ethnicgroups.
Multi-ethnic PRS analysis is a new PRS approach that combines PRS analysis basedon two distinct populations [42]. For instance, multi-ethnic PRS analysis could mergeFebruary 18, 2021 14/24RS analysis based on European training data with PRS analysis based on trainingdata from another population. The multi-ethnic PRS approach computes PRS valuegiven a target individual with genotypes g as follows: P RS = M (cid:88) i =1 ˆ b i g i , (9)where M is the total number of individual’s genetic markers, and ˆ b i is an estimate ofeffect sizes. For a multi-ethnic PRS analysis, this approach uses a linear combination ofthe two distinct PRS values and applying mixing weights parameters α i . PRS analysis could be molded using the well-known approach of best linear unbiasedpredictors (BLUP) [54]. BLUP is used to consider and linearly model both randomeffects and fixed effects. It is also known as genomic best linear unbiased prediction(gBLUP). [55]. BLUP/gBLUP estimates PRS values using the following formula
P RS = Xβ + g + (cid:15), (10)where β is a vector of fixed effects, g is the total genetic effects of the training samplesand (cid:15) are the normally distributed residuals. To evaluate the fixed effects, BLUPconsiders an individual GWAS indicator, the top 5 principal components (PCs) derivedwith all samples together and/or a list of the significant SNPs. The BLUP approach isa computationally efficient algorithm. However, the limitation of BLUP arose due toits requirement of the Individual-level genotype data. BLUP has been implemented inGCTA software (Genome-wide Complex Trait Analysis) https://cnsgenomics.com/software/gcta/ . Moreover, it has been extended to XP-BLUP to modelPRS values for admixed populations [55]. Also, BLUP has been extended to MultiBLUPto include multiple random effects [56]. GeRSI uses mixed models by combing fixed-effects models and random-effects modelsfor controlling population structure [57]. GeRSI performs Gibbs sampling to estimateindividuals’ genetic risk score given the case-control study’s genotypes under a random-effects model. GeRSI proposed conditional distributions of the genetic and environmentalusing the standard liability-threshold model. However, the limitation of GeRSI is that itrequires individual-level genotypes which will not be available to many bioinformaticians.
XP-BLUP is an extension of the BLUP method that could be applied to trans-ethnicpopulations [55]. XP-BLUP utilizes trans-ethnic information to improve PRS valuepredictive accuracy in minority populations. XP-BLUP combines the linear mixed-effectsmodel (LMM) of the GeRSI method with the BLUP method.
Within the current literature that addresses the statistical power of the PRS analysis,most of these articles consider the sample size as a milestone to power the PRS analysis.For instance, in 2013, Dudbridge estimated the predictive power of the polygenic scoreusing results from several published studies [12]. Dudbridge has concluded that allFebruary 18, 2021 15/24ublished studies with a significant association of PRS values are statistically well-powered. Also, Dudbridge pointed out that the accuracy of the PRS analysis dependsonly on the size of the initial sample (training sample). Furthermore, he provided amathematical model to estimate the statistical power of PRS value as a function ofsample size. In 2014, Middeldorp et al. [27] suggested performing PRS analysis on asample size of 2000 individuals is good enough to obtain a statistically powered PRSvalue. However, Dima and Breen in 2015 [58] demonstrated that a sample size of 1500is enough to increase the predictive power to a statistically significant point. However,they stated that the predictive power of polygenic risk scores is not good enough forclinical applications but it could be used as a biomarker for traits of interest withinindividuals. Recently, in 2017, Krapohl et al. [5] introduced a multi-polygenic score thatis capable of increasing the predictive power of PRS analysis.Regarding the relative accuracy of PRS values across ancestries, Yengo et al. [59]proposed a theoretical model to estimate the relative accuracy of PRS value acrossancestries. Their method utilizes the frequencies of the minor alleles (MAF) in thetwo populations, the LD between the causal SNPs and the heritabilities. The authorsassumed that causal variants are shared across ancestries. However, their effect sizesmight vary. Based on their model, [59] concluded that LD and MAF differences acrossancestries explained 70-80% of the loss of relative accuracy of European-based PRS valuein African ancestry.
The approach of the PRS analysis is still not applied to study traits in the Africanpopulation. For instance, upon searching PubMed using the keywords: ”polygenicrisk” + ”African”, only 78 hits was obtained. This number represents about 5.45% oftotal hits that were obtained without using the keyword ”African”. The traits studiedusing PRS analysis in the African population include types 1 & 2 diabetes mellitus,depression, ischemic stroke, schizophrenia, sarcoidosis , alzheimer’s disease, obesity,insomnia disorder, post-traumatic stress and cancer. The following paragraphs willdemonstrate the selected PRS studies done on the African sub-saharan populations.Also, we will highlight their outcomes.In 2020, Ekoru and his colleagues investigated the genetic risk scores for car-diometabolic traits in several African ancestries, including sub-saharan African pop-ulations [60]. They concluded that the predictive power of the risk score is limitedin the African ancestry populations. They stated that this limitation is due to theinsufficient diversity among their samples of genomic discovery. Therefore, they adjustedfor ancestry-derived principal components to obtain up to 5-fold and 20-fold higherpredictive power. However, they observed that the predictive power of genetic risk scoreswas higher in the African Americans (n=9139) and the European Americans (n=9594)relative to the sub-saharan African populations (n=5200). Based on their outcome,Ekoru and his colleagues concluded that PRS analysis performs poorly in sub-saharanAfrican populations. Also, they recommended paying attention to the representation ofmulti-ethnic populations in genomic studies to improve the power of the genetic riskscores.In 2020, Hayat and her colleagues investigated the genetic associations between serumlow LDL-cholesterol levels and selected genetics variants in sub-saharan African of fourcountries; Kenya, South Africa, Ghana and Burkina Faso [61]. Using 1000 genomesdata from the African populations, they selected four genes for their investigation(
LDLR , APOB , PCSK9 , and
LDLRAP1 ). They performed genotyping of 19 SNPsusing 1000 participants in the Human Heredity and Health in Africa (H3Africa) AWI-Gen Collaborative Center (Africa, Wits-INDEPTH Partnership for GENomic studies).February 18, 2021 16/24lthough they used a limited number of variants, their outcome showed significantassociations of these SNPs with lower LDL-cholesterol levels in sub-saharan Africans.In 2020, Cavazos and Witte have proposed the inclusion of variants discovered fromvarious populations to improve PRS transferability for diverse populations [62]. Theyused both simulated data for the Yoruba group for the sub-saharan African populationand European populations. They tested their findings on real data consisting of diabetes-free training samples of European ancestry ( n = 123,665) and African descent ( n =7,564). They evaluated performance of PRS analysis using genotype and phenotypedata for a test (predictive) sample of European ancestry ( n = 394,472) individualsof African origin from the UK Biobank ( n = 5,886). Based on their findings, theyconcluded that incorporating variants selected from the European population will limitthe accuracy of PRS values in non-Europeans populations including African communities.Also, they commented on the need for diverse GWAS data to improve PRS accuracyacross populations.In 2017, Marquez-Luna et al. [42] proposed a multi-ethnic PRS analysis to improverisk prediction in diverse populations including the African community. To overcomethe lack of enough training data for the African populations, the authors combinedthe training data to involve data from European samples and training data from thetarget population. As the authors did not state whether they used sub-saharan Africancommunities, we did not include their study. However, this highlights the challengeof performing PRS analysis in sub-saharan African populations due to lack of enoughtraining data.In 2017, Vassos et al. had examined PRS values in a group of individuals withfirst-episode psychosis [63]. For the control sample, they combined African European( n = 70) and a sample of sub-saharan African ancestry ( n =828). Their finding showedthat PRS value was more potent in Europeans i.e. 9.4% discriminative ability than inAfricans i.e. only 1.1% discriminative ability in Africans.Moreover, PRS analysis has been applied to investigate the risk score for prostatecancer. Prostate cancer is considered a complex genetic disease with high heritabilityand disproportionally affects men of African descent [64]. In a study to predict therisks of prostate cancer in urban African populations, involving seven African studysites as well as European men from the 1000 Genomes Project. It was determined thatrisks of prostate cancer are much more significant for African genomes than Europeangenomes ( p -value < Many PRS methods have been developed and applied to test the risk score of individuals.Nevertheless, PRS analysis has not yet been used in the clinical field for the Africanpopulation. There are still many limitations and challenges regarding the applicationFebruary 18, 2021 17/24f PRS analysis in the African population. One of these challenges is lack of sufficientdata to perform PRS analysis. For instance, querying the a term ”sub-saharan” inthe GWAS Catalog repository, the search resulted in only 70 publications out of 4,628papers. Considering that several publications might use the same GWAS data, wecould affirm that more GWAS experiments need to be done for the sub-saharan Africanpopulation. This might be due to lack of infrastructure and funding to perform GWASexperiments in many countries in Africa. Also, such restrictions might be due to thefact that many African scientists are still focusing their research on infectious diseaseslike malaria, tuberculosis and HIV. However, providing funding priority for infectiousdiseases is necessary for the African communities as they account for a higher mortalityrate in Africa.Due to lack of enough training and test data sets, some scientists choose to usetraining data from European samples which results in decreasing PRS prediction accuracy.Therefore, PRS analysis is not widely applied to clinical research in Africa. Moreover,considering the diversity among African population, the model used for PRS analysismight not work for African sub-populations. Therefore, our future direction would be todevelop African-specific PRS methods that combine different sources of information.Another challenge in performing and applying PRS analysis in the African populationis the lack of long-term funds for GWAS experiments. Therefore, African state authoritiesshould be made aware of this challenge, so as to make more funds available for genomicresearch. Howbeit, the funds should not be limited to the research institutes and principalinvestigators alone, they should equally be directed towards the provision of scholarships(postgraduate programs like PhD) and financial aids for young African researchers. Wehave some promising African research consortiums (e.g. the pan African BioinformaticsNetwork for the Human Heredity and Health in Africa (H3ABioNet, h3abionet.org ) andthe Human Heredity and Health in Africa (H3Africa, h3africa.org )) that are contributingin this regard. However, their funds come from outside Africa. There are also regionalAfrican efforts like the World Bank funded Africa Center of Excellence (ACE) I and theone following this, the ACE Impact, but these initiatives consist of few genomic researchprojects. A follow up project to the H3Africa, dedicated to data science health research,entitled Harnessing Data Science for Health Discovery and Innovation in Africa (DS-IAfrica) will soon commence.Moreover, lack of a pan-African genomic advisory board remains another challengefor genomic research in Africa including PRS analysis. The existence of such a researchadvisory board would help research transparency and establish ethical guidelines toperform genomic research. This could open the window to get more grants from fundingagencies such as the National Center for Biotechnology Information (NCBI). It is clearthat without a rigorous ethical guide and transparency policies, it is hard to get long-termfunds.
PRS analysis has been successfully applied to estimate and identify individuals withgenetic risk for many biological traits especially type-2 diabetes, breast cancer andprostate cancer (See supplementary file S1). Most of these studies provide significantevidence of the success of PRS analysis in identifying patients who are at high risk ofdeveloping disease complications. Hence, the primary strength of PRS analysis is itscapability of stratifying individuals based on their probability of developing a disease.Also, the biological power of PRS analysis arose due to its potential capacity to identifytherapeutic and genomic pathways for type-2 diabetes, breast cancer and prostate cancer.February 18, 2021 18/24oreover, applying PRS analysis on these traits showed that PRS results are reproduciblein the European population.Nonetheless, one weakness of applying PRS analysis on these traits is its limited abilityin detecting the false-positive results. Also, it is observed that most PRS studies are onlyavailable for European ancestries. Therefore, we can not apply them to non-Europeancommunities. In addition, performing PRS analysis on a sizeable multi-ethnic data isindispensable for obtaining more accurate PRS values across populations. Furthermore,the possibility of applying PRS outcomes for personalized medicine requires robustvalidation procedures before broad clinical applications for multi-ethnic communities.
There are several approaches under the umbrella of PRS analysis. GWAS are conductedon finite samples extracted from particular subsets of the human population. Moreover,the SNP effect size estimates are some combination of true effect and stochastic variation,thus producing ’winner’s curse’ among the top-ranking associations, and the estimatedeffects may not generalized well to different populations. Furthermore, the correlationcomplicates the aggregation of SNP effects across the genome, therefore to apply PRSanalysis across ethnic groups, ’Linkage Disequilibrium’ (LD) holds the key. Thus, criticalfactors in the development of methods for calculating PRS values are • The potential adjustment of GWAS estimated effect sizes e.g. via shrinkage andincorporation of their uncertainty; • The tailoring of PRS values to target populations; and • The task of dealing with LD.As members of the H3Africa consortium and the associated bioinformatics consortium,H3ABioNet (see h3abionet.org and https://sysbiolpgwas.waslitbre.org ), we areworking to extend existing methods to be applicable to African populations. Also, onefuture direction will be to develop an African-specific PRS method that combines thedifferent sources of information.
10 Acknowledgements
Research reported in this publication is supported by the National Human GenomeResearch Institute (NHGRI), Office Of The Director, National Institutes Of Health (OD)under award number U24HG006941 and the World Bank funding for the ACE Impactprojects. The content is solely the responsibility of the authors and does not necessarilyrepresent the official views of the National Institutes of Health and the World Bank.Our special thanks to Kalyani Dhusia for her editorial assistance.
11 Competing interests
The authors declare that they have no competing interests
12 Organization Description
H3ABioNet is a pan-African bioinformatics network comprising 28 bioinformatics researchgroups distributed amongst 16 African countries and 2 partner institutions in the USA.The consortium supports H3Africa researchers and their projects whilst developingbioinformatics capacity within AfricaFebruary 18, 2021 19/24 eferences
1. Bush WS. Genome-Wide Association Studies. In: Encyclopedia of Bioinformaticsand Computational Biology. Elsevier; 2019. p. 235–241. Available from: https://doi.org/10.1016/b978-0-12-809633-8.20232-x .2. Gurdasani D, Carstensen T, Fatumo S, Chen G, Franklin CS, Prado-MartinezJ, et al. Uganda Genome Resource Enables Insights into Population His-tory and Genomic Discovery in Africa. Cell. 2019;179(4):984–1002.e36.doi:10.1016/j.cell.2019.10.004.3. Cantor RM, Lange K, Sinsheimer JS. Prioritizing GWAS Results: A Review ofStatistical Methods and Recommendations for Their Application. The AmericanJournal of Human Genetics. 2010;86(1):6–22. doi:10.1016/j.ajhg.2009.11.017.4. Zhang Q, Long Q, Ott J. AprioriGWAS, a New Pattern Mining Strategy for Detect-ing Genetic Variants Associated with Disease through Interaction Effects. PLoSComputational Biology. 2014;10(6):e1003627. doi:10.1371/journal.pcbi.1003627.5. Krapohl E, Patel H, Newhouse S, Curtis CJ, von Stumm S, Dale PS, et al. Multi-polygenic score approach to trait prediction. Molecular Psychiatry. 2017;23(5):1368–1374. doi:10.1038/mp.2017.163.6. Pasaniuc B, Price AL. Dissecting the genetics of complex traits using sum-mary association statistics. Nature Reviews Genetics. 2016;18(2):117–127.doi:10.1038/nrg.2016.142.7. Chimusa ER, Dalvie S, Dandara C, Wonkam A, Mazandu GK. Post genome-wideassociation analysis: dissecting computational pathway/network-based approaches.Briefings in Bioinformatics. 2019;20(2):690–700. doi:10.1093/bib/bby035.8. Buniello A, Macarthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C,et al. The NHGRI-EBI GWAS Catalog of published genome-wide associationstudies, targeted arrays and summary statistics 2019. Nucleic Acids Research.2019;doi:10.1093/nar/gky1120.9. Beck T, Shorter T, Brookes AJ. GWAS Central: a comprehensive resource forthe discovery and comparison of genotype and phenotype data from genome-wide association studies. Nucleic Acids Research. 2019;48(D1):D933–D940.doi:10.1093/nar/gkz895.10. Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, et al.The NCBI dbGaP database of genotypes and phenotypes. Nature Genetics.2007;39(10):1181–1186. doi:10.1038/ng1007-1181.11. Tryka KA, Hao L, Sturcke A, Jin Y, Wang ZY, Ziyabari L, et al. NCBI’s Databaseof Genotypes and Phenotypes: dbGaP. Nucleic Acids Research. 2013;42(D1):D975–D979. doi:10.1093/nar/gkt1211.12. Dudbridge F. Power and Predictive Accuracy of Polygenic Risk Scores. PLoSGenetics. 2013;9(3):e1003348. doi:10.1371/journal.pgen.1003348.13. Lewis CM, Vassos E. Prospects for using risk scores in polygenic medicine. GenomeMedicine. 2017;9(1). doi:10.1186/s13073-017-0489-y.14. Chatterjee N, Wheeler B, Sampson J, Hartge P, Chanock SJ, Park JH. Projectingthe performance of risk prediction based on polygenic analyses of genome-wideassociation studies. Nature Genetics. 2013;45(4):400–405. doi:10.1038/ng.2579.February 18, 2021 20/245. Vilhj´almsson BJ, Yang J, Finucane HK, Gusev A, Lindstr¨om S, Ripke S,et al. Modeling Linkage Disequilibrium Increases Accuracy of PolygenicRisk Scores. The American Journal of Human Genetics. 2015;97(4):576–592.doi:10.1016/j.ajhg.2015.09.001.16. So HC, Sham PC. Improving polygenic risk prediction from summary statistics byan empirical Bayes approach. Scientific Reports. 2017;7(1). doi:10.1038/srep41262.17. Song S, Jiang W, Hou L, Zhao H. Leveraging effect size distributions toimprove polygenic risk scores derived from summary statistics of genome-wide association studies. PLOS Computational Biology. 2020;16(2):e1007565.doi:10.1371/journal.pcbi.1007565.18. Hu Y, Lu Q, Powles R, Yao X, Yang C, Fang F, et al. Leveraging functional annota-tions in genetic risk prediction for human complex diseases. PLOS ComputationalBiology. 2017;13(6):e1005589. doi:10.1371/journal.pcbi.1005589.19. Mak TSH, Porsch RM, Choi SW, Zhou X, Sham PC. Polygenic scores viapenalized regression on summary statistics. Genetic Epidemiology. 2017;41(6):469–480. doi:10.1002/gepi.22050.20. Zhang Y, Qi G, Park JH, Chatterjee N. Estimation of complex effect-size distribu-tions using summary-level statistics from genome-wide association studies across32 complex traits. Nature Genetics. 2018;50(9):1318–1326. doi:10.1038/s41588-018-0193-x.21. Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, Moser G, Kemper KE, et al. Im-proved polygenic prediction by Bayesian multiple regression on summary statistics.Nature Communications. 2019;10(1). doi:10.1038/s41467-019-12653-0.22. Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. Polygenic prediction via Bayesianregression and continuous shrinkage priors. Nature Communications. 2019;10(1).doi:10.1038/s41467-019-09718-5.23. M´arquez-Luna C, Gazal S, Loh PR, Kim SS, Furlotte N, Auton A, et al. LDpred-funct: incorporating functional priors improves polygenic prediction accuracy inUK Biobank and 23andMe data sets. 2018;doi:10.1101/375337.24. Shi J, Park JH, Duan J, Berndt ST, Moy W, Yu K, et al. Winner ' s CurseCorrection and Variable Thresholding Improve Performance of Polygenic RiskModeling Based on Genome-Wide Association Study Summary-Level Data. PLOSGenetics. 2016;12(12):e1006493. doi:10.1371/journal.pgen.1006493.25. Turley P, , Walters RK, Maghzian O, Okbay A, Lee JJ, et al. Author Correction:Multi-trait analysis of genome-wide association summary statistics using MTAG.Nature Genetics. 2019;51(8):1295–1295. doi:10.1038/s41588-019-0469-9.26. Choi SW, Mak TSH, O’Reilly PF. A guide to performing Polygenic Risk Scoreanalyses. 2018;doi:10.1101/416545.27. Wray NR, Lee SH, Mehta D, Vinkhuyzen AAE, Dudbridge F, MiddeldorpCM. Research Review: Polygenic methods and their application to psychi-atric traits. Journal of Child Psychology and Psychiatry. 2014;55(10):1068–1087.doi:10.1111/jcpp.12295.28. Loh PR, Kichaev G, Gazal S, Schoech AP, Price AL. Mixed-model association forbiobank-scale datasets. Nature Genetics. 2018;50(7):906–908. doi:10.1038/s41588-018-0144-6.February 18, 2021 21/249. Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM. Pitfalls ofpredicting complex traits from SNPs. Nature Reviews Genetics. 2013;14(7):507–515. doi:10.1038/nrg3457.30. Euesden J, Lewis CM, O’Reilly PF. PRSice: Polygenic Risk Score software.Bioinformatics. 2015;31(9):1466–1468. doi:10.1093/bioinformatics/btu848.31. Choi SW, O ' Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scaledata. GigaScience. 2019;8(7). doi:10.1093/gigascience/giz082.32. Priv´e F, Arbel J, Vilhj´almsson BJ. LDpred2: better, faster, stronger.2020;doi:10.1101/2020.04.28.066720.33. Yang S, Zhou X. Accurate and Scalable Construction of Polygenic Scores in LargeBiobank Data Sets. The American Journal of Human Genetics. 2020;106(5):679–693. doi:10.1016/j.ajhg.2020.03.013.34. Zeng P, Zhou X. Non-parametric genetic prediction of complex traits withlatent Dirichlet process regression models. Nature Communications. 2017;8(1).doi:10.1038/s41467-017-00470-2.35. Maier RM, Zhu Z, Lee SH, Trzaskowski M, Ruderfer DM, Stahl EA, et al. Improv-ing genetic prediction by leveraging genetic correlations among human diseasesand traits. Nature Communications. 2018;9(1). doi:10.1038/s41467-017-02769-6.36. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience.2015;4(1). doi:10.1186/s13742-015-0047-8.37. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D.Principal components analysis corrects for stratification in genome-wide associationstudies. Nature Genetics. 2006;38(8):904–909. doi:10.1038/ng1847.38. Astle W, Balding DJ. Population Structure and Cryptic Relatedness in GeneticAssociation Studies. Statistical Science. 2009;24(4):451–471. doi:10.1214/09-sts307.39. Price AL, Zaitlen NA, Reich D, Patterson N. New approaches to populationstratification in genome-wide association studies. Nature Reviews Genetics.2010;11(7):459–463. doi:10.1038/nrg2813.40. Kim MS, Patel KP, Teng AK, Berens AJ, Lachance J. Genetic disease riskscan be misestimated across global populations. Genome Biology. 2018;19(1).doi:10.1186/s13059-018-1561-7.41. Martin AR, Gignoux CR, Walters RK, Wojcik GL, Neale BM, Gravel S, et al.Human Demographic History Impacts Genetic Risk Prediction across DiversePopulations. The American Journal of Human Genetics. 2017;100(4):635–649.doi:10.1016/j.ajhg.2017.03.004.42. M´arquez-Luna C, Loh PR, Price AL, and. Multiethnic polygenic risk scores improverisk prediction in diverse populations. Genetic Epidemiology. 2017;41(8):811–823.doi:10.1002/gepi.22083.43. Finucane HK, , Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, et al. Partitioningheritability by functional annotation using genome-wide association summarystatistics. Nature Genetics. 2015;47(11):1228–1235. doi:10.1038/ng.3404.February 18, 2021 22/244. Vilhj´almsson BJ, Yang J, Finucane HK, Gusev A, Lindstr¨om S, Ripke S,et al. Modeling Linkage Disequilibrium Increases Accuracy of PolygenicRisk Scores. The American Journal of Human Genetics. 2015;97(4):576–592.doi:10.1016/j.ajhg.2015.09.001.45. Mokhtari R, Lachman HM. The Major Histocompatibility Complex (MHC) inSchizophrenia: A Review. Journal of Clinical & Cellular Immunology. 2016;07(06).doi:10.4172/2155-9899.1000479.46. Matzaraki V, Kumar V, Wijmenga C, Zhernakova A. The MHC locus and geneticsusceptibility to autoimmune and infectious diseases. Genome Biology. 2017;18(1).doi:10.1186/s13059-017-1207-1.47. Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al.Genome-wide polygenic scores for common diseases identify individuals withrisk equivalent to monogenic mutations. Nature Genetics. 2018;50(9):1219–1224.doi:10.1038/s41588-018-0183-z.48. Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. Polygenic prediction via Bayesianregression and continuous shrinkage priors. Nature Communications. 2019;10(1).doi:10.1038/s41467-019-09718-5.49. Chang CC, Chow CC, Tellier LCAM, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience.2014;doi:10.1186/s13742-015-0047-8.50. Janssens ACJ, Joyner MJ. Polygenic Risk Scores That Predict Common DiseasesUsing Millions of Single Nucleotide Polymorphisms: Is More, Better? Clinicalchemistry. 2019;65(5):609–611. doi:10.1373/clinchem.2018.296103.51. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of theRoyal Statistical Society Series B (Methodological). 1996;58(1):267–288.52. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journalof the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–320. doi:10.1111/j.1467-9868.2005.00503.x.53. Schork AJ, Thompson WK, Pham P, Torkamani A, Roddey JC, Sullivan PF, et al.All SNPs Are Not Created Equal: Genome-Wide Association Studies Reveal aConsistent Pattern of Enrichment among Functionally Annotated SNPs. PLoSGenetics. 2013;9(4):e1003449. doi:10.1371/journal.pgen.1003449.54. Chen CY, Han J, Hunter DJ, Kraft P, Price AL. Explicit Modeling of AncestryImproves Polygenic Risk Scores and BLUP Prediction. Genetic Epidemiology.2015;39(6):427–438. doi:10.1002/gepi.21906.55. Clark SA, van der Werf J. Genomic Best Linear Unbiased Prediction (gBLUP) forthe Estimation of Genomic Breeding Values. In: Methods in Molecular Biology.Humana Press; 2013. p. 321–330. Available from: https://doi.org/10.1007/978-1-62703-447-0_13https://doi.org/10.1007/978-1-62703-447-0_13