A deep learning classifier for local ancestry inference
Matthew Aguirre, Jan Sokol, Guhan Venkataraman, Alexander Ioannidis
AA deep learning classifier for local ancestry inference
Matthew Aguirre
Department of Biomedical Data ScienceStanford University [email protected]
Jan Sokol
Department of Biomedical Data ScienceStanford University [email protected]
Guhan Venkataraman
Department of Biomedical Data ScienceStanford University [email protected]
Alexander Ioannidis
Department of Biomedical Data ScienceStanford University [email protected]
Abstract
Local ancestry inference (LAI) identifies the ancestry of each segment of anindividual’s genome and is an important step in medical and population geneticstudies of diverse cohorts. Several techniques have been used for LAI, includingHidden Markov Models and Random Forests. Here, we formulate the LAI taskas an image segmentation problem and develop a new LAI tool using a deepconvolutional neural network with an encoder-decoder architecture. We trainour model using complete genome sequences from 982 unadmixed individualsfrom each of five continental ancestry groups, and we evaluate it using simulatedadmixed data derived from an additional 279 individuals selected from the samepopulations. We show that our model is able to learn admixture as a zero-shot task,yielding ancestry assignments that are nearly as accurate as those from the existinggold standard tool, RFMix.
Ancestry inference refers to the task of assigning genetic ancestry labels given an individual’s genomicsequence. Two forms are of note: global ancestry inference, which performs this task at an individuallevel, and local ancestry inference (LAI), which further segments an individual’s genetic ancestryinto assignments that can differ across the genome.Many models of ancestry inference have been proposed, spurring significant statistical methodsdevelopment. The first model of global ancestry inference, STRUCTURE, was an independentdiscovery of Latent Dirichlet Allocation (LDA), and is still the preferred method for this task [1, 2, 3].Early models for local ancestry inference, such as HAPMIX, followed in the tradition of the Kingmancoalescent and used extensions of a haplotype-based model of linkage disequilibrium due to Li andStephens [4, 5, 6, 7]. Other approaches to LAI have used Hidden Markov Models [8] or discriminativemodels over windows of chromosomal sequence [9, 10]. These tools have been reviewed in greaterdepth by Liu, et. al. [11].For most modern studies, however, LAI is a pure prediction problem in that model parameters are notof interest: the goal is the ancestry assignments themselves. Further, the LAI task can also be framedas image segmentation. In this formulation, the genome sequence is a one-dimensional image. Eachvariant site constitutes a pixel, with two complementary channels for reference and alternate allelesrelative to a genome standard. Here, we describe a deep learning model for LAI using a convolutional
Accepted to Learning Meaningful Representations of Life (LMRL), Workshop at the 34th Conference on NeuralInformation Processing Systems (NeurIPS 2020), Vancouver, Canada. a r X i v : . [ q - b i o . GN ] N ov eural network (CNN) designed for image segmentation, and benchmark it against the existing goldstandard LAI tool, RFMix [10]. A schematic of our model is shown in Figure 1 .Figure 1: Model overview. Input is a two-channel bit vector that encodes whether a sample’s maternalor paternal allele varies from the reference genome at a given site. Data are fed into a CNN based onthe Segnet architecture [12], which consists of several conv/maxpool and conv/upsampling blocks.The output is a segmentation of the input sequence, in which each letter is assigned to a geneticancestry group.
We assembled a dataset of whole genome sequences from real individuals in the 1000 GenomesProject (1kG, n = 2 , ), Human Genome Diversity Project (HGDP, n = 929 ), and SimonsGenome Diversity Project (SGDP, n = 279 ) [13, 14, 15]. These data were harmonized using standardbioinformatic tools such as PLINK and bcftools (to filter and merge genetic data), liftOver (to mapvariants to a common reference genome), and Beagle (to infer haplotypes from unphased samples)[16, 17, 18, 19].We used ADMIXTURE, a tool which implements the STRUCTURE algorithm, to produce unsuper-vised genetic ancestry clusters, referencing metadata to interpret the resulting cluster assignments[3]. At K = 8 clusters, we find genetic clusters consistent with African (AFR), East Asian (EAS),European (EUR), Native American (NAT), Oceanian (OCE), South Asian (SAS), San-Mbuti (SMB),and West Asian (WAS) ancestries. We removed individuals in the three least common cluster groups,as well as admixed individual (having < assignment to a single cluster), yielding a final dataset( n = 2 , haplotypes) of five continental ancestries (AFR, EAS, EUR, NAT, SAS), which we splitinto train, development (dev), and test sets ( Table 1 ). Related individuals were removed to avoiddata leakage. For computational tractability, we subset the genetic data to include only biallelic (onealternate allele) single nucleotide polymorphisms (SNPs) on chromosome 20 that are present in thetrain set ( p = 475 , variant positions).To evaluate our models under realistic scenarios of admixture, we simulated the progeny of the realdev and test set individuals using RFMix [10]. We varied the number of generations of mixture andancestral diversity of the founding members of the population so as to create a comprehensive setof benchmark datasets. For model selection (see Methods), we used a dev set consisting of 200genotypes simulated by 10 generations of admixture across all five ancestries of dev set individuals.2opulation AFR EAS EUR NAT SASTrain 289 389 118 51 135Dev 52 70 24 16 24Test 26 35 12 8 12Table 1: Number of individuals of each ancestral composition in the train, dev, and test sets. SegNet is a convolutional encoder-decoder neural network architecture designed for semantic seg-mentation [12]. In its original instance SegNet consists of five “down” blocks containing two or threeconvolutional operators and a pooling layer, then five “up” blocks containing an up-sampling layerfollowed by two or three convolutional layers. Three SegNet hyperparameters are of interest: thenumber of blocks, the number of filters in the input layer, and the filter width. The number of filtersincreases by a factor of two in each down/up block, and filter width is fixed throughout.We conducted a grid search to select an optimal network architecture using these three hyperparame-ters. Filter width and filter count values were roughly spaced exponentially in the range [2 , ] , andthe number of blocks was 3, 4, or 5. All models were trained using categorical cross-entropy loss,evaluated over genetic variants and individuals. The best performing model (with minimal dev setloss) had 4 blocks, with 16 filters with width 16. Complete results are available as a table on theproject GitHub ( ).As baseline, we evaluated RFMix on the same test data. This method is fully described by Mapleset. al. [10]; in brief, it uses several independent random forest classifiers to predict ancestry withinchromosomal windows. Classifier output is then smoothed by a conditional random field (CRF) layerto yield final predictions. We used a version of RFMix (v2.03) from bioconda [20]. In all, we trained about 50 models to select optimal hyperparameters. Several perform moderatelywell, achieving base-pair level accuracy above 80% on the dev set (see Dataset), with the best model(see Methods) reaching 85.6% accuracy. We also found that our model predictions are, in places,noisy (
Figure 2 ). We therefore post-processed model output with a mode filter, replacing the outputfor each position with the most common prediction within k positions on either side. We let k = 2000 ,for an overall window size of 4000, and found that this increased the accuracy of our predictions by afew percentage points.Figure 2: LAI Example. Genetic ancestry ground truth labels (top), with SegNet (middle) and RFMixpredictions (bottom). x -axis values are position indexes. SegNet output is subject to a mode filter.In an independent test set of individuals simulated under the same conditions as the dev set, our modelachieves 88.2% accuracy. Relative to the baseline method, RFMix, however, our model comparesrather unfavorably ( Table 2 ). The test set accuracy of RFMix, fit using the same training set as our3odel, is 97.2%. While both models do quite well identifying regions of African (AFR) and EastAsian (EAS) ancestries, the SegNet model fares much worse with the other populations. This is notsurprising, as AFR and EAS are over-represented classes in the training set (
Table 1 ). Moreover, theerrors made by our model (e.g. confusion between EUR/SAS, and between NAT/EAS) tend to reflectthe more recent common ancestry between those groups.SegNet AFR EAS EUR NAT SASAFR 0.973 0.009 0.010 0.000 0.007EAS 0.005 0.949 0.006 0.002 0.039EUR 0.042 0.037 0.710 0.006 0.204NAT 0.004 0.182 0.010 0.739 0.066SAS 0.028 0.185 0.153 0.009 0.626RFMix AFR EAS EUR NAT SASAFR 0.981 0.004 0.013 0.000 0.002EAS 0.003 0.989 0.002 0.002 0.004EUR 0.006 0.026 0.939 0.005 0.025NAT 0.001 0.011 0.029 0.954 0.004SAS 0.003 0.037 0.021 0.003 0.937Table 2: Model performance. Ground truth genotypes (rows) were generated by 10 generations ofadmixture between test set individuals. Confusion with output labels (columns) is normalized acrossrows. Diagonal entries are ancestry-specific sensitivity. All values rounded to thousandths.Our model also generalizes across varied cohort demography. In another series of experiments, wegenerated 100 simulated individuals from 10 generations of admixture between founding populationsof two ancestry groups at an unequally sampled (80:20) ratio. In Table 3, we highlight results from acohort of 80% African and 20% South Asian ancestry. While RFMix achieves 98.4% accuracy afterfitting a new model, our SegNet model reaches 89.8% without the need for retraining. We conductedsimilar experiments across all pairs of ancestry groups and find that model performance (i.e. classconfusions from
Table 2 ) is broadly consistent across the cohorts (full results on GitHub).SegNet AFR EAS EUR NAT SASAFR 0.989 0.001 0.004 0.000 0.006SAS 0.018 0.143 0.202 0.003 0.634RFMix AFR EAS EUR NAT SASAFR 0.989 0.001 0.009 0.000 0.001SAS 0.007 0.015 0.002 0.015 0.961Table 3: Model performance in 80/20 ancestry composition cohort. Ground truth genotypes (rows)are assigned output labels (columns), with values normalized across rows. Diagonal entries areancestry-specific sensitivity. All values rounded to thousandths.
Here, we formulate the local ancestry inference task as a one-dimensional image segmentationproblem, and develop a new LAI tool based on a deep convolutional neural network (SegNet). Whilethere is a clear performance gap between the method we present here and the current gold standard,we foresee several ways to improve our model. One is data augmentation. This includes incorporatingadmixed genotypes during training, which may be generated on the fly within each epoch duringtraining (as RFMix does), or which may be pre-computed. Another is adding data “channels” forbiologically relevant quantities – such as recombination rate between sites (also considered byRFMix), or population-specific allele frequencies for each input variant. Architectural changes,including adding a conditional random field output layer to smooth output predictions, could alsobe helpful. Finally, more extensive use of regularization techniques like dropout may also improveperformance, as most genetic variants are not informative with respect to ancestry.4hough work remains to make our model viable, we note that our approach is not without benefitsor novelty. First, while the models in this work are not very resource-hungry (all run to completionin under two days on a single NVIDIA RTX 2080 Ti), the “train once, use anywhere” paradigm isimmensely important for analyses of large or growing datasets [21]. The portability of the modelallows ancestry learned on a privacy-protected or proprietary dataset to be applied by external userslacking access, something that is not possible with RFMix. Second, it warrants mention that deeplearning techniques are underutilized in population genetics, and many popular genomics tools arenot optimized for graphical processing resources. We believe there is significant room for methodsdevelopment in this space, particularly for purely predictive tasks such as LAI. We anticipate this willbe of interest as the size and scope of genomic data continue to grow.
Several commercial entities offer generic ancestry inference as a direct-to-consumer enterprise, whichhas remained largely uncontroversial, aside from data privacy concerns and occasional unexpectedfamily discoveries. However, we must acknowledge that the field of genetics – in particular, geneticstudies of differences between individuals of diverse backgrounds – has had a problematic past withrespect to data collection (e.g. participant consent) and to propagating systems of racial hierarchyand racism (e.g. eugenics). Even today, there remains significant risk that certain individuals willpromote misreadings of genetic studies to further personal or political agendas. In this respect, letus be very clear: we believe that race, as a social construct, has no biological basis and therefore nomeaningful place in population genetic studies like this one.Population genetic studies, using tools such as the ones we propose, can reveal a great deal about ourorigins and history as a species: our migrations, our interactions, and our families. For some, thesestories may be a source of pride, or a tool to reclaim a lost, colonized past, while for others, they maybe reminders of times where our ancestors have been less kind. These realities do not negate the wayswe choose to describe ourselves in the present, and they do not foretell the type of people we maybecome.
References [1] Jonathan K Pritchard, Matthew Stephens, and Peter Donnelly. Inference of population structure usingmultilocus genotype data.
Genetics , 155(2):945–959, 2000.[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
Journal of machineLearning research , 3(Jan):993–1022, 2003.[3] David H Alexander and Kenneth Lange. Enhancements to the admixture algorithm for individual ancestryestimation.
BMC bioinformatics , 12(1):246, 2011.[4] Alkes L Price, Arti Tandon, Nick Patterson, Kathleen C Barnes, Nicholas Rafaels, Ingo Ruczinski, Terri HBeaty, Rasika Mathias, David Reich, and Simon Myers. Sensitive detection of chromosomal segments ofdistinct ancestry in admixed populations.
PLoS genetics , 5(6):e1000519, 2009.[5] John Frank Charles Kingman. The coalescent.
Stochastic processes and their applications , 13(3):235–248,1982.[6] Richard R Hudson. Properties of a neutral allele model with intragenic recombination.
Theoreticalpopulation biology , 23(2):183–201, 1983.[7] Na Li and Matthew Stephens. Modeling linkage disequilibrium and identifying recombination hotspotsusing single-nucleotide polymorphism data.
Genetics , 165(4):2213–2233, 2003.[8] Nick Patterson, Neil Hattangadi, Barton Lane, Kirk E Lohmueller, David A Hafler, Jorge R Oksenberg,Stephen L Hauser, Michael W Smith, Stephen J O’Brien, David Altshuler, et al. Methods for high-densityadmixture mapping of disease genes.
The American Journal of Human Genetics , 74(5):979–1000, 2004.[9] Sriram Sankararaman, Srinath Sridhar, Gad Kimmel, and Eran Halperin. Estimating local ancestry inadmixed populations.
The American Journal of Human Genetics , 82(2):290–303, 2008.[10] Brian K Maples, Simon Gravel, Eimear E Kenny, and Carlos D Bustamante. Rfmix: a discriminativemodeling approach for rapid and robust local-ancestry inference.
The American Journal of Human Genetics ,93(2):278–288, 2013.[11] Yushi Liu, Toru Nyunoya, Shuguang Leng, Steven A Belinsky, Yohannes Tesfaigzi, and Shannon Bruse.Softwares and methods for estimating genetic ancestry in human populations.
Human genomics , 7(1):1,2013.
12] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoderarchitecture for image segmentation.
IEEE transactions on pattern analysis and machine intelligence ,39(12):2481–2495, 2017.[13] 1000 Genomes Project Consortium. A global reference for human genetic variation.
Nature , 526(7571):68,2015.[14] Anders Bergström, Shane A McCarthy, Ruoyun Hui, Mohamed A Almarri, Qasim Ayub, Petr Danecek,Yuan Chen, Sabine Felkel, Pille Hallast, Jack Kamm, et al. Insights into human genetic variation andpopulation history from 929 diverse genomes.
Science , 367(6484), 2020.[15] Swapan Mallick, Heng Li, Mark Lipson, Iain Mathieson, Melissa Gymrek, Fernando Racimo, MengyaoZhao, Niru Chennagiri, Susanne Nordenfelt, Arti Tandon, et al. The simons genome diversity project: 300genomes from 142 diverse populations.
Nature , 538(7624):201–206, 2016.[16] Christopher C Chang, Carson C Chow, Laurent CAM Tellier, Shashaank Vattikuti, Shaun M Purcell, andJames J Lee. Second-generation plink: rising to the challenge of larger and richer datasets.
Gigascience ,4(1):s13742–015, 2015.[17] Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, GoncaloAbecasis, and Richard Durbin. The sequence alignment/map format and samtools.
Bioinformatics ,25(16):2078–2079, 2009.[18] Robert M Kuhn, David Haussler, and W James Kent. The ucsc genome browser and associated tools.
Briefings in bioinformatics , 14(2):144–161, 2013.[19] Sharon R Browning and Brian L Browning. Rapid and accurate haplotype phasing and missing-datainference for whole-genome association studies by use of localized haplotype clustering.
The AmericanJournal of Human Genetics , 81(5):1084–1097, 2007.[20] Björn Grüning, Ryan Dale, Andreas Sjödin, Brad A Chapman, Jillian Rowe, Christopher H Tomkins-Tinch,Renan Valieris, and Johannes Köster. Bioconda: sustainable and comprehensive software distribution forthe life sciences.
Nature methods , 15(7):475–476, 2018.[21] Keith Noto, Yong Wang, Shiya Song, Joshua G Schraiber, Alisa Sedghifar, Jake K Byrnes, David ATurissini, Eurie L Hong, and Catherine A Ball. Ancestry inference using reference labeled clusters ofhaplotypes. bioRxiv , 2020., 2020.