LAI-Net: Local-Ancestry Inference with Neural Networks
Daniel Mas Montserrat, Carlos Bustamante, Alexander Ioannidis
©©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, includingreprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, orreuse of any copyrighted component of this work in other works. DOI: 10.1109/ICASSP40776.2020.9053662
LAI-NET: LOCAL-ANCESTRY INFERENCE WITH NEURAL NETWORKS
Daniel Mas Montserrat (cid:63) , Carlos Bustamante † , Alexander Ioannidis † (cid:63) Purdue University † Stanford University
ABSTRACT
Local-ancestry inference (LAI), also referred to as ancestrydeconvolution, provides high-resolution ancestry estimationalong the human genome. In both research and industry,LAI is emerging as a critical step in DNA sequence analy-sis with applications extending from polygenic risk scores(used to predict traits in embryos and disease risk in adults)to genome-wide association studies, and from pharmacoge-nomics to inference of human population history. Whilemany LAI methods have been developed, advances in com-puting hardware (GPUs) combined with machine learningtechniques, such as neural networks, are enabling the devel-opment of new methods that are fast, robust and easily sharedand stored. In this paper we develop the first neural networkbased LAI method, named LAI-Net, providing competitiveaccuracy with state-of-the-art methods and robustness tomissing or noisy data, while having a small number of layers.
Index Terms — Local-Ancestry Inference, Genomics,Genetics, Neural Networks, Deep Learning
1. INTRODUCTION
Although most positions in the human DNA sequence (genome)do not vary between individuals, about two percent ( ∼ a r X i v : . [ q - b i o . GN ] A p r ig. 1 . Illustration of local-ancestry inference problem. Anadmixed pair of chromosomes are shown with the true ances-try (top) and the decoded ancestry (bottom). The admixeddiploid individual has three ancestral population sources:African, European and East Asian.tated datasets typically contain only a small number of sam-ples. Therefore, methods that are capable of processing high-dimensional information in an efficient manner are preferred.Additionally, many datasets containing human genomic se-quences are proprietary, protected by privacy restrictions, orare otherwise not accessible to the public. Models that canbe easily shared once trained can be useful in such scenar-ios. While the datasets with their de-identifiable genome-wide sequences remain securely private, models trained onthem could be made publicly available.In recent years, deep learning has proved useful in solvingcomputer vision and natural language processing problems,[12] and it is becoming widespread in the medical field. Fromanalyzing MRI scans [13] and detecting tumors within images[14] to finding disease predisposition in the human genome[15], neural networks have provided useful and effective so-lutions. Several deep learning methods have recently beenpresented in the field of genomics [16, 17].In this work we present a neural network named LAI-Netand its lightweight version named Small LAI-Net. Both net-works achieve state-of-the-art results on admixed individualssimulated from real human sequences. Additionally, experi-mental results show that the networks are robust to missingdata and phasing errors.
2. ADMIXTURE SIMULATION DATASET
In this work we use full genome sequences obtained from hu-man research participants through the 1000 genomes project[18]. We select a total of 1668 single-population individ-uals from East Asia (EAS), African (AFR) and European(EUR) ancestry. The East Asian group is composed of thefollowing individuals: 103 Han Chinese in Beijing, China(CHB), 104 Japanese in Tokyo, Japan (JPT), 105 South-ern Han Chinese (CHS), 93 Chinese Dai in Xishuangbanna,China (CDX) and 99 Kinh in Ho Chi Minh City, Vietnam(KHV). The African group is composed of the following in-dividuals: 108 Yoruba in Ibadan, Nigeria (YRI), 99 Luhya inWebuye, Kenya (LWK), 113 Gambian in Western Divisionsin the Gambia (GWD), 85 Mende in Sierra Leone (MSL), 99 Esan in Nigeria (ESN), 61 Americans of African Ances-try in Southwest USA (ASW) and 96 African Caribbeans inBarbados (ACB). Finally, the European group is composedof the following sub-populations: 99 Utah Residents (CEPH)with Northern and Western European Ancestry (CEU), 107Toscani in Italia (TSI), 99 Finnish in Finland (FIN), 91 Britishin England and Scotland (GBR) and 107 Iberian Populationin Spain (IBS).Using the full genomes of these individuals we simulatedadmixed descendants using Wright-Fisher forward simulationover a series of generations. In particular, from the 1668single-population individuals, 1328 were selected to gener-ate 600 admixed individuals for training, 170 were used togenerate 400 admixed individuals for validation and the re-maining 170 were used to generate 400 admixed individualsfor testing. The validation and testing set was generated us-ing 10 individuals for each of the 17 different ancestries. The600 admixed individuals of the training set were composed bygroups of 100 individuals generated after 2, 4, 16, 32 and 64generations. The 400 admixed individuals of the validationand testing set were generated with 6, 12, 24 and 48 genera-tions each. (With increasing numbers of generations follow-ing initial admixture, descendants have increasing numbers ofancestry switches along the genome, leading to a more chal-lenging inference.) This simulation scheme allowed for train-ing and testing the network over a wide range of generations;yielding a method that is robust to populations and individualshaving different admixture histories.
3. NEURAL NETWORK ARCHITECTURE
The proposed network, LAI-Net, is composed of two sub-network: a classification network and a smoothing layer. Thenetwork is trained to infer the ancestry of phased diploid se-quences. The first layer provides an initial ancestry estimate, ˆ y , within windowed regions of the chromosome sequence.The second layer smooths the estimates over multiple win-dows providing the final estimate ˆ y . Figure 2 presents thenetwork architecture.The first sub-network consists of a set of classifiers withinnon-overlapping windows of a fixed size of 500 SNPs. Theinput of the network consists of the base-pairs of the SNPs en-coded as -1 (common variant) and 1 (minority variant). Eachclassifier is composed of a linear layer of size 500 ×
30 fol-lowed by a ReLU activation and batch normalization. A lin-ear layer of size 30 × N A , combined with a softmax func-tion, that maps the hidden layer to the probabilities for assign-ment to each of the possible ancestries (in this case N A = 3 ,African, European and East Asian). The first sub-networkis used twice, one for each sequence of the diploid individ-ual. The second sub-network, a smoothing layer, consists ofa two-dimensional convolution layer that takes as inputs theconcatenated probabilities of the first layer and outputs theclassification estimation within each window. The convolu- ig. 2 . LAI-Net architecture. A: Input SNPs, B: Output offirst layer, ˆ y , for maternal and paternal sequence, C: Outputof smoothing layer, ˆ y , D: Inferred ancestry at each window, argmax ˆ y , for maternal and paternal sequence. Each colorrepresents a different ancestry (AFR, EUR, EAS).tion layer has a kernel size of 75 × N A input and outputchannels. Therefore, the ancestry of each window is inferredby weighting the 75 initial neighboring estimates for both ma-ternal and paternal sequences. The convolution is performedwith the proper reflection padding in order to maintain thesame input and output size of the layer. By using a convolu-tional layer we obtain invariance of the order in which bothsequences are presented (i.e. the output of the network is thesame, up to a permutation, independently if the maternal orpaternal sequence is presented first or last).The network is trained with two cross-entropy loss func-tions: L ( y, ˆ y ) = λ L CE ( y, ˆ y )+ λ L CE ( y, ˆ y ) The first lossfunction, L CE ( y, ˆ y ) , compares the estimate of the first sub-network with the true ancestries and updates the weights ofthe first sub-network. The second term, L CE ( y, ˆ y ) , com-pares the estimate of the last layer with the true ancestries andupdates the weights of the overall network. When λ > ,the output of the first layer, ˆ y , represents the probabilitiesestimated by the classifiers, otherwise the output of the clas-sifiers can be interpreted as a hidden layer. In this work weuse λ = λ = .Dropout regularization is applied to the input data. Thismodels missing input SNPs and provides robustness to miss-ing data, which is a common occurrence when using currentcommercial genotyping arrays. Experimental results, pre-sented in section 4, suggest that even with half of the SNPsites removed (treated as missing), the network is able toaccurately estimate ancestry, suggesting that models trainedon one genotyping array could even be applied to anothergenotyping array. (Different commercial genotyping arrayssequence different sets of SNPs with intersections as low as50%.)While methods such as RFMix require the user to spec-ify the number of generations since admixture, LAI-Net canhandle populations of variable, or unknown, generations since admixture. Generation agnosticism is obtained by training thenetwork with data simulated over a wide range of generations.However, even if only one generation is used for training, ex-perimental results suggest that the network is still able to inferancestry from other generations with only a small decrease ofaccuracy. We present a lightweight version of the network that we nameSmall LAI-Net. This network follows the same scheme asLAI-Net, but with the hidden layer removed. Thus, the net-work is composed only of two layers: a set of linear classifierswith size 500 × N A and a smoothing layer with a kernel ofdimension 75 × N A = 3 input and output channels.Albeit less accurate, this architecture has several advan-tages. First, it is faster and ∼ × smaller than LAI-Net.Second, when λ > during training, the output of the firstlayer, ˆ y , represents the probabilities of the linear classifiers.This leads to a more interpretable network since the learnedweights of the linear classifiers specify the importance of eachSNP to belong at some ancestry.
4. EXPERIMENTAL RESULTS
We used the simulated data previously described to train andtest LAI-Net. The network was trained using Adam optimizerand a learning rate of . over 100 epochs. The validationset was used to select the training parameters, λ and λ , andthe network hyperparameters: window size, hidden layer sizeand smoothing kernel size. Table 1 presents the accuracy re-sults for chromosome 20 of LAI-Net and Small LAI-Net withand without the smoothing layer compared with the RFMixaccuracy. Table 1 . Accuracy of RFMix, Small LAI-Net and LAI-Netwith smoothing (w/ s.) and without smoothing (w/out s.) inthe validation and testing set.
Method Validation TestRFMix [4]
Small LAI-Net w/out s. w/ s.
LAI-Net w/out s. w/ s. ∼ MBand ∼ MB for Small LAI-Net and LAI-Net respectively.These networks are trained here with data from chromosome20; their size scales linearly with larger chromosomes. .1. Missing Data Robustness
Applications that work with genotype data commonly facedata that is noisy or incomplete due to genotyping errors. Inother cases only a subset of SNPs might be available due todiffering commercial genotyping arrays. Therefore, robust-ness to missing data is an important element when deploy-ing LAI methods. Current LAI techniques require the user toupdate the references (training panel) and re-train the modelwhen large numbers of SNPs are missing (eg. when using agenotyping array vs. whole genome sequences or when us-ing different genotyping arrays); our method does not requirethis.In order to evaluate the network performance when largeramounts of data are missing, we trained and tested the net-work with different percentages of missing input SNPs. Thestructure of the network was not changed and the missing la-bels were modeled by applying dropout to the input data inboth training and testing (i.e. missing SNPs were set to 0).Table 2 presents the accuracy values of the estimate on thefirst and second layer with a different percentage of missinginput SNPs.
Table 2 . Accuracy of LAI-Net for different percentage ofmissing input SNPs with and without smoothing layer.%
Missing SNPs w/out Smoothing w/ Smoothing0 Humans carry two complete copies of the genome, one fromeach parent. Current sequencing technologies are typicallyunable to ascertain whether two neighboring SNP variantsbelong to the same sequence (maternal or paternal) or op-posite sequence. That is, read base-pairs cannot be properlyassigned to the paternal or maternal sequences. Assigningvariants to their correct sequence is known as phasing, andstatistical algorithms have been developed to solve this prob-lem based on observed correlations between neighboring SNP variants allele in reference populations. Such methods in-clude Beagle [19] and SHAPEIT [20]. However, these toolsare not perfect with occasional swaps occurring between thetwo sequences.In order to evaluate the network’s performance in the pres-ence of phase errors, we trained and tested the network withdata containing different percentages of phasing errors. In or-der to model these errors, we randomly swapped the genomicsequence in locations where the base-pairs differed in the ma-ternal and paternal sequences. In other words, after encodingthe SNPs as -1 and 1, the sign of the SNPs in positions wherethe paternal and maternal are 1 and -1 or vise-versa, wereswitched with a probability p .Table 3 presents the accuracy results of LAI-Net whendifferent values of p were used for training and evaluation.Results suggest that the network is able to handle small andmedium levels of phasing errors, however the accuracy de-creases considerably when very high phasing errors ( ∼ )are present. Table 3 . Accuracy of LAI-Net with and withoutsmoothing layer for different percentage of phasing er-ror. The networks are trained and evaluated with p ∈{ , . , . , . , . , . } .% Phasing Errors w/out Smoothing w/ Smoothing0
5. CONCLUSIONS
LAI methods are being used across a broadening array of ap-plications by researchers and practitioners with widely differ-ent technical backgrounds. Thus, these methods, besides be-ing accurate, need to be easy to share once trained and mustbe robust to missing data, allowing for application across dif-fering genotyping platforms. In this work, we present an ap-proach based on neural networks that provides accuracy com-petitive with state-of-the-art methods and a shareable modelthat can perform across different genotyping arrays. The abil-ity to share trained models removes the burden of training(and finding appropriate reference populations) from the user,simplifying the use of LAI. Potential pitfalls are reduced, as isthe level of experience required of the user, while the trainingtime (the slowest and most computationally expensive step inLAI) need not be born by the user. Most importantly, highlyaccurate models can be generated on data sets that cannotthemselves be shared without breaching privacy restrictions. . REFERENCES [1] J. Z. Li, D. M. Absher, H. Tang, A. M. Southwick, A. M.Casto, S. Ramachandran, H. M. Cann, G. S. Barsh,M. Feldman, L. L. Cavalli-Sforza, and R. M. Myers,“Worldwide human relationships inferred from genome-wide patterns of variation,”
Science , vol. 319, no. 5866,pp. 1100–1104, February 2008.[2] M. DeGiorgio, M. Jakobsson, and N. A. Rosenberg,“Out of Africa: modern human origins special feature:explaining worldwide patterns of human genetic varia-tion using a coalescent-based serial founder model ofmigration outward from Africa.,”
Proceedings of theNational Academy of Sciences of the United States ofAmerica , vol. 106, no. 38, pp. 16057–16062, September2009.[3] A. L. Price, A. Tandon, N. Patterson, K. C. Barnes, N.Rafaels, I. Ruczinski, T. H. Beaty, R. Mathias, D. Re-ich, and S. Myers, “Sensitive Detection of Chromoso-mal Segments of Distinct Ancestry in Admixed Popu-lations,”
PLoS Genetics , vol. 5, no. 6, pp. 1–18, June2009.[4] B. K. Maples, S. Gravel, E. E. Kenny, and C. D. Busta-mante, “RFMix: a discriminative modeling approachfor rapid and robust local-ancestry inference,”
TheAmerican Journal of Human Genetics , vol. 93, no. 2,pp. 278–288, August 2013.[5] A. R. Martin, M. Lin, J. M. Granka, J. W. Myrick, X.Liu, A. Sockell, E. G. Atkinson, C. J. Werely, M. M¨oller,M. S. Sandhu, et al., “An unexpectedly complex archi-tecture for skin pigmentation in africans,”
Cell , vol. 171,no. 6, pp. 1340–1353, November 2017.[6] A. Moreno-Estrada et al., “The genetics of Mexicorecapitulates Native American substructure and affectsbiomedical traits,”
Science , vol. 344, no. 6189, pp.1280–1285, June 2014.[7] K. Bryc, E. Y. Durand, J. M. Macpherson, D. Reich,and J. L. Mountain, “The Genetic Ancestry of AfricanAmericans, Latinos, and European Americans acrossthe United States,”
American journal of human genetics ,vol. 96, no. 1, pp. 37–53, January 2015.[8] D. G. Torgerson et al., “Case-control admixture map-ping in latino populations enriches for known asthma-associated genes,”
Journal of Allergy and Clinical Im-munology , vol. 130, no. 1, pp. 76–82, 2012.[9] H. Tang, M. Coram, P. Wang, X. Zhu, , and N. Risch,“Reconstructing genetic ancestry blocks in admixed in-dividuals,”
The American Journal of Human Genetics ,vol. 79, pp. 1–12, May 2006. [10] A. Sundquist, E. Fratkin, C. B. Do, and S. Batzoglou,“Effect of genetic divergence in identifying ancestralorigin using HAPAA,”
Genome research , vol. 18, pp.676682, April 2008.[11] S. Sankararaman, S. Sridhar, G. Kimmel, and E.Halperin, “Estimating local ancestry in admixed pop-ulations,”
The American Journal of Human Genetics ,vol. 82, no. 2, pp. 290–303, February 2008.[12] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,”
Nature , vol. 521, pp. 436–444, May 2015.[13] A. S. Lundervold and A. Lundervold, “An overviewof deep learning in medical imaging focusing on mri,”
Zeitschrift f¨ur Medizinische Physik , vol. 29, no. 2, pp.102–127, May 2019.[14] D. S. Gareau, J. Correa da Rosa, S. Yagerman, J. A.Carucci, N. Gulati, F. Hueto, J. L. DeFazio, M. Su´arez-Fari˜nas, A. Marghoob, and J. G. Krueger, “Digital imag-ing biomarkers feed machine learning for melanomascreening,”
Experimental dermatology , vol. 26, no. 7,pp. 615–618, July 2017.[15] X. Li, L. Liu, J. Zhou, and C. Wang, “Heterogene-ity analysis and diagnosis of complex diseases based ondeep learning method,”
Scientific reports , vol. 8, no. 1,pp. 1–8, April 2018.[16] D. Mas Montserrat, C. Bustamante, and A. Ioannidis,“Class-Conditional VAE-GAN for Local-Ancestry Sim-ulation,”
Machine Learning in Computational Biology ,December 2019, Vancouver, Canada.[17] G. Eraslan, ˇZ. Avsec, J. Gagneur, and F. J. Theis, “Deeplearning: new computational modelling techniques forgenomics,”
Nature Reviews Genetics , pp. 389–403,April 2019.[18] 1000 Genomes Project Consortium and others, “Aglobal reference for human genetic variation,”
Nature ,vol. 526, no. 7571, pp. 68, 2015.[19] S. R. Browning and B. L. Browning, “Rapid and ac-curate haplotype phasing and missing-data inference forwhole-genome association studies by use of localizedhaplotype clustering,”
The American Journal of HumanGenetics , vol. 81, no. 5, pp. 1084–1097, 2007.[20] O. Delaneau, J. Marchini, and J.-F. Zagury, “A linearcomplexity phasing method for thousands of genomes,”