Class-Conditional VAE-GAN for Local-Ancestry Simulation
Daniel Mas Montserrat, Carlos Bustamante, Alexander Ioannidis
CClass-Conditional VAE-GAN for Local-AncestrySimulation
Daniel Mas Montserrat ∗ Purdue University
Carlos Bustamante
Stanford University
Alexander Ioannidis
Stanford University
Abstract
Local ancestry inference (LAI) allows identification of the ancestry of all chromo-somal segments in admixed individuals, and it is a critical step in the analysis ofhuman genomes with applications from pharmacogenomics and precision medicineto genome-wide association studies. In recent years, many LAI techniques havebeen developed in both industry [1] and academic research [2]. However, thesemethods require large training data sets of human genomic sequences from theancestries of interest. Such reference data sets are usually limited, proprietary, pro-tected by privacy restrictions, or otherwise not accessible to the public. Techniquesto generate training samples that resemble real haploid sequences from ancestriesof interest can be useful tools in such scenarios, since a generalized model canoften be shared, but the unique human sample sequences cannot. In this work wepresent a class-conditional VAE-GAN to generate new human genomic sequencesthat can be used to train local ancestry inference (LAI) algorithms. We evaluate thequality of our generated data by comparing the performance of a state-of-the-artLAI method when trained with generated versus real data.
Human populations all share a common ancient origin in Africa [3], and a common set of variablesites, but correlations between neighboring sites along the genome, which are typically inheritedtogether, vary between sub-populations around the globe [4]. These correlations along the genome,known as linkage, influence polygenic risk scores (PRS) [5], genome-wide association study (GWAS)results [6], and many other features of precision medicine. Unfortunately, large portions of theworld’s populations have not been included in modern genetic research studies with over 80% ofthese studies to date including only individuals of European ancestry [7]. This has serious adverseconsequences for the ability of associations learned in these modern studies to be applied to the restof the world [5]. Deconvolving the ancestry of admixed individuals using local-ancestry inferencecan contribute to filling this gap and understanding the genetic architecture and associations ofnon-European ancestries; thus allowing the benefits of genomic medicine to accrue to a larger portionof the planet’s population.Many methods for local-ancestry inference exist and are open-source, HAPAA [8], HAPMIX [9]and SABER [10] infer local-ancestry using Hidden Markov Models (HMMs), LAMP [11] usesprobability maximization with a sliding window, and RFMix [2] uses random forests within windows.However, these algorithms all require accessible training data from relevant ancestries in order torecognize those ancestry segments.The challenge is that many data sets containing human genomic references are proprietary [12, 13],protected by privacy restrictions [14], or are otherwise not accessible to the public, especially datasets for under-served or sensitive populations. Generative models that can be easily shared once ∗ This work was conducted during an internship at Stanford University.Preprint. Under review. a r X i v : . [ q - b i o . GN ] N ov igure 1: The class-conditional VAE is composed of an encoder-decoder pair. The encoder transformsthe input sequence x from the ancestry c into an embedded representation z . The decoder transformsthe embedding z and ancestry c into a reconstruction of the input sequence, ˜ x .trained can be useful in such scenarios. While the data sets with their de-anonymizable genome-widesequences remain securely private, models trained on them could be made publicly available.In recent years, deep learning has proven effective in solving computer vision and natural languageprocessing problems [15]. These methods are being used in the biology, medical and genomics fields[16–19]. Specifically, deep learning-based generative methods have been increasingly popular inrecent years. Generative networks such as Variational Autoencoders (VAEs) [20] contain a networkthat encodes the input data into a lower-dimensional space and a decoder that tries to reconstructs theoriginal input. Generative Adversarial Networks (GANs) [21] have been able to generate samplesthat resemble the training data. GANs are able to generate realistic data by using two competingnetworks: a generator that aims to create realistic new samples and a discriminator that classifiesbetween real and generated samples. Many variants and extensions of GANs and VAEs have beenpresented recently [22–24].In this work, we present a class-conditional Variational Autoencoder and Generative AdversarialNetwork (VAE-GAN) for human genome sequence simulation. The network combines a class-conditional VAE, shown in figure 1, with a class-conditional GAN, shown in figure 2. The networkis able to simulate new single-ancestry sequences that resemble the sequences from the training set.The generated sequences are used to train RFMix. In this work we use simulated datasets with ancestry data generated from out-of-Africa simulationsusing msprime [25]. This simulation models the origin and spread of humans as a single ancestralpopulation that grew instantaneously into the continent of Africa. This population stayed with aconstant size to the present day. At some point in the past, a small group of individuals migrated outof Africa and later split in two directions: some founding the present day European populations, andanother founding the present day East Asian populations. Both populations grew exponentially aftertheir separation. The parameters that determine the timing of these events, effective population sizes,and growth rates of European and East Asian populations, are presented in Gravel et al. [26].Following the above out-of-Africa model, we generated three groups of 100 diploid individuals ofsingle-ancestry, one group each of African, European and East Asian ancestry. We divided these 300simulated individuals into training, validation and testing sets with 240, 30 and 30 diploid individualsrespectively. Later, the validation and testing individuals were used to generate admixed descendantsusing Wright-Fisher forward simulation over a series of generations. From 30 single-ancestryindividuals, a total of 100 admixed individuals were generated with the admixture event occurring 8generations in their past to create both validation and testing sets. The 240 single-ancestry individualswere used to train RFMix and the class-conditional VAE-GAN, and the 200 admixed individuals ofthe validation and testing sets were used to evaluate RFMix following training. Throughout we usechromosome 20 of each individual for experiments.2igure 2: The class-conditional GAN is composed of a decoder-discriminator pair. The decodergenerates new samples x fake from a Gaussian representation z x and ancestry c . The discriminatorseparates between out-of-Africa sequences x real and VAE-GAN generated sequences x fake . The proposed network splits the genome into fixed-size non-overlapping windows. The SNPs withineach window are used as the input for individual class-conditional VAE-GAN’s. The input SNPsare encoded as -1 and 1 for each base-pair. Missing input SNPs are modeled by inputting a 0 in thecorresponding position. The VAE-GAN’s are composed of three sub-networks: an encoder, a decoder,and a discriminator. Each sub-network is class-conditional (i.e. the ancestry is an additional input ofthe network). The encoder-decoder pair forms a VAE (figure 1) while the decoder-discriminator pairforms a GAN (figure 2).The encoder, q ( x ; c ) , transforms the input SNPs x from the given the ancestry c (represented withone-hot encoding) into an isotropic Gaussian embedding space z . The network encodes the inputsequence to the embedding space by estimating µ ( x ; c ) and log Σ( x ; c ) . The variance is estimatedin a logarithmic form to force Σ( x ; c ) > . The embedded representation of a sample x from anancestry c can be sampled from z x ∼ N ( µ ( x ; c ) , Σ( x ; c )) . The sampling can be performed with thereparametrization trick: z x = µ ( x ; c ) + Σ( x ; c ) (cid:12) (cid:15) , where (cid:15) ∼ N (0 , I ) and (cid:12) is an element-wisemultiplication. The encoder networks begin with an input linear layer of size ( W + C ) × H , where W is the window’s size, C is the number of ancestries, and H is the size of the hidden layer. Followingthe first layer, a ReLU non-linearity and batch normalization is used. Then, two linear layers are usedwith dimensions H × J , where J is the dimension of the embedding space, to estimate µ ( x ; c ) and log Σ( x ; c ) .The decoder, with a given ancestry c and embedded representation z x , tries to reconstruct the inputSNPs ˜ x = p ( z x ; c ) . In order to obtain training samples for LAI methods, new sequences can besimulated by selecting the desired ancestry c , sampling a random embedding, z ∼ N (0 , I ) , andreconstructing the SNP sequence x new = p ( z ; c ) . The decoder networks start with an input layer ofsize ( J + C ) × H followed by a ReLU non-linearity, batch normalization and the output linear layerof size H × W . The discriminator network is trained to distinguish the real samples from the fakesamples ˆ y = D ( x ; c ) . The discriminator networks start with an input layer of size ( W + C ) × H followed by a ReLU non-linearity, batch normalization and the output linear layer of size H × .The encoder is trained by minimizing the mean square error between the input and reconstructedsequences and the Kullback-Leibler divergence. The encoder loss function is as follows: L q ( x, c ) = || x − ˜ x || + 12 J (cid:88) j µ j + Σ j − log Σ j − (1)3here x and ˜ x are the input and reconstructed sequence respectively, J is the dimension of theembedding space, µ j is the j th element of µ j ( x ; c ) and Σ j is the j th element of the diagonal of Σ j ( x ; c ) . The decoder is trained by minimizing the mean square error of the reconstruction and theadversarial loss: L p ( x, z, c ) = || x − ˜ x || + λ log(1 − D ( p ( z ; c ))) (2)where p ( z ; c ) is a simulated sequence from a randomly selected ancestry c and z ∼ N (0 , I ) . In ourwork we select λ = 0 . . The discriminator is trained using binary cross-entropy with real, x , andsimulated data, p ( z ; c ) : L D ( x, z, c ) = − log( D ( x )) − log(1 − D ( p ( z ; c ))) (3)Because the sequence is generated in a windowed approach, a different ancestry can be assignedto each window, to simulate an admixed individual. However, in this work we focus on single-ancestry individuals. The network is trained to obtain haploid sequences, but by generating pairsof haploid sequences, diploid chromosomes can be simulated. In order to avoid duplicate or verysimilar individuals, we generate N times the number of desired individuals and compute the pair-wisecorrelations of the generated sequences. Then, we select the N individuals with the lowest averagecorrelation. In this paper we use N = 2 . We use the single-ancestry out-of-Africa individuals of the training set to train each VAE-GAN.After training the networks, we generate a total of 80 synthetic samples per ancestry and trainRFMix. RFMix is then evaluated with the admixed individuals in the validation set. We select thehyper-parameters of the VAE-GAN (window size, hidden layer size and embedding space) and thetraining parameters (learning rate, batch size and epoch) that provide the highest validation accuracyof RFMix. Finally, we compare the testing accuracy of RFMix when trained with out-of-Africa dataversus when trained with data generated with the VAE-GANs. Additionally, we compare the resultsof including the discriminator and the adversarial loss (VAE-GAN) with only using a VAE.Table 1 shows that RFMix obtains comparable accuracies when trained with out-of-Africa and datasimulated data. Accuracy results show that adding the discriminator and the adversarial loss helpsthe network to learn to simulate human-chromosome sequences that are more similar to the originaltraining data and therefore more useful to train LAI methods, providing a significant increase inaccuracy. Table 1: Accuracy of RFMix [2] trained with real and generated data
Method RFMix Validation Accuracy RFMix Testing AccuracyOut-of-Africa Data
Generated Data (VAE)
Generated Data (VAE-GAN)
In this work we show a proof of concept for data generation using VAE-GANs. Such networks showpromising results with Out-of-Africa simulated data. Strong simulation methods allow researchersto work infer ancestry using a wide-range of existing tools without the need for having access toreal data from sensitive populations, or from proprietary or protected databases. Besides simulation,generative models have the potential to estimate meaningful representations in the embedding spaceor to be useful tools for data imputation or reconstruction.Future work includes using real humane-genome sequences to train and evaluate our networks andstudying how generative models can be used to help interpret the histories of populations.4 eferences [1] E. Y. Durand, C. B. Do, J. L. Mountain, and J. M. Macpherson, “Ancestry Composition: ANovel, Efficient Pipeline for Ancestry Deconvolution,” bioRxiv , p. 010512, Oct. 2014.[2] B. K. Maples, S. Gravel, E. E. Kenny, and C. D. Bustamante, “RFMix: a discriminativemodeling approach for rapid and robust local-ancestry inference,”
The American Journal ofHuman Genetics , vol. 93, pp. 278–288, August 2013.[3] M. DeGiorgio, M. Jakobsson, and N. A. Rosenberg, “Out of Africa: modern human originsspecial feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa.,”
Proceedings of the NationalAcademy of Sciences of the United States of America , vol. 106, pp. 16057–16062, September2009.[4] J. Z. Li, D. M. Absher, H. Tang, A. M. Southwick, A. M. Casto, S. Ramachandran, H. M.Cann, G. S. Barsh, M. Feldman, L. L. Cavalli-Sforza, and R. M. Myers, “Worldwide humanrelationships inferred from genome-wide patterns of variation,”
Science , vol. 319, pp. 1100–1104, February 2008.[5] L. Duncan, H. Shen, B. Gelaye, J. Meijsen, K. Ressler, M. Feldman, R. Peterson, andB. Domingue, “Analysis of polygenic risk score usage and performance in diverse humanpopulations,”
Nature Communications , vol. 10, pp. 1–9, July 2019.[6] A. R. Martin, M. Lin, J. M. Granka, J. W. Myrick, X. Liu, A. Sockell, E. G. Atkinson, C. J.Werely, M. Möller, M. S. Sandhu, et al. , “An unexpectedly complex architecture for skinpigmentation in africans,”
Cell , vol. 171, pp. 1340–1353, November 2017.[7] A. B. Popejoy and S. M. Fullerton, “Genomics is failing on diversity,”
Nature News , vol. 538,pp. 161–164, October 2016.[8] A. Sundquist, E. Fratkin, C. B. Do, and S. Batzoglou, “Effect of genetic divergence in identifyingancestral origin using HAPAA,”
Genome research , vol. 18, p. 676–682, April 2008.[9] A. L. Price, A. Tandon, N. Patterson, K. C. Barnes, N. Rafaels, I. Ruczinski, T. H. Beaty,R. Mathias, D. Reich, and S. Myers, “Sensitive Detection of Chromosomal Segments of DistinctAncestry in Admixed Populations,”
PLoS Genetics , vol. 5, pp. 1–18, June 2009.[10] H. Tang, M. Coram, P. Wang, X. Zhu, , and N. Risch, “Reconstructing genetic ancestry blocksin admixed individuals,”
The American Journal of Human Genetics , vol. 79, pp. 1–12, May2006.[11] S. Sankararaman, S. Sridhar, G. Kimmel, and E. Halperin, “Estimating local ancestry in admixedpopulations,”
The American Journal of Human Genetics , vol. 82, pp. 290–303, February 2008.[12] E. Han, Y. Wang, P. Carbonetto, R. E. Curtis, J. M. Granka, J. Byrnes, K. Noto, A. R. Kermany,N. M. Myres, M. J. Barber, K. A. Rand, S. Song, T. Roman, E. Battat, E. Elyashiv, H. Guturu,E. L. Hong, K. G. Chahine, and C. A. Ball, “Clustering of 770,000 genomes reveals post-colonial population structure of North America,”
Nature Communications , vol. 8, p. 14238, Feb.2017.[13] K. Bryc, E. Y. Durand, J. M. Macpherson, D. Reich, and J. L. Mountain, “The Genetic Ancestryof African Americans, Latinos, and European Americans across the United States,”
Americanjournal of human genetics , vol. 96, pp. 37–53, January 2015.[14] G. L. Wojcik, M. Graff, K. K. Nishimura, R. Tao, J. Haessler, C. R. Gignoux, H. M. Highland,Y. M. Patel, E. P. Sorokin, C. L. Avery, G. M. Belbin, S. A. Bien, I. Cheng, S. Cullina, C. J.Hodonsky, Y. Hu, L. M. Huckins, J. Jeff, A. E. Justice, J. M. Kocarnik, U. Lim, B. M. Lin,Y. Lu, S. C. Nelson, S.-S. L. Park, H. Poisner, M. H. Preuss, M. A. Richard, C. Schurmann,V. W. Setiawan, A. Sockell, K. Vahi, M. Verbanck, A. Vishnu, R. W. Walker, K. L. Young,N. Zubair, V. Acuna-Alonso, J. L. Ambite, K. C. Barnes, E. Boerwinkle, E. P. Bottinger,C. D. Bustamante, C. Caberto, S. Canizales-Quinteros, M. P. Conomos, E. Deelman, R. Do,K. Doheny, L. Fernández-Rhodes, M. Fornage, B. Hailu, G. Heiss, B. M. Henn, L. A. Hindorff,5. D. Jackson, C. A. Laurie, C. C. Laurie, Y. Li, D.-Y. Lin, A. Moreno-Estrada, G. Nadkarni,P. J. Norman, L. C. Pooler, A. P. Reiner, J. Romm, C. Sabatti, K. Sandoval, X. Sheng, E. A.Stahl, D. O. Stram, T. A. Thornton, C. L. Wassel, L. R. Wilkens, C. A. Winkler, S. Yoneyama,S. Buyske, C. A. Haiman, C. Kooperberg, L. Le Marchand, R. J. F. Loos, T. C. Matise, K. E.North, U. Peters, E. E. Kenny, and C. S. Carlson, “Genetic analyses of diverse populationsimproves discovery for complex traits,”
Nature , vol. 570, pp. 514–518, June 2019.[15] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,”
Nature , vol. 521, pp. 436–444, May2015.[16] D. S. Gareau, J. Correa da Rosa, S. Yagerman, J. A. Carucci, N. Gulati, F. Hueto, J. L. DeFazio,M. Suárez-Fariñas, A. Marghoob, and J. G. Krueger, “Digital imaging biomarkers feed machinelearning for melanoma screening,”
Experimental dermatology , vol. 26, pp. 615–618, July 2017.[17] A. S. Lundervold and A. Lundervold, “An overview of deep learning in medical imagingfocusing on mri,”
Zeitschrift für Medizinische Physik , vol. 29, pp. 102–127, May 2019.[18] X. Li, L. Liu, J. Zhou, and C. Wang, “Heterogeneity analysis and diagnosis of complex diseasesbased on deep learning method,”
Scientific reports , vol. 8, pp. 1–8, April 2018.[19] G. Eraslan, Ž. Avsec, J. Gagneur, and F. J. Theis, “Deep learning: new computational modellingtechniques for genomics,”
Nature Reviews Genetics , pp. 389–403, April 2019.[20] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
Proceedings of the Interna-tional Conference on Learning Representations , pp. 1–14, April 2014.[21] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio, “Generative adversarial nets,”
Advances in neural information processing systems ,pp. 2672–2680, 2014.[22] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” arXiv preprint arXiv:1701.07875 ,2017.[23] A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity naturalimage synthesis,”
Proceedings of the International Conference on Learning Representations ,May 2019. New Orleans, LA.[24] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditionalgenerative models,”
Advances in neural information processing systems , pp. 3483–3491, 2015.[25] J. Kelleher, A. M. Etheridge, and G. McVean, “Efficient coalescent simulation and genealogicalanalysis for large sample sizes,”
PLoS Computational Biology , vol. 12, pp. 1–22, May 2016.[26] S. Gravel, B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu,R. A. Gibbs, C. D. Bustamante, . G. Project, et al. , “Demographic history and rare allelesharing among human populations,”