Addressing Ancestry Disparities in Genomic Medicine: A Geographic-aware Algorithm
Daniel Mas Montserrat, Arvind Kumar, Carlos Bustamante, Alexander Ioannidis
AAccepted as a workshop paper at AI4CC, ICLR 2020 A DDRESSING A NCESTRY D ISPARITIES IN G ENOMIC M EDICINE : A G
EOGRAPHIC - AWARE A LGORITHM
Daniel Mas Montserrat ∗ Purdue University
Arvind Kumar
Stanford University
Carlos Bustamante
Stanford University
Alexander Ioannidis
Stanford University A BSTRACT
With declining sequencing costs a promising and affordable tool is emerging incancer diagnostics: genomics [1]. By using association studies, genomic variantsthat predispose patients to specific cancers can be identified, while by using tumorgenomics cancer types can be characterized for targeted treatment. However, asevere disparity is rapidly emerging in this new area of precision cancer diagnosisand treatment planning, one which separates a few genetically well-characterizedpopulations (predominantly European) from all other global populations. Herewe discuss the problem of population-specific genetic associations, which is driv-ing this disparity, and present a novel solution–coordinate-based local ancestry–for helping to address it. We demonstrate our boosting-based method on wholegenome data from divergent groups across Africa and in the process observe sig-nals that may stem from the transcontinental Bantu-expansion.
NTRODUCTION
Cancer genomics depends upon the identification of variants that are associated with particular typesof cancers. Because such variants are deleterious, they are not typically part of the ancient standingvariation spread across all humans; instead they are more recent mutations specific to particularpopulations. Indeed, such variants are often present prominently only in particular ethnic groupsdue to genetic drift [2]. In addition, most associations are mapped not to causal variants, but tomore common neighboring variants that are present on genotyping arrays. Since these neighboringvariants are linked to the causal variant via correlation structures (linkage) that are specific to eachpopulation, the ancestry of the genomic segment in which the correlated variant is found becomescrucial. Indeed, as a result of linkage and epistatic effects, genomic variants that are associated withcancer in one ancestry maybe have no association [3], or may even have an opposite association[4], in another ancestry. This phenomenon persists even in admixed individuals possessing multipleancestries, such as African Americans; in such individuals the ancestry (European or African) of thespecific genomic fragment containing the associated variant has been found to reverse the association[5]. This phenomenon dubbed ”flip-flop,” is not an unusual case, rather ancestry-specific effects ingenetic association studies are the rule. For this reason, polygenic-risk scores (PRS), increasinglyimportant to genomic cancer prediction [6], have been found to be several times less accurate whenused on populations of different ancestry from the one on which they were trained [7].As a result of these ancestry specific effects, accurately identifying the ancestry of each segmentof the genome is becoming increasingly crucial for genomic medicine. Such algorithms, known aslocal ancestry inference, have been developed both for historical population genetics [8–15] and forrecreational consumer ancestry products [16], but none have been developed to date for the partic-ular demands of clinical genomic medicine. Such an algorithm would need to provide ancestry notas a culturally defined label, but as continuous genetic coordinates that could be used as a covariatein predication and association algorithms. This method is also important for deconvolving ancestryeffects in genetic association studies. To date, most genome-wide association studies (GWAS) areconducted in populations of single ancestry (typically European) to avoid confounding effects of an-cestry on reversing associations. Researchers often avoid admixed populations, for instance AfricanAmericans or Hispanics, who encompass more than one ancestry, and avoid populations with toomuch genetic variation or too many diverse sub-populations, as is common within Africa. This hasresulted in over 80% of the individuals in GWAS studies to date stemming from European ancestry ∗ Work conducted during an internship at Stanford University. a r X i v : . [ q - b i o . GN ] A p r ccepted as a workshop paper at AI4CC, ICLR 2020(and only 2% from African ancestry) [17, 18]. A reliable coordinate-based local ancestry algorithmwould allow such studies to embrace diversity, rather than intentionally eschewing it, by allowingan additional covariate along the genome to be used (ancestry) to remove the confounding effectsof ancestry-dependent genomic associations. With such a tool, medical researchers would no longerneed to avoid admixed and globally diverse genetic study cohorts. NCESTRY I NFERENCE
Here were present an accurate coordinate-based local ancestry inference algorithm, XGMix, that canbe used for addressing ancestry-specific associations and predictions. XGMix uses modern singleancestry reference populations to accurately predict the latitude and longitude of the closest modernsource population for each segment of an individual’s genome. These coordinate annotations alongthe genome can then be used as covariates for genome-wide association studies (GWAS) and forpolygenic risk score (PRS) predictions.Estimation of an individual’s ancestry, both globally and locally (i.e. assigning an ancestry esti-mate to each region of the chromosomal sequence), has been tackled with a wide range of methodsand technologies [8–15]. Local ancestry inference has traditionally been framed as a classificationproblem using pre-defined ancestries. Classification approaches provide discrete ancestry labels butcan be highly inaccurate for neighboring populations (or population gradients) and intractable forgenetically diverse populations with multiple sources. Geographical regression along the genome,although a much more challenging problem, could provide a continuous representation of ancestrycapable of capturing the complexities of worldwide populations.XGMix consists of two layers of stacked gradient boosted trees (a genomic window-specific layerand a window aggregating smoother) and can infer local-ancestry with both classification probabili-ties and geographical coordinates along each phased chromosome. Here we demonstrate XGMix bytraining on whole genomes from real individuals from the five African populations included in the1000 genomes project [19]. We simulate admixed individuals of various generations using Wright-Fisher simulation [13] to create ground truth labels of ancestry along the genome and split this datafor training and testing. As these reference African populations lie close to a single arc along theglobe we estimate along this arc, getting geographic assignments for each genomic segment. K e n y a U g a n d a C o n g o C a m e r oo n N i g e r i a S i e rr a L e o n e G a m b i a KenyanNigerian (Yoruba)KenyanGambian a b c
Figure 1: (a) The inferred coordinates for each genomic segment of an admixed Kenyan-Nigerianindividual. The model was trained on all indicated African reference populations. (b-c) The inferredlocation of each genomic segment of a Kenyan-Nigerian (b) and Kenyan-Gambian (c) individualusing the principal coordinate arc of the reference populations’ locations. The bimodal distributionof Kenyan segments (green) may reflect the historical Bantu expansion from Cameroon into Kenya.2ccepted as a workshop paper at AI4CC, ICLR 2020 R EFERENCES [1] K. Schwarze, J. Buchanan, J. M. Fermont, H. Dreau, M. W. Tilley, J. M. Taylor, P. Antoniou,S. J. L. Knight, C. Camps, M. M. Pentony, E. M. Kvikstad, S. Harris, N. Popitsch, A. T.Pagnamenta, A. Schuh, J. C. Taylor, and S. Wordsworth, “The complete costs of genomesequencing: a microcosting study in cancer and rare diseases from a single center in the UnitedKingdom,”
Genetics in Medicine , vol. 22, no. 1, pp. 85–94, January 2020.[2] W. D. Foulkes, I. Thiffault, S. B. Gruber, M. Horwitz, N. Hamel, C. Lee, J. Shia, A. Markowitz,A. Figer, E. Friedman, D. Farber, C. M. T. Greenwood, J. D. Bonner, K. Nafa, T. Walsh, V.Marcus, L. Tomsho, J. Gebert, F. A. Macrae, C. L. Gaff, B. B.-d. Paillerets, P. K. Gregersen,J. N. Weitzel, P. H. Gordon, E. MacNamara, M. C. King, H. Hampel, A. de la Chapelle, J.Boyd, K. Offit, G. Rennert, G. Chong, and N. A. Ellis, “The Founder Mutation MSH2*1906G–¿C Is an Important Cause of Hereditary Nonpolyposis Colorectal Cancer in the AshkenaziJewish Population,”
The American Journal of Human Genetics , vol. 71, no. 6, pp. 1395–1412,Dec. 2002.[3] S. Wang, F. Qian, Y. Zheng, T. Ogundiran, O. Ojengbede, W. Zheng, W. Blot, K. L. Nathanson,A. Hennis, B. Nemesure, S. Ambs, O. I. Olopade, and D. Huo, “Genetic variants demonstratingflip-flop phenomenon and breast cancer risk prediction among women of African ancestry,”
Breast Cancer Research and Treatment , vol. 168, no. 3, pp. 703–712, Apr. 2018.[4] F. Rajabli, B. E. Feliciano, K. Celis, K. L. Hamilton-Nelson, P. L. Whitehead, L. D. Adams,P. L. Bussies, C. P. Manrique, A. Rodriguez, V. Rodriguez, T. Starks, G. E. Byfield, C. B. S.Lopez, J. L. McCauley, H. Acosta, A. Chinea, B. W. Kunkle, C. Reitz, L. A. Farrer, G. D.Schellenberg, B. N. Vardarajan, J. M. Vance, M. L. Cuccaro, E. R. Martin, J. L. Haines, G. S.Byrd, G. W. Beecham, and M. A. Pericak-Vance, “Ancestral origin of ApoE ε PLoS Genetics , vol. 14, no.12, pp. e1007791, December 2018.[5] F. Rajabli et al., “Ancestral origin of ApoE ε PLoS Genetics , vol. 14, no. 12, pp. e1007791, December2018.[6] N. Mavaddat, K. Michailidou, J. Dennis, M. Lush, L. Fachal, A. Lee, J. P. Tyrer, T.-H. Chen, Q.Wang, M. K. Bolla, X. Yang, M. A. Adank, T. Ahearn, K. Aittom¨aki, J. Allen, I. L. Andrulis,H. Anton-Culver, N. N. Antonenkova, V. Arndt, K. J. Aronson, P. L. Auer, P. Auvinen, M.Barrdahl, L. E. Beane Freeman, M. W. Beckmann, S. Behrens, J. Benitez, M. Bermisheva, L.Bernstein, C. Blomqvist, N. V. Bogdanova, S. E. Bojesen, B. Bonanni, A.-L. Børresen-Dale, H.Brauch, M. Bremer, H. Brenner, A. Brentnall, I. W. Brock, A. Brooks-Wilson, S. Y. Brucker, T.Br¨uning, B. Burwinkel, D. Campa, B. D. Carter, J. E. Castelao, S. J. Chanock, R. Chlebowski,H. Christiansen, C. L. Clarke, J. M. Coll´ee, E. Cordina-Duverger, S. Cornelissen, F. J. Couch,A. Cox, S. S. Cross, K. Czene, M. B. Daly, P. Devilee, T. D¨ork, I. dos Santos-Silva, M. Du-mont, L. Durcan, M. Dwek, D. M. Eccles, A. B. Ekici, A. H. Eliassen, C. Ellberg, C. Engel, M.Eriksson, D. G. Evans, P. A. Fasching, J. Figueroa, O. Fletcher, H. Flyger, A. F¨orsti, L. Fritschi,M. Gabrielson, M. Gago-Dominguez, S. M. Gapstur, J. A. Garc´ıa-S´aenz, M. M. Gaudet, V.Georgoulias, G. G. Giles, I. R. Gilyazova, G. Glendon, M. S. Goldberg, D. E. Goldgar, A.Gonz´alez-Neira, G. I. Grenaker Alnæs, M. Grip, J. Gronwald, A. Grundy, P. Gu´enel, L. Hae-berle, E. Hahnen, C. A. Haiman, N. H˚akansson, U. Hamann, S. E. Hankinson, E. F. Harkness,S. N. Hart, W. He, A. Hein, J. Heyworth, P. Hillemanns, A. Hollestelle, M. J. Hooning, R. N.Hoover, J. L. Hopper, A. Howell, G. Huang, K. Humphreys, D. J. Hunter, M. Jakimovska, A.Jakubowska, W. Janni, E. M. John, N. Johnson, M. E. Jones, A. Jukkola-Vuorinen, A. Jung, R.Kaaks, K. Kaczmarek, V. Kataja, R. Keeman, M. J. Kerin, E. Khusnutdinova, J. I. Kiiski, J. A.Knight, Y.-D. Ko, V.-M. Kosma, S. Koutros, V. N. Kristensen, U. Kr¨uger, T. K¨uhl, D. Lam-brechts, L. Le Marchand, E. Lee, F. Lejbkowicz, J. Lilyquist, A. Lindblom, S. Lindstr¨om, J.Lissowska, W.-Y. Lo, S. Loibl, J. Long, J. Lubi´nski, M. P. Lux, R. J. MacInnis, T. Maishman, E.Makalic, I. Maleva Kostovska, A. Mannermaa, S. Manoukian, S. Margolin, J. W. M. Martens,M. E. Martinez, D. Mavroudis, C. McLean, A. Meindl, U. Menon, P. Middha, N. Miller, F.Moreno, A. M. Mulligan, C. Mulot, V. M. Mu˜noz-Garzon, S. L. Neuhausen, H. Nevanlinna, P.Neven, W. G. Newman, S. F. Nielsen, B. G. Nordestgaard, A. Norman, K. Offit, J. E. Olson,3ccepted as a workshop paper at AI4CC, ICLR 2020H. Olsson, N. Orr, V. S. Pankratz, T.-W. Park-Simon, J. I. A. Perez, C. P´erez-Barrios, P. Peter-longo, J. Peto, M. Pinchev, D. Plaseska-Karanfilska, E. C. Polley, R. Prentice, N. Presneau, D.Prokofyeva, K. Purrington, K. Pylk¨as, B. Rack, P. Radice, R. Rau-Murthy, G. Rennert, H. S.Rennert, V. Rhenius, M. Robson, A. Romero, K. J. Ruddy, M. Ruebner, E. Saloustros, D. P.Sandler, E. J. Sawyer, D. F. Schmidt, R. K. Schmutzler, A. Schneeweiss, M. J. Schoemaker, F.Schumacher, P. Sch¨urmann, L. Schwentner, C. Scott, R. J. Scott, C. Seynaeve, M. Shah, M. E.Sherman, M. J. Shrubsole, X.-O. Shu, S. Slager, A. Smeets, C. Sohn, P. Soucy, M. C. Southey,J. J. Spinelli, C. Stegmaier, J. Stone, A. J. Swerdlow, R. M. Tamimi, W. J. Tapper, J. A. Tay-lor, M. B. Terry, K. Th¨one, R. A. E. M. Tollenaar, I. Tomlinson, T. Truong, M. Tzardi, H.-U.Ulmer, M. Untch, C. M. Vachon, E. M. van Veen, J. Vijai, C. R. Weinberg, C. Wendt, A. S.Whittemore, H. Wildiers, W. Willett, R. Winqvist, A. Wolk, X. R. Yang, D. Yannoukakos, Y.Zhang, W. Zheng, A. Ziogas, A. M. Dunning, D. J. Thompson, G. Chenevix-Trench, J. Chang-Claude, M. K. Schmidt, P. Hall, R. L. Milne, P. D. P. Pharoah, A. C. Antoniou, N. Chatterjee,P. Kraft, M. Garc´ıa-Closas, J. Simard, and D. F. Easton, “Polygenic Risk Scores for Predictionof Breast Cancer and Breast Cancer Subtypes,”
American journal of human genetics , vol. 104,no. 1, pp. 21–34, Jan. 2019.[7] A. R. Martin et al., “Clinical use of current polygenic risk scores may exacerbate healthdisparities,”
Nature Genetics , vol. 51, no. 4, pp. 584–591, April 2019.[8] H. Tang, M. Coram, P. Wang, X. Zhu, , and N. Risch, “Reconstructing genetic ancestry blocksin admixed individuals,”
The American Journal of Human Genetics , vol. 79, pp. 1–12, May2006.[9] A. Sundquist, E. Fratkin, C. B. Do, and S. Batzoglou, “Effect of genetic divergence in identi-fying ancestral origin using HAPAA,”
Genome research , vol. 18, pp. 676682, April 2008.[10] A. L. Price et al., “Sensitive Detection of Chromosomal Segments of Distinct Ancestry inAdmixed Populations,”
PLoS Genetics , vol. 5, no. 6, pp. 1–18, June 2009.[11] S. Sankararaman, S. Sridhar, G. Kimmel, and E. Halperin, “Estimating local ancestry in ad-mixed populations,”
The American Journal of Human Genetics , vol. 82, no. 2, pp. 290–303,February 2008.[12] E. Y. Durand, C. B. Do, J. L. Mountain, and J. M. Macpherson, “Ancestry Composition: ANovel, Efficient Pipeline for Ancestry Deconvolution,” bioRxiv , October 2014.[13] B. K. Maples, S. Gravel, E. E. Kenny, and C. D. Bustamante, “RFMix: a discriminativemodeling approach for rapid and robust local-ancestry inference,”
The American Journal ofHuman Genetics , vol. 93, no. 2, pp. 278–288, August 2013.[14] D. Mas Montserrat, C. Bustamante, and A. Ioannidis, “Class-Conditional VAE-GAN forLocal-Ancestry Simulation,”
Machine Learning in Computational Biology , December 2019,Vancouver, Canada.[15] D. Mas Montserrat, C. Bustamante, and A. Ioannidis, “LAI-Net: Local-Ancestry InferenceWith Neural Networks,”
Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing , May 2020, Barcelona, Spain.[16] K. Bryc, E. Y. Durand, J. M. Macpherson, D. Reich, and J. L. Mountain, “The Genetic An-cestry of African Americans, Latinos, and European Americans across the United States,”
American journal of human genetics , vol. 96, no. 1, pp. 37–53, January 2015.[17] G. Sirugo, S. M. Williams, and S. A. Tishkoff, “The Missing Diversity in Human GeneticStudies.,”
Cell , vol. 177, no. 1, pp. 26–31, Mar. 2019.[18] A. B. Popejoy and S. M. Fullerton, “Genomics is failing on diversity,”
Nature News , vol. 538,no. 7624, pp. 161–164, October 2016.[19] 1000 Genomes Project Consortium and others, “A global reference for human genetic varia-tion,”