Nucleotide 9-mers Characterize the Type II Diabetic Gut Metagenome
aa r X i v : . [ q - b i o . GN ] M a y Nucleotide 9-mers Characterize the Type II DiabeticGut Metagenome
Bal´azs Szalkai a , Vince Grolmusz a,b, ∗ a PIT Bioinformatics Group, E¨otv¨os University, H-1117 Budapest, Hungary b Uratim Ltd., H-1118 Budapest, Hungary
Abstract
Discoveries of new biomarkers for frequently occurring diseases are of specialimportance in today’s medicine. While fully developed type II diabetes (T2D)can be detected easily, the early identification of high risk individuals is an areaof interest in T2D, too. Metagenomic analysis of the human bacterial flora hasshown subtle changes in diabetic patients, but no specific microbes are knownto cause or promote the disease. Moderate changes were also detected in themicrobial gene composition of the metagenomes of diabetic patients, but again,no specific gene was found that is present in disease-related and missing inhealthy metagenome. However, these fine differences in microbial taxon- andgene composition are difficult to apply as quantitative biomarkers for diagnosingor predicting type II diabetes. In the present work we report some nucleotide9-mers with significantly differing frequencies in diabetic and healthy intestinalflora. To our knowledge, it is the first time such short DNA fragments have beenassociated with T2D. The automated, quantitative analysis of the frequencies ofshort nucleotide sequences seems to be more feasible than accurate phylogeneticand functional analysis, and thus it might be a promising direction of diagnosticresearch.Keywords: Type II diabetes, metagenome, biomarkers, G-C content, oligomers,short nucleotide sequences, tetranucleotides.
1. Introduction
Metagenomics [1] is rapidly gaining importance in clinical research [2, 3, 4, 5, 6,7, 8, 9], environmental studies [10, 11, 12] and biotechnology [13, 14, 15]. Numer- ∗ Corresponding author de novo assembled, and the bac-terial genes were mapped to a metagenomic gene catalog. Genes related tooxidative stress response were found more abundant in the samples originatingfrom diabetic subjects. Additionally, moderate changes in intestinal bacterialcomposition were detected, but no specific microbes were associated with themetagenomes of the type II diabetes (T2D) patients.After a very complex selection and filtering process, genome-specific nucleotidemarkers of length 50 were identified in [33]. The markers were applied forstrain/species identification, and also as markers for microbial species that mightplay a role in T2D and obesity in the data set of [3].Here we describe a very simple and straightforward approach for finding shortnucleotide sequences whose frequencies significantly differ in T2D and healthymetagenomes of the dataset of [3]. We identify several nucleotide 9-mers thatmay serve as quantitative biomarkers of the pre-diabetic state in the future. Toour knowledge, such short sequences have never been found to characterize T2Dor any other disease. 2e need to clarify that we do not state that the identified 9-mers will generallybe applicable as biomarkers for diabetes for all human populations. We be-lieve that “enterotype-specific” [34] quantitative biomarkers could be found foreach enterotypes by exhaustive searches described in the Methods section, andthose enterotype-specific biomarkers could serve as predictors of type 2 diabetesmellitus.
2. Discussion and Results
Our results are summarized on Table 1 and on Figure 1. Table 1 contains 207-, 8- and 9-mers of the highest statistical significance, distinguishing betweenthe diabetic and non-diabetic metagenomes of the study [3].Table 1 was prepared without considering complementarities between the shortnucleotide sequences. Therefore, the complements found with very close fre-quencies and statistical parameters independently verify our results. It is easyto recognize in Table 1 that TGTGGTA and TACCACA are exact comple-ments. The complement of TCCACAT, ATGTGGA, is almost the prefix ofATGTGGTAC. The complement of TGTGGTACT (line 3) is again the exactcomplement of AGTACCACA (line 6), just to mention some of the complemen-tarities in the table.Figure 1 gives the empirical cumulative distribution functions of the frequencyof 9-mer TGTGGTGTA in the diabetic and in the non-diabetic samples. Thedifference between the expected values (means) of the two distribution is obviouson the figure and is quantified statistically in Table 1.We also searched for short nucleotide sequences characterizing lean/obese andmale/female individuals in the dataset of [3]. Only one short sequence passedthe statistical significance bound in the lean/obese search, and none in themale/female search (c.f. Table S1 and S2 and Figure S1 in the Appendix).The source of the bias in short nucleotide sequence frequencies is most proba-bly due to the difference in the gene- and species composition of diabetic andhealthy metagenomes, found in [3, 33]. These frequencies could be measuredand evaluated more easily than the much more involved characteristics found in[3, 33].
3. Materials and Methods frequency of the sequence AAA included not only howmany times the sequence AAA occurs in a specific metagenome, but also howmany times AAG, CAA, ATA, ... occur in that metagenome, except that thenumber of occurrences for these related DNA fragments was divided by two.Let ℓ M denote the length in base-pairs (bp) of a metagenome M . Let d ( s, t )be the number of mismatches between the two sequences of same length, s and t (also called the Hamming distance). Let k M ( s ) denote the number ofexact matches of sequence s in metagenome M . Then f M ( s ) (the frequency ofsequence s with respect to metagenome M ) is defined by the formula f M ( s ) = 1 ℓ M k M ( s ) + 12 X d ( s,t )=1 k M ( t ) . This approach (counting some non-exact matches as well, but with the halfthe weight) yielded statistically better results when compared to the original,stricter counting process, which only allowed exact matches.We developed C++ programs for counting the fragments and analyzing theresults. Several partitions on the set of subjects were analyzed, by dividingthem into two groups by different attributes: diabetic/non-diabetic, lean/obeseand female/male. Our aim was to look for short DNA sequences whose meanfrequency differs for the two groups.To achieve this, first we calculated f M ( s ) for each raw/assembled metagenome M and each short DNA sequence s of length ℓ s ≤
9. Then, for each s wecalculated a p -value using Welch’s t-test, which showed whether the frequency s is the same in the two groups (i.e., p is large) or differs significantly (i.e., p ≤ . x the curves demonstrate the diabetic (solid line) and non-diabetic (dashedline) fraction of metagenomes with TGTGGTGTA frequency of at most x . Forexample, for x = 0 . x , while only 38% of the non-diabetic samples have thatfrequency less than x . Further empirical cumulative distribution functions aregiven in the Appendix.Since this was done for each short DNA fragment, the number of total statisticaltests done for a given division of subjects was equal to the number of possible s DNA sequences of length at most 9. As this is more than 300,000, there wasa high probability that one of the tests would yield a very low p -value but thelarge measured difference of means would be in fact due to mere chance.Therefore we utilized a two-step hypothesis testing procedure. First we com-puted the p -values for Study 1 (with 145 subjects, SRA accession numberSRA045646) only, which now became our training set . Then we sorted the pos-sible s sequence candidates by p -value ascending, and chose those 20 sequenceswhich had the lowest p -value. These were those sequences which showed promisethat their frequency might differ significantly between diabetic/non-diabetic,lean/obese and female/male individuals, depending our current partitioning ofthe subjects. Then we tested these selected sequences (and corresponding sta-tistical hypotheses) on the holdout set , which was the collection of metagenomesfrom Study 2 (SRA accession number SRA050230, 225 subjects). On this setwe performed only those 20 tests which qualified in the first round, which againyielded a second p -value for each of the 20 DNA sequences. Fragment Diabetic Non-diabetic p (training set) p (holdout set) p (corrected)TGTGGTGTA 4.475e-05 4.713e-05 GTGGTA 0.0006214 0.0006428
Table 1: Frequencies of 7-, 8- and 9-mers in diabetic vs. non-diabetic sampleswith the highest significance (training set: Study 1, holdout set: Study 2). Thetraining set p-values are highlighted for the statistically significant multimers( p < .
05 with Holm-Bonferroni correction). It is easy to recognize that TGTG-GTA and TACCACA are exact complements. The complement of TCCACAT,ATGTGGA, is almost the prefix of ATGTGGTAC. 9-mer TGTGGTACT (line3) is the exact complement of AGTACCACA (line 6). One can find furthercomplementarities in the table. These independently found complements withvery close frequencies and p-values strengthen our findings. More tables (forlean-obese and female-male distributions) are given in the Appendix.Then the Holm-Bonferroni correction was used to determine which of the se-quences had a significantly different frequency among the two groups. Thiscorrection algorithm effectively calculates an upper bound for a p -value whichtakes the fact that we performed multiple (i.e., 20) statistical tests into account.Since the frequencies in the second study are independent from those in the firststudy, the first one is indeed a suitable training set for the model, and we cansafely ignore that we performed over 300,000 statistical tests on the first study,since we use only the tests on the holdout set to make predictions.We have applied the raw, unassembled metagenomes from Study 1 and Study2 to look for short marker sequences of diabetes.Unfortunately, there was not enough information available to us to determinewhich subjects of Study 2 are lean/obese or female/male. Thus we had to usethe available assembled metagenomes in Study 1 to look for marker fragmentsfor sex and obesity. We partitioned the assembled metagenomes of the firststudy into two “random” groups: one of the groups consisted of those individualswith an odd subject ID, and the other group contained those with an even ID.One of these was the training set and the other became the holdout set, i.e.6hey took the role of Study 1 and Study 2 for the lean/obese and female/maleclassifications (Tables S1 and S2 in the Appendix).One sequence passed the significance threshold for the lean/obese division, andnone of the short sequences had a significant difference of frequency betweenthe two sexes.
4. ReferencesReferences [1] Committee on Metagenomics: Challenges and Functional Applications, Na-tional Research Council.
The New Science of Metagenomics: Revealing theSecrets of Our Microbial Planet . The National Academies Press, 2007.[2] Xiaokang Wu, Chaofeng Ma, Lei Han, Muhammad Nawaz, Fei Gao, XuyanZhang, Pengbo Yu, Chang’an Zhao, Lianchuan Li, Aiping Zhou, JuanWang, John E Moore, B. Cherie Millar, and Jiru Xu. Molecular char-acterisation of the faecal microbiota in patients with type II diabetes.
CurrMicrobiol , 61(1):69–78, Jul 2010.[3] Junjie Qin, Yingrui Li, Zhiming Cai, Shenghui Li, Jianfeng Zhu, Fan Zhang,Suisha Liang, Wenwei Zhang, Yuanlin Guan, Dongqian Shen, YangqingPeng, Dongya Zhang, Zhuye Jie, Wenxian Wu, Youwen Qin, Wenbin Xue,Junhua Li, Lingchuan Han, Donghui Lu, Peixian Wu, Yali Dai, XiaojuanSun, Zesong Li, Aifa Tang, Shilong Zhong, Xiaoping Li, Weineng Chen,Ran Xu, Mingbang Wang, Qiang Feng, Meihua Gong, Jing Yu, YanyanZhang, Ming Zhang, Torben Hansen, Gaston Sanchez, Jeroen Raes, GwenFalony, Shujiro Okuda, Mathieu Almeida, Emmanuelle LeChatelier, PierreRenault, Nicolas Pons, Jean-Michel Batto, Zhaoxi Zhang, Hua Chen, RuifuYang, Weimou Zheng, Songgang Li, Huanming Yang, Jian Wang, S DuskoEhrlich, Rasmus Nielsen, Oluf Pedersen, Karsten Kristiansen, and JunWang. A metagenome-wide association study of gut microbiota in type2 diabetes.
Nature , 490(7418):55–60, Oct 2012.[4] Josef Neu, Graciela Lorca, Sandra D K Kingma, and Eric W Triplett. Theintestinal microbiome: relationship to type 1 diabetes.
Endocrinol MetabClin North Am , 39(3):563–571, Sep 2010.[5] A. Lyra, S. Lahtinen, K. Tiihonen, and A. C. Ouwehand. Intestinal micro-biota and overweight.
Benef Microbes , 1(4):407–421, Nov 2010.[6] Patrice D Cani and Nathalie M Delzenne. The gut microbiome as thera-peutic target.
Pharmacol Ther , 130(2):202–212, May 2011.[7] William D Bradley, Catherine Zwingelstein, and Cristina M Rondinone.The emerging role of the intestine in metabolic diseases.
Arch PhysiolBiochem , 117(3):165–176, Jul 2011.78] Brian P Boerner and Nora E Sarvetnick. Type 1 diabetes: role of intestinalmicrobiome in humans and mice.
Ann N Y Acad Sci , 1243:103–118, Dec2011.[9] J. Amar, M. Serino, C. Lange, C. Chabo, J. Iacovoni, S. Mondot, P. Lepage,C. Klopp, J. Mariette, O. Bouchez, L. Perez, M. Courtney, M. Marre,P. Klopp, O. Lantieri, J. Dore, M. Charles, B. Balkau, R. Burcelin, andD.E.S.I.R. Study Group. Involvement of tissue bacteria in the onset ofdiabetes in humans: evidence for a concept.
Diabetologia , 54(12):3055–3061, Dec 2011.[10] Noah Fierer, Jonathan W. Leff, Byron J. Adams, Uffe N. Nielsen,Scott Thomas Bates, Christian L. Lauber, Sarah Owens, Jack A. Gilbert,Diana H. Wall, and J Gregory Caporaso. Cross-biome metagenomic analy-ses of soil microbial communities and their functional attributes.
Proc NatlAcad Sci U S A , 109(52):21390–21395, Dec 2012.[11] A. S. Pandit, M. N. Joshi, P. Bhargava, G. N. Ayachit, I. M. Shaikh, Z. M.Saiyed, A. K. Saxena, and S. B. Bagatharia. Metagenomes from the salinedesert of Kutch.
Genome Announc , 2(3), 2014.[12] Wei Xie, Fengping Wang, Lei Guo, Zeling Chen, Stefan M. Sievert, JunMeng, Guangrui Huang, Yuxin Li, Qingyu Yan, Shan Wu, Xin Wang,Shangwu Chen, Guangyuan He, Xiang Xiao, and Anlong Xu. Comparativemetagenomics of microbial communities inhabiting deep-sea hydrothermalvent chimneys with contrasting chemistries.
ISME J , 5(3):414–426, Mar2011.[13] Christel Schmeisser, Helen Steele, and Wolfgang R. Streit. Metagenomics,biotechnology with non-culturable microbes.
Appl Microbiol Biotechnol ,75(5):955–962, Jul 2007.[14] Manuel Ferrer, Olga Golyshina, Ana Beloqui, and Peter N. Golyshin. Min-ing enzymes from extreme environments.
Curr Opin Microbiol , 10(3):207–214, Jun 2007.[15] Helen L. Steele and Wolfgang R. Streit. Metagenomics: advances in ecologyand biotechnology.
FEMS Microbiol Lett , 247(2):105–111, Jun 2005.[16] Daniel H. Huson, Alexander F. Auch, Ji Qi, and Stephan C. Schuster.MEGAN analysis of metagenomic data.
Genome Res , 17(3):377–386, Mar2007.[17] Daniel H. Huson, Suparna Mitra, Hans-Joachim Ruscheweyh, Nico Weber,and Stephan C. Schuster. Integrative analysis of environmental sequencesusing MEGAN4.
Genome Res , 21(9):1552–1560, Sep 2011.[18] Daniel H. Huson and Suparna Mitra. Introduction to the analysis of en-vironmental sequences: metagenomics with MEGAN.
Methods Mol Biol ,856:415–429, 2012. 819] Andreas Wilke, Elizabeth M. Glass, Daniela Bartels, Jared Bischof, DanielBraithwaite, Mark D’Souza, Wolfgang Gerlach, Travis Harrison, KevinKeegan, Hunter Matthews, Renzo Kottmann, Tobias Paczian, Wei Tang,William L. Trimble, Pelin Yilmaz, Jared Wilkening, Narayan Desai, andFolker Meyer. A metagenomics portal for a democratized sequencing world.
Methods Enzymol , 531:487–523, 2013.[20] Elizabeth M. Glass, Jared Wilkening, Andreas Wilke, DionysiosAntonopoulos, and Folker Meyer. Using the metagenomics rast server (MG-RAST) for analyzing shotgun metagenomes.
Cold Spring Harb Protoc ,2010(1):pdb.prot5368, Jan 2010.[21] Martin Wu and Jonathan A Eisen. A simple, fast, and accurate method ofphylogenomic inference.
Genome Biol , 9(10):R151, 2008.[22] Martin Wu and Alexandra J Scott. Phylogenomic analysis of bacterial andarchaeal sequences with AMPHORA2.
Bioinformatics , 28(7):1033–1034,Apr 2012.[23] Csaba Kerepesi, Daniel Banky, and Vince Grolmusz. AmphoraNet: thewebserver implementation of the AMPHORA2 metagenomic workflowsuite.
Gene , 533(2):538–540, Jan 2014.[24] Csaba Kerepesi, Bal´azs Szalkai, and Vince Grolmusz. Visual analysis of thequantitative composition of metagenomic communities: the AmphoraVizuwebserver.
Microb Ecol , Oct 2014.[25] N. Sueoka. On the genetic basis of variation and heterogeneity of dna basecomposition.
Proc Natl Acad Sci U S A , 48:582–592, Apr 1962.[26] G. Bernardi and G. Bernardi. Compositional constraints and genome evo-lution.
J Mol Evol , 24(1-2):1–11, 1986.[27] L. D. Hurst and A. R. Merchant. High guanine-cytosine content is not anadaptation to high temperature: a comparative analysis amongst prokary-otes.
Proc Biol Sci , 268(1466):493–497, Mar 2001.[28] Hao Wu, Zhang Zhang, Songnian Hu, and Jun Yu. On the molecularmechanism of gc content variation among eubacterial genomes.
Biol Direct ,7:2, 2012.[29] S. Karlin, J. Mr´azek, and A. M. Campbell. Compositional biases of bacte-rial genomes and evolutionary implications.
J Bacteriol , 179(12):3899–3913,Jun 1997.[30] Isaam Saeed and Saman K. Halgamuge. The oligonucleotide frequencyderived error gradient and its application to the binning of metagenomefragments.
BMC Genomics , 10 Suppl 3:S10, 2009.931] Lutz Krause, Naryttza N. Diaz, Alexander Goesmann, Scott Kelley,Tim W. Nattkemper, Forest Rohwer, Robert A. Edwards, and Jens Stoye.Phylogenetic classification of short environmental dna fragments.
NucleicAcids Res , 36(7):2230–2239, Apr 2008.[32] E. L. Sonnhammer, S. R. Eddy, and R. Durbin. Pfam: a comprehensivedatabase of protein domain families based on seed alignments.
Proteins ,28(3):405–420, Jul 1997.[33] Qichao Tu, Zhili He, and Jizhong Zhou. Strain/species identification inmetagenomes using genome-specific markers.
Nucleic Acids Res , 42(8):e67,Apr 2014.[34] Manimozhiyan Arumugam, Jeroen Raes, Eric Pelletier, Denis Le Paslier,Takuji Yamada, Daniel R. Mende, Gabriel R. Fernandes, Julien Tap,Thomas Bruls, Jean-Michel Batto, Marcelo Bertalan, Natalia Borruel,Francesc Casellas, Leyden Fernandez, Laurent Gautier, Torben Hansen,Masahira Hattori, Tetsuya Hayashi, Michiel Kleerebezem, Ken Kurokawa,Marion Leclerc, Florence Levenez, Chaysavanh Manichanh, H BjørnNielsen, Trine Nielsen, Nicolas Pons, Julie Poulain, Junjie Qin, ThomasSicheritz-Ponten, Sebastian Tims, David Torrents, Edgardo Ugarte, Er-win G. Zoetendal, Jun Wang, Francisco Guarner, Oluf Pedersen, Willem M.de Vos, Søren Brunak, Joel Dor´e, MetaH. I. T Consortium , Mar´ıa Antol´ın,Fran¸cois Artiguenave, Herv´e M. Blottiere, Mathieu Almeida, Christian Bre-chot, Carlos Cara, Christian Chervaux, Antonella Cultrone, Christine De-lorme, G´erard Denariaz, Rozenn Dervyn, Konrad U. Foerstner, CarstenFriss, Maarten van de Guchte, Eric Guedon, Florence Haimet, WolfgangHuber, Johan van Hylckama-Vlieg, Alexandre Jamet, Catherine Juste,Ghalia Kaci, Jan Knol, Omar Lakhdari, Severine Layec, Karine Le Roux,Emmanuelle Maguin, Alexandre M´erieux, Raquel Melo Minardi, ChristineM’rini, Jean Muller, Raish Oozeer, Julian Parkhill, Pierre Renault, MariaRescigno, Nicolas Sanchez, Shinichi Sunagawa, Antonio Torrejon, KeithTurner, Gaetana Vandemeulebrouck, Encarna Varela, Yohanan Winograd-sky, Georg Zeller, Jean Weissenbach, S Dusko Ehrlich, and Peer Bork. En-terotypes of the human gut microbiome.
Nature , 473(7346):174–180, May2011. 10 . Appendix
Fragment Lean Obese p (training set) p (holdout set) p (corrected)CTCGTGACA 2.002e-05 1.901e-05
Table S1: Frequencies of ninemers of in lean vs. obese samples with the highestsignificance (training and holdout sets: two halves of Study 1). The boldfacenumber in column 4 denotes the significant difference by Holm-Bonferroni cor-rections, shown in the last column.11 ragment Male Female p (training set) p (holdout set) p (corrected)TAGTACTGG 2.748e-05 2.854e-05 0.006019 0.174548 3.490951TTCATAGGG 3.385e-05 3.479e-05 0.0005157 0.305204 5.798868AGTCTCAGG 2.314e-05 2.229e-05 0.007333 0.353644 6.365594GATGTGTCT 3.878e-05 3.841e-05 0.006985 0.452399 7.690776GTCTCACAC 1.635e-05 1.594e-05 0.00236 0.495140 7.922244CTCAGTCT 0.0001047 0.0001014 0.006424 0.512597 7.688961CATGTAACC 2.969e-05 2.932e-05 0.001608 0.515833 7.221663GCTTCAGAC 4.097e-05 3.98e-05 0.006813 0.546829 7.108781CTCTAACAC 2.147e-05 2.098e-05 0.006313 0.578498 6.941978ACAGACTCA 3.893e-05 3.82e-05 0.007392 0.582096 6.403058GGTCAATTC 4.215e-05 4.266e-05 0.006413 0.595760 5.957600TGTGAGTCT 2.247e-05 2.204e-05 0.007573 0.618236 5.564123CAGACTCAT 4.513e-05 4.426e-05 0.007669 0.619291 4.954327GTGTTAGAC 1.626e-05 1.596e-05 0.004958 0.625175 4.376228ACCTCTGTC 4.032e-05 3.957e-05 0.005543 0.729250 4.375499GTCTAACAC 1.634e-05 1.596e-05 0.002582 0.752233 3.761164AGGATGTGT 4.805e-05 4.725e-05 0.001627 0.795980 3.183920TCTCCTCAA 5.775e-05 5.652e-05 0.006681 0.909561 2.728684TCTCAGTCT 3.361e-05 3.263e-05 0.004097 0.945748 1.891496GGTGTGTCT 2.855e-05 2.794e-05 0.005231 0.949554 0.949554ragment Male Female p (training set) p (holdout set) p (corrected)TAGTACTGG 2.748e-05 2.854e-05 0.006019 0.174548 3.490951TTCATAGGG 3.385e-05 3.479e-05 0.0005157 0.305204 5.798868AGTCTCAGG 2.314e-05 2.229e-05 0.007333 0.353644 6.365594GATGTGTCT 3.878e-05 3.841e-05 0.006985 0.452399 7.690776GTCTCACAC 1.635e-05 1.594e-05 0.00236 0.495140 7.922244CTCAGTCT 0.0001047 0.0001014 0.006424 0.512597 7.688961CATGTAACC 2.969e-05 2.932e-05 0.001608 0.515833 7.221663GCTTCAGAC 4.097e-05 3.98e-05 0.006813 0.546829 7.108781CTCTAACAC 2.147e-05 2.098e-05 0.006313 0.578498 6.941978ACAGACTCA 3.893e-05 3.82e-05 0.007392 0.582096 6.403058GGTCAATTC 4.215e-05 4.266e-05 0.006413 0.595760 5.957600TGTGAGTCT 2.247e-05 2.204e-05 0.007573 0.618236 5.564123CAGACTCAT 4.513e-05 4.426e-05 0.007669 0.619291 4.954327GTGTTAGAC 1.626e-05 1.596e-05 0.004958 0.625175 4.376228ACCTCTGTC 4.032e-05 3.957e-05 0.005543 0.729250 4.375499GTCTAACAC 1.634e-05 1.596e-05 0.002582 0.752233 3.761164AGGATGTGT 4.805e-05 4.725e-05 0.001627 0.795980 3.183920TCTCCTCAA 5.775e-05 5.652e-05 0.006681 0.909561 2.728684TCTCAGTCT 3.361e-05 3.263e-05 0.004097 0.945748 1.891496GGTGTGTCT 2.855e-05 2.794e-05 0.005231 0.949554 0.949554