A reproducible effect size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasets
Andrew D. Fernandes, Michael T.H.Q. Vu, Lisa-Monique Edward, Jean M. Macklaim, Gregory B. Gloor
aa r X i v : . [ q - b i o . GN ] M a y A reproducible effect size is more useful than anirreproducible hypothesis test to analyze high throughputsequencing datasets
Andrew D. Fernandes , Michael T.H.Q. Vu , Lisa-Monique Edward ,Jean M. Macklaim, and Gregory B. Gloor ∗ May 14, 2019 Department of Biochemistry, University of Western Ontario, London, N6A 5C1, Canadaand DNA GenoTek, Ottawa. ∗ To whom correspondence should be addressed.
AbstractMotivation:
High throughput sequencing is analyzed using a combination of nullhypothesis significance testing and ad-hoc cutoffs. This framework is strongly affectedby sample size, and is known to be irreproducible in underpowered studies, yet nosuitable non-parameteric alternative has been proposed.
Results:
Here we present implementations of non-parametric standardized medianeffect size estimates, E d , for high-throughput sequencing datasets. Case studies areshown for transcriptome and amplicon-sequencing datasets. The E d statistic is shownto be more reproducible and robust than p-values and requires sample sizes as smallas 5 to reproducibly identify differentially abundant features. Availability:
Source code and binaries freely available at:https://bioconductor.org/packages/ALDEx2.html, omicplotR, andhttps://github.com/ggloor/CoDaSeq. [email protected] [email protected]
Supplementary information:
Supplementary data and code will be available whenpublished. ntroduction High throughput sequencing (HTS) datasets for transcriptomics, metagenomics and16S rRNA gene sequencing are high dimensional, commonly conducted with pilot-scale experiments and analyzed using a null hypothesis significance testing framework[21]. Much effort has been spent identifying the best approaches and tools to deter-mine what is ‘significantly different’ between groups [21, 22], but the answer seemsto depend on the specific dataset and associated model parameters [14, 23, 25]. Ascommonly conducted the investigator determines what is ‘significantly different’ usinga null hypothesis significance approach and then decides what level of difference is‘biologically meaningful’ among the ‘significantly’ different features. Graphically, thisapproach is represented by the Volcano plot [5] where the magnitude of change (dif-ference) is plotted vs the p-value. One under-appreciated consequence of pilot scaleresearch is that features with significant p-values will have dramatically exaggeratedapparent effect sizes [13]. This explains in part why so many observations of apparentlarge effect fail to replicate in larger datasets [15]. In fact, both p-values and absolutedifferences are poor predictors of which differences would reproduce if the experimentwere conducted again [6, 13]As many have pointed out, p-values are not useful proxies for biological relevancesince p-values are designed, colloquially speaking, to estimate the likelihood of notrue difference; p-values are not a test that the alternate hypothesis is true. Theycan only be used to estimate false-discovery rates if the p-value is calculated from adistribution that is appropriate for the experimental data, if reasonable estimates forthe statistical power exist and a if reasonable estimate of the a priori probability thatthe null hypothesis is false [4, 13]. Simply put, p-values can only be used to test if there is no difference between groups, not to measure the magnitude of change betweengroups [3, 4]. The tension between the information that p-values provide and what theinvestigator needs is why magnitude of change cutoffs [5], or other ad-hoc methods areused when deciding what is biologically relevant. Null-hypothesis significance basedtesting methods also have the property that the number of significant features identifiedis affected by the number of samples being compared. Thus leads to the conceptof statistical power, where the experiment is designed such that statistical power isprioritized over biological significance.On the other hand, a standardized effect size addresses the issues of interest to thebiologist:“what is reproducibly different?” or “would I identify the same true positivefeatures as differential if the experiment were repeated?” [3, 4, 11, 19]. Standardizedeffect size statistics start from the assumption that there is a difference, but that he difference can be arbitrarily close to zero. Unfortunately, standardized effect sizemetrics are not routinely used when analyzing HTS datasets, and one potential barrierto their use is that parametric effect size statistics may not be suitable for HTS datasets.Here we introduce a simple non-parametric standardized effect size statistic fordistributions, E d , that is implemented in the ALDEx2, omicplotR and CoDaSeq Rpackages. The E d statistic has been used in both meta-transcriptome and microbiomestudies, for example see [2, 18], and has been shown to give remarkably reproducibleresults even with extremely small sample sizes [20]. E d has a near monotonic relation-ship with p-values (Supplement Figure 1). However, it is unknown how E d compareswith p-values or other effect size estimates, how many samples are required, and itssensitivity and specificity characteristics. Approach and Methods
High throughput sequencing (HTS) machines output thousands to billions of ‘reads’,short nucleotide sequences that are derived from a DNA or RNA molecule in thesequencing ‘library’. The library is a subset of the nucleic acid molecules that have beencollected from an environment and made compatible with a particular HTS platform.The HTS instruments deliver these reads as integer ‘counts’ per genomic feature—gene,location, etc. However, the counts are actually a single proxy for the probability ofobserving the particular read in a sample under a repeated sampling model; this is clearsince technical replicates of the same library return different counts. The differencebetween technical replicates is consistent with multivariate Poisson sampling [7, 10]The probability estimate is delivered by the instrument as an integer representation ofthe probability multiplied by the number of reads [7, 10]. Thus, the data returned byHTS are a type of count compositional data, where only the relationships between thefeatures have meaning [1, 8, 12, 16, 17].The ALDEx2 tool uses a combination of probabilistic modelling and compositionaldata analysis to determine the features that are different between groups, where thatdifference is insensitive to random sampling. Technical replicate variance estimationand conversion of the count data to probabilities is accomplished by Monte-Carlo sam-pling from the Dirichlet distribution [7, 10], which is conveniently also the conjugateprior for the multivariate Poisson process. The differences between features is linearizedby applying a log-ratio transformation to the Dirichlet Monte-Carlo realizations andanalyzed according to the rules of compositional data analysis [1, 7, 12, 24].Starting with two vectors ~a and ~b that correspond to the concatenated log-ratiotransformed Dirichlet Monte-Carlo realizations of a feature in two groups, we need a ethod to determine the standardized effect size for the log-ratio transformed posteriorprobability estimates; that is, the difference between groups relative to an estimate ofwithin-group dispersion. Since these posterior distributions can have heavy tails, bemultimodal, and be skewed, any useful statistic should be insensitive to even extremenon-Normality and provide sensible answers even if the posterior picture distributionsare almost Cauchy in one or both groups [7]. Below and in the Supplement we definethe properties of the approach used.Cohen’s d is a parametric standardized effect size for the difference between themeans of two groups, and a general formulation is given in Equation 1. Cohen’s dis the difference between the means of the two distributions divided by the pooledstandard deviation, denoted as σ a,b . However, this metric depends upon the databeing relatively Normal, which cannot be guaranteed for HTS data.d = mean( ~a ) − mean( ~b ) σ a,b (1)We can define a non-parametric difference vector in Equation (2) as the signeddifference between the two groups. We can further define a non-parametric dispersion vector as in Equation (3), where the notation ρ ~a indicates one or more random per-mutations of the vector. Finally, we can define an effect vector as in Equation (4) thatis the ratio of these two non-parametric statistics. ~δ = ~a − ~b (2) ~σ = max {| ~a − ρ ~a | , | ~b − ρ ~b |} (3) ~ε = ~δ~σ (4)Taking the median of ~δ, ~σ and ~ε returns a robust estimate of the central tendencyof these statistics ( ˜ D , MMAD (median of the maximum absolute deviation), and E d ),and these are the ‘diff.btw’, ‘diff.win’ and ‘effect’ statistics reported by ALDEx2. ˜ D is the same as the difference between the means or the difference between mediansin a Normal distribution as shown in Supplementary Figure 2. The MMAD metricis novel and the Supplement shows it has a Gaussian efficiency of 52%, a breakdownpoint of 20% (Supplementary Figure 3), and is 1.42 times the size of the standarddeviation on a Normal distribution. The E d statistic is a standardized effect size andis approximately 0.7 of Cohen’s D when comparing the difference between two Normaldistributions. This is simply a Monte-Carlo estimate of a function of the respective andom variables. Below and in Supplementary Figure 4 we show that this metricreturns sensible values even with a Cauchy distribution.We used simple simulated datasets to determine baseline characteristics in a numberof different distributions. Then we use the data from a highly replicated RNA-seqexperiment [21] and examined 100 random subsets of the data with between 2 and 40samples in each group. For each random subset we collected the set of features thatwere called as differentially abundant at thresholds of E d ≥
1, or with an expectedBenjamini-Hochburg adjusted p-value of ≤ . Results and Conclusions
Measuring differential abundance in high throughput sequencing datasets is difficultfor a variety of reasons. First, almost all experiments are underpowered. Second, thetrue distribution of the data is unknown. Third, when sample sizes are large almost allfeatures are identified as ‘significantly different’ by null hypothesis significance testingframeworks.We began by examining behaviour of E d and Cohen’s d statistics in simulateddistributions as shown in Figure 1. For this, we generated 100000 distributions forsample sizes of between 2 and 40 in each group, and calculated the median and 99 th percentile effect size. When there is no difference between groups as illustrated inthe top two panels, all four distributions had a median effect size of 0 at all samplesizes. However, the 99 th percentiles were different between the two statistics. Weobserve that the 99 th of the E d metric increases more rapidly than does Cohen’s d atvery small sample sizes, and that the 99 th percentile of the effect size of the Cauchydistribution is always somewhat less than the other three distributions for the E d metric.In contrast, the 99 th percentile of the Cauchy distribution is much larger than theother distributions at low sample sizes, and tends to become smaller than the otherdistributions at large sample sizes. This behaviour is perhaps not surprising since the arametric Cohen’s d effect size should be undefined in a Cauchy distribution.The bottom two panels in Figure 1 show the behaviour when there is a smalldifference between groups; for reference; Supplementary Figure 5 shows single examplesof the distributions used for comparison. Again, we can see that the distributionof the median and 99 th percentile difference at low sample sizes and are remarkablysimilar when using the non-parametric E d effect size metric, but diverge substantiallywhen using the parametric Cohen’s d. Thus, we conclude that if the data were drawnfrom a Normal distribution that Cohen’s d would be preferred. However, given thatdistributions from actual sequencing datasets are unknown, and can be nearly Cauchy,multimodal or skewed [7], the non-parametric E d effect size statistic gives a more stableand robust estimate of the standardized difference between distributions than doesCohen’s d.With this null behaviour information, we can examine an example dataset of ahighly replicated RNA-seq dataset geneated by [21]. In this dataset, the edgeR toolidentified over 4600 out of 6349 genes as ‘significant’ (Benjamini-Hochberg adjusted p-value < .
05) when all samples were included using either the glm or exact test modes(Supplementary Table 1). Other widely used tools gave similar results [21]. The nullhypthesis testing framework in ALDEx2 also returned at least 4300 genes in the samedataset. Thus, identifying such a large proportion of genes as differentially abundantindicates that statistical significance is not informative for this type of experiment.Schurch et al. (and others) recommend adding a secondary threshold such as a fold-change cutoff to identify genes of interest for follow-up analyses [5, 21]. When samplesizes are sufficiently large, we would expect that the fold-change cutoff itself would bethe primary determinant of difference; however, this approach would not include eitherthe biological variance or the uncertainty of measurement in the analysis.The relationship between sample size and the number of features identified as sig-nificantly different using a null hypothesis testing framework in this dataset is shownin the top panels of Figure 2. Here we are testing for the ability to detect all featuresthat would have been observed as differentially expressed in the full dataset when usinga random subset of the data. The top two panels were generated using the expectedvalue of the Benjamini-Hochberg corrected Welch’s t-test by ALDEx2, but similar plotscan be observed using the edgeR tool or others [21]. As expected we observe that thepower of a t-test is strongly affected by sample size when the minimum absolute dif-ference between groups is 0. However, when the minimum feature difference is a 2-foldchange (D=2), the number of features identified is reduced approximately 10 fold, anda relatively small number of samples is required for acceptable power. The tradeoffhere is that applying both the q-score and difference cutoffs results in an increase in . . . . . E d sample size e ff e c t s i z e NormalUniformCauchyBeta 10 20 30 40 . . . . . Cohen’s d sample size e ff e c t s i z e
10 20 30 40 E d sample size e ff e c t s i z e
10 20 30 40
Cohen’s d sample size e ff e c t s i z e Figure 1: Characteristics of E d and when detecting differential features in two groups inrandom Normal, random uniform, random Cauchy and a heavily skewed random β distri-bution as a function of the per group sample size. The top panels show the behaviour ofthe E d and Cohen’s d effect sizes when the two groups are drawn from the same distribu-tion. The simulation included one-hundred thousand random distributions for each. Thebottom two panels show the behaviour when there is a difference between groups. Singleexamples of the distributions are shown in the Supplement. The dashed lines show themedian effect size, the solid lines show the 99 th percentile of the statistics.7 . . . . . . q<0.1, D=0 . . . . . . q<0.1, D=2 . . . . . . E=1, D=0 . . . . . . E=1, D=2 . . . . . . E=1, D=0 true effect D en s i t y N=2N=5N=10N=20 0.0 0.5 1.0 1.5 2.0 2.5 3.0 . . . . E=1, D=2 true effect D en s i t y Figure 2: Comparing E d and adjusted p-values when detecting differential features betweentwo groups. The top four panels show the median and 95% confidence interval of the truepositive rate (TPR) and false positive rate (FPR) at different cutoff values. The toptwo panels show the values for the q-value, the Benjamini-Hochberg corrected p-value,as a function of sample size. The middle two panels show the values for E d . Data weresummarized using either a 0 or 2-fold difference cutoff (D). The number of features ofthe 6236 non-0 features that are identified as significantly different in the full dataset areshown in the top right corner of each plot. The bottom two panels show the effect sizedistribution of features identified as false positives by E d at four different sample sizes. Thedashed grey lines show the cutoff for a 10% FDR and an 80% power to detect.8
20 40 60 80 − Volcano
Difference − * l og ( q ) − Effect
Dispersion D i ff e r en c e not differentq < 0.1E > 1E > 1, D>2 Figure 3: Volcano and Effect plots of features identified by E d , adjusted p-values andabsolute differences when detecting differential features between two groups. These plotscompare the features identified by E d and by q-scores (FDR) and a 2-fold fold-changethresholds in the full dataset. In these plots all features that have a q score less than 0.1also have an effect size greater than 1. Thus, the features in magenta are only identified assignificantly different by q scores, those in orange are significantly different by both their qscore and their effect size, and features in blue are significant by their q score, their effectsize and their absolute difference. The dashed grey lines in the two plots demarcate the2-fold difference location; note that the difference is in a log2 scale. The Volcano plot isrotated from its usual orientation and the vertical solid line in the Volcano plot indicatesa q score of 0.1. The diagonal solid lines in the Effect plot indicate the boundary wherethe difference equals the dispersion; ie, where the effect size is 1.9 he FDR at small numbers of samples. Note that all tools have difficulty estimatingthe actual FDR in many datasets [14, 23].The middle two panels of Figure 2 show the behaviour of the E d statistic in thesame random datasets. Note that even when only two samples are used, the E d statisticidentifies over 80% of the features as different as are identified by the same statisticin the full dataset. Thus, the simple metric outlined here can correctly identify the‘true positive’ set even when the number of samples is very small. The tradeoff whenusing this statistic is that at very low sample sizes the False Discovery Rate (fdr) isextreme; in this dataset and with and with a cutoff of E d >
1, the fdr is 40% withtwo samples, but falls to less than 10% only when there are 15 or more samples.Interestingly, applying a fold-change cutoff to the E d metric reduces the false discoveryrate dramatically and also reduces the number of features identified as significantlydifferent.The bottom two panels of Figure 2 show the effect size in the full dataset of falsepositive features identified as different in subsets of the dataset. We can see thatat a sample size of 2 the false positive features have true effect-sizes that are nearlyrandomly distributed, but that at a sample size of 10 or more, the vast majority offalse positive features have true effect sizes that are at least 50% of the desired effectsize. When applying the difference cutoff, we observe that the majority of false posi-tive features identified at small sample sizes have an effect size greater than the cutoffwhen the sample size is 5 or more. Thus, this implies that false positive features inthis case pass the effect size cutoff but fail on the absolute difference cutoff. Supple-mentary Figure 6 shows a similar analysis for a 16S rRNA gene sequencing dataset[2] with similar results, even thought he abundance/variance relationship between thefeatures is very different than in the transcriptome experiment (Supplementary Figure7). Supplementary Figure 8 shows that the effect thresholds from the simulated datain Figure 1 are appropriate, and perhaps even conservative, for real HTS data, andprovide further evidence that the features identified as different in Figure 2 are likelyto be reproducible.Taking the data from Figures 1 and 2, and the Supplement together, we can provideguidance as to appropriate cutoff values when using the E d statistic in HTS datasets.First, point estimates of E d in real datasets and in simulated distributions are highlycongruent. Thus, we can estimate that for every 100 features in a HTS dataset, wecan use the curves in Figures 1 and Supplementary Figure 8 to determine the numberof false positive features expected if there is truly no difference between groups. TheCoDaSeq R package contains a function that can be used to empirically determinethresholds for any desired percentile cutoff. As an example, the plots in Figure 1 that s a rule of thumb for the experimentalist to be 99% confident in a true positive,effect sizes should be greater than 2 when the sample size is 4, greater than 1 whenthe sample size is 10, and greater than 0.5 for sample sizes larger than 40. This ruleof thumb holds true regardless of the true underlying distribution and is appropriatewhenever point estimates are computed. When computing the expected E d as does thealdex.effect function, these thresholds are much lower at small sample sizes, and effectsizes can be about 30% smaller.Figure 3 shows how the different threshold cutoffs relate to each other when plottedas a difference vs. q-score in a Volcano plot [5] and in a dispersion vs difference Effectplot [11]. These plots demonstrate the advantage of using a standardized effect metricover a q-score either with or without an absolute difference cutoff. In the Volcano plot,the q-score only features coloured in magenta, are exclusively features that have q scoresnear the upper bound of statistical significance. These include features with both largeand small absolute differences, and the reason that a feature may have a large differencebut a marginally significant q-score is not revealed on the Volcano plot. However,examination of the Effect plot shows that features that have a marginal q-score andlarge difference are features with very large dispersion; that is the difference betweenfeatures, ˜ D , is much smaller than the within-group dispersion, MMAD, calculated asin Equation 4. Such features would not be expected to reproduce well in a new datasetbecause of their intrinsically high variance. In fact, these features are exactly thosethat are excluded by the E d statistic. Furthermore, adding the constraint that featureshave at least a 2-fold difference does not exclude these high-dispersion features fromconsideration.Examination of the orange features in the Volcano and Effect plots, we can see thatboth q-scores and E d can exhibit arbitrarily small absolute differences if the dispersionis very small. In the case of the E d statistic, adding in the requirement for at least a2-fold change reduces the number of features to only those that are both different andreproducible.By default, we want to know both ‘what is significant’ and ‘what is different’ [5].Both of these questions can be addressed with a standardized effect size statistic thatscales the difference between features by their dispersion. We have found plots ofdifference and dispersion to be an exceeding useful tool when examining HTS datasets[11]. Furthermore, datasets analyzed by this approach have proven to be remarkablyreproducible as shown by independent lab validation [18, 20]The E d statistic outlined here is a relatively robust statistic with the attractiveproperty that it consistently identifies almost all the same set of true features regardlessof the underlying distribution as shown in Figure 1, and the number of samples as hown in Figure 2. In marked contrast, even the best p-value based approaches canidentify only a small proportion of the features at small samples sizes that would havebeen found in the full dataset [21]. Thus, the simple metric outlined here can correctlyidentify the ‘true positive’ set even when the number of samples is very small. Notethat fold-change thresholds as is commonly used, is not the same as an standardizedeffect statistic, and applying the threshold values of [21] while reducing the featuresthat are found does not necessarily enhance reproducibility (Figures 2 and 3 ).The tradeoff when using the E d statistic is that at very low sample sizes the FalseDiscovery Rate can be extreme; in this dataset and with and with a cutoff of E d >
1, theFDR is 40% with two samples, but falls to less than 10% only when there are 15 or moresamples. Adding in an absolute fold-change restriction reduces the FDR substantially.Further tempering this, is the observation that the false positive features are frequentlyclose to the cutoff in the full dataset: that is, false positive features typically are truepositives at slightly lower effect sizes or absolute fold changes. This is in contrast to thewell-known random uniform distribution of false positive p-values. The Supplementshows additional evidence that the E d statistic is generally useful, having essentiallythe same characteristics in a 16S rRNA gene sequencing dataset which has much largerper feature dispersion.This work describes the E d statistic for examining high throughput sequencingdatasets. The statistic is relatively robust and efficient, and answers the question mostdesired by the biologist, namely ‘what is reproducibly different’. E d is computed inthe ALDEx2 R package as the ‘effect’ output where it is the median of the inferredtechnical and biological data, and in the CoDaSeq R package where it acts only thepoint estimates of the data. Interactive exploration of effect sizes can be done in theomicplotR Bioconductor package [9]. Authors contributions
AF devised the d
NEF metric. MTHQV and LM did simulation studies to character-ize the metric. JM characterized the properties of the metric in real data. GBGcharacterized the metric, did final simulation studies, wrote the first draft manuscriptand supplement. All authors provided input into the final form and substance of themanuscript. eferencesReferences [1] Aitchison, J. (1986). The Statistical Analysis of Compositional Data . Chapman &Hall.[2] Bian, G., Gloor, G. B., Gong, A., Jia, C., Zhang, W., Hu, J., Zhang, H., Zhang,Y., Zhou, Z., Zhang, J., Burton, J. P., Reid, G., Xiao, Y., Zeng, Q., Yang, K., andLi, J. (2017). The gut microbiota of healthy aged chinese is similar to that of thehealthy young. mSphere , (5), e00327–17.[3] Coe, R. (2002). It’s the effect size, stupid: What effect size is and why it isimportant.[4] Colquhoun, D. (2014). An investigation of the false discovery rate and the misin-terpretation of p-values. R Soc Open Sci , (3), 140216.[5] Cui, X. and Churchill, G. A. (2003). Statistical tests for differential expression incdna microarray experiments. Genome Biol , (4), 210.1 – 210.10.[6] Cumming, G. (2008). Replication and p intervals: p values predict the futureonly vaguely, but confidence intervals do much better. Perspect Psychol Sci , (4),286–300.[7] Fernandes, A. D., Macklaim, J. M., Linn, T. G., Reid, G., and Gloor, G. B.(2013). Anova-like differential expression (aldex) analysis for mixed population rna-seq. PLoS One , (7), e67019.[8] Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell,D. R., and Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencingdatasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growthexperiments by compositional data analysis. Microbiome , , 15.1–15.13.[9] Giguere, D., Macklaim, J., and Gloor, G. (2019). omicplotr: Visual exploration ofomic datasets using a shiny app. Bioconductor v1.4.0.[10] Gloor, G. B., Macklaim, J. M., Vu, M., and Fernandes, A. D. (2016a). Com-positional uncertainty should not be ignored in high-throughput sequencing dataanalysis. Austrian Journal of Statistics , , 73–87.
11] Gloor, G. B., Macklaim, J. M., and Fernandes, A. D. (2016b). Displaying variationin large datasets: Plotting a visual summary of effect sizes.
Journal of Computationaland Graphical Statistics , (3C), 971–979.[12] Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., and Egozcue, J. J. (2017).Microbiome datasets are compositional: And this is not optional. Front Microbiol , , 2224.[13] Halsey, L. G., Curran-Everett, D., Vowler, S. L., and Drummond, G. B. (2015).The fickle p value generates irreproducible results. Nat Methods , (3), 179–85.[14] Hawinkel, S., Mattiello, F., Bijnens, L., and Thas, O. (2018). A broken promise :microbiome differential abundance methods do not control the false discovery rate. BRIEFINGS IN BIOINFORMATICS .[15] Ioannidis, J. P. A. (2005). Why most published research findings are false.
PLoSMed , (8), e124.[16] Kaul, A., Mandal, S., Davidov, O., and Peddada, S. D. (2017). Analysis of mi-crobiome data in the presence of excess zeros. Front Microbiol , , 2114.[17] Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S., and B¨ahler, J.(2015). Proportionality: a valid alternative to correlation for relative data. PLoSComput Biol , (3), e1004075.[18] Macklaim, J. M., Fernandes, A. D., Di Bella, J. M., Hammond, J.-A., Reid, G.,and Gloor, G. B. (2013). Comparative meta-rna-seq of the vaginal microbiota anddifferential expression by lactobacillus iners in health and dysbiosis. Microbiome , (1), 12.[19] Nakagawa, S. (2004). A farewell to bonferroni: the problems of low statisticalpower and publication bias. Behavioral Ecology , (6), 1044–1045.[20] Nelson, T. M., Borgogna, J.-L. C., Brotman, R. M., Ravel, J., Walk, S. T., andYeoman, C. J. (2015). Vaginal biogenic amines: biomarkers of bacterial vaginosis orprecursors to vaginal dysbiosis? Frontiers in physiology , .[21] Schurch, N. J., Schofield, P., Gierli´nski, M., Cole, C., Sherstnev, A., Singh, V.,Wrobel, N., Gharbi, K., Simpson, G. G., Owen-Hughes, T., Blaxter, M., and Barton,G. J. (2016). How many biological replicates are needed in an rna-seq experimentand which differential expression tool should you use? RNA , (6), 839–51.
22] Soneson, C. and Delorenzi, M. (2013). A comparison of methods for differentialexpression analysis of RNA-seq data.
BMC Bioinformatics , , 91.[23] Thorsen, J., Brejnrod, A., Mortensen, M., Rasmussen, M. A., Stokholm, J., Al-Soud, W. A., Sørensen, S., Bisgaard, H., and Waage, J. (2016). Large-scale bench-marking reveals false discoveries and count transformation sensitivity in 16S rRNAgene amplicon data analysis methods used in microbiome studies. Microbiome , (1),62.[24] Tsilimigras, M. C. B. and Fodor, A. A. (2016). Compositional data analysis ofthe microbiome: fundamentals, tools, and challenges. Ann Epidemiol , (5), 330–5.[25] Weiss, S., Xu, Z. Z., Peddada, S., Amir, A., Bittinger, K., Gonzalez, A., Lozupone,C., Zaneveld, J. R., V´azquez-Baeza, Y., Birmingham, A., Hyde, E. R., and Knight,R. (2017). Normalization and microbial differential abundance strategies dependupon data characteristics. Microbiome , (1), 27.(1), 27.