[PDF] A reproducible effect size is more useful than an irreproducible hypothesis test to analyze high throughput sequencing datasets

Abstract

Motivation: P values derived from the null hypothesis significance testing framework are strongly affected by sample size, and are known to be irreproducible in underpowered studies, yet no suitable replacement has been proposed. Results: Here we present implementations of non-parametric standardized median effect size estimates, dNEF, for high-throughput sequencing datasets. Case studies are shown for transcriptome and tag-sequencing datasets. The dNEF measure is shown to be more reproducible and robust than P values and requires sample sizes as small as 3 to reproducibly identify differentially abundant features. Availability: Source code and binaries freely available at: this https URL , omicplotR, and this https URL .

Full PDF

aa r X i v : . [ q - b i o . GN ] M a y A reproducible eﬀect size is more useful than anirreproducible hypothesis test to analyze high throughputsequencing datasets

Andrew D. Fernandes , Michael T.H.Q. Vu , Lisa-Monique Edward ,Jean M. Macklaim, and Gregory B. Gloor ∗ May 14, 2019 Department of Biochemistry, University of Western Ontario, London, N6A 5C1, Canadaand DNA GenoTek, Ottawa. ∗ To whom correspondence should be addressed.

AbstractMotivation:

High throughput sequencing is analyzed using a combination of nullhypothesis signiﬁcance testing and ad-hoc cutoﬀs. This framework is strongly aﬀectedby sample size, and is known to be irreproducible in underpowered studies, yet nosuitable non-parameteric alternative has been proposed.

Results:

Here we present implementations of non-parametric standardized medianeﬀect size estimates, E d , for high-throughput sequencing datasets. Case studies areshown for transcriptome and amplicon-sequencing datasets. The E d statistic is shownto be more reproducible and robust than p-values and requires sample sizes as smallas 5 to reproducibly identify diﬀerentially abundant features. Availability:

Source code and binaries freely available at:https://bioconductor.org/packages/ALDEx2.html, omicplotR, andhttps://github.com/ggloor/CoDaSeq. [email protected] [email protected]

Supplementary information:

Supplementary data and code will be available whenpublished. ntroduction High throughput sequencing (HTS) datasets for transcriptomics, metagenomics and16S rRNA gene sequencing are high dimensional, commonly conducted with pilot-scale experiments and analyzed using a null hypothesis signiﬁcance testing framework[21]. Much eﬀort has been spent identifying the best approaches and tools to deter-mine what is ‘signiﬁcantly diﬀerent’ between groups [21, 22], but the answer seemsto depend on the speciﬁc dataset and associated model parameters [14, 23, 25]. Ascommonly conducted the investigator determines what is ‘signiﬁcantly diﬀerent’ usinga null hypothesis signiﬁcance approach and then decides what level of diﬀerence is‘biologically meaningful’ among the ‘signiﬁcantly’ diﬀerent features. Graphically, thisapproach is represented by the Volcano plot [5] where the magnitude of change (dif-ference) is plotted vs the p-value. One under-appreciated consequence of pilot scaleresearch is that features with signiﬁcant p-values will have dramatically exaggeratedapparent eﬀect sizes [13]. This explains in part why so many observations of apparentlarge eﬀect fail to replicate in larger datasets [15]. In fact, both p-values and absolutediﬀerences are poor predictors of which diﬀerences would reproduce if the experimentwere conducted again [6, 13]As many have pointed out, p-values are not useful proxies for biological relevancesince p-values are designed, colloquially speaking, to estimate the likelihood of notrue diﬀerence; p-values are not a test that the alternate hypothesis is true. Theycan only be used to estimate false-discovery rates if the p-value is calculated from adistribution that is appropriate for the experimental data, if reasonable estimates forthe statistical power exist and a if reasonable estimate of the a priori probability thatthe null hypothesis is false [4, 13]. Simply put, p-values can only be used to test if there is no diﬀerence between groups, not to measure the magnitude of change betweengroups [3, 4]. The tension between the information that p-values provide and what theinvestigator needs is why magnitude of change cutoﬀs [5], or other ad-hoc methods areused when deciding what is biologically relevant. Null-hypothesis signiﬁcance basedtesting methods also have the property that the number of signiﬁcant features identiﬁedis aﬀected by the number of samples being compared. Thus leads to the conceptof statistical power, where the experiment is designed such that statistical power isprioritized over biological signiﬁcance.On the other hand, a standardized eﬀect size addresses the issues of interest to thebiologist:“what is reproducibly diﬀerent?” or “would I identify the same true positivefeatures as diﬀerential if the experiment were repeated?” [3, 4, 11, 19]. Standardizedeﬀect size statistics start from the assumption that there is a diﬀerence, but that he diﬀerence can be arbitrarily close to zero. Unfortunately, standardized eﬀect sizemetrics are not routinely used when analyzing HTS datasets, and one potential barrierto their use is that parametric eﬀect size statistics may not be suitable for HTS datasets.Here we introduce a simple non-parametric standardized eﬀect size statistic fordistributions, E d , that is implemented in the ALDEx2, omicplotR and CoDaSeq Rpackages. The E d statistic has been used in both meta-transcriptome and microbiomestudies, for example see [2, 18], and has been shown to give remarkably reproducibleresults even with extremely small sample sizes [20]. E d has a near monotonic relation-ship with p-values (Supplement Figure 1). However, it is unknown how E d compareswith p-values or other eﬀect size estimates, how many samples are required, and itssensitivity and speciﬁcity characteristics. Approach and Methods

High throughput sequencing (HTS) machines output thousands to billions of ‘reads’,short nucleotide sequences that are derived from a DNA or RNA molecule in thesequencing ‘library’. The library is a subset of the nucleic acid molecules that have beencollected from an environment and made compatible with a particular HTS platform.The HTS instruments deliver these reads as integer ‘counts’ per genomic feature—gene,location, etc. However, the counts are actually a single proxy for the probability ofobserving the particular read in a sample under a repeated sampling model; this is clearsince technical replicates of the same library return diﬀerent counts. The diﬀerencebetween technical replicates is consistent with multivariate Poisson sampling [7, 10]The probability estimate is delivered by the instrument as an integer representation ofthe probability multiplied by the number of reads [7, 10]. Thus, the data returned byHTS are a type of count compositional data, where only the relationships between thefeatures have meaning [1, 8, 12, 16, 17].The ALDEx2 tool uses a combination of probabilistic modelling and compositionaldata analysis to determine the features that are diﬀerent between groups, where thatdiﬀerence is insensitive to random sampling. Technical replicate variance estimationand conversion of the count data to probabilities is accomplished by Monte-Carlo sam-pling from the Dirichlet distribution [7, 10], which is conveniently also the conjugateprior for the multivariate Poisson process. The diﬀerences between features is linearizedby applying a log-ratio transformation to the Dirichlet Monte-Carlo realizations andanalyzed according to the rules of compositional data analysis [1, 7, 12, 24].Starting with two vectors ~a and ~b that correspond to the concatenated log-ratiotransformed Dirichlet Monte-Carlo realizations of a feature in two groups, we need a ethod to determine the standardized eﬀect size for the log-ratio transformed posteriorprobability estimates; that is, the diﬀerence between groups relative to an estimate ofwithin-group dispersion. Since these posterior distributions can have heavy tails, bemultimodal, and be skewed, any useful statistic should be insensitive to even extremenon-Normality and provide sensible answers even if the posterior picture distributionsare almost Cauchy in one or both groups [7]. Below and in the Supplement we deﬁnethe properties of the approach used.Cohen’s d is a parametric standardized eﬀect size for the diﬀerence between themeans of two groups, and a general formulation is given in Equation 1. Cohen’s dis the diﬀerence between the means of the two distributions divided by the pooledstandard deviation, denoted as σ a,b . However, this metric depends upon the databeing relatively Normal, which cannot be guaranteed for HTS data.d = mean( ~a ) − mean( ~b ) σ a,b (1)We can deﬁne a non-parametric diﬀerence vector in Equation (2) as the signeddiﬀerence between the two groups. We can further deﬁne a non-parametric dispersion vector as in Equation (3), where the notation ρ ~a indicates one or more random per-mutations of the vector. Finally, we can deﬁne an eﬀect vector as in Equation (4) thatis the ratio of these two non-parametric statistics. ~δ = ~a − ~b (2) ~σ = max {| ~a − ρ ~a | , | ~b − ρ ~b |} (3) ~ε = ~δ~σ (4)Taking the median of ~δ, ~σ and ~ε returns a robust estimate of the central tendencyof these statistics ( ˜ D , MMAD (median of the maximum absolute deviation), and E d ),and these are the ‘diﬀ.btw’, ‘diﬀ.win’ and ‘eﬀect’ statistics reported by ALDEx2. ˜ D is the same as the diﬀerence between the means or the diﬀerence between mediansin a Normal distribution as shown in Supplementary Figure 2. The MMAD metricis novel and the Supplement shows it has a Gaussian eﬃciency of 52%, a breakdownpoint of 20% (Supplementary Figure 3), and is 1.42 times the size of the standarddeviation on a Normal distribution. The E d statistic is a standardized eﬀect size andis approximately 0.7 of Cohen’s D when comparing the diﬀerence between two Normaldistributions. This is simply a Monte-Carlo estimate of a function of the respective andom variables. Below and in Supplementary Figure 4 we show that this metricreturns sensible values even with a Cauchy distribution.We used simple simulated datasets to determine baseline characteristics in a numberof diﬀerent distributions. Then we use the data from a highly replicated RNA-seqexperiment [21] and examined 100 random subsets of the data with between 2 and 40samples in each group. For each random subset we collected the set of features thatwere called as diﬀerentially abundant at thresholds of E d ≥

1, or with an expectedBenjamini-Hochburg adjusted p-value of ≤ . Results and Conclusions

Measuring diﬀerential abundance in high throughput sequencing datasets is diﬃcultfor a variety of reasons. First, almost all experiments are underpowered. Second, thetrue distribution of the data is unknown. Third, when sample sizes are large almost allfeatures are identiﬁed as ‘signiﬁcantly diﬀerent’ by null hypothesis signiﬁcance testingframeworks.We began by examining behaviour of E d and Cohen’s d statistics in simulateddistributions as shown in Figure 1. For this, we generated 100000 distributions forsample sizes of between 2 and 40 in each group, and calculated the median and 99 th percentile eﬀect size. When there is no diﬀerence between groups as illustrated inthe top two panels, all four distributions had a median eﬀect size of 0 at all samplesizes. However, the 99 th percentiles were diﬀerent between the two statistics. Weobserve that the 99 th of the E d metric increases more rapidly than does Cohen’s d atvery small sample sizes, and that the 99 th percentile of the eﬀect size of the Cauchydistribution is always somewhat less than the other three distributions for the E d metric.In contrast, the 99 th percentile of the Cauchy distribution is much larger than theother distributions at low sample sizes, and tends to become smaller than the otherdistributions at large sample sizes. This behaviour is perhaps not surprising since the arametric Cohen’s d eﬀect size should be undeﬁned in a Cauchy distribution.The bottom two panels in Figure 1 show the behaviour when there is a smalldiﬀerence between groups; for reference; Supplementary Figure 5 shows single examplesof the distributions used for comparison. Again, we can see that the distributionof the median and 99 th percentile diﬀerence at low sample sizes and are remarkablysimilar when using the non-parametric E d eﬀect size metric, but diverge substantiallywhen using the parametric Cohen’s d. Thus, we conclude that if the data were drawnfrom a Normal distribution that Cohen’s d would be preferred. However, given thatdistributions from actual sequencing datasets are unknown, and can be nearly Cauchy,multimodal or skewed [7], the non-parametric E d eﬀect size statistic gives a more stableand robust estimate of the standardized diﬀerence between distributions than doesCohen’s d.With this null behaviour information, we can examine an example dataset of ahighly replicated RNA-seq dataset geneated by [21]. In this dataset, the edgeR toolidentiﬁed over 4600 out of 6349 genes as ‘signiﬁcant’ (Benjamini-Hochberg adjusted p-value < .

05) when all samples were included using either the glm or exact test modes(Supplementary Table 1). Other widely used tools gave similar results [21]. The nullhypthesis testing framework in ALDEx2 also returned at least 4300 genes in the samedataset. Thus, identifying such a large proportion of genes as diﬀerentially abundantindicates that statistical signiﬁcance is not informative for this type of experiment.Schurch et al. (and others) recommend adding a secondary threshold such as a fold-change cutoﬀ to identify genes of interest for follow-up analyses [5, 21]. When samplesizes are suﬃciently large, we would expect that the fold-change cutoﬀ itself would bethe primary determinant of diﬀerence; however, this approach would not include eitherthe biological variance or the uncertainty of measurement in the analysis.The relationship between sample size and the number of features identiﬁed as sig-niﬁcantly diﬀerent using a null hypothesis testing framework in this dataset is shownin the top panels of Figure 2. Here we are testing for the ability to detect all featuresthat would have been observed as diﬀerentially expressed in the full dataset when usinga random subset of the data. The top two panels were generated using the expectedvalue of the Benjamini-Hochberg corrected Welch’s t-test by ALDEx2, but similar plotscan be observed using the edgeR tool or others [21]. As expected we observe that thepower of a t-test is strongly aﬀected by sample size when the minimum absolute dif-ference between groups is 0. However, when the minimum feature diﬀerence is a 2-foldchange (D=2), the number of features identiﬁed is reduced approximately 10 fold, anda relatively small number of samples is required for acceptable power. The tradeoﬀhere is that applying both the q-score and diﬀerence cutoﬀs results in an increase in . . . . . E d sample size e ff e c t s i z e NormalUniformCauchyBeta 10 20 30 40 . . . . . Cohen’s d sample size e ff e c t s i z e

10 20 30 40 E d sample size e ff e c t s i z e

10 20 30 40

Cohen’s d sample size e ff e c t s i z e Figure 1: Characteristics of E d and when detecting diﬀerential features in two groups inrandom Normal, random uniform, random Cauchy and a heavily skewed random β distri-bution as a function of the per group sample size. The top panels show the behaviour ofthe E d and Cohen’s d eﬀect sizes when the two groups are drawn from the same distribu-tion. The simulation included one-hundred thousand random distributions for each. Thebottom two panels show the behaviour when there is a diﬀerence between groups. Singleexamples of the distributions are shown in the Supplement. The dashed lines show themedian eﬀect size, the solid lines show the 99 th percentile of the statistics.7 . . . . . . q<0.1, D=0 . . . . . . q<0.1, D=2 . . . . . . E=1, D=0 . . . . . . E=1, D=2 . . . . . . E=1, D=0 true effect D en s i t y N=2N=5N=10N=20 0.0 0.5 1.0 1.5 2.0 2.5 3.0 . . . . E=1, D=2 true effect D en s i t y Figure 2: Comparing E d and adjusted p-values when detecting diﬀerential features betweentwo groups. The top four panels show the median and 95% conﬁdence interval of the truepositive rate (TPR) and false positive rate (FPR) at diﬀerent cutoﬀ values. The toptwo panels show the values for the q-value, the Benjamini-Hochberg corrected p-value,as a function of sample size. The middle two panels show the values for E d . Data weresummarized using either a 0 or 2-fold diﬀerence cutoﬀ (D). The number of features ofthe 6236 non-0 features that are identiﬁed as signiﬁcantly diﬀerent in the full dataset areshown in the top right corner of each plot. The bottom two panels show the eﬀect sizedistribution of features identiﬁed as false positives by E d at four diﬀerent sample sizes. Thedashed grey lines show the cutoﬀ for a 10% FDR and an 80% power to detect.8

20 40 60 80 − Volcano

Difference − * l og ( q ) − Effect

Dispersion D i ff e r en c e not differentq < 0.1E > 1E > 1, D>2 Figure 3: Volcano and Eﬀect plots of features identiﬁed by E d , adjusted p-values andabsolute diﬀerences when detecting diﬀerential features between two groups. These plotscompare the features identiﬁed by E d and by q-scores (FDR) and a 2-fold fold-changethresholds in the full dataset. In these plots all features that have a q score less than 0.1also have an eﬀect size greater than 1. Thus, the features in magenta are only identiﬁed assigniﬁcantly diﬀerent by q scores, those in orange are signiﬁcantly diﬀerent by both their qscore and their eﬀect size, and features in blue are signiﬁcant by their q score, their eﬀectsize and their absolute diﬀerence. The dashed grey lines in the two plots demarcate the2-fold diﬀerence location; note that the diﬀerence is in a log2 scale. The Volcano plot isrotated from its usual orientation and the vertical solid line in the Volcano plot indicatesa q score of 0.1. The diagonal solid lines in the Eﬀect plot indicate the boundary wherethe diﬀerence equals the dispersion; ie, where the eﬀect size is 1.9 he FDR at small numbers of samples. Note that all tools have diﬃculty estimatingthe actual FDR in many datasets [14, 23].The middle two panels of Figure 2 show the behaviour of the E d statistic in thesame random datasets. Note that even when only two samples are used, the E d statisticidentiﬁes over 80% of the features as diﬀerent as are identiﬁed by the same statisticin the full dataset. Thus, the simple metric outlined here can correctly identify the‘true positive’ set even when the number of samples is very small. The tradeoﬀ whenusing this statistic is that at very low sample sizes the False Discovery Rate (fdr) isextreme; in this dataset and with and with a cutoﬀ of E d >

1, the fdr is 40% withtwo samples, but falls to less than 10% only when there are 15 or more samples.Interestingly, applying a fold-change cutoﬀ to the E d metric reduces the false discoveryrate dramatically and also reduces the number of features identiﬁed as signiﬁcantlydiﬀerent.The bottom two panels of Figure 2 show the eﬀect size in the full dataset of falsepositive features identiﬁed as diﬀerent in subsets of the dataset. We can see thatat a sample size of 2 the false positive features have true eﬀect-sizes that are nearlyrandomly distributed, but that at a sample size of 10 or more, the vast majority offalse positive features have true eﬀect sizes that are at least 50% of the desired eﬀectsize. When applying the diﬀerence cutoﬀ, we observe that the majority of false posi-tive features identiﬁed at small sample sizes have an eﬀect size greater than the cutoﬀwhen the sample size is 5 or more. Thus, this implies that false positive features inthis case pass the eﬀect size cutoﬀ but fail on the absolute diﬀerence cutoﬀ. Supple-mentary Figure 6 shows a similar analysis for a 16S rRNA gene sequencing dataset[2] with similar results, even thought he abundance/variance relationship between thefeatures is very diﬀerent than in the transcriptome experiment (Supplementary Figure7). Supplementary Figure 8 shows that the eﬀect thresholds from the simulated datain Figure 1 are appropriate, and perhaps even conservative, for real HTS data, andprovide further evidence that the features identiﬁed as diﬀerent in Figure 2 are likelyto be reproducible.Taking the data from Figures 1 and 2, and the Supplement together, we can provideguidance as to appropriate cutoﬀ values when using the E d statistic in HTS datasets.First, point estimates of E d in real datasets and in simulated distributions are highlycongruent. Thus, we can estimate that for every 100 features in a HTS dataset, wecan use the curves in Figures 1 and Supplementary Figure 8 to determine the numberof false positive features expected if there is truly no diﬀerence between groups. TheCoDaSeq R package contains a function that can be used to empirically determinethresholds for any desired percentile cutoﬀ. As an example, the plots in Figure 1 that s a rule of thumb for the experimentalist to be 99% conﬁdent in a true positive,eﬀect sizes should be greater than 2 when the sample size is 4, greater than 1 whenthe sample size is 10, and greater than 0.5 for sample sizes larger than 40. This ruleof thumb holds true regardless of the true underlying distribution and is appropriatewhenever point estimates are computed. When computing the expected E d as does thealdex.eﬀect function, these thresholds are much lower at small sample sizes, and eﬀectsizes can be about 30% smaller.Figure 3 shows how the diﬀerent threshold cutoﬀs relate to each other when plottedas a diﬀerence vs. q-score in a Volcano plot [5] and in a dispersion vs diﬀerence Eﬀectplot [11]. These plots demonstrate the advantage of using a standardized eﬀect metricover a q-score either with or without an absolute diﬀerence cutoﬀ. In the Volcano plot,the q-score only features coloured in magenta, are exclusively features that have q scoresnear the upper bound of statistical signiﬁcance. These include features with both largeand small absolute diﬀerences, and the reason that a feature may have a large diﬀerencebut a marginally signiﬁcant q-score is not revealed on the Volcano plot. However,examination of the Eﬀect plot shows that features that have a marginal q-score andlarge diﬀerence are features with very large dispersion; that is the diﬀerence betweenfeatures, ˜ D , is much smaller than the within-group dispersion, MMAD, calculated asin Equation 4. Such features would not be expected to reproduce well in a new datasetbecause of their intrinsically high variance. In fact, these features are exactly thosethat are excluded by the E d statistic. Furthermore, adding the constraint that featureshave at least a 2-fold diﬀerence does not exclude these high-dispersion features fromconsideration.Examination of the orange features in the Volcano and Eﬀect plots, we can see thatboth q-scores and E d can exhibit arbitrarily small absolute diﬀerences if the dispersionis very small. In the case of the E d statistic, adding in the requirement for at least a2-fold change reduces the number of features to only those that are both diﬀerent andreproducible.By default, we want to know both ‘what is signiﬁcant’ and ‘what is diﬀerent’ [5].Both of these questions can be addressed with a standardized eﬀect size statistic thatscales the diﬀerence between features by their dispersion. We have found plots ofdiﬀerence and dispersion to be an exceeding useful tool when examining HTS datasets[11]. Furthermore, datasets analyzed by this approach have proven to be remarkablyreproducible as shown by independent lab validation [18, 20]The E d statistic outlined here is a relatively robust statistic with the attractiveproperty that it consistently identiﬁes almost all the same set of true features regardlessof the underlying distribution as shown in Figure 1, and the number of samples as hown in Figure 2. In marked contrast, even the best p-value based approaches canidentify only a small proportion of the features at small samples sizes that would havebeen found in the full dataset [21]. Thus, the simple metric outlined here can correctlyidentify the ‘true positive’ set even when the number of samples is very small. Notethat fold-change thresholds as is commonly used, is not the same as an standardizedeﬀect statistic, and applying the threshold values of [21] while reducing the featuresthat are found does not necessarily enhance reproducibility (Figures 2 and 3 ).The tradeoﬀ when using the E d statistic is that at very low sample sizes the FalseDiscovery Rate can be extreme; in this dataset and with and with a cutoﬀ of E d >

1, theFDR is 40% with two samples, but falls to less than 10% only when there are 15 or moresamples. Adding in an absolute fold-change restriction reduces the FDR substantially.Further tempering this, is the observation that the false positive features are frequentlyclose to the cutoﬀ in the full dataset: that is, false positive features typically are truepositives at slightly lower eﬀect sizes or absolute fold changes. This is in contrast to thewell-known random uniform distribution of false positive p-values. The Supplementshows additional evidence that the E d statistic is generally useful, having essentiallythe same characteristics in a 16S rRNA gene sequencing dataset which has much largerper feature dispersion.This work describes the E d statistic for examining high throughput sequencingdatasets. The statistic is relatively robust and eﬃcient, and answers the question mostdesired by the biologist, namely ‘what is reproducibly diﬀerent’. E d is computed inthe ALDEx2 R package as the ‘eﬀect’ output where it is the median of the inferredtechnical and biological data, and in the CoDaSeq R package where it acts only thepoint estimates of the data. Interactive exploration of eﬀect sizes can be done in theomicplotR Bioconductor package [9]. Authors contributions

AF devised the d

NEF metric. MTHQV and LM did simulation studies to character-ize the metric. JM characterized the properties of the metric in real data. GBGcharacterized the metric, did ﬁnal simulation studies, wrote the ﬁrst draft manuscriptand supplement. All authors provided input into the ﬁnal form and substance of themanuscript. eferencesReferences [1] Aitchison, J. (1986). The Statistical Analysis of Compositional Data . Chapman &Hall.[2] Bian, G., Gloor, G. B., Gong, A., Jia, C., Zhang, W., Hu, J., Zhang, H., Zhang,Y., Zhou, Z., Zhang, J., Burton, J. P., Reid, G., Xiao, Y., Zeng, Q., Yang, K., andLi, J. (2017). The gut microbiota of healthy aged chinese is similar to that of thehealthy young. mSphere , (5), e00327–17.[3] Coe, R. (2002). It’s the eﬀect size, stupid: What eﬀect size is and why it isimportant.[4] Colquhoun, D. (2014). An investigation of the false discovery rate and the misin-terpretation of p-values. R Soc Open Sci , (3), 140216.[5] Cui, X. and Churchill, G. A. (2003). Statistical tests for diﬀerential expression incdna microarray experiments. Genome Biol , (4), 210.1 – 210.10.[6] Cumming, G. (2008). Replication and p intervals: p values predict the futureonly vaguely, but conﬁdence intervals do much better. Perspect Psychol Sci , (4),286–300.[7] Fernandes, A. D., Macklaim, J. M., Linn, T. G., Reid, G., and Gloor, G. B.(2013). Anova-like diﬀerential expression (aldex) analysis for mixed population rna-seq. PLoS One , (7), e67019.[8] Fernandes, A. D., Reid, J. N., Macklaim, J. M., McMurrough, T. A., Edgell,D. R., and Gloor, G. B. (2014). Unifying the analysis of high-throughput sequencingdatasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growthexperiments by compositional data analysis. Microbiome , , 15.1–15.13.[9] Giguere, D., Macklaim, J., and Gloor, G. (2019). omicplotr: Visual exploration ofomic datasets using a shiny app. Bioconductor v1.4.0.[10] Gloor, G. B., Macklaim, J. M., Vu, M., and Fernandes, A. D. (2016a). Com-positional uncertainty should not be ignored in high-throughput sequencing dataanalysis. Austrian Journal of Statistics , , 73–87.

11] Gloor, G. B., Macklaim, J. M., and Fernandes, A. D. (2016b). Displaying variationin large datasets: Plotting a visual summary of eﬀect sizes.

Journal of Computationaland Graphical Statistics , (3C), 971–979.[12] Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V., and Egozcue, J. J. (2017).Microbiome datasets are compositional: And this is not optional. Front Microbiol , , 2224.[13] Halsey, L. G., Curran-Everett, D., Vowler, S. L., and Drummond, G. B. (2015).The ﬁckle p value generates irreproducible results. Nat Methods , (3), 179–85.[14] Hawinkel, S., Mattiello, F., Bijnens, L., and Thas, O. (2018). A broken promise :microbiome diﬀerential abundance methods do not control the false discovery rate. BRIEFINGS IN BIOINFORMATICS .[15] Ioannidis, J. P. A. (2005). Why most published research ﬁndings are false.

PLoSMed , (8), e124.[16] Kaul, A., Mandal, S., Davidov, O., and Peddada, S. D. (2017). Analysis of mi-crobiome data in the presence of excess zeros. Front Microbiol , , 2114.[17] Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S., and B¨ahler, J.(2015). Proportionality: a valid alternative to correlation for relative data. PLoSComput Biol , (3), e1004075.[18] Macklaim, J. M., Fernandes, A. D., Di Bella, J. M., Hammond, J.-A., Reid, G.,and Gloor, G. B. (2013). Comparative meta-rna-seq of the vaginal microbiota anddiﬀerential expression by lactobacillus iners in health and dysbiosis. Microbiome , (1), 12.[19] Nakagawa, S. (2004). A farewell to bonferroni: the problems of low statisticalpower and publication bias. Behavioral Ecology , (6), 1044–1045.[20] Nelson, T. M., Borgogna, J.-L. C., Brotman, R. M., Ravel, J., Walk, S. T., andYeoman, C. J. (2015). Vaginal biogenic amines: biomarkers of bacterial vaginosis orprecursors to vaginal dysbiosis? Frontiers in physiology , .[21] Schurch, N. J., Schoﬁeld, P., Gierli´nski, M., Cole, C., Sherstnev, A., Singh, V.,Wrobel, N., Gharbi, K., Simpson, G. G., Owen-Hughes, T., Blaxter, M., and Barton,G. J. (2016). How many biological replicates are needed in an rna-seq experimentand which diﬀerential expression tool should you use? RNA , (6), 839–51.

22] Soneson, C. and Delorenzi, M. (2013). A comparison of methods for diﬀerentialexpression analysis of RNA-seq data.

BMC Bioinformatics , , 91.[23] Thorsen, J., Brejnrod, A., Mortensen, M., Rasmussen, M. A., Stokholm, J., Al-Soud, W. A., Sørensen, S., Bisgaard, H., and Waage, J. (2016). Large-scale bench-marking reveals false discoveries and count transformation sensitivity in 16S rRNAgene amplicon data analysis methods used in microbiome studies. Microbiome , (1),62.[24] Tsilimigras, M. C. B. and Fodor, A. A. (2016). Compositional data analysis ofthe microbiome: fundamentals, tools, and challenges. Ann Epidemiol , (5), 330–5.[25] Weiss, S., Xu, Z. Z., Peddada, S., Amir, A., Bittinger, K., Gonzalez, A., Lozupone,C., Zaneveld, J. R., V´azquez-Baeza, Y., Birmingham, A., Hyde, E. R., and Knight,R. (2017). Normalization and microbial diﬀerential abundance strategies dependupon data characteristics. Microbiome , (1), 27.(1), 27.