[PDF] Sparse and compositionally robust inference of microbial ecological networks

Abstract

16S-ribosomal sequencing and other metagonomic techniques provide snapshots of microbial communities, revealing phylogeny and the abundances of microbial populations across diverse ecosystems. While changes in microbial community structure are demonstrably associated with certain environmental conditions, identification of underlying mechanisms requires new statistical tools, as these datasets present several technical challenges. First, the abundances of microbial operational taxonomic units (OTUs) from 16S datasets are compositional, and thus, microbial abundances are not independent. Secondly, microbial sequencing-based studies typically measure hundreds of OTUs on only tens to hundreds of samples; thus, inference of OTU-OTU interaction networks is severely under-powered, and additional assumptions are required for accurate inference. Here, we present SPIEC-EASI (SParse InversE Covariance Estimation for Ecological Association Inference), a statistical method for the inference of microbial ecological interactions from metagenomic datasets that addresses both of these issues. SPIEC-EASI combines data transformations developed for compositional data analysis with a graphical model inference framework that assumes the underlying ecological interaction network is sparse. To reconstruct the interaction network, SPIEC-EASI relies on algorithms for sparse neighborhood and inverse covariance selection. Because no large-scale microbial ecological networks have been experimentally validated, SPIEC-EASI comprises computational tools to generate realistic OTU count data from a set of diverse underlying network topologies. SPIEC-EASI outperforms state-of-the-art methods in terms of edge recovery and network properties on realistic synthetic data under a variety of scenarios. SPIEC-EASI also reproducibly predicts previously unknown microbial interactions using data from the American Gut project.

Full PDF

SSparse and compositionally robust inference of microbialecological networks

Zachary D. Kurtz ∗ † , Christian L. Mueller ∗ ‡ , Emily R. Miraldi ∗ § , Dan R.Littman ¶ , Martin J. Blaser (cid:107) and Richard A. Bonneau ∀ Departments of Microbiology and Medicine, New York University School of Medicine, New York, NY 10016 Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY 10003 Courant Institute of Mathematical Sciences, New York University, New York, NY 10012 Simons Center for Data Analysis, Simons Foundation, New York, NY 10010

Abstract

16S ribosomal RNA (rRNA) gene and other environmental sequencing techniques provide snapshots ofmicrobial communities, revealing phylogeny and the abundances of microbial populations across diverseecosystems. While changes in microbial community structure are demonstrably associated with certain en-vironmental conditions (from metabolic and immunological health in mammals to ecological stability in soilsand oceans), identiﬁcation of underlying mechanisms requires new statistical tools, as these datasets presentseveral technical challenges. First, the abundances of microbial operational taxonomic units (OTUs) fromamplicon-based datasets are compositional. Counts are normalized to the total number of counts in the sam-ple. Thus, microbial abundances are not independent, and traditional statistical metrics (e.g., correlation)for the detection of OTU-OTU relationships can lead to spurious results. Secondly, microbial sequencing-based studies typically measure hundreds of OTUs on only tens to hundreds of samples; thus, inferenceof OTU-OTU association networks is severely under-powered, and additional information (or assumptions)are required for accurate inference. Here, we present SPIEC-EASI ( SP arse I nvers E C ovariance Estimationfor E cological A ssociation I nference), a statistical method for the inference of microbial ecological networksfrom amplicon sequencing datasets that addresses both of these issues. SPIEC-EASI combines data transfor-mations developed for compositional data analysis with a graphical model inference framework that assumesthe underlying ecological association network is sparse. To reconstruct the network, SPIEC-EASI relies onalgorithms for sparse neighborhood and inverse covariance selection. To provide a synthetic benchmark inthe absence of an experimentally validated gold-standard network, SPIEC-EASI is accompanied by a setof computational tools to generate OTU count data from a set of diverse underlying network topologies.SPIEC-EASI outperforms state-of-the-art methods to recover edges and network properties on syntheticdata under a variety of scenarios. SPIEC-EASI also reproducibly predicts previously unknown microbialassociations using data from the American Gut project. ∗ These authors contributed equally to this work † [email protected] ‡ [email protected] § [email protected] ¶ [email protected] (cid:107)

[email protected] ∀ [email protected] a r X i v : . [ s t a t . A P ] F e b Introduction

Low-cost metagenomic and amplicon-based sequencing promises to make the resolution of complex inter-actions between microbial populations and their surrounding environment a routine component of observa-tional ecology and experimental biology. Indeed, large-scale data collection eﬀorts (such as Earth MicrobiomeProject [1], the Human Microbiome Project [2], and the American Gut Project [3]) bring an ever-increasingnumber of samples from soil, marine and animal-associated microbiota to the public domain. Recent re-search eﬀorts in ecology, statistics, and computational biology have been aimed at reliably inferring novelbiological insights and testable hypotheses from population abundances and phylogenies. Classic objectivesin community ecology include, (i) the accurate estimation of the number of taxa (observed and unobserved)from microbial studies [4] and, related to that, (ii) the estimation of community diversity within and acrossdiﬀerent habitats from the modeled population counts [5]. Moreover, some microbial compositions appearto form distinct clusters, leading to the concept of enterotypes, or ecological steady states in the gut [6], buttheir existence has not been established with certainty [7]. Another aim of recent studies is the elucidationof connections between microbes and environmental or host covariates. Examples include a novel statis-tical regression framework for relating microbiome compositions and covariates in the context of nutrientintake [8], observations that microbiome compositions strongly correlate with disease status in new-onsetCrohn’s disease [9], and the connections between helminth infection and the microbiome diversity [10].One goal of microbiome studies is the accurate inference of microbial ecological interactions from population-level data [11]. ’Interactions’ are inferred by detecting signiﬁcant (typically non-directional) associations be-tween sampled populations, e.g., by measuring frequency of co-occurrence [12, 13]. Microbiota are measuredby proﬁling variable regions of bacterial 16S rRNA gene sequences. These regions are ampliﬁed, sequenced,and the resulting reads are then grouped into common Operational Taxonomic Units (OTUs) and quantiﬁed,with OTU counts serving as a proxy to the underlying microbial populations’ abundances. Knowledge ofinteraction networks (here, a measure of microbial association) provides a foundation to predictively modelthe interplay between environment and microbial populations. A recent example is the successful construc-tion of a dynamic diﬀerential equation model to describe the primary succession of intestinal microbiota inmice [14]. A commonly used tool to infer a network is correlation analysis; that is computing Pearson’scorrelation coeﬃcient among all pairs of OTU samples, and an interaction between microbes is assumedwhen the absolute value of the correlation coeﬃcient is suﬃciently high [15, 16].However, applying traditional correlation analysis to amplicon surveys of microbial population data islikely to yield spurious results [9, 17]. To limit experimental biases due to sampling depth, OTU count datais typically transformed by normalizing each OTU count to the total sum of counts in the sample. Thus,communities of microbial relative abundances, termed compositions, are not independent, and classicalcorrelation analysis may fail [18]. Recent methods such as Sparse Correlations for Compositional data(SparCC) [17] and Compositionally Corrected by REnormalization and PErmutation (CCREPE) [9, 11, 19]are designed to account for these compositional biases and represent the state of the art in the ﬁeld. Yet, itis not clear that correlation is the proper measure of association. For example, correlations can arise betweenOTUs that are indirectly connected in an ecological network (we expand on this point below).Dimensionality poses another challenge to statistical analysis of microbiome studies, as the number ofmeasured OTUs p is on the order of hundreds to thousands whereas the number of samples n generally rangesfrom tens to hundreds. This implies that any meaningful interaction inference scheme must operate in theunderdetermined data regime ( p > n ), which is viable only if additional assumptions about the interactionnetwork can be made. As technological developments lead to greater sequencing depths, new computationalmethods that address the ( p > n ) challenge will become increasingly important.In the present work, we present a novel strategy to infer networks from (potentially high-dimensional)community composition data. We introduce SPIEC-EASI ( SP arse I nvers E C ovariance Estimation for E cological AS sociation I nference, pronounced speakeasy), a new statistical method for the inference ofmicrobial ecological networks and generation of realistic synthetic data. SPIEC-EASI inference comprisestwo steps: First, a transformation from the ﬁeld of compositional data analysis is applied to the OTU data.Second, SPIEC-EASI estimates the interaction graph from the transformed data using one of two methods:(i) neighborhood selection [20, 21] and (ii) sparse inverse covariance selection [22, 23]. Unlike empirical cor-relation or covariance estimation, used in SparCC and CCREPE, our pipeline seeks to infer an underlyinggraphical model using the concept of conditional independence. Figure 1. Conditional independence vs correlation analysis for a toy dataset:

In an ecosystem,the abundance of any OTU is potentially dependent on the abundances of other OTUs in the ecologicalnetwork. Here, we simulate abundances from a network where OTU 3 directly inﬂuences (via some set ofbiological mechanisms) the abundances of OTUs 1, 2 and 4 ( a ). The inference goal here is to recover theunderlying network from the simulated data. b ) Absolute abundances of these four OTUs were drawn froma negative-binomial distribution across 500 samples according to the true network (as described in theMethods section). c ) Computing all pairwise Pearson correlation yields a symmetric matrix showingpatterns of association (positive correlations are green and negative are red). We thresholded entries of thecorrelation matrix to generate relevance networks. d ) A threshold at ρ ≥ | . | (represented by dashed andsolid edges) results in a network in which OTU 3 is connected to all other OTUs with an additionalconnection between OTU 2 and OTU 4. A more stringent threshold at ρ ≥ | . | , results in a sparserrelevance network (notably missing the edge between OTU 3 and OTU 1), and is represented in d by solidedges only. Importantly, no single threshold recovers the true underlying hub topology. e ) The inversesample covariance matrix yields a symmetric matrix where entries are approximately zero if thecorresponding OTU pairs are conditionally independent. The network ( f ) inferred from the non-zeroentries (colored in blue in e ) identiﬁes the correct hub network. Thus, it is possible to choose a thresholdfor the sample inverse covariance that faithfully recovers the true network. Such a threshold is notguaranteed to exist for correlation or covariance (the metric used by SparCC and CCREPE). Intuitively,this is because simultaneous direct connections can induce strong correlations between nodes that do nothave direct relationships (e.g. OTU 2-4). Conversely, weak correlations can arise between directlyconnected nodes (e.g. OTU 1-3). Although correlation is a useful measure of association in many contexts,it is a pairwise metric and therefore limited in a multivariate setting. On the other hand, SPIEC-EASI’sestimate of entries in the inverse covariance matrix depend on the conditional states of all available nodes.This feature helps SPIEC-EASI avoid detection of indirect network interactions.Informally, two nodes (e.g. OTUs) are conditionally independent if, given the state (e.g. abundance) ofall other nodes in the network, neither node provides additional information about the state of the other. Alink between any two nodes in the graphical model implies that the OTU abundances are not conditionallyindependent and that there is a (linear) relationship between them that cannot be better explained by analternate network wiring. In this way, our method avoids detection of correlated but indirectly connectedOTUs, thus ensuring parsimony of the resulting network model (for more detail, see Materials and Methodsand Figure 1). This model is an undirected graph where links between nodes represent signed associationsbetween OTUs. The use of graphical models has gained considerable popularity in network biology [24–26]and, more recently, in structural biology [27], particularly to correct for transitive correlations in proteinstructure prediction [28].To properly benchmark our inference scheme and compare its performance with other state-of-the-artschemes [9,17], SPIEC-EASI is accompanied by a synthetic data generation routine, which generates realisticsynthetic OTU data from networks with diverse topologies. This is signiﬁcant because, to date, (i) noexperimentally veriﬁed set of “gold-standard" microbial interactions exists, (ii) previous synthetic benchmarkdata do not accurately reﬂect the actual properties of microbiome data [17], and (iii) theoretical and empiricalwork from high-dimensional statistics [29–31] suggests that network topology can strongly impact networkrecovery and performance and thus must be considered in the design of synthetic datasets.We show that SPIEC-EASI is a scalable inference engine that (i) yields superior performance withrespect to state-of-the-art methods in terms of interaction recovery and network features in a diverse set ofrealistic synthetic benchmark scenarios, (ii) provides the most stable and reproducible network when appliedto real data, and (iii) reliably estimates an invertible covariance matrix which can be used for additionaldownstream statistical analysis. In agreement with statistical theory [29], inference on the synthetic datasetsdemonstrates that the degree distribution of the underlying network has the largest eﬀect on performance,and this eﬀect is observed across all methods tested. SPIEC-EASI network inference applied to actualdata from the American Gut Project (AGP) shows (i) that clusters of strongly connected components arelikely to contain OTUs with common family membership and (ii) that actual gut microbial networks arelikely composites of archetypical network topologies. In the Materials and Methods section, we presentstatistical and computational aspects of SPIEC-EASI. We then benchmark SPIEC-EASI, comparing it tocurrent inference schemes using synthetic data. We then apply SPIEC-EASI to measurements availablefrom the AGP database. The SPIEC-EASI pipeline is implemented in the R package [SpiecEasi] freelyavailable at https://github.com/zdk123/SpiecEasi . All presented numerical data is available at http://bonneaulab.bio.nyu.edu/ . Materials and Methods

SPIEC-EASI comprises both an inference and a synthetic data generation module. Figure 2 summarizes thekey components of the pipeline. In this section, we introduce all statistical and computational aspects of theinference scheme and then describe our approach for generating realistic synthetic datasets.

Real OTU dataGraphical model generatorSynthetic OTU dataMarginal ﬁtting Data processing and CLR transformationNeighborhood selection Inverse Covariance SelectionStability-based model selection Ecological network and regularized covariance matrix

Synthetic data generation Network inference

SPIEC-EASI pipeline

NORmal To Anything

Figure 2. Workﬂow of the SPIEC-EASI pipeline : The SPIEC-EASI pipeline consists of twoindependent parts for a ) synthetic data generation and b ) network inference. a ) Synthetic data generationrequires an OTU count table and a user-selected network topology. Internally, the parameters of astatistical distribution (the zero-inﬂated Negative binomial model is suggested) are ﬁt to the OTUmarginals of the real data, and are combined with the randomly-generated network in the Normal toAnything (NORTA) approach to generate correlated count data. b ) Network inference proceeds in threestages on synthetic or real OTU count data: First, data is pre-procssed and centered log-ratio (CLR)transformed to ensure compositional robustness. Next, the user selects one of two graphical modelinference procedures: 1) Neighborhood selection (the MB method) or 2) inverse covariance selection (theglasso method). SPIEC-EASI network inference assumes that the underlying network is sparse. We inferthe correct model sparseness by the Stability Approach to Regularization Selection (StARS), whichinvolves random subsampling of the dataset to ﬁnd a network with low variability in the selected set ofedges. SPIEC-EASI outputs include an ecological network (from the non-zero entries of the inversecovariance network) and an invertible covariance matrix. If the network was inferred from synthetic data,it can be compared with the input network to assess inference quality. Data processing and transformation of standard OTU count data

For this discussion, a table of OTU count data, typical output of 16S rRNA gene sequencing data curationpipelines (e.g., mothur [32], QIIME [33]) are given. The OTU data are stored in a matrix W ∈ N n × p where w ( j ) = [ w ( j )1 , w ( j )2 , . . . , w ( j ) p ] denotes the p -dimensional row vector of OTU counts from the j th sample, j = 1 , . . . , n , with total cumulative count m ( j ) = p (cid:80) i =1 w ( j ) i ; N denotes the set of natural numbers { , , , . . . } .As described above, to account for sampling biases, microbiome data is typically transformed by normalizingthe raw count data w ( j ) with respect to the total count m ( j ) of the sample [10]. We thus arrive at vectorsof relative abundances or compositions x ( j ) = [ w ( j )1 m ( j ) , w ( j )2 m ( j ) , . . . , w ( j ) p m ( j ) ] for sample j . Due to this normalizationOTU abundances are no longer independent, and the sample space of this p-part composition x ( j ) is notthe unconstrained Euclidean space but the p -dimensional unit simplex S p . = { x | x i > , (cid:80) pi =1 x i = 1 } . Thus,OTU compositions from n samples are constrained to lie in the unit simplex, X ∈ S n × p . This restrictionof the data to the simplex prohibits the application of standard statistical analysis techniques, such aslinear regression or empirical covariance estimation. Covariance matrices of compositional data exhibit, forinstance, a negative bias due to closure eﬀects.Major advances in the statistical analysis of compositional data were achieved by Aitchison in the 1980’s[18, 34]. Rather than considering compositions in the simplex, Aitchison proposed log-ratios, log[ x i x j ] , as abasis for studying compositional data. The simple equivalence log[ x i x j ] = log[ w i /mw j /m ] = log[ w i w j ] implies thatstatistical inferences drawn from analysis of log-ratios of compositions are equivalent to those that could bedrawn from the log-ratios of the unobserved absolute measurements, also termed the basis .Aitchison also proposed several statistically equivalent log-ratio transformations to remove the unit-sumconstraint of compositional data [18]. Here we apply the centered log-ratio (clr) transform: z . = clr( x ) = [log( x /g ( x )) , ..., log( x p /g ( x )] = [log( w /g ( w )) , ..., log( w p /g ( w ))] (1)where g ( x ) = (cid:20) p (cid:81) i =1 x i (cid:21) /p is the geometric mean of the composition vector. The clr transform is symmetricand isometric with respect to the component parts. The resulting vector z is constrained to a zero sum.The clr transform maps the data from the unit simplex to a p − -dimensional Euclidean space, and thecorresponding population covariance matrix Γ = Cov [clr(X)] ∈ R p × p is also singular [18]. The covariancematrix Γ is related to the population covariance of the log-transformed absolute abundances Ω = Cov [logW] via the relationship [34]:

Γ = GΩG (2)where

G = I p − p J , I p is the p -dimensional identity matrix, and J = [ j , j , . . . , j i , . . . , j p ] , j i = [1 , , . . . , the p -dimensional all-ones vector. For high-dimensional data, p >> , the matrix G is close to the identitymatrix, and thus we can assume that a ﬁnite sample estimator ˆΓ of Γ serves as a good approximation of ˆΩ .This approximation serves as the basis of our network inference scheme. Finally, because real-world OTUdata often contain samples with a zero count for low-abundance OTUs, we add a unit pseudo count to theoriginal count data to avoid numerical problems with the clr transform. Inference of microbial associations from microbial abundance datasets

Our key objective is to learn a network of pairwise taxon-taxon associations (putative interactions) fromclr-transformed microbiome compositions Z ∈ R n × p . We represent the network as an undirected, weightedgraph G = ( V, E ) , where the vertex set V = { v , . . . , v p } represents the p taxa (e.g., OTUs) and the edgeset E ⊂ V × V the possible associations among them. Our formal approach is to learn a probabilisticgraphical model [35] (i) that is consistent with the observed data and (ii) for which the (unknown) graph G encodes the conditional dependence structure between the random variables (in our case, the observedtaxa). Graphical models over undirected graphs (also known as Markov networks or Markov RandomFields) have a straightforward distributional interpretation when the data are drawn from a probabilitydistribution π ( x ) that belongs to an exponential family [36,37]. For example, when the data are drawn froma multivariate normal distribution π ( x ) = N ( x | µ, Σ) with mean µ and covariance Σ , the non-zero elementsof the oﬀ-diagonal entries of the inverse covariance matrix Θ = Σ − , also termed the precision matrix,deﬁnes the adjacency matrix of the graph G and thus describes the factorization of the normal distributioninto conditionally dependent components [35]. Conversely, if and only if an entry in Θ : Θ i,j = 0 , thenthe two variables are conditionally independent, and there is no edge between v i and v j in G . We seek toestimate the inverse covariance matrix from the data, thereby inferring associations based on conditionalindependence. This is fundamentally distinct from SparCC and CCREPE (see Table A.1), which essentiallyestimate pairwise correlations (though other pairwise metrics could be considered for CCREPE). We highlightthis key diﬀerence in Figure 1. For an intuitive introduction to graphical models in the context of biologicalnetworks see Bühlmann et. al , 2014 [38].Inferring the exact underlying graph structure in the presence of a ﬁnite amount of samples is, in general,intractable. However, two types of statistical inference procedures have been useful in high-dimensionalstatistics due to their provable performance guarantees under assumptions about the sample size n , di-mensionality p , underlying graph properties, and the generating distribution [29, 39]. The ﬁrst approach,neighborhood selection [20, 39], aims at reconstructing the graph on a node-by-node basis where, for eachnode, a penalized regression problem is solved. The second approach is the penalized maximum likelihoodmethod [22, 23], where the entire graph is reconstructed by solving a global optimization problem, the so-called covariance selection problem [40]. The key advantages of these approaches are that (i) their underlyinginference procedures can be formulated as convex (and hence tractable) optimization problems, and (ii) theyare applicable even in the underdetermined regime ( p > n ) , provided that certain structural assumptionsabout the underlying graph hold. One assumption is that the true underlying graph is reasonably sparse,e.g., that the number of taxon-taxon associations scales linearly with the number of measured taxa. Graphical model inference.

The SPIEC-EASI pipeline comprises two types of inference schemes, neigh-borhood and covariance selection. The neighborhood selection framework, introduced by M einshausen and B ühlmann [20] and thus often referred as the MB method, tackles graph inference by solving p regularizedlinear regression problems, leading to local conditional independence structure predictions for each node.Let us denote the i th column of the data matrix Z by Z i ∈ R n and the remaining columns by Z ¬ i ∈ R n × p − .For each node v i , we solve the following convex problem: ˆ β i,λ = arg min β ∈ R p − (cid:18) n (cid:107) Z i − Z ¬ i β (cid:107) + λ (cid:107) β (cid:107) (cid:19) , (3)where (cid:107) a (cid:107) = (cid:80) p − i =1 | a i | denotes the L1 norm, and λ ≥ is a scalar tuning parameter. This so-calledLASSO problem [41] aims at balancing the least-square ﬁt and the number of necessary predictors (thenon-zero components β j of β ) by tuning the λ parameter. We deﬁne the local neighborhood of a node v i as N λi = { j ⊂ { , . . . p } \ i : ˆ β i,λj (cid:54) = 0 } . The ﬁnal edge set E of G can be deﬁned via the intersection or theunion operation of the local neighborhoods. An edge e i,j between node v i and v j exists if j ∈ N λi ∩ i ∈ N λj or j ∈ N λi ∪ i ∈ N λj . For edges in the set j ∈ N λi ∩ i ∈ N λj , the edge weights, e i,j and e j,i , are estimated using theaverage of the two corresponding β entries. From a theoretical point of view, both edge selection choices areasymptotically consistent under certain technical assumptions [20]. The choice of the λ parameter controlsthe sparsity of the local neighborhood, which requires tuning [42]. We present our parameter selectionstrategy at the end of this section.The second inference approach, (inverse) covariance selection, relies on the following penalized maximumlikelihood approach. In the standard Gaussian setting, the related convex optimization problem reads: ˆΘ = arg min Θ ∈ PD (cid:16) − log det(Θ) + tr(Θ ˆΣ) + λ (cid:107) Θ (cid:107) (cid:17) , (4)where PD denotes the set of symmetric positive deﬁnite matrices { A : x T Ax > , ∀ x ∈ R p } , ˆΣ the empiricalcovariance estimate, (cid:107) · (cid:107) the element-wise L1 norm, and λ ≥ a scalar tuning parameter. For λ = 0 , theexpression is identical to the maximum likelihood estimate of a normal distribution N ( x | , Σ) . For non-zero λ , the objective function (also referred as the graphical Lasso [22]) encourages sparsity of the underlyingprecision matrix Θ . The non-zero, oﬀ-diagonal entries in Θ deﬁne the adjacency matrix of the interactiongraph G which, similar to MB, depends on the proper choice of the penalty parameter λ . Originally,this estimator was shown to have theoretical guarantees on consistency and recovery only under normalityassumptions [43]. However, recent theoretical [29, 44] work shows that distributional assumptions can beconsiderably relaxed, and the estimator is applicable to a larger class of problems, including inference ondiscrete (count) data. In addition, nonparametric approaches, such as sparse additive models, can be used to“gaussianize" the data prior to network inference [45]. We thus propose the following estimator for inferringmicrobial ecological associations. Given clr-transformed OTU data Z ∈ R n × p , we propose the modiﬁedoptimization problem: ˆΩ − = arg min Ω − ∈ P D (cid:16) − log det(Ω − ) + tr(Ω − ˆΓ) + λ (cid:13)(cid:13) Ω − (cid:13)(cid:13) (cid:17) , (5)where ˆΓ is the empirical covariance estimate of Z , and Ω − is the inverse covariance (or precision matrix) ofthe underlying (unknown) basis. As stated above, ˆΓ will be a good approximation for the basis covariancematrix ˆΩ because p >> . The resulting solution is constrained to the set of PD matrices, ensuring that thepenalized estimator has full rank p . The non-zero oﬀ-diagonal entries of the estimated matrix Ω − deﬁnethe inferred network G , and their values are the signed edge weights of the graph. To reduce the variance ofthe estimate, the covariance matrix ˆΓ can also be replaced by the empirical correlation matrix ˆ R = D ˆΓ D ,where D is a diagonal matrix that contains the inverse of the estimated element-wise standard deviations.The covariance selection approach has two advantages over the neighborhood selection framework. First,we obtain unique weights associated with each edge in the network. No averaging or subsequent edgeselection is necessary. Second, the covariance selection framework provides invertible precision and covariancematrix estimates that can be used in further downstream microbiome analysis tasks, such as regression anddiscriminant analysis [10]. Model selection.

For both neighborhood and covariance selection, the tuning parameter λ ∈ [0 , λ max ] controls the sparsity of the ﬁnal model. Rather than inferring a single graphical model, both methods producea λ -dependent solution path with the complete and the empty graph as extreme networks. A number ofmodel selection criteria, such as Bayesian Information Criteria [46] and resampling schemes [47], have beenused. Here we use a popular model selection scheme known as the St ability A pproach to R egularization S election (StARS) [48]. This method repeatedly takes random subsamples ( in the standard setting) ofthe data and estimates the entire graph solution path based on this subsample. For each subsample, the λ -dependent incidence frequencies of individual edges are retained, and a measure of overall edge stabilityis calculated. StARS selects the λ value at which subsampled non-empty graphs are the least variable(most stable) in terms of edge incidences. For the selected graph, the observed edge frequencies indicate thereproducibility, and likely the predictive power, and are used to rank edges according to conﬁdence. Theoretical and computational aspects.

Learning microbial graphical models with neighborhood orinverse covariance selection schemes has important theoretical and practical advantages over current meth-ods. A wealth of theoretical results are available that characterize conditions for asymptotic and ﬁnite sampleguarantees for the estimated networks [20, 29, 39, 43, 46]. Under certain model assumptions, the number ofsamples n necessary to infer the true topology of the graph in the neighborhood selection framework is knownto scale as n = O ( d log( p )) , where d is the maximum vertex (or node) degree of the underlying graph (i.e.the maximum size of any local neighborhood). Additional assumptions on the sample covariance matricesreduce the scaling to n = O ( d log( p )) [39]. This implies that graph recovery and precision matrix estimationis indeed possible even in the p >> n regime, and that the underlying graph topology strongly impacts edgerecovery. The latter observation means that, even if the number of interactions e is constant, graphs withlarge hub nodes, perhaps representing keystone species in microbial networks, or, more generally, scale-freegraphs with, a few highly connected nodes, will be more diﬃcult to recover than networks with evenly dis-tributed neighborhoods. In addition to these theoretical results, a second advantage is that well-established,eﬃcient, and scalable implementations are available to infer microbial ecological networks from OTU datain practice. Thus, SPIEC-EASI methods will eﬃciently scale as microbiome dataset dimensions grow (e.g.,due to technological advances that increase the number of OTUs detected per sample). The SPIEC-EASIinference engine relies on the R package huge [49], which includes algorithms to solve neighborhood andcovariance selection problems [20, 22], as well as the StARS model selection. Generation of synthetic microbial abundance datasets

Estimating the absolute and comparative performance of network inference schemes from biological dataremains a fundamental challenge in biology. In the context of gene regulatory network inference, recentcommunity-wide eﬀorts, such as the DREAM (Dialogue for Reverse Engineering Assessments and Methods)Challenges ( ), have considerably advanced our understanding aboutfeasibility, accuracy, and applicability of a large number of developed methods. In the DREAM challenges,both real data from “gold standard" regulatory networks (e.g., networks where the true topology is knownfrom independent experimental evidence) and realistic in-silico data (using, e.g., the GeneNetWeaver pipeline[50]) are included. In the context of microbiome data and microbial ecological networks, neither a goldstandard nor a realistic synthetic data generator exist. SPIEC-EASI is accompanied by a set of computationaltools that allow the generation of realistic synthetic OTU data. As outlined in Figure 2, real taxa countdata serve as input to SPIEC-EASI’s synthetic data generation pipeline. The pipeline enables one to: (i) ﬁtthe marginal distributions of the count data to a parametric statistical model and (ii) specify the underlyinggraphical model architecture (e.g., scale-free).

The NorTA approach.

The parametric statistical model and network topology are then combined in the’Normal To Anything’ (NorTA) [51] approach to generate synthetic OTU data that resemble real measure-ments from microbial communities but with user-speciﬁed network topologies. NorTA [51] is an approximate

Figure 3. a ) Bivariate illustration of the NorTA approach . First normal data, incorporating thetarget correlation structure, is generated. Uniform data are then generated for each margin via the normaldensity function. These is then converted to an arbitrary marginal distribution (Poisson and Zero-inﬂatedNegative Binomial shown as examples) via its quantile function. To generate realistic synthetic data,parameters for these margins are ﬁt to real data. b ) Examples of band-like, cluster, and scale-free networktopologiestechnique to generate arbitrary continuous and discrete multivariate distributions, given (1) a target cor-relation structure R with entries ρ i,j and (2) a target univariate marginal distribution U i . To achieve thistask, NorTA relies on normal-copula functions [51–53]. A n × p matrix of data U is sampled from a normaldistribution with zero mean and a p × p correlation matrix R N . For each marginal U i , the Normal cumulativedistribution function (CDF) is transformed to the target distribution via its inverse CDF. For any targetdistribution P with CDF Ξ , we can thus generate multivariate correlated data via U P i = Ξ − (Φ(U N i )) , (6)where U N ∼ N (0 , R N ) and Φ(U) = ˆ U −∞ √ σ e − u du , the CDF of a univariate normal. In Figure 3a, weillustrate this process for bivariate Poisson and negative binomial data ( n = 1000 and correlation ρ ij = 0 . ).In the original NorTA approach, an element-wise monotone transformation c U ( · ) with R N = c U ( R ) is appliedto account for slight diﬀerences in correlation structure between normal and target distribution samples [51].Here we neglect this transformation step because we observe that the log-transformed data from exponentialcount distributions, such as the Poisson and Negative binomial, are already close to R , provided that themean is greater than one, particularly when the counts data are log-transformed (Appendix A.4). In practice,SPIEC-EASI relies on routines from base R and VGAM packages [54, 55] to estimate the uniform quantilesof the normal data and to ﬁt the desired CDF with estimated parameters. Fitting marginal distribution to real OTU data.

Prior to ﬁtting marginal distributions to real data,several commonly used pre-processing steps are applied. For any given OTU abundance table of size n × K ,we ﬁrst select p non-zero columns. To account for experimental diﬀerences in sample sequencing, we thennormalize samples to a median sequencing depth by multiplying all counts by the ratio of minimum desirablesampling depth to the total sum of counts in that sample and rounding to the nearest whole number, whichis preferable to rarefaction [1]. These ﬁltered and sequencing-depth-normalized data serve as the marginalcounts, which are ﬁt to a parametric distribution U i and used as input to the NorTA approach. The concretetarget marginal distribution depends on the actual microbiome dataset. For gut microbiome data (e.g. fromHMP or APG), the zero-inﬂated Negative Binomial (ziNB) distribution is a good choice, as it accounts forboth overdispersion [1,57] and the preponderance of zero-count data points in microbial count datasets. Theﬁtting procedure is done within a maximum likelihood framework. The corresponding optimization problem0is solved with the Quasi-Newton methods with box-constraints, as implemented in the optim function inR [54]. In Appendix A.2.2, we use quantile-quantile plots to compare ziNB to several other candidatedistributions (e.g., lognormal, Poisson, NB) and show that ziNB has superior ﬁt. Generation of network topologies and correlation matrices.

Under normality assumption, the non-zero pattern of the precision matrix corresponds to the adjacency matrix of the underlying undirected graph.We use this property to generate target covariance (correlation) matrices originating from diﬀerent graphtopologies. The pipeline to generate a network structure for simulated data proceeds in three steps: (i)Generate an undirected graph, in the form of an adjacency matrix, with a desired topology and sparsity, (ii)convert the adjacency matrix to a positive-deﬁnite precision matrix by assigning positive and negative edgeweights and appropriate diagonal entries, and (iii) invert Θ and convert the resulting covariance matrix Σ to a correlation matrix ( R = D Σ D , where, D is a diagonal matrix with diagonal entries / √ σ i ).Among many potential graph structures, we focus on three representative network structures: band-like,cluster, and scale-free graphs (see Figure 3b for graphical examples). Maximum network degree stronglyimpacts network recovery, and thus our choice of network topologies spans a range of maximum degrees(band < cluster < scale-free). In addition, cluster and scale-free lend themselves to hypothetical ecologicalscenarios. Cluster graphs may be seen as archetypal models for microbial communities that populate diﬀerentdisjoint niches (clusters) and have only few associations across niches. Scale-free graphs, ubiquitous in manyother facets of network biology (such as gene regulatory, protein-protein and social networks), serve as abaseline model for a microbial community that comprises (1) a few “keystone" species (hub nodes with manypartners) that are essential for coordinating/stabilizing the community and (2) many dependent species thatare sparsely connected to each other. The sparsity of the networks is controlled by the number of edges, e < p ( p − / , in the graph. The topologies are generated according to the following algorithms, startingwith an empty p × p adjacency matrix:1. Band:

A band-type network consists of a chain of nodes that connect only their nearest neighbors.Let e = e used + e available , the number of edges already used and available, respectively. Fill the nextavailable oﬀ-diagonal vector with edges if and only if e available ≥ number of elements in the oﬀ-diagonal.2. Cluster:

A cluster network comprises h independent groups of randomly connected nodes. For given p and e we divide the set of nodes into h components of (approximately) identical size and set thenumber of edges in each component to e comp = e/h . For each component, we generate a random(Erdös-Renyi) graph for which we randomly assign an edge between two nodes in the cluster withprobability p = e comp h ( h − / . Scale-free:

The distribution of degrees, the number of edges per node, in a scale-free network isdescribed by a power law, implying that the central node or nodes (potentially keystone species in anecological network) have proportionally more connections. We use the standard preferential attachmentscheme [58] until p − edges are exhausted.After generation of these standard adjacency matrices, we randomly remove or add edges until the adjacencymatrix has exactly e edges. All schemes generate symmetric adjacency matrices that describes a graph, withentries of 1 if an edge exists and 0 otherwise.From the adjacency matrices, we generate precision matrices by uniformly sampling non-zero entries Θ i,j ∈ [ − Θ max , − Θ min ] ∪ [Θ min , Θ max ] , where Θ min , Θ max > are model parameters and describe the strengthof the conditional dependence among the nodes. To ensure that the precision matrix is positive deﬁnite withtunable condition number κ = cond (Θ) , we scale the diagonal entries Θ i,i by a constant c using binarysearch. The precision matrix Θ is then converted to a correlation matrix R to be used as input to the NorTAapproach. Results

Network inference on synthetic microbiome data

Given that no large-scale experimentally validated microbial ecological network exists, we use SPIEC-EASI’sdata generator capabilities to synthesize data whose OTU count distributions faithfully resemble microbiome1count data. By varying parameters known to inﬂuence network recovery (network topology, associationstrength, sample number) and quantifying performance on resulting networks, we rigorously assess SPIEC-EASI inference relative to state-of-the-art inference methods, SparCC [17] and CCREPE [9], as well asstandard Pearson correlation.

Benchmark setup

We modeled the synthetic datasets on American Gut Project data using SPIEC-EASI’s data generationmodule. The count data, accessed February, 2014 at and available at https://github.com/zdk123/SpiecEasi/tree/master/inst/extdata , come from two sampling rounds and com-prise several thousand OTUs. Round 1 data contains n = 304 , and Round 2 data contains n = 254 samples. As ﬁltering steps, OTUs were removed from the input data if present in fewer than of thesamples, while samples were removed if total sequencing depth fell below the 1st quartile (10,800 sequencereads). Thus, we arrived at a total of p = 205 distinct OTUs. We also generated smaller-dimensional datasets( p = 68 ) with fewer zero counts by requiring that OTUs be present in > of the samples. We used Round1 data and ﬁt the n count histograms to a ziNB distribution (for justiﬁcation of this, see Appendix A.2).The empirical eﬀective number n eﬀ is 13.5 for p = 205 and 7.5 for p = 68 data. The resulting parametrizedmarginal distributions served as input to NorTA.As described above, network topology is expected to inﬂuence network recovery; thus, we consider thethree previously described topologies (band-like, cluster, and scale-free) as representative microbial networks.We hypothesize that any method that successfully infers the sets of associations underlying these archetypalnetworks from synthetic datasets is likely to also perform well in the context of true microbiomes, whoseunderlying network architecture is unknown but expected to be a mixture of these network types. For allnetworks, we ﬁx the total number of edges e to the respective number of OTUs p , and we analyze a medium( p = 68 ) and a high-dimensional scenario ( p = 205 ). Microbial association strength is controlled by the rangeof values in oﬀ-diagonal entries in the precision matrices Θ and the condition number κ = cond (Θ) . We use Θ min = 2 and Θ max = 3 with either condition number κ = 10 or . In this setting, κ controls the spread ofthe absolute correlation values (and thus the strength of indirect associations) present in the synthetic data.The relationship between condition number and distribution of correlation is illustrated in Figure A.5. Foreach network type and size, we generate 20 distinct instances. For each instance, we then use the NorTAapproach to generate a maximum of n = 1360 synthetic microbial count data samples. In Figure A.3, wehighlight the ﬁdelity between the target microbiome dataset and the synthetic datasets, especially in termsof dataset dimensions and OTU count distributions. To assess the eﬀect of sample size on network recovery,we test methods on a range of sample sizes: n = 34 , , , .We compare SPIEC-EASI’s covariance selection method (referred to as S-E(glasso)) and neighborhoodselection method (referred to as S-E(MB)) to SparCC and CCREPE, methods which were also designedto be robust to compositional artifacts. As a baseline reference, we also compared all methods to Pearsoncorrelation, which is neither compositionally robust nor appropriate for estimating correlation in the under-determined regime. Both of these methods, however, infer interactions from correlations and do not considerthe concept of conditional independence. We improved the runtime of the original SparCC implementation(available at https://bitbucket.org/yonatanf/sparcc ) and include the updated code in our SPIEC-EASIpackage. The original SparCC package also includes a small benchmark test case available at https://bitbucket.org/yonatanf/sparcc/src/9a1142c179f7/example ). The recovery performance results for S-E(MB), SparCC, and CCREPE for this test case can be found in Figure A.12. The CCREPE implementationis downloaded from . Recovery of microbial networks

To quantify each method’s ability to recover the true underlying association network, we evaluated per-formance in terms of precision-recall (P-R) curves and area under P-R curves (AUPR). For each method,we ranked edge predictions according to conﬁdence. For SparCC, CCREPE and Pearson correlation, edgepredictions were ranked according to p-value. SPIEC-EASI edge predictions were ranked according to edgestability, inferred by StARS model selection step at the most stable tuning parameter λ StARS . Figure 4summarizes methods’ performance on 960 independent synthetic datasets for a total of 48 conditions (4samples sizes × × × Figure 4. Precision-recall performance on synthetic datasets . a ) Red=S-E(glasso),orange=S-E(MB), purple=SparCC, blue=CCREPE, green=Pearson correlation, black=random. Areaunder precision-recall (AUPR) vs. number of samples n for diﬀerent κ values are depicted. Bars representaverage over 20 synthetic datasets, and error bars represent standard error. Asterisks denote conditionsunder which SPIEC-EASI methods had signiﬁcantly higher AUPR relative to all other control methods(P<0.05 for all one-sided T tests). b ) Representative precision-recall curves for p = 68 , n = 102 , κ = 100 ;solid and dashed lines denote SPIEC-EASI and control methods, respectively.3We observe the following key trends. First, the performance of all methods improves with increasingsample size. Under certain scenarios, even near-perfect recovery (AUPR ≈ ) is possible in the large samplelimit ( n = 1360 ). Second, all methods show a clear dependence on the network topology. Best performanceis achieved for band graphs, followed by cluster and scale-free graphs. These results are consistent withtheoretical results [29], which show that the maximum node degree d reduces the probability of correctlyinferring network edges for ﬁxed sample size (scale-free networks have highest maximum degree, followed bycluster and then band.) Third, for most scenarios, the SPIEC-EASI methods, particularly S-E(MB), performas well or signiﬁcantly better than all control methods. Standard Pearson correlation is outperformed byall methods that take the compositional nature of the data into account. Forth, in the large sample limit( n = 1360 ), S-E(MB) is the only method that recovers a signiﬁcant portion of edges under all tested scenarios(particularly scale-free networks). Additionally, we observe (Figure A.12) that SPIEC-EASI performs wellon synthetic data generated by the SparCC benchmark [17, 59].The present results suggest that complete network recovery is likely an unrealistic goal for microbiomestudies, given that most studies have at most hundreds of samples. In addition, the P-R curves are basedon ranking predicted edges. To generate a ﬁnal network, conﬁdence-based criteria must be applied to selecta ﬁnal set of edges for network inclusion, and, to date, no optimal selection process exists. Nonetheless, ifwe focus on the set of high-conﬁdence interactions (i.e. the top-ranked entries in the edge list), we see thatS-E methods, particularly S-E(MB), can achieve very high precision for all network types (see Figure 4 fora representative P-R curve).Overall, these results suggest that S-E methods outperform current state-of-the-art methods in terms ofnetwork recovery under most tested scenarios, with S-E(MB) showing superior performance over S-E(glasso). Recovery of Global Network Properties

Accurate recovery of global network properties (e.g., degree distribution, number of connected components,shortest path lengths) from taxa abundance data would help deﬁne the underlying topology of the ecologicalnetwork (e.g., cluster versus scale-free) [15, 58]. At this point, little is known about the underlying topologyof microbial ecological networks, but, as elaborated in the Discussion, such information could be incorporatedas a constraint into SPIEC-EASI’s inference methods, thereby further improving prediction. Thus, we testedhow well SPIEC-EASI and other methods recover global network properties from the synthetic datasets andevaluate whether these methods might be able to provide insight into global network architecture, perhapseven in the underdetermined regime, where the prediction of individual edges is less accurate. To controlfor the disparate means by which individual methods ranked edge conﬁdences (e.g., stability for the SPIEC-EASI methods, estimation of p-values for SparCC, Pearson and CCREPE), for each synthetic dataset andmethod, ﬁnal networks were generated by selecting the top 205 predicted edges and comparing to the truesynthetic network topologies for p = 205 OTUs.We ﬁrst consider (node) degree distributions, where node degree is deﬁned as the number of edges eachnode has. In Figure 5, we show the empirical degree distribution and the underlying ground truth for allmethods and networks types, n = 1360 and κ = 100 . Scale-free networks are characterized by exponentialdegree distributions, in which few nodes (e.g., hubs and, in our context, potential keystone taxa) have veryhigh degree (e.g., interact with other taxa), while most nodes/taxa have few interactions. In contrast, nodes incluster networks have relatively even degree, which depends on cluster size. In the ecological context, clusternetworks would be consistent with niche communities that share few interactions with microbiota outside ofone’s niche community; this structure is also reﬂected in degree distributions. Using Kullback-Leibler (KL)divergence to measure the dissimilarity between methods’ predicted degree distributions and the true degreedistribution we see that S-E(MB) outperforms all other methods in recovering degree distributions (Figure5). This performance improvement also holds for smaller samples sizes.Another common topological feature is betweenness centrality, which, similar to degree, betweennesscentrality can be used to gauge the relative importance of a node (e.g., taxon) to the (ecological) network.Betweenness centrality, as the fraction of shortest paths between all other nodes in the network that containthe given node, highlights central nodes. The distribution of nodes’ betweenness centrality provides infor-mation about the network architecture (Figure A.6). Speciﬁcally, scale-free networks are expected to havea few nodes with very high betweenness centrality that connect most other nodes to each other; in scale-free networks, betweenness centrality can approach unity. For cluster networks, the maximum betweenness4 ********* K L D i ve r g e n ce Degree: Band (1360, 100) SE ( g l ass o ) SE ( M B ) S p a r CC CCR

EPE P ea r s on ****** ****** ****** Degree: Cluster (1360, 100) SE ( g l ass o ) SE ( M B ) S p a r CC CCR

EPE P ea r s on **** ** ***** Degree: Scale free (1360, 100) SE ( g l ass o ) SE ( M B ) S p a r CC CCR

EPE P ea r s on B a nd C l u s t e r S ca l e -f r ee Degree Degree Degree Degree Degree

Figure 5. a ) Predicted degree distributions (colored) are overlaid with the true degree distribution(white) for n = 1360 samples, p = 205 OTUs, κ = 100 . Lighter shades correspond to regions of overlapbetween predicted and true distributions. Dissimilarity between the distributions is measured by KLdivergence, D KL . b ) Bars represent the average D KL over three independent sets of synthetic datasets (7datasets per set); error bars represent standard error. Divergences were compared between S-E and controlmethods using one-sided T-tests; ***,**,* correspond to P<0.001, 0.01, and 0.05.5centrality is limited by the total number of independent clusters. In band networks, similar to scale-free, allnodes are connected; however, the degree is ﬁxed and so the betweenness centrality distribution is roughlyuniform from zero to one. For smaller sample sizes n < , no method dominates. However, for thelargest sample size, n = 1360 , S-E(MB) is again signiﬁcantly better than all other methods for ﬁve out ofsix conditions with the exception of scale-free networks κ = 10 , where SparCC recovery is best (Figure A.7).We next consider distributions over graph geodesic distances. The geodesic distance is the length ofthe shortest path between two nodes. Given the existence of highly connected hubs in scale-free networks,geodesic distances for scale-free networks tend to be short, a feature that is described as the "small-world"property. Thus, the geodesic distributions in the scale-free network are a lot smaller than for the band andcluster networks (Figure A.8). In recovery of geodesic distance distributions, S-E(MB) performs equivalentlyor signiﬁcantly better than all other methods for scale-free networks as well as band graphs across allconditions. For cluster networks, the other methods generally outperform SPIEC-EASI methods for smallersample sizes ( n < ). In the large sample limit n = 1360 , S-E(MB) has signiﬁcantly better recovery ofgeodesic distance distributions relative to all control methods, even for cluster graphs (Figure A.9).Finally, we analyzed the number and size of connected components in the inferred graphs. While allsynthetic band and scale-free synthetic networks form a single connected component containing all nodes,cluster networks have a variable number of connected components. In terms of cluster number recovery, allmethods predicted too many connected components. Overall, S-E methods had lower error rates for bandand scale-free networks over all sample sizes. For high sample number ( n = 1360 ), S-E(MB) had signiﬁcantlybetter recover of cluster size across all network types (Figure A.10), with nearly perfect recovery for clustergraphs ( D KL = 0 , Figure A.11). Inference of the American Gut network

Thus far we have used the n = 304 ﬁrst-round AGP samples as a means to construct realistic syntheticmicrobiome data sets with SPIEC-EASI’s data generation module. In this section, we apply SPIEC-EASIinference methods to construct ecological association networks from the AGP data directly. To do this, weﬁrst ﬁlter out rare OTUs by selecting only the top 205 OTUs (to match the dimensionality of the syntheticdata) in the combined AGP dataset (by frequency of presence) before adding a pseudo-count and total-sum normalization. Although there is no independent means to assess the accuracy of these hypotheticalnetworks, we can assess their reproducibility and consistency. For each method, we ﬁrst infer a singlerepresentative network of taxon-taxon interactions from Round 1 AGP abundance data. For SPIEC-EASI,the StARS model selection approach is used to select the ﬁnal model network. For SparCC, we use athreshold ρ t = 0 . to construct a relevance network from the SparCC-inferred correlation matrix; i.e. anedge between nodes v i , v j is present in the SparCC network if | ρ i,j | > ρ t [17]. Similarly, we use a q-valuecut-oﬀ of − to create an interaction network from CCREPE-corrected signiﬁcance scores of Pearson’scorrelation coeﬃcient [9]. For each method, we thus arrive at a reference network that can be consideredthe hypothetical gold standard. We then use the n = 254 Round 2 AGP samples as an independenttest set and learn a new model network from these data alone. We measure consistency between the twonetwork models by computing the Hamming distance between the reference and new network models, i.e.,the diﬀerence between the upper triangular part of the two adjacency matrices. For the present data, theHamming distance can vary between p ( p − / (no edges in common) and a minimum of foridentical networks. Conﬁdence intervals for Hamming distances can be obtained by combining Round 1 and2 samples into a uniﬁed dataset, repeatedly subsampling these data into two disjoint groups of size n and n , and repeating the entire inference procedure.Figure 6a shows network reproducibility for SPIEC-EASI methods, SparCC, and CCREPE. The S-E(MB)has smallest the Hamming distance, followed by S-E(glasso), SparCC, and CCREPE. In S-E(MB), the edgedisagreement is roughly 50 with very small error bars. At the other extreme, CCREPE edge disagreementis 250 edges and highly variable.These numerical experiments clearly demonstrate that SPIEC-EASI networks are more reproduciblethan other current methods. Finally, we use each inference method to construct a candidate American Gutmicrobiome association network from the uniﬁed dataset of size n + n = 558 (Figure 6c). We analyze thediﬀerences between the reconstructed networks by quantifying the number of unique and shared predictededges (Figure 6b). All four methods agree on a core network that consists of 127 edges. These edges are6 Figure 6. a ) Network reproducibility for inference methods (see main text for details). Bars representmean Hamming distance, and errorbars are 95% conﬁdence intervals. b ) Visualization of edge overlapbetween networks inferred with SPIEC-EASI, SparCC, and CCREPE. c ) Network visualizations with OTUnodes colored by Family lineage (or Order, when the Family of the OTU is unknown), edges are colored bysign (positive: green, negative: red), and the node diameter proportional to the geometric mean of thatOTU’s relative abundance.7mostly found within OTUs of the same taxonomic group. This phenomenon, termed assortativity, has alsobeen observed in other microbial network studies [11]. Assortativity is one of the most salient features of theAGP-derived networks, and, for all networks, the assortativity coeﬃcients for each network are close to unity(e.g., maximum assortativity, Figure A.13). The SparCC network comprises about twice as many edges asthe SPIEC-EASI networks. SparCC infers 147 distinct edges; these additional edges correspond to negativeassociations between OTUs of Ruminococcaceae (genus Faecalibacterium) and Enterobacteriacae families(various genera) and a dense web of correlations within Enterobacteriacae OTUs. Similarly, CCREPEidentiﬁed 152 edges uniquely, with many negative edges between Enterobacteriaceae and Lachnospiraceae(genera: Blautia, Roseburia and unknown); additionally, CCREPE uniquely predicted positive edges betweenthe Lachnospiraceae and Ruminococcaceae (genus: Faecalibacterium). Both SPIEC-EASI methods producerelatively sparse networks by comparison. S-E(glasso) infers a total of 271 total edges (with one uniqueedge), and S-E(MB) infers 206 edges with 25 unique edges. In scale with edge predictions, both CCREPEand SparCC infer networks with large maximum degree (33 and 30, respectively), while the S-E(MB) andS-E(glasso) networks have a maximum degree of sixteen and eight, respectively (Figure A.13). However,even though CCREPE and SparCC predict a similar number of total edges, the global network propertiesare distinct. CCREPE predicts a higher maximum betweenness centrality and a larger number of nodes inthe largest connected component (100).In summary, analysis of the AGP networks suggests that the SPIEC-EASI inference schemes constructmore reproducible taxon-taxon interactions than SparCC and CCREPE and infer considerably sparser modelnetworks than the other two methods. These observations may be explained as follows: SparCC andCCREPE aim to recover correlation networks, which contain both direct edges as well as indirect (e.g.,spurious) edges (due to correlation alone). SparCC and CCREPE may recover indirect edges less robustlythan direct edges, an explanation that would be consistent with the Hamming distance reproducibility anal-ysis. In addition, all methods’ resulting networks suggest that the topology of the American Gut associationnetwork cannot be attributed to a speciﬁc network class. Instead, these networks are a mixture of band,scale-free, and cluster network type, and they exhibit high phylogenetic assortativity within highly connectedcomponents. Discussion

Inferring interactions among diﬀerent microbial species within a community and understanding their in-ﬂuence on the environment is of central importance in ecology and medicine [19, 60]. An ever increasingnumber of recent amplicon-based sequencing studies have uncovered strong correlations between microbialcommunity composition and environment in diverse and highly relevant domains of life [1,9,10,61–63]. Thesestudies alone underscore the need to understand how the microbial communities adapt, develop, and inter-act with the environment [5]. Elucidation of species interactions in microbial communities across diﬀerentenvironments remains, however, a formidable challenge. Foremost, available high-throughput experimentaldata are compositional in nature, overdispersed, and usually underdetermined with respect to statisticalinference. In addition, for most microbes few to no ecological interactions are known, thus the ecologicalinteraction network must be constructed de novo , in the absence of guiding assumptions and a set of "goldstandard" interactions for validation.To overcome both challenges, we present SPIEC-EASI ( S parse I nvers E C ovariance Estimation for E cological A ssociation I nference), a computational framework that includes statistical methods for the in-ference of microbial ecological interactions from 16S rRNA gene sequencing datasets and a sophisticated syn-thetic microbiome data generator with controllable underlying species interaction topology. SPIEC-EASI’sinference engine includes two well-known graphical model estimators, neighborhood selection [20] and sparseinverse covariance selection [22, 23, 46] that are extended by compositionally robust data transformations forapplication to the speciﬁc context of microbial abundance data.The synthetic data pipeline was used to generate realistic-looking gut microbiome datasets for a controlledbenchmark of SPIEC-EASI’s inference performance relative to two state-of-the-art methods, SparCC [17]and CCREPE [9]. We showed that neighborhood selection (S-E(MB)) outperforms SparCC and CCREPEin terms of recovery of taxon-taxon interactions and global network topology features under almost all testedbenchmark scenarios, while covariance selection (S-E(glasso)) performs competitively with and sometimes8better than SparCC and CCREPE.Through our simulation study, we demonstrate that several other factors, in addition to total number ofsamples, aﬀect network recovery. Foremost and in agreement with theoretical results from high-dimensionalstatistics [29, 30, 39], network topology has a signiﬁcant impact, as network recovery performance is nearlydoubled from scale-free to cluster to band (Figure 4) for ﬁxed sample size, number of taxa, and conditionnumber. We also demonstrated dependence to strength of direct interactions (and thus strength of corre-lations) within a given network. Our simulation study provides the community with rough guidelines forrequisite sample sizes, given state-of-the-art network inference and basic assumptions about the underlyingnetwork. This is of obvious importance to experimental design and the estimation of statistical power. Here,we used the synthetic data pipeline to generate datasets characteristic of the gut microbiome. However,the SPIEC-EASI data generator is generic and therefore enables researchers to generate synthetic datasetsthat resemble microbiome samples in terms of taxa dispersion and marginal distributions from their ﬁeld ofresearch, such as soil or sea water ecosystems [1].Our application study on real American Gut Project (AGP) data revealed that inference with SPIEC-EASI produced more consistent and sparser interaction networks than SparCC and CCREPE. In addition,our AGP network analysis revealed several biologically relevant observations. Speciﬁcally, we observed thatOTUs were more likely to interact with phylogenetically related OTUs (Figure 6c and Figure A.13). Inaddition, our gut microbial interaction networks appear to be a composite of network types, as we ﬁndevidence for scale-free, band-like, and cluster subnetworks.An important advantage of neighborhood and covariance selection as underlying inference frameworks istheir ability to include prior knowledge about the underlying data or network structure from independentscientiﬁc studies in a principled manner. For example, in the neighborhood selection scheme, the standardLASSO approach can be augmented by a group penalty [64] that takes into account a priori known groupstructure. The assortativity observed in our gut microbial interaction networks suggests that such a groupingof OTUs based on phylogenetic relationship might improve inference. Moreover, if veriﬁed species interactionsare available for a certain microbial contexts, this knowledge can be included in covariance and neighborhoodselection by relaxing the penalty term on these interactions. This strategy has already been fruitfully appliedto inference of similarly high-dimensional transcriptional regulatory networks [65]. Finally, in agreement withtheoretical and empirical work in high-dimensional statistics, our synthetic benchmark results conﬁrmed thatnetworks with scale-free structures elude accurate inference even if the underlying network is globally sparse.Recent modiﬁed neighborhood [30] and covariance selection [31] schemes improve recovery of scale-freenetworks and can be conveniently included into SPIEC-EASI.Finally, although the main focus of this work is inference of microbial interaction networks, estimationof the regularized inverse covariance matrix with S-E(glasso) will be key to addressing several other impor-tant questions arising from microbiome studies. For example, statistical methods to infer which taxa areresponsive to design factors in 16S gene amplicon studies is an active area of research. Most methods testeach taxon independently one-at-a-time (see [1] and references therein) even though taxa are actually highlycorrelated and thought to ecologically interact. Inference of taxa responses from 16S rRNA gene sequenc-ing datasets could be improved by modeling this correlation structure through incorporation of the inversecovariance matrix into the statistical model [66].Other, more complex questions are motivated by a desire to understand why and how ecosystems evolvewith time. In the dynamic modeling setting, association networks have already been successfully used as anunderlying structure to ﬁt a diﬀerential-equation-based model of gut microbiome development in mice [14].Thus, association networks provide the underlying topology for dynamic models, which can be used todevelop hypotheses about how the ecosystems might respond to speciﬁc perturbations [5].In conclusion, SPIEC-EASI is an improvement over state-of-the-art methods for inference of microbialecological networks from microbiome composition datasets. We demonstrate this through rigorous bench-marking with synthetic networks and also through application to a true biological dataset. In addition, theLASSO underpinnings of the SPIEC-EASI inference methods provide a ﬂexible and principled mathematicalframework to incorporate additional information about microbial ecological association networks as it be-comes available, thereby improving prediction. We anticipate that SPIEC-EASI network inference will serveas a backbone for more sophisticated modeling endeavors, engendering new hypotheses and predictions ofrelevance to environmental ecology and medicine.9 Acknowledgments

We would like thank Eric Alm and Jonathan Friedman for many helpful discussions and the manuscriptreviewers for key improvements in the presentation of this work.

References

1. Gilbert J, Meyer F, Jansson J, Gordon J, Pace N, et al. (2010) The earth microbiome project: Meetingreport of the “1st emp meeting on sample selection and acquisition” at argonne national laboratoryoctober 6th 2010. Standards in Genomic Sciences 3.2. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett C, Knight R, et al. (2007) The human microbiomeproject: exploring the microbial part of ourselves in a changing world. Nature 449: 804.3. AmGut. The american gut project. http://humanfoodproject.com/americangut/ . Accessed: 2014-01-30.4. Bunge J, Willis A, Walsh F (2014) Estimating the number of species in microbial diversity studies.Annual Review of Statistics and Its Application 1: 427-445.5. Foster JA, Krone SM, Forney LJ (2008) Application of ecological network theory to the human mi-crobiome. Interdisciplinary perspectives on infectious diseases 2008: 839501.6. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, et al. (2011) Enterotypes of the humangut microbiome. Nature 473: 174–180.7. Koren O, Knights D, Gonzalez A, Waldron L, Segata N, et al. (2013) A guide to enterotypes acrossthe human body: meta-analysis of microbial community structures in human microbiome datasets.PLoS computational biology 9: e1002863.8. Chen J, Li H (2013) Variable selection for sparse dirichlet-multinomial regression with an applicationto microbiome data analysis. The Annals of Applied Statistics 7: 418–442.9. Gevers D, Kugathasan S, Denson LA, Vázquez-Baeza Y, Van Treuren W, et al. (2014) The treatment-naive microbiome in new-onset crohn’s disease. Cell Host & Microbe 15: 382–392.10. Lee SC, Tang MS, Lim YAL, Choy SH, Kurtz ZD, et al. (2014) Helminth colonization is associatedwith increased diversity of the gut microbiota. PLoS Negl Trop Dis 8: e2880.11. Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, et al. (2012) Microbial Co-occurenceRelationships in the Human Microbiome. PLoS Computational Biology 8: e1002606.12. Fuhrman JA, Steele JA (2008) Community structure of marine bacterioplankton: Patterns, networks,and relationships to function. In: Aquatic Microbial Ecology. volume 53, pp. 69–81. doi:10.3354/ame01222.13. Barberán A, Bates ST, Casamayor EO, Fierer N (2012) Using network analysis to explore co-occurrence patterns in soil microbial communities. The ISME journal 6: 343–51.14. Marino S, Baxter NT, Huﬀnagle GB, Petrosino JF, Schloss PD (2014) Mathematical modeling ofprimary succession of murine intestinal microbiota. Proceedings of the National Academy of Sciencesof the United States of America 111: 439–44.15. Deng Y, Jiang YH, Yang Y, He Z, Luo F, et al. (2012) Molecular ecological network analyses. BMCBioinformatics 13: 113.16. Steele Ja, Countway PD, Xia L, Vigil PD, Beman JM, et al. (2011) Marine bacterial, archaeal andprotistan association networks reveal ecological linkages. The ISME journal 5: 1414–25.017. Friedman J, Alm EJ (2012) Inferring correlation networks from genomic survey data. PLoS compu-tational biology 8: e1002687.18. Aitchison J (1981) A new approach to null correlations of proportions. Mathematical Geology 13:175–189.19. Faust K, Raes J (2012) Microbial interactions: from networks to models. Nat Rev Micro 10: 538–550.20. Meinshausen N, Bühlmann P (2006) High Dimensional Graphs and Variable Selection with the Lasso.The Annals of Statistics 34: 1436–1462.21. Bonneau R, Reiss D, Shannon P, Facciotti M, Hood L, et al. (2006) The inferelator: an algorithm forlearning parsimonious regulatory networks from systems-biology data sets de novo. Genome Biology7: R36.22. Friedman J, Hastie T, Tibshirani R (2008) Sparse inverse covariance estimation with the graphicallasso. Biostatistics (Oxford, England) 9: 432–441.23. Banerjee O, El Ghaoui L, d’Aspremont A (2008) Model selection through sparse maximum likelihoodestimation for multivariate gaussian or binary data. The Journal of Machine . . . 9: 485–516.24. Wille A, Zimmermann P, Vranová E, Fürholz A, Laule O, et al. (2004) Sparse graphical Gaussianmodeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol 5: R92.25. Friedman N (2004) Inferring Cellular Networks Using Probabilistic Graphical Models. Science 303:799–805.26. Bonneau R (2008) Learning biological networks: from modules to dynamics. Nature chemical biology4: 658–64.27. Jones DT, Buchan DWA, Cozzetto D, Pontil M (2011) Psicov: Precise structural contact predictionusing sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics .28. Marks DS, Hopf TA, Sander C (2012) Protein structure prediction from sequence variation. Naturebiotechnology 30: 1072–80.29. Ravikumar P, Wainwright MJ, Raskutti G, Yu B (2011) High-dimensional covariance estimation byminimizing L1 -penalized log-determinant divergence. Electronic Journal of Statistics 5: 935–980.30. Tandon R, Ravikumar P (2014) Learning Graphs with a Few Hubs. In: Proceedings of The 31stInternational Conference on Machine Learning. pp. 602–610. URL http://jmlr.org/proceedings/papers/v32/tandon14.html .31. Liu Q, Ihler AT (2011) Learning scale free networks by reweighted l1 regularization. In: AISTATS.pp. 40-48.32. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, et al. (2009) Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbialcommunities. Applied and Environmental Microbiology 75: 7537–7541.33. La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, et al. (2012) Hypothesis testing and powercalculations for taxonomic-based human microbiome data. PloS one 7: e52078.34. Aitchison J (1986) The statistical analysis of compositional data. London; New York: Chapman andHall.35. Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. MIT press.36. Lauritzen SL (1996) Graphical models. Oxford University Press.37. Wainwright MJ, Jordan MI (2008) Graphical models, exponential families, and variational inference.Foundations and Trends in Machine Learning 1: 1–305.138. Bühlmann P, Kalisch M, Meier L (2014) High-dimensional statistics with a view toward applicationsin biology. Annual Review of Statistics and Its Application 1: 255–278.39. Ravikumar P, Wainwright MJ, Laﬀerty JD, et al. (2010) High-dimensional ising model selection usingl1-regularized logistic regression. The Annals of Statistics 38: 1287–1319.40. Dempster AP (1972) Covariance selection. Biometrics : 157–175.41. Tibshirani R (1996) Regression shrinkage and selection via the lasso. Journal of the Royal StatisticalSociety Series B (Methodological) 58: 267–288.42. Lederer J, Müller CL (2015) Don’t fall for tuning parameters: Tuning-free variable selection in highdimensions with the TREX. In: AAAI Conference on Artiﬁcial Intelligence.43. Lam C, Fan J (2009) Sparsistency and rates of convergence in large covariance matrix estimation. TheAnnals of Statistics 37: 4254–4278.44. Loh PL, Wainwright MJ (2013) Structure estimation for discrete graphical models: Generalized co-variance matrices and their inverses. The Annals of Statistics 41: 3022–3049.45. Liu H, Laﬀerty J, Wasserman L (2009) The nonparanormal: Semiparametric estimation of high di-mensional undirected graphs. J Mach Learn Res 10: 2295–2328.46. Yuan M, Lin Y (2007) Model selection and estimation in the Gaussian graphical model. Biometrika94: 19-35.47. Hastie T, Tibshirani R, Friedman J, Hastie T, Friedman J, et al. (2009) The Elements of StatisticalLearning. Springer.48. Liu H, Roeder K, Wasserman L (2010) Stability approach to regularization selection (stars) for highdimensional graphical models. Proceedings of the Twenty-Third Annual Conference on Neural Infor-mation Processing Systems (NIPS) : 1–14.49. Zhao T, Liu H, Roeder K (2012) The huge package for high-dimensional undirected graph estimationin R. The Journal of Machine Learning Research 13: 1059–1062.50. Schaﬀter T, Marbach D, Floreano D (2011) GeneNetWeaver: in silico benchmark generation andperformance proﬁling of network inference methods. Bioinformatics (Oxford, England) 27: 2263–2270.51. Nelsen RB (1999) An introduction to copulas. Springer Series in Statistics. Springer.52. Madsen L, Dalthorp D (2007) Simulating correlated count data. Environmental and Ecological Statis-tics 14: 129–148.53. Cario MC, Nelson BL (1997) Modeling and generating random vectors with arbitrary marginal distri-butions and correlation matrix. Industrial Engineering : 1–19.54. R Development Core Team (2011). R: A Language and Environment for Statistical Computing. URL .55. Yee TW (2007) The VGAM Package for Categorical Data Analysis. Journal of Statistical Software32: 1–34.56. McMurdie PJ, Holmes S (2014) Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmis-sible. PLoS Computational Biology 10: e1003531.57. La Rosa PS, Brooks JP, Deych E, Boone EL, Edwards DJ, et al. (2012) Hypothesis testing and powercalculations for taxonomic-based human microbiome data. PloS one 7: e52078.58. Barabási AL, Albert R (1999) Emergence of Scaling in Random Networks. Science 286: 509–512.259. Friedman J. SparCC package and data. https://bitbucket.org/yonatanf/sparcc/src/9a1142c179f7/example . Accessed: 2014-11-05.60. Longman RS, Yang Y, Diehl GE, Kim SV, Littman DR (2013) Microbiota: Host interactions inmucosal homeostasis and systemic autoimmunity. In: Cold Spring Harbor symposia on quantitativebiology. Cold Spring Harbor Laboratory Press, volume 78, pp. 193–201.61. HMP Consortium (2012) Structure, function and diversity of the healthy human microbiome. Nature486: 207–214.62. de Vos WM, de Vos EA (2012) Role of the intestinal microbiome in health and disease: from correlationto causation. Nutrition Reviews 70: S45–S56.63. Scher JU, Sczesnak A, Longman RS, Segata N, Ubeda C, et al. (2013) Expansion of intestinal prevotellacopri correlates with enhanced susceptibility to arthritis. eLife 2.64. Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. Journalof the Royal Statistical Society: Series B (Statistical Methodology) 68: 49–67.65. Greenﬁeld A, Hafemeister C, Bonneau R (2013) Robust data-driven incorporation of prior knowledgeinto the inference of dynamic regulatory networks. Bioinformatics 29: 1060-1067.66. Lin W, Shi P, Feng R, Li H (2014) Variable selection in regression with compositional covariates.Biometrika accepted.3

A Appendix

A.1 Method comparison

SPIEC EASI SparCC CCREPE PearsonUnderlying Metric Conditional Independence Correlation Correlation CorrelationCompositional Correction Yes Yes Yes NoAitchison Measure Inverse of clr covariance matrix Aitchison Variation - -Network Assumptions Network sparsity Average correlation is zero - -

Table to compare some of the features of SPIEC-EASI, SparCC, CCREPE and Pearson’s correlation coeﬃ-cient.

A.2 Comparative marginal ﬁts to American Gut Project count data

We considered ﬁve common distributions for modeling OTU count data in SPIEC-EASI’s synthetic datagenerator. As as target count data we used rounded common-scale normalized data from the American GutProject (AGP). After ﬁltering, a total of p = 205 diﬀerent OTU counts w i ∈ N n of sample size n = 558 areavailable for marginal ﬁtting. We ﬁt the parameters of each distributional model to each OTU count vector w i independently using maximum likelihood (ML) estimation. A.2.1 Common count distributions

We considered the following ﬁve common distributions to model univariate count data:

Log-normal.

The log-normal distribution is a continuous distribution of a random variable U whoselogarithm is normally distributed, U ∼ ln N ( µ, σ ) , where µ and σ are the mean and standard deviation of ln( U ) , respectively. Its probability density function is given by P ( U = u ) = 1 uσ √ π e − (ln u − µ )22 σ , (7)To model discrete counts, the continuous values can be discretized by rounding to the nearest integer. Poisson.

The Poisson distribution models a given number of discrete “events" occurring in a ﬁxed interval(a of an ecosystem). In ecology, a Poisson distributed random variable U ∼ ln Pois( λ ) describes the numberof occurrences of a species across diﬀerent ecosystem samples. The probability mass function is given by: P ( U = u ) = λ u e − λ u ! , (8)where λ is the Poisson ’rate’ parameter that determines both mean and variance. Zero-inﬂated Poisson.

Zero-inﬂated models account for an excess of zero counts in real data that cannotbe handled by a single component model. In ecology, count data may be skewed toward zero due to a pre-ponderance of counts that fall below the sampling depth. The zero-inﬂated Poisson (ziP) model incorporatesan additional term to account for this feature. The random variable reads: U ∼ (cid:40) . φ Pois( λ ) with prob . − φ and its mass function is given by:4 P ( U = 0) = φ + (1 − φ ) e − λ (9) P ( U = u ) = (1 − φ ) λ u e − λ u ! (10)where φ is the probability of obtaining an excess zero and λ is the Poisson rate parameter as above. When φ = 0 , the ziP distribution reduces to the Poisson distribution. Negative Binomial.

The negative binomial (NB) distribution arises as a hierarchical mixture of Poissondistributions that can model the variability from multiple sources (such as, e.g., biological replication andlibrary preparation) [1]. The NB distribution is thus more appropriate for handling overdispersion in thedata. We assume the means of the Poisson variables to be Gamma-distributed random variables with shapehyper-parameter r and scale θ = p/ (1 − p ) . First, a random Poisson mean λ is sampled from a Gammadistribution Γ( r, θ ) , and then a random variable u from Pois( λ ) . The compact form of the mass distributionof a discrete NB random variable U ∼ NB(r; p) then reads: P ( U = u ) = Γ( r + u ) u ! Γ( r ) p u (1 − p ) r . (11)Here, overdispersion is controlled by the parameter r . The NB model collapses to the Poisson model for r → ∞ . Zero-inﬂated Negative Binomial

Similar to the ziP model, the zero-inﬂated NB (ziNB) distributioncan better account for excess zeros in the observed data. The standard NB distribution is augmented by azero-inﬂation term. We denote a ziNB random variable by: U ∼ (cid:40) . φ NB( r ; p ) with prob . − φ , (12)where φ is the probability of obtaining an excess zero. The corresponding ziNB mass function reads: P ( U = 0) = φ + (1 − φ ) p u (13) P ( U = u ) = (1 − φ ) Γ( r + u ) u ! Γ( r ) p u (1 − p ) r . (14)For φ = 0 the ziNB model reduces to the NB distribution. A.2.2 Goodness-of-ﬁt to the AGP data

The AGP data was normalized using common-scale normalization, and fractional counts were rounded. Weﬁt the count data to the ﬁve diﬀerent statistical models and drew synthetic data (under null correlationbetween OTU marginals) from these distributions with the ML ﬁtted parameters. Figure 7 shows QQ plotsof synthetic and real data for the ﬁve diﬀerent distributions.

Figure 7.

QQ plots of AGP data ﬁt to ﬁve diﬀerent distributions (shown on a log-log scale for bettervisibility). The r values correspond to the unscaled quantile-quantile relationships.5The ziNB distribution ( r = 0 . ) is the only model that can accurately model both tails of the OTUdata. We thus use the ziNB as the standard setting in SPIEC-EASI’s data generation model. References

1. McMurdie PJ, Holmes S (2014) Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmis-sible. PLoS Computational Biology 10: e1003531.

A.3 Heatmaps of AGP and synthetic data

20 40 60 80 10020406080100120140160180200

Samples O T U s Synthetic Dataset (Graph band, κ = 100)

20 40 60 80 10020406080100120140160180200

Samples O T U s Synthetic Dataset (Graph cluster, κ = 100)

20 40 60 80 10020406080100120140160180200

Samples O T U s Synthetic Dataset (Graph scale free, κ = 100)

20 40 60 80 10020406080100120140160180200

Samples O T U s Synthetic Dataset (Graph band, κ = 10)

20 40 60 80 10020406080100120140160180200

Samples O T U s Synthetic Dataset (Graph cluster, κ = 10)

20 40 60 80 10020406080100120140160180200

Samples O T U s Synthetic Dataset (Graph scale free, κ = 10)

20 40 60 80 10020406080100120140160180200

Samples O T U s American Gut Data

Figure 8.

Visual comparison of American Gut Project data with synthetically generated datasets.Heatmaps show log10(counts). For all network types (band, graph, scale-free), synthetic datasets areconsistent with real datasets in terms of number of OTUs, number of samples, and OTU count abundancesacross samples.6

A.4 Relationship between target and empirical correlation after NorTA trans-formation

Figure 9.

The recovery of empirical Pearson correlations generated from the NORTA process, usingzero-inﬂated Negative Binomial as a model (x-axis) verses the input multivariate Normal empiricalcorrelations (upper panels) or inverse correlations (lower panels) on untransformed counts ( a ) orlog-transformed counts ( b ). Simulated data are with p = 205 OTUs, n = 20 , samples with replicateson each plot.7 A.5 Eﬀect of Condition Number on Correlations Distribution Correlation C oun t s Correlation C oun t s Cluster Scale-free Band

Correlation C o u n t s Correlation

Correlation C o u n t s = 10 = 10 = 10 = 100 = 100 = 100 Figure 10.

The condition of a Precision matrix is the ratio of the largest to smallest eigenvalue/singularvalue. The relationship between condition number and correlation distribution for the synthetic networks;increasing condition number corresponds to increasing the strength of correlations in the network.

A.6 Betweenness Centrality for Synthetic Networks

Figure 11.

Examples of betweenness centrality distributions for each network type (in white) overlayedwith the distribution predicted by S-E(MB) (in orange) for κ = 100 , n = 1360 samples, p = 205 OTUs.8

A.7 Recovery Performance of Betweenness Centrality

34 68 102 13600246810 Sample Size (N) K L D i v e r gen c e Band ( κ = 10)

34 68 102 1360012345 Sample Size (N)

Cluster ( κ = 10)

34 68 102 136000.20.40.60.811.2 Sample Size (N)

Scale−free ( κ = 10)

34 68 102 13600246810 Sample Size (N) K L D i v e r gen c e Band ( κ = 100)

34 68 102 136001234 Sample Size (N)

Cluster ( κ = 100)

34 68 102 136000.20.40.60.811.21.4 Sample Size (N)

Scale−free ( κ = 100) Figure 12.

Performance results for betweenness centrality (red = S-E(glasso) orange = S-E(MB), purple= SparCC, blue = CCREPE, green = Pearson). Bars represent the average KL-divergence over threeindependent sets of synthetic datasets (7 datasets per set); error bars represent standard error. Asterisksindicate that an S-E method had siginiﬁcantly better recovery of the true betweenness centralitydistributions ( p < . for one-sided T tests in comparison to each control method). A.8 Geodesic Distance for Synthetic Networks

Figure 13.

Examples of geodesic distance distributions for each network type (in white) overlaid with thedistribution predicted by S-E(MB) (in orange) for κ = 100 , n = 1360 samples, p = 205 OTUs.9

A.9 Recovery Performance of Geodesic Distance

34 68 102 1360051015 Sample Size (N) K L D i v e r gen c e Band ( κ = 10)

34 68 102 1360012345 Sample Size (N)

Cluster ( κ = 10)

34 68 102 1360051015 Sample Size (N)

Scale−free ( κ = 10)

34 68 102 136005101520 Sample Size (N) K L D i v e r gen c e Band ( κ = 100)

34 68 102 136001234 Sample Size (N)

Cluster ( κ = 100)

34 68 102 136005101520 Sample Size (N)

Scale−free ( κ = 100) Figure 14.

Performance results for geodesic distance distributions (red = S-E(glasso), orange = S-E(MB),purple = SparCC, blue = CCREPE, green = Pearson). Bars represent the average KL-divergence overthree independent sets of synthetic datasets (7 datasets per set); error bars represent standard error.Asterisks indicate that an S-E method had signiﬁcantly better recovery of the true geodesic distancedistributions ( p < . for one-sided T tests in comparison to each control method). A.10 Cluster Sizes for Synthetic Networks

Figure 15.

Examples of cluster size distributions for each network type (in white) overlayed with thedistribution predicted by S-E(MB) (in orange), for κ = 100 , n = 1360 samples, p = 205 OTUs.0

A.11 Recovery Performance of Cluster Sizes

34 68 102 13600123456 Sample Size (N) K L D i v e r gen c e Band ( κ = 10)

34 68 102 136000.20.40.60.811.2 Sample Size (N)

Cluster ( κ = 10)

34 68 102 13600123456 Sample Size (N)

Scale−free ( κ = 10)

34 68 102 13600123456 Sample Size (N) K L D i v e r gen c e Band ( κ = 100)

34 68 102 136000.20.40.60.811.2 Sample Size (N)

Cluster ( κ = 100)

34 68 102 136001234567 Sample Size (N)

Scale−free ( κ = 100) Figure 16.

Performance results for cluster size distributions (red = S-E(glasso), orange = S-E(MB),purple = SparCC, blue = CCREPE, green = Pearson). Bars represent the average KL-divergence overthree independent sets of synthetic datasets (7 datasets per set); error bars represent standard error.Asterisks indicate that an S-E method had signiﬁcantly better recovery of the true cluster size distributions(

P < . for one-sided T tests in comparison to each control method).1 A.12 Synthetic test case from SparCC

Figure 17.

Using the example data provided by the SparCC package, we inferred networks from SparCC(threshold correlation at ± . ), S-E(MB) and CCREPE (thresholding q-value at ∗ − ). In this setting,SparCC and SPIEC-EASI correctly recover four true edges, including the association sign. A.13 Assortativity coeﬃcients in inferred American Gut networks