[PDF] CONCOCT: Clustering cONtigs on COverage and ComposiTion

Abstract

Metagenomics enables the reconstruction of microbial genomes in complex microbial communities without the need for culturing. Since assembly typically results in fragmented genomes the grouping of genome fragments (contigs) belonging to the same genome, a process referred to as binning, remains a major informatics challenge. Here we present CONCOCT, a computer program that combines three types of information - sequence composition, coverage across multiple sample, and read-pair linkage - to automatically bin contigs into genomes. We demonstrate high recall and precision rates of the program on artificial as well as real human gut metagenome datasets.

Full PDF

11 CONCOCT: Clustering cONtigs on COverage and ComposiTion

Johannes Alneberg , ∗ , Brynjar Sm´ari Bjarnason , ∗ , Ino de Bruijn , , Melanie Schirmer , Joshua Quick ,Umer Z. Ijaz , Nicholas J. Loman , Anders F. Andersson , † & Christopher Quince , † ∗ Joint ﬁrst authors † Correspondence and requests for materials should be addressed to A.F.A (email:[email protected]) or C.Q. (email: [email protected])

Abstract

Metagenomics enables the reconstruction of microbial genomes in complex microbial communities withoutthe need for culturing. Since assembly typically results in fragmented genomes the grouping of genomefragments (contigs) belonging to the same genome, a process referred to as binning, remains a majorinformatics challenge. Here we present CONCOCT, a computer program that combines three types ofinformation - sequence composition, coverage across multiple sample, and read-pair linkage - to auto-matically bin contigs into genomes. We demonstrate high recall and precision rates of the program onartiﬁcial as well as real human gut metagenome datasets.

Introduction

Metagenomic shotgun sequencing of microbial communities is widely used to investigate the geneticpotential and taxonomic composition of microbial communities. It has also been used to reconstructthe genomes of individual microbial populations [1, 2], and with the rise of high-throughput sequencingthis has become possible for highly complex environments such as sediments and soils [3]. Repeatedsequences within and across genomes, limited coverage, sequencing errors, and strain-level variationin gene content all result in fragmented genomes following genome assembly. A major bioinformaticschallenge in metagenomics is therefore the process of binning contigs into species-level groups. If closelyrelated reference genomes are lacking, binning has to be conducted in an unsupervised fashion. Whilenumerous binning methods exist, few combine multiple lines of evidence and most are not fully automatic.For example, a recent method utilises coverage across two samples, composition and linkage but througha two-dimensional visualisation approach requiring manual partitioning and validation [4]. Such anapproach is not reproducible and cannot scale to incorporate information from more than two samples.Here we present CONCOCT, a computer program that uses Gaussian mixture models (GMMs) [5] tocluster contigs into genomes based on sequence composition (kmer frequencies) and coverage acrossmultiple samples. To determine the number of clusters we perform model selection using the Bayesianinformation criteria (BIC) [5]. Since this typically results in slightly more clusters than number ofgenomes, it is followed by an automatic cluster-merging step that uses information on read-pair linkagewithin and across clusters. a r X i v : . [ q - b i o . GN ] D ec Results

Synthetic Mock Metagenome

In order to validate CONCOCT we constructed a synthetic mock metagenome data set of 64 samples,each sample comprising random paired-end reads from the same 41 genomes but with diﬀerent relativefrequencies. These frequencies were taken from true abundances of related organisms in 64 humanfecal samples determined by 16S rRNA sequencing as part of the Human Microbiome Project [6]. Theabundances of the 41 species across the samples are shown in Figure 1. These reads were then usedto generate a co-assembly of all 64 samples giving contigs. A contig is a stretch of consensus sequencegenerated by overlapping reads. Statistics summarising the co-assembly are given in Table 1. The readswere then mapped back onto the contigs to determine the contig coverages in each of the samples. Thecoverage is deﬁned as the average number of reads mapping to each position in the contig in that sample.These coverages across samples reﬂect the abundances of the underlying organisms across the samplesand, hence, can be used to disentangle which contig derives from which organism’s genome.The sequence composition of the contigs also contains information, since diﬀerent organisms havediﬀerent characteristic signatures of k-mers [7], speciﬁc n-grams of nucleotide sequence, reﬂecting muta-tional biases [8]. This forms the basis of most existing non-supervised binning methods [9–11]. To enableus to incorporate both composition and coverage we developed a unique pre-processing strategy. Foreach contig we generate a combined proﬁle, joining the coverage and composition vectors together. Thisis a high dimensional object possessing a total of 65 coverage dimensions - a normalised coverage in eachsample and a total coverage - plus 136 k-mer dimensions, if tetramers are used as here. We then performa principle components analysis (PCA), to reduce the dimensionality of this 201 dimensional combinedvector, keeping enough (22) dimensions D to maintain 90% of the information. The transformed vectorsthen distill information from both coverage and composition. In Figure 2 we plot the 10,458 contigs fromthe coassembly that exceeded 1000 base pairs (bp) in length in the ﬁrst two PCA dimensions. The speciesform distinct clusters even in this two dimensional visualisation, strongly implying that clustering themin the full D = 22 dimensional space should allow the diﬀerent species to be easily resolved.To cluster the contigs we require a method that is unsupervised, operates in any number of dimen-sions, and can automatically determine the number of clusters present. Model based clustering using anexplicit likelihood to describe the data as a mixture of cluster components each described by a separatedistribution fulﬁlls these criteria. It requires no human input. The number of dimensions is simply aparameter of the model allowing us to use all D dimensions and model selection based on the BIC canbe used to determine cluster number [5]. Then we simply need to determine the form of the distributionsdescribing the clusters, it is apparent from Figure 2 that each species forms an ellipsoid in the transformedspace, this strongly motivates the use of Gaussian mixture models (GMMs) with full covariance matrices,where each cluster component describes a fully ﬂexible ellipse. This also has the advantage that softwarepackages exist for eﬃciently ﬁtting these models [12].In CONCOCT we couple the above pre-processing strategy with a GMM implementation to auto-matically cluster metagenome contigs. Applying this to the synthetic community with 64 samples wepredicted an optimal cluster number of 56 as the one which minimises the BIC (Figure 3). The clustersare visualised in the ﬁrst two PCA dimensions in Figure 4. For the synthetic community we can directlycompare to the known assignments of contigs to genomes. We illustrate this graphically as a heat map inFigure 5 and present statistics quantifying the comparison in Table 2. We use three statistics precision,recall and the adjusted Rand index (see methods). The precision describes the purity of the clusters,this is remarkably high, 0.97, so that on average a cluster almost entirely contains contigs from the samegenome. The recall conversely gives the proportion of each genome that derives from the same cluster, itis how complete the genomes are, here we obtain 0.84, this is less than one because some genomes, as isclear from Figure 5, are split across multiple clusters. Finally, the adjusted Rand index summarises bothprecision and recall. For this we obtain 0.83 for the synthetic community.To illustrate how much additional information the coverage contains over composition alone and todetermine how many samples are necessary to resolve a data set of the complexity of the syntheticmock we reran CONCOCT for diﬀerent number of samples (Figure 6). The overall accuracy of theclusterings starts to decrease below 16 samples (Figure 6C), mostly due to a loss of precision althoughrecall is impacted too. Clustering without coverage performs poorly although better than random withan adjusted Rand index of 0.31.The basic CONCOCT algorithm performs very well in terms of precision but to improve recall weimplemented a further clustering step that utilises read pair linkage information. If a substantial portionof links from contigs in one cluster are to contigs in a diﬀerent cluster these clusters are merged, providedthe two clusters have similar coverage patterns. This reduced the number of clusters to 51 and improvingrecall to 0.91, without impacting precision, giving a substantially better adjusted Rand index of 0.94(Table 2).We also evaluate clusters based on 36 single-copy core genes (SCGs) that are found in almost all knownbacterial genomes once (Table 3). The counts of these genes are shown in Figure 7 for the clusters withoutlinkage. This suggests that cluster 39 is chimeric which is conﬁrmed by Figure 5, this cluster containsall the Akkermansia muciniphila contigs but also 17 contigs from

Coriobacterium glomerans . The otherclusters, however, appear pure, some are incomplete though, reﬂecting the splitting of genomes acrossclusters mentioned above. Read pair linkage clustering counteracts this problem without negativelyimpacting purity of the clusters (Figure 8).

Escherichia coli (STEC) O104:H4 outbreak metagenome

We next applied CONCOCT to a real metagenomic dataset consisting of a total of over 300 million readsfrom 53 fecal samples taken from individuals suspected of being infected during the 2011 outbreak of Shigatoxin-producing

Escherichia coli (STEC) O104:H4 [13]. 43 of these samples derive from individuals thattested positive for STEC by PCR, ten samples were diagnosed as containing other pathogens. Followingco-assembly of these reads, 22,585 contigs of length greater than 1,000 bp were obtained. We couldtaxonomically classify 8,058 of these contigs to species level. A total of 51 species were observed but only19 with greater than ﬁve assigned contigs. This agreed well with the 31 clusters predicted to be optimalby CONCOCT (Figure 9). In contrast to other studies of the human gut microbiome, the low diversitiesobserved here are likely because most samples are dominated by

E. coli . We compared the clusters to the36% of contigs which could be classiﬁed (Figure 10 and Table 2). As for the synthetic mock we obtaineda very high precision of 0.94. The recall was somewhat lower, 0.79, probably because in a real communityfrom multiple individuals some species will be split across separate, distantly related strains. In Figure 11we give the frequencies of the SCGs in each cluster wthis suggests that we have 11 complete and fairlypure genomes, with another ten fragments and then some clusters that lack markers.An important use of metagenomic data in pathogen discovery is the reconstruction of entire pathogengenomes, without requiring a reference sequence. We expect that the

E. coli outbreak genome will be moreabundant in recently infected individuals. Therefore we correlated the time since onset of diarrhea, ddays ,with the mean log-coverage proﬁles for the 43 STEC samples. These are obtained by back-transformingthe means of each cluster component (Table 4). There are ten clusters with a false discovery rate ofless than 10%. The relative abundance of these clusters across samples ordered by ddays is shown inthe top panel of Figure 12. The two most signiﬁcant clusters, k = 7 and k = 18, are both stronglynegatively correlated with ddays , they are also predominately made up of contigs assigned to Escherichiacoli . They are therefore candidates for the outbreak genome. We can verify this because the outbreakgenome is known [13], in Table 5 we give the number of contigs from each cluster that map either toparts of the

Escherichia coli

O104:H4 genome that are speciﬁc to that strain or core

E. coli genes. 94.8%of the mapped contigs derive from clusters 7 and 18. These two clusters together represent the outbreakgenome. Many more contigs mapping to the

E. coli core genome derive from cluster 18 and a higherproportion of outbreak-speciﬁc contigs derive from cluster 7 (Table 5 and Figure 13). The separationof the outbreak strain into core and outbreak portions probably reﬂects both the greater variation innucleotide composition in non-core bacterial genes and diﬀerences in coverage where other non-outbreak

E. coli strains are present. From the lower panel of Figure 12 it is apparent that there is a greatervariability in cluster 7 than 18. This leads to the single genome being best described by two componentswith diﬀerent variances.We have focused on the outbreak

E. coli clusters but the clustering approach in CONCOCT facilitatesa whole community analysis, this is evident from Figure 12, there are two further clusters negativelycorrelated with ddays , clusters 8 and 1, very few contigs from the former can be classiﬁed. Cluster 1 isfrom

Enterococcus faecium , potentially evidence of this organism increasing in abundance in associationwith STEC infection. The other clusters are all positively associated with ddays . These are organismsthat may be important in the recovery from STEC infection, they include, cluster 29 -

Acidaminococcusintestini , and cluster 16 -

Ruminococcus sp. SR1/5 . The other clusters are unknown Bacteroides orClostridia species. Note that as the clusters move from negative to positive correlation with ddays their means move to the right in the transformed lower panel of Figure 12. The most important PCAdimension, PCA1, is separating outbreak associated organisms from the healthy commensal microbiota.Finally, applying linkage clustering reduces the cluster number from 31 to 29 and improves recall to 0.87,giving an adjusted Rand index of 0.82 (see Table 2 and Figure 14 for SCGs). In fact, two of the clustersjoined were the

Escherichia coli clusters 7 and 18 recovering the near-complete outbreak genome as asingle cluster.The CONCOCT software is downloadable from https://github.com/BinPro/CONCOCT and test datafrom https://github.com/BinPro/CONCOCT-test-data . DiscussionMaterials and Methods

Synthetic HMP data set

We based the simulation of our mock data set on 64 16S rRNA samples from the HMP project [6]. Thesesamples were denoised with the AmpliconNoise pipeline [14] and OTUs constructed at 3% to approximatespecies. This generated a total 6,839 OTUs with known relative abundance proﬁles across the samples.After ﬁltering out OTUs with a total of less than 50 counts summed across samples, we used BLASTto match the remaining OTU sequences against the NCBI whole genome database [15]. We were ableto identify distinct organisms for 41 OTUs (with a blast identity of > in vivo mock community was prepared withthe TruSeq library preparations method and sequenced on a HiSeq. We also inferred the fragment sizedistribution from this data set and used it for the simulation of the paired-end 2x100bp reads. Theread simulation programme outputs reads in fasta format which we converted into a pseudo fastq formatfor the downstream analysis assuming uniform quality scores. The read names contain information onthe genome from which the respective read originated for the subsequent validation of the CONCOCTalgorithm. E. coli

O104:H4 outbreak

In addition to the synthetic community we considered a real data set. This consisted of over 300 millionreads ﬁltered for human DNA from 53 metagenomic datasets from individuals presenting during the 2011outbreak of Shiga-toxin producing

Escherichia coli (STEC) O104:H4 [13]. 43 of these samples were fromSTEC infected individuals, with the remaining ones from patients with clinical diagnoses of

Clostridiumdiﬃcile , Salmonella enterica and

Campylobacter jejuni . Coassembly and mapping

For both the simulated and real data a coassembly of reads from all samples was performed. Ray version2.1.0 [16] was used to generate the coassemblies, because it is able to handle large datasets by distributingthe computation over multiple nodes and is speciﬁcally designed to handle metagenomics data. For the

E. coli

O104:H4 outbreak only 10% of the reads were used in the coassembly in order to complete on amachine with 256GB RAM with a k-mer length of 31. For the synthetic data, the k-mer length was setto 41, this took six hours using 1,024 cores on a Cray XE6 system.To determine coverage of the coassembly per sample, the reads were mapped back with bwa version0.7.5a-r405 [17] using the mem algorithm. The coverages were subsequently computed with BEDTools[18]. The linkage between contigs was determined using the bam_to_linkage.py script, which is partof the CONCOCT package. The script searches for paired reads in a bam ﬁle that align to the tips oftwo diﬀerent contigs, where a tip was chosen to be 500 bases. Since the synthetic coassembly communityis based on simulated reads, it was furthermore possible to determine the origin of each read mappingto a certain contig. Each contig can then be assigned to a genome based on a majority vote. Thisscript can also be found in CONCOCT: contig_read_count_per_genome.py . The

E. coli contigs weretaxonomically classiﬁed by searching against the NCBI database using

TAXAassign v0.4 [19]. Thisprograms uses BLAST [15] to search for matches from the NCBI nucleotide database that are withina given identity and query coverage. Taxonomy is then assigned based on the top n hits that matchthese criteria using a consensus approach i.e. we assign at a given level if at least 80% of the hits atthat level have the same taxa. We used 90% for both identity and coverage, and the top 100 sequences.These values were chosen to ensure a stringency suﬃcient that we could be conﬁdent in species levelassignments. Preprocessing

Prior to preprocessing contigs are ﬁltered by length, only contigs greater than a minimum size are used.This is an adjustable parameter but was ﬁxed at 1000 base pairs throughout this study. Each ﬁlteredcontig indexed i = 1 , . . . , N is represented by a coverage vector and a composition vector. The coveragevector is simply the average number of reads per base from each of M samples, indexed j = 1 , . . . , M ,mapping to that contig. We will denote the coverage vector for each contig by Y i = ( Y i, , . . . , Y i,M ).The composition vector contains the frequency for each k-mer and its reverse complement in that contig.In all the results presented here a k-mer frequency of 4 was used although CONCOCT can acceptany k-mer length as a parameter. The frequencies are combined with complements since sequencingis bidirectional. The dimension V of the composition is 136 for tetramers due to palindromic k-mers.We denote the composition vector for each contig by Z i = ( Z i, , . . . , Z i,V ). Prior to normalisation weadded a small pseudo-count to both coverage and composition vectors. This removes non-zero entries,necessary to allow the log-transform below. It is equivalent to assuming a uniform Dirichlet prior on therelative frequencies. For the composition we simply add a single count to each k-mer Z (cid:48) i,j = Z i,j + 1but for the coverage we imagine mapping an extra read of length 100bp to each contig in each sample Y (cid:48) i,j = Y i,j + 100 /L i where L i is the contig length.Coverage vectors are normalised, ﬁrstly over samples, so that diﬀerent read numbers from a sampleare accounted for: Y (cid:48)(cid:48) i,j = Y (cid:48) i,j (cid:80) Nk =1 Y (cid:48) k,j , (1)and over contigs to give coverage proﬁles p . This normalises for coverage variation within a genomeensuring that this does not mask co-occurrence of contigs: p i,j = Y (cid:48)(cid:48) i,j (cid:80) Mk =1 Y (cid:48)(cid:48) i,k . (2)However, the total coverage does contain further information that may potentially discriminate organisms,so we keep this as an additional variable Y (cid:48)(cid:48) i,. = (cid:80) Mk =1 Y (cid:48)(cid:48) i,k . We also normalise composition to givecomposition proﬁles q . This accounts for diﬀerent contig lengths: q i,j = Z (cid:48) i,j (cid:80) Vk =1 Z (cid:48) i,k . (3)These two vectors and the total coverage are joined together and log-transformed to give a combinedlog-proﬁle x i = (cid:2) log( p i, ) , . . . , log( p i,M ) , log( Y (cid:48)(cid:48) i,. ) , log( q i, ) , . . . , log( q i,V ) (cid:3) of dimension E = M + V + 1.This vector then represents both coverage and composition, the transform expands the domain of thenormalised variables to the negative half-space, and gives improved results over not-transforming oralternatives such as the square root. Finally, a dimensionality reduction using principle componentsanalysis, implemented as a singular value decomposition, was performed on the N × E matrix of log-proﬁles X with rows corresponding to the vectors x i and thus elements, x i,j . The number of components, D , necessary to explain 90% of the variance in the data were kept. This reduces the dimensionality ofthe clustering problem whilst keeping the majority of the coverage and composition information. We willdenote the transformed data N × D matrix by V with row vectors v i for each contig i . Gaussian mixture model (GMM)

To cluster contigs into bins we use a Gaussian mixture model. This deﬁnes the data likelihood as a sumof K Gaussians: L (cid:0) V| K, π k , µ k , Σ k (cid:1) = N (cid:88) i =1 K (cid:88) k =1 π k N (cid:0) v i | µ k , Σ k (cid:1) , (4)where N denotes a Gaussian or normal distribution. Each of these Gaussian components corresponds toa diﬀerent cluster. They are each parameterised by their own mean vectors, µ k and standard deviations,Σ k for which full covariance structures were used. This allows a very ﬂexible speciﬁcation of the clusteras an ellipse in the D dimensional reduced space. The proportions of each mixture component in thedata set are given by the K dimensional vector π = ( π , . . . , π K ). These are eﬀectively prior probabilitiesthat a contig will derive from a cluster before considering its data vector. We ﬁt the GMM using thescikit-learn Python library [12]. This uses an expectation-maximisation (EM) algorithm to determineboth the mixture model parameters and, crucially, the responsibilities, z i,k , which can be viewed as theprobability that contig i derives from cluster, k . For most of the statistics, we use a hard assignment,obtained by assigning contig i to the cluster for which the z i,k is maximised denoting this γ i ∈ , . . . , K .To determine the correct number of clusters, K in a sample we adopt a model selection approachbased on the Bayesian information criterion (BIC). The BIC is deﬁned as: BIC = − ln (cid:2) L (cid:0) V| K, π k , µ k , Σ k (cid:1)(cid:3) + p.ln ( N ) (5)where p is the total number of parameters, p = K.D ( D + 3) /

2. The BIC therefore penalises badly ﬁttingmodels and parameter number simultaneously. We try a range of K values and select the model thatminimises the BIC. This is a standard approach to model selection for GMMs [5]. It is also useful toconsider the cluster means in the original space, transforming the µ k vectors from the D dimensionalreduced space back to the original E dimensions of the combined log-proﬁles, we will denote these back-transformed vectors as µ (cid:48) k . In particular, the ﬁrst M +1 dimensions of these vectors give the mean clusterproﬁle of the component which can be correlated with sample-speciﬁc meta-data. Evaluating clusterings by comparison to known genome assignments

For the synthetic communities we know the genome each contig derived from, deﬁning this as the genomefrom which the majority of reads mapping to that contig derived. We can view these as class labels.These class assignments represent the idealised grouping, a perfect clustering would predict a K equal tothe number of classes or genomes, 41, and with each cluster purely composed of contigs from one of thegenomes. For the E. coli

O104:H4 outbreak data we do not know the true species assignments for themajority of contigs but we do for those that we could unambiguously assign with the

TAXAAssign script,8,058 out of the 22,585 contigs with length > T P . Similarly if a pairof elements deriving from diﬀerent classes are placed in diﬀerent clusters then this is considered a truenegative,

T N . The Rand index lying between 0 and 1, is simply the number of corrects pairs,

T N + T P ,divided by the total number of pairs (cid:0) N (cid:1) . However, given a random classiﬁcation and a random clusteringwe would expect a non-zero Rand index just by chance. The adjusted Rand index accounts for this bysubtracting the expected value given ﬁxed class and cluster sizes [20] and normalising so that values arestill smaller than equal to 1 which indicates a perfect clustering. In addition to the adjusted Rand indexwe used two further measures that help to understand in what way a clustering is deviating from theclassiﬁcation. The ﬁrst is the recall, here we calculate the number of contigs from each species or classthat are placed in the same cluster, sum over all classes and divide by N . This indicates how completethe genome bins are. The second is the precision, which inversely is calculated by summing the contigsin each cluster that derive from the same species and dividing by N , it indicates how pure the clustersare. These statistics are generated by the script Validate.pl distributed in CONCOCT from the clusterand class assignments.

Evaluating clusters with single-copy core genes (SCGs)

Evaluating clusters with single-copy core genes is an alternative way of evaluating cluster completenessand purity. We utilize housekeeping genes that typically occur in single copies in microbial genomes. Toselect appropriate genes we downloaded all complete microbial genomes from NCBI (August 25, 2013)and selected one random genome for each genus. We counted the frequency of each Cluster of OrthologousGroups (COGs) [21] of genes in these 525 genomes. Since there are very few COGs that occur once inevery genome we instead applied the more relaxed criteria of being present in greater than 97.0% of thegenomes and having an average frequency of less than 1.03. This resulted in 36 COGs. Twenty-seven ofthese are shared with the list of 40 COGs that was selected in a similar way in an earlier study [22]. Thescript

Validate_scg.pl generates a table of counts for these COGs (or another list of genes provided bythe user) within the diﬀerent clusters output by CONCOCT. The 36 selected COGs are given in Table 3.

Incorporating linkage

The clusters produce by the GMM are of a high precision mostly containing contigs from just a singlespecies but some species are split between multiple clusters. To address this we incorporate a furthersource of information, paired-end linkage. The standard Illumina sequencing protocol can generate readsfrom either end of a longer fragment, these reads are known as paired. In some cases the paired reads willnot map to the same contig but diﬀerent, this is referred to as linkage. This suggests that these two contigsin fact derived from the same genome sequence. We use this information to devise a further clusteringalgorithm, but one which is a hierarchical agglomerative clusterer, joining the clusters generated by theGMM if there are suﬃcient links between but with the restriction that the mean coverage proﬁles of theclusters are similar enough. We begin by deﬁning a symmetric N × N linkage matrix, L , with elements l i,j corresponding to the number of linked reads between two contigs i and j . If the cluster of contigs k with members U k comprising all contigs with γ i = k derives from the same genome as another cluster l with members U l then we would expect a suﬃciently large portion of the links formed by U k to join with U l . We quantify this by deﬁning a K × K link transition matrix T with elements t k,l : t k,l = (cid:80) i ∈ U k (cid:80) j ∈ U l H [ l i,j − l m ] (cid:80) i ∈ U k (cid:80) Nj =1 H [ l i,j − l m ] (6)where H [ l i,j − l m ] is a Heaviside function such that: H [ l i,j − l m ] = (cid:40) , if l i,j − l m < . , otherwise . (7)The constant l m acts as a threshold above which we consider two contigs to have enough linked readsto count as ‘linked’. This helps eliminate noise due to badly mapped reads or chimeric contigs, for allresults here we used l m = 10. All linked pairs are given equal weighting through the Heaviside functionto account for diﬀerent coverage levels. The T matrix gives the proportion of links from cluster k tocluster l normalised by the total number of links associated with k . We then calculate the mean coverageproﬁle of each cluster: (cid:104) p k (cid:105) = (cid:80) i ∈ U k p i,j | U k | , (8)and a K × K coverage overlap matrix O which gives the degree of overlap between any two clustercoverage proﬁles: o k,l = M (cid:88) j =1 min( (cid:104) p k,j (cid:105) , (cid:104) p l,j (cid:105) ) . (9)The clustering algorithm itself consists of the following steps:1. Join the two clusters k (cid:48) , l (cid:48) with largest oﬀ-diagonal value t k,l but only if o k,l > o m

2. Update the T and O matrices by creating a new aggregate cluster comprising the union of U k (cid:48) and U l (cid:48) Repeat until largest t k,l < t m .Clusters are joined provided they are suﬃciently correlated until the majority of links are within ratherthan between clusters. We used values of t m = 0 .

05 and o m = 0 . Acknowledgments

This research arose out of a workshop funded through the COST project ES1103 graciously hosted byPedro Fernandes at the Instituto Gulbenkian de Ciˆencia in Lisbon. This work was funded by the SwedishResearch Councils VR [grant 2011-5689], FORMAS [grant 2009-1174] and EU-FP7 [grant BLUEPRINT]through grants to A.F.A. C.Q. is funded by an EPSRC Career Acceleration Fellowship - EP/H003851/1.M.S. is supported by Unilever R&D Port Sunlight, Bebington, United Kingdom. The computationswere performed on resources provided by the Swedish National Infrastructure for Computing (SNIC) atPDC Centre for High Performance Computing (PDC-HPC) and Uppsala Multidisciplinary Center forAdvanced Computational Science (UPPMAX).

References

1. Tyson G, Chapman J, Hugenholtz P, Allen E, Ram R, et al. (2004) Community structure andmetabolism through reconstruction of microbial genomes from the environment. Nature 428: 37-43.2. Herlemann D, Lundin D, Labrenz M, Jrgens K, Zheng Z, et al. (2013) Metagenomic de novoassembly of an aquatic representative of the verrucomicrobial class spartobacteria. MBio 4: e00569-12.3. Kantor R, Wrighton K, Handley K, Sharon I, Hug L, et al. (2013) Small genomes and sparsemetabolisms of sediment-associated bacteria from four candidate phyla. MBio 4: e00708-13.4. Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, et al. (2013) Genome sequencesof rare, uncultured bacteria obtained by diﬀerential coverage binning of multiple metagenomes.Nature Biotech 31: 533+.5. Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comp J 41: 578-588.6. Huttenhower C, Gevers D, Knight R, Abubucker S, Badger JH, et al. (2012) Structure, functionand diversity of the healthy human microbiome. Nature 486: 207-214.7. Sandberg R, Winberg G, Branden C, Kaske A, Ernberg I, et al. (2001) Capturing whole-genomecharacteristics in short sequences using a nave bayesian classiﬁer. Genome Res 11: 1404-9.8. Knight R, Freeland S, Landweber L (2001) A simple model based on mutation and selection explainstrends in codon and aminoacid usage and gc composition within and across genomes. Genome Biol.9. Chatterji S, Yamazaki I, Bai Z (2008) CompostBin : A DNA composition-based algorithm forbinning environmental shotgun reads. In: Research in Computational Molecular Biology. pp. 17–28.10. Dick G, Andersson A, Baker B, Simmons S, Thomas B, et al. (2009) Community-wide analysis ofmicrobial genome sequence signatures. Genome Biol 10: R85.11. Wang Y, Leung H, Yiu S, Chin F (2012) Metacluster 5.0: a two-round binning approach formetagenomic data for low-abundance species in a noisy sample. Bioinformatics 28: i356-i362.12. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, et al. (2011) Scikit-learn: Machinelearning in Python. J Mach Learn Res 12: 2825–2830.013. Loman NJ, Constantinidou C, Christner M, Rohde H, Chan J, et al. (2013) A culture-independentsequence-based metagenomics approach to the investigation of an outbreak ofshiga-toxigenic escherichia coli o104:h4. JAMA .14. Quince C, Lanzen A, Davenport RJ, Turnbaugh PJ (2011) Removing noise from pyrosequencedamplicons. BMC Bioinf 12.15. Altschul S, Gish W, Miller W, Myers E, Lipman D (1990) Basic local alignment search tool. J MolBiol 215: 403-410.16. Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J (2012) Ray Meta: scalable de novometagenome assembly and proﬁling. Genome Biol 13:R122.17. Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform.Bioinformatics 26: 589–595.18. Quinlan A, Hall I (2010) Bedtools: a ﬂexible suite of utilities for comparing genomic features.Bioinformatics 26: 841-842.19. Ijaz U (2013). TAXAassign v0.4. https://github.com/umerijaz/taxaassign .20. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2: 193-218.21. Tatusov R, Koonin E, Lipman D (1997) A genomic perspective on protein families. Science 278:631-7.22. Ciccarelli F, Doerks T, von Mering C, Creevey C, Snel B, et al. (2006) Toward automatic recon-struction of a highly resolved tree of life. Science 311: 1283-7.1

Figures Sample9711 Sample8397 Sample9016 Sample1324 Sample1818 Sample1840 Sample1340 Sample1195 Sample1131 Sample454 Sample1234 Sample1100 Sample9290 Sample548 Sample507 Sample872 Sample8690 Sample8652 Sample1278 Sample9904 Sample10092 Sample353 Sample624 Sample8986 Sample8459 Sample871 Sample1215 Sample1140 Sample868 Sample564 Sample8999 Sample9919 Sample1298 Sample1128 Sample1164 Sample120 Sample1293 Sample1221 Sample1036 Sample1129 Sample1150 Sample1095 Sample9408 Sample8554 Sample9512 Sample177 Sample263 Sample746 Sample904 Sample1006 Sample424 Sample302 Sample230 Sample9278 Sample631 Sample9152 Sample710 Sample9453 Sample1023 Sample9918 Sample118 Sample8702 Sample8655 Sample215 Sample234 Sample8784 Sample717 Sample9054 Sample321 Sample8429 Sample1328 Sample1146 Sample9047 Sample290 Sample9421 Sample9494 Sample8475 Sample9467 Sample609 Sample812 Sample8504 Sample9500 Sample712 Sample244 Sample330 Sample8757 Sample961 Sample10138 Sample9243 Sample1388 Sample1252 Sample9150 Sample1180 Sample1356 Sample1593 Sample9267 Sample620 Sample8462 Sample9737 Sample838 Sample1316 Sample1253 Sample1063 Sample9971 Sample767 Sample10123 Sample8703 Sample134 Sample1720 Sample427 Sample8801 Sample882 Sample9903 Sample9479 Sample1274 Sample1377 Sample9961 Sample9111 Sample983 Sample127 Sample9932 Sample8671 Sample1295 Sample9441 Sample803 Sample943 Sample8415 Sample8744 Sample8668 Sample919 Sample8485 Sample10072 Sample9509 Sample8544 Sample482 Sample9085 Sample8526 Sample8835 Sample8513 Sample343 Sample9752 Sample416 Sample9305 Sample10153 Sample9098 Sample8775 Sample1822 Sample1801 Sample9951 Sample8681 Sample8649 Sample9125 Sample8658 Sample1848 Sample1781 Sample387 Sample512 Sample9072 Sample9502 Sample616 Sample9763 Sample9254 Sample522 Sample477 Sample495 Sample687 Sample9778 Sample9901 Sample8495 Sample9497 Sample827 Sample853 Sample759 Sample491 Sample906 Sample10107 Sample371 Sample733 Sample9724 Sample8818 Sample8716 Sample8660 Sample509 Sample409 Sample8445 Sample261 Sample8727 Sample8664

Bact._fragilisNostoc_sp.F._prausnitziiFus._nucleatumR._obeumV._parvulaC._cellulovoransParab._distasonisAc._fermentansR._bromiiR._champ.H._parainfluenzaeCopr._sp._ART55Y._pseudotuber.Bif._adolescentisCopr._catusAk._muciniphilaC._saccharo.Bif._longumOdor._splanch.Al._finegoldiiAl._shahiiAc._intestiniEub._eligensR._sp._SR1R._torquesC._phytofer.Corio._glomeransBact._helcogenesC._saccharo._K10C._difficileEub._rectaleC._thermocellumButyr.−p._bacter.Eub._rectale_M104Eub._siraeumY._enterocoliticaOsc._valericig.Ros._intestinalisBact._thetaiota.Bact._vulgatus

Value

Color Keyand Histogram C oun t Figure 1. Heat map of the relative abundances of the 41 genomes in the synthetic mockcommunity distributed across the 64 HMP samples.

The species and samples have beenpositioned according to similarity. The relative abundances have been square root transformed toemphasise rare species and the inset scale should be interpreted accordingly.3 −30−20−10010 −40 −20 0 20

PCA1 P C A Species lllllllllllllllllllll

Ac._fermantansAc._intestiniAk._mucinphiliaAl._finegoldiiAl._shahiiBact._fragilisBact._helcogenesBact._thetaiota.Bact._vulgatusBif._adolescentisBif._longumButyr.−p._bacter.C._cellulovoransC._difficileC._phytofermentansC._saccharolyticumC._saccharolyticum_K10C._thermocellumCopr._cattusCopr._sp._ART55Corio._glomeransEub._eligensEub._rectale_33656Eub._rectale_M104Eub._siraeumF._prausnitziiFus._nucleatumH._parainfluenzaeNostoc_sp._PCC_7107Odor._splanchnicusOsc._valericigenesParab._distasonisR._bromiiR._champR._obeumR._sp._SR1/5R._torquesRos._intestinalisV._parvulaY._enterocoliticaY._pseudotuber.

Figure 2. PCA plot of the mock metagenome species.

The 10,458 synthetic community contigsof length > lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Number of components K B I C Figure 3. The Bayesian information criterion (BIC) as a function of cluster number K forthe synthetic mock community. Results are the minimum from ﬁve runs with independent randomstarting conditions. The optimal K with the smallest BIC is 56.5 −30−20−10010 −40 −20 0 20 PCA1 P C A Figure 4. PCA plot of the mock metagenome clusters.

The 10,458 synthetic community contigsof length > D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D Y._pseudotuber.Y._enterocoliticaV._parvulaRos._intestinalisR._torquesR._sp._SR1/5R._obeumR._champ.R._bromiiParab._distasonisOsc._valericig.Odor._splanch.Nostoc_sp.H._parainfluenzaeFus._nucleatumF._prausnitziiEub._siraeumEub._rectale_M104Eub._rectaleEub._eligensCorio._glomeransCopr._sp._ART55Copr._catusC._thermocellumC._saccharo._K10C._saccharo.C._phytofer.C._difficileC._cellulovoransButyr.−p._bacter.Bif._longumBif._adolescentisBact._vulgatusBact._thetaiota.Bact._helcogenesBact._fragilisAl._shahiiAl._finegoldiiAk._mucinphiliaAc._intestiniAc._fermantans

Value

Color Keyand Histogram C oun t Figure 5. Confusion matrix for the mock metagenome.

A heatmap visualisation of theconfusion matrix comparing the CONCOCT contig clusterings (without linkage) for the optimal 56cluster solution with the species assignments for the synthetic mock contigs. Each column is a clusternamed D k where k is the cluster index. The rows correspond to the species and the intensities reﬂectthe proportion of each cluster deriving from each species.7 l llllllll llllllll llllllll llllllll llll ll l A Sample no. R e c a ll l llllllll llllllll llllllll llllllll llll ll l B Sample no. P r e c i s i on l llllllll llllllll llllllll llllllll llll ll l C Sample no. A d j . R and i nde x l llllllll llllllll llllllll llllllll llll ll l D Sample no. N o . o f c l u s t e r s K Figure 6. Impact of sample number on accuracy of synthetic mock community clustering.

CONCOCT was run with variable sample numbers the optimal K determined (D) and recall (A),precision (B) and adjusted Rand index (C) calculated.8

D9D1D14D31D0D27D15D41D2D35D7D55D13D49D52D6D28D45D38D24D50D54D37D4D43D51D33D21D20D47D10D16D19D26D25D32D42D53D11D30D34D40D5D17D29D36D48D8D18D3D44D22D23D12D46D39 C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG variable C l u s t e r COG counts

Figure 7. Single-copy core gene frequencies in the mock metagenome clusters.

A heatmapvisualisation of the number of single-copy core genes in each cluster for the optimal 56 cluster solutiongenerated by CONCOCT without linkage applied to the synthetic mock community.9

E6E11E27E37E23E0E31E10E4E45E24E41E34E20E50E48E33E39E47E17E29E16E46E43E7E12E2E15E22E21E28E38E49E26E3E30E36E8E13E25E32E44E5E1E14E40E18E19E9E42E35 C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG variable C l u s t e r COG counts

Figure 8. Single-copy core gene frequencies in the mock metagenome clusters.

A heatmapvisualisation of the number of single-copy core genes in each cluster for the optimal 51 cluster solutiongenerated by CONCOCT with linkage applied to the synthetic mock community.0 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l

Number of components K B I C Figure 9. BIC as a function of K for the Escherichia coli (STEC) O104:H4 data set.

TheBayesian information criterion (BIC) as a function of cluster number K for the Escherichia coli (STEC)O104:H4 data set. Results are the minimum from ﬁve runs with independent random startingconditions. The optimal K with the smallest BIC is 31.1 D D D D D D D D D D D D D D D D D D D D butyr.−p._bacter._SSC/2Rum._torquesRum._sp._SR1Rum._obeumRos._intestinalisPrev._denticolaParabact._distasonisF._prausnitziiEub._rectaleEnt._faeciumE._coliCopr._catusBact._xylanisolvensBact._vulgatusBact._thetaiota.Bact._fragilisAl._shahiiAl._finegoldiiAc._intestini Value

Color Keyand Histogram C oun t Figure 10. Confusion matrix for the

Escherichia coli (STEC) O104:H4 metagenome.

Aheatmap visualisation of the confusion matrix comparing the CONCOCT contig clusterings (withoutlinkage) for the optimal 31 cluster solution of the shiga-toxigenic

Escherichia coli (STEC) O104:H4outbreak. Each column is a cluster named D k where k is the cluster index. The rows correspond to thespecies assignments, the intensities reﬂect the proportion of each cluster deriving from each species.Only clusters and taxa with greater than ﬁve labelled representatives are shown.2 D0D15D17D30D6D8D7D12D13D27D21D25D16D26D10D20D3D4D2D28D11D23D24D19D14D18D22D29D9D5D1 C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG variable C l u s t e r COG counts

Figure 11. Single-copy core gene frequencies in the

Escherichia coli (STEC) O104:H4metagenome clusters.

A heatmap visualisation of the number of single-copy core genes in eachcluster for the optimal 31 cluster solution without linkage of CONCOCT applied to the shiga-toxigenic

Escherichia coli (STEC) O104:H4 outbreak.3

Days since onset of diarrhea C l u s t e r value −15−10−50510 −20 −10 0 10 PCA1 P C A Clusters

Figure 12. Visualisation of key cluster abundances and location in PCA map from the

Escherichia coli (STEC) O104:H4 metagenome.

Top panel: heat map giving the relativeabundance of the ten clusters that correlated signiﬁcantly with days since onset of diarrhea, ddays , ineach of the 43 STEC positive samples. The STEC samples have been ordered by ddays . The clustersare ordered bottom to top from negative to positive correlation coeﬃcients (Table 4). Bottom panel:the same ten clusters plotted in the ﬁrst two PCA components of the transformed combined logcoverage composition proﬁles.4

Genome position (megabases) T o t a l c o v e r age dep t h Cluster

Figure 13. Mapping of contigs to known outbreak genome.

The mapping of contigs to theknown

Escherichia coli (STEC) O104:H4 outbreak genome with cluster discriminated by colour andtotal coverage across all samples shown on the y-axis.

E0E13E15E27E6E7E11E25E19E14E24E18E3E9E4E2E23E10E21E22E17E12E20E26E16E8E5E1 C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG C OG variable C l u s t e r COG counts

Figure 14. Single-copy core gene frequencies in the

Escherichia coli (STEC) O104:H4metagenome clusters following application of linkage.

A heatmap visualisation of the number ofsingle-copy core genes in each cluster for the optimal 28 cluster solution with linkage of CONCOCTapplied to the shiga-toxigenic

Escherichia coli (STEC) O104:H4 outbreak.5

Tables > N length %age ref.Mock 41 1.504 × Table 1.

Co-assembly statistics for the synthetic mock and the

Escherichia coli O104:H4 outbreak.Ray version 2.1.0 [16] was used to generate the coassemblies. The E. coli assembly was a randomsubsample of 10% of the reads in the study [13].Sample

N N (cid:48)

S K

Rec. Prec. Adj. RandMock 10,459 10,459 41 56 0.84 0.97 0.83+ linkage 10,459 10,459 41 51 0.91 0.97 0.94E. coli 22,585 8,058 51 31 0.79 0.94 0.73+ linkage 22,585 8,058 51 28 0.87 0.94 0.82

Table 2.

Cluster validation statistics for the synthetic mock community and the

Escherichia coliO104:H4 outbreak. The number of contigs with length > bp clustered is N , the number with classlabels N (cid:48) , the number of distinct classes i.e. species S , the optimal cluster number K , Rec. is the recall,Prec. the precision and Adj. Rand the adjusted Rand index.7 COG Name Presence (%) Mean frequencyCOG0016 Phenylalanyl-tRNA synthetase alpha subunit 99.6 1.02COG0060 Isoleucyl-tRNA synthetase 99.6 1.01COG0184 Ribosomal protein S15P/S13E 99.6 1COG0049 Ribosomal protein S7 99.4 1COG0088 Ribosomal protein L4 99.4 1COG0092 Ribosomal protein S3 99.4 1.01COG0094 Ribosomal protein L5 RPL11 99.4 1COG0197 Ribosomal protein L16/L10E 99.4 1COG0201 Preprotein translocase subunit SecY 99.4 1.01COG0532 Translation initiation factor 2 99.4 1.01COG0048 Ribosomal protein S12 99.2 1COG0052 Ribosomal protein S2 99.2 1COG0080 Ribosomal protein L11 99.2 1COG0081 Ribosomal protein L1 99.2 1COG0087 Ribosomal protein L3 99.2 1COG0090 Ribosomal protein L2 99.2 1COG0093 Ribosomal protein L14 99.2 1COG0096 Ribosomal protein S8 99.2 1COG0097 Ribosomal protein L6P/L9E 99.2 1COG0103 Ribosomal protein S9 99.2 1COG0256 Ribosomal protein L18 99.2 1COG0051 Ribosomal protein S10 99 1.02COG0072 Phenylalanyl-tRNA synthetase beta subunit 99 1COG0089 Ribosomal protein L23 99 1COG0091 Ribosomal protein L22 99 1COG0100 Ribosomal protein S11 99 1COG0102 Ribosomal protein L13 99 1COG0185 Ribosomal protein S19 99 1COG0200 Ribosomal protein L15 99 0.99COG0244 Ribosomal protein L10 99 0.99COG0186 Ribosomal protein S17 98.9 1COG0198 Ribosomal protein L24 98.5 0.99COG0541 Signal recognition particle GTPase 98.5 1COG0552 Signal recognition particle GTPase 98.5 1COG0504 CTP synthase (UTP-ammonia lyase) 97.9 1.02COG0130 Pseudouridine synthase 97.5 0.99

Table 3.

The list of single-copy core genes (SCGs) used for evaluating cluster completeness and purity,the percentage of genomes in which they are present, and their mean frequencies within genomes.Calculations were based on 525 microbial genomes, each representing a unique genus.Cluster - D k p a p r D7 0.001394846 4.499503e-05 -0.580448290D18 0.014409298 9.296321e-04 -0.486898053D19 0.022560790 2.585918e-03 0.448066807D2 0.022560790 2.911070e-03 0.443265420D8 0.090729544 2.026110e-02 -0.352944330D1 0.090729544 2.239690e-02 -0.347547700D16 0.090729544 2.655809e-02 0.338165455D3 0.090729544 2.733903e-02 0.336543155D29 0.090729544 2.855075e-02 0.334100554D12 0.090729544 2.926759e-02 0.332695738

Table 4.

Correlation between days since onset of diarrhea ddays i for each sample i and the ﬁrst M components of the back-transformed cluster coverage proﬁles µ (cid:48) k . The p and r values give thesigniﬁcance and correlation respectively, between the mean log coverage proﬁle at a sample µ (cid:48) k,i for acluster denoted D k and ddays i using Pearson’s test. The p a denote Benjamini-Hochberg corrected falsediscovery rates. Only clusters with p a < . k D0 D1 D7 D8 D14 D18No. contigs | U k |

843 958 437 862 732 953Outbreak speciﬁc 1 1 205 0 10 18Core 0 0 203 3 65 935Plamid pAA 0 0 22 0 0 0Plasmid pESBL 0 0 16 0 0 0

Table 5.

Comparison of clusters to the

E coli core and outbreak

E. coli

O104:H4 STEC 280 genomes.Each of the 22,585 contigs of greater than 1000bp were mapped to the two genomes and outbreakplasmids using MUMMER. Only six clusters had any contigs mapping and for these the numbersclassiﬁed to either core

E. coli genes present in both genomes or outbreak speciﬁc genes found only in

E. coli

O104:H4 are shown. In addition, to the number mapping to two outbreak associated clusters.The total number of contigs in each cluster ||

O104:H4 are shown. In addition, to the number mapping to two outbreak associated clusters.The total number of contigs in each cluster || U k ||