Nature Biotechnology | 2019
Detecting contamination in viromes using ViromeQC
Abstract
To the Editor — Eukaryotic viruses and bacteriophages have important roles in microbiomes, but characterization of viruses in metagenomics data is difficult. Viral-like particle (VLP) purification enables enrichment for viruses from microbiome samples before sequencing, but contamination can result in misleading conclusions. We present a software tool named ViromeQC for analyzing virome data. Here, we demonstrate the utility of ViromeQC by applying it to 2,050 human, animal and environmental samples from 35 metagenomic virome sequencing studies that used one of the available VLP enrichment techniques. The resulting analysis reveals these viromes to be rife with bacterial, archaeal and fungal contamination. Most samples show only modest virus enrichment, and such enrichment is very variable between viromes in the same study. To address these issues, we present a validated contamination quality-control pipeline to enable more robust virome metagenomic analyses. Viruses affect the ecology and composition of microbial communities1,2. Bacteriophages (viruses of bacteria and archaea) are extremely abundant and diverse, and they affect microbiomes in several ways, including transduction, which is an important mechanism of lateral gene transfer3. Metagenomics can be used to characterize phage populations, but phage are so diverse, and evolve so rapidly, that they are poorly represented in sequence databases. Also, there are no universal viral genetic markers, and the overall biomass of viruses, compared with that of other microorganisms in a sample, is low. For these reasons, phage sequences are difficult to identify in metagenomes, although specific methods that are partly based on sequence characteristics of known phages have been reported4,5. VLP purification can be used to enrich microbiome samples for viral nucleic acids6, thereby improving virus detection. VLP protocols have various goals, ranging from untargeted analyses of highly purified phage populations to targeted identification of rare sequences of viral pathogens in diagnostic samples. These methods typically include filtration through small-poresize filters that retain bacteria, cesium chloride gradient purification, treatment with chloroform to disrupt membranes, and exposure to nucleases to reduce free DNA and RNA concentration. If the aim is to use metagenomics to detect known viral pathogens, a low-purity sample may suffice because identification will be by alignment of sequence reads to viral databases. However, if the aim is to detect unknown viruses or report all viruses in a sample, a high-purity sample is required. When coupled with untargeted shotgun sequencing7, VLP enrichment has underpinned many studies in human8,9, environmental10,11 and built-environment settings12, but there is no single VLP enrichment protocol that is optimal for all sample types. Regardless of the VLP protocol, non-viral genetic material remains after enrichment13. These unwanted nucleic acids are contaminants, and their presence particularly confounds the de novo discovery of phages in untargeted virome sequencing. If the VLP virome is pure, it is possible to assemble reads into possibly fragmented viral genomes without using computational prediction approaches, which are unavoidably affected by lowconfidence calls and false negatives4,5. The fraction of next-generation sequencing reads belonging to viruses in the VLP sample correlates with the performance of de novo recovery of new viruses, but methods for evaluating VLP purity in samples have not been systematically explored. Studies have assessed contamination of VLP preparations by PCR amplification of prokaryotic 16S rRNA gene sequences before virome sequencing11,14–19. Others have mapped next-generation virome sequencing output against the 16S rRNA gene, or a different marker9,20–24. However, these studies have not provided a validated pipeline to quantify viral enrichment in viromes or unenriched samples. Although efforts toward VLP-protocol optimization have been reported24, the largest meta-analysis of post-sequencing non-viral quantification to date considered just 67 viromes13. As the use of VLP enrichment for virome sequencing is increasing, we set out to evaluate non-viral contamination in >2,000 virome samples. To assess the enrichment rates of publicly available viromes, we applied our method (Supplementary Methods) to a collection of 2,050 VLP samples (Supplementary Table 1). As controls, we included 2,189 metagenomes that were not enriched for viruses from the curatedMetagenomicData25 and the National Center for Biotechnology Information Sequence Read Archive (NCBISRA)26 repositories, as well as 108 publicly accessible synthetic metagenomes27,28 and one mock community (Supplementary Table 2). After uniform preprocessing to remove low-quality reads (Supplementary Methods), we computed the percentage of raw reads in each sample that align to the small subunit ribosomal RNA gene (SSU rRNA), which has never been found in a viral genome. This provided a proxy for non-viral microbial sequence abundance13. We estimated the abundance of bacterial and archaeal 16S and microeukaryotic 18S ribosomal genes in all of the viromes and metagenomes. Unenriched metagenomes provided a baseline estimation of the environment-specific rRNA gene abundance, from which we calculated the relative enrichment of viromes with respect to the metagenomes. Environmental and human/animal unenriched metagenomes had a median rRNA gene abundance of 0.08% (n = 320, interquartile range = 0.07%) and 0.25% (n = 1,551, interquartile range = 0.1%), respectively (Fig. 1). Prokaryotic and micro-eukaryotic contamination of viromes estimated by the quantification of the SSU rRNA revealed a wide range of enrichment efficiencies, with a large fraction of samples (n = 567, 28.7%) having no virus enrichment at all and >50% (n = 990) having less than threefold enrichment. A substantially smaller fraction of samples (n = 339, 17.15%) showed high enrichment (>100-fold). Differences in enrichment rates were not clearly associated with any one VLP purification method, although the heterogeneity of protocols makes it difficult to provide statistical support to this observation. According to taxonomic annotations of the rRNA gene sequences retrieved in viromes, the largest source of contamination was bacterial DNA (1,466 samples), with 88 samples having higher abundances of eukaryotic-associated SSU rRNAs (Supplementary Table 3). The rRNA gene abundance variability was higher in viromes than in metagenomes (Mann–Whitney U test P = 7.5 × 10–8, Supplementary Fig. 1), revealing not only that many viromes are poorly enriched for