Bacteriophage classification for assembled contigs using Graph Convolutional Network
BB ACTERIOPHAGE CLASSIFICATION FOR ASSEMBLED CONTIGSUSING G RAPH C ONVOLUTIONAL N ETWORK
Jiayu SHANG
Dept. of Electrical EngineeringCity University of Hong KongKowloon, Hong Kong SAR, China [email protected]
Jingzhe Jiang
South China Sea Fisheries Research InstituteChinese Academy of Fishery SciencesGuangzhou, Guangdong Province, China [email protected]
Yanni Sun
Dept. of Electrical EngineeringCity University of Hong KongKowloon, Hong Kong SAR, China [email protected]
February 9, 2021 A BSTRACT
Motivation:
Bacteriophages (aka phages), which mainly infect bacteria, play key roles in the biologyof microbes. As the most abundant biological entities on the planet, the number of discovered phagesis only the tip of the iceberg. Recently, many new phages have been revealed using high throughputsequencing, particularly metagenomic sequencing. Compared to the fast accumulation of phage-likesequences, there is a serious lag in taxonomic classification of phages. High diversity, abundance, andlimited known phages pose great challenges for taxonomic analysis. In particular, alignment-basedtools have difficulty in classifying fast accumulating contigs assembled from metagenomic data.
Results:
In this work, we present a novel semi-supervised learning model, named PhaGCN, toconduct taxonomic classification for phage contigs. In this learning model, we construct a knowledgegraph by combining the DNA sequence features learned by convolutional neural network (CNN) andprotein sequence similarity gained from gene-sharing network. Then we apply graph convolutionalnetwork (GCN) to utilize both the labeled and unlabeled samples in training to enhance the learningability. We tested PhaGCN on both simulated and real sequencing data. The results clearly show thatour method competes favorably against available phage classification tools.
Availability:
The source code of PhaGCN is available via: https://github.com/KennthShang/PhaGCN
Contact: [email protected]
Bacteriophages (or phages), which mainly infect bacteria, are among the most common and diverse biological entitiesin the biosphere [1]. They regulate the actions of the ecosystem through killing, metabolic reprogramming, or genetransfer [2, 3]. As a major agent of horizontal gene transfer between bacteria, phages can change the virulence ofbacteria and indirectly cause human diseases. There are active studies that use phages for applications such as phagetherapy [4], disease diagnostics [5, 6], and antimicrobial drug discovery [7].Despite important functions of phages, our understanding of them is still very limited. Metagenomic sequencing, whichallows us to obtain total genomic DNA directly from host-associated and environmental samples, has contributedsignificantly to new phage discovery [8, 9, 10, 11]. In particular, metagenomic sequencing allows sequencing ofuncultured dark matter of the microbial biosphere, which can contain a large amount of phages [12]. The advancements a r X i v : . [ q - b i o . GN ] F e b PREPRINT - F
EBRUARY
9, 2021of high-throughput sequencing, assembly, and contig scaffolding have led to phage-like contigs or genomes fromdifferent types of samples. According to the RefSeq database supported by the National Center for Biotechnologyinformation (NCBI) , the number of identified phages changed from 1,468 in 2015 to 3,852 in 2020 in the RefSeqdatabase, which is more than twice of increase. Despite the increase, known phages is just the tip of the iceberg of thevirome on the planet [13]. How to automatically and accurately mine phages and assign their taxonomic groups fromvast amount of sequencing data remains a challenging problem.There are two specific challenges for phage classification. First, the phages with known taxa are very limited. TheInternational Committee on Taxonomy of Viruses (ICTV) is responsible for the official virus taxonomy and organizesviruses in order, family, subfamily, genera and species. Current ICTV classification procedures cannot catch up withnew phage discovery. For example, one of the phage order named
Caudovirales has 3,691 reference genomes. However,there are more than 1,800 new
Caudovirales sequences found in 2020 that are unclassified into families. Limitedlabeled genomes pose challenges for both alignment-based and learning-based classification. Second, many phages indifferent taxa can share protein homologs, which adds ambiguity for alignment-based taxonomic classification. Forexample, more than 7,616 (~10%) proteins in all annotated phage proteins are shared by phages in different familiesunder
Caudovirales . In addition, more than 18,970 (~27%) pairs of highly similar proteins (e-value of BLASTP result < − ) are encoded by phages in different families. Therefore, using homology search alone can return ambiguousclassification.In this work, we present a method that automates taxonomic classification for contigs, which are the outputs of assembly.Although taxonomic classification can be conducted on both reads and contigs [14], recombination in viruses can makeread-level taxonomic classification difficult. In addition, more distinctive features can be derived from contigs andthus can lead to improved classification accuracy. Current metagenomic assembly tools, such as MEGAHIT [15],have been extensively tested and can produce quality contigs from complex datasets. Thus, our tool accepts contigs asinput. In order to address the aforementioned challenges, we developed a semi-supervised learning framework thatincorporated the automatically learned features for each contig via a CNN, the protein sequence similarity, and thegene-sharing features between contigs/genomes. Both the unlabeled and labeled sequences were utilized for trainingin a graph convolutional neural network (GCN). We will demonstrate that the features from the unlabeled sequences(contigs) improve the learning ability and accuracy for phage classification. Below we summarize related work forphage classification. Many attempts have been made for phage taxonomic classification. They can be roughly divided into two groups:alignment-based [16, 17, 18] and learning-based [19, 20, 21, 22]. Alignment-based methods utilize either nucleotide-level or protein-level homology search between query contigs and reference genomes for assigning the taxon for thequery. ClassiPhage [18] and Phage Orthologous Groups (POGs) [16] are two representative alignment-based phageclassification tools. POGs extract taxon-specific marker genes and align query sequences against the marker genes usingBLASTP. If there are statistical significant alignment for the contigs, the label of the best-aligned marker gene will beassigned to the contigs. ClassiPhage builds a profile Hidden Markov Model (pHMM) for each phage taxonomic groupand apply HMM-based alignment for classification. There are two limitations with alignment-based method. First, asgenes or proteins can be shared by different taxa, alignment-based method may lead to ambiguous label assignment orreturn a label with a higher rank using the lowest common ancestor (LCA) in the phylogenetic tree. Second, as phagesare highly abundant and diverse, alignment-based methods are not able to assign taxa for new species that harbor novelproteins or lack quality alignments with the references. For example, under the
Caudovirales order, 187,006 proteinsare named as hypothetical proteins without known family labels. 13,382 proteins from phages released in 2020 do nothave BLASTP results with the phages released before 2020. Thus, using only sequence similarity cannot provide idealresolution.There are a number of learning-based tools for microbe classification such as the Naïve Bayes classifier [23] and CNN[24]. These methods aim to automatically learn features from the reference sequences and predict taxonomic labels ofbacteria or RNA viruses. The most relevant learning-based tool to phage classification is vConTACT 2.0 [22], whichapplies a graph clustering algorithm to assign labels for unknown contigs. In order to leverage gene organizationconservation for phage classification, vConTACT utilizes a clustering algorithm to construct a gene-sharing network[21, 22]. If the reference genomes and contigs are in the same cluster, the labels of the reference genomes will beassigned to those contigs. While these gene-sharing network methods present satisfactory performance on classificationof complete genomes, the classification accuracy decreases as the length of the contigs becomes shorter. The decreasedperformance stems from the fact that short contigs do not contain many proteins and thus do not lead to valid edges inthe gene-sharing network. As a result, the clustering algorithms fail to group contigs and reference genomes in thesame cluster. Then, no labels will be assigned to these contigs.2
PREPRINT - F
EBRUARY
9, 2021Given the enormous diversity of phages and the sheer amount of unlabeled phages, we formulate the phage classificationproblem as a semi-supervised learning problem. We choose the GCN as our learning model and combine the strengthof both the alignment-based and the learning-based methods. First, we utilize the pair-wise BLASTP result betweencontigs and references to improve the weighted edges in the gene-sharing network. Second, to handle the situation thatshort contigs lack gene organization-related features, a CNN-based model is adopted to encode nucleotide informationfrom the sequence. The GCN model allows us to utilize features from both labeled and unlabeled samples in trainingand thus lead to more accurate and sensitive phage classification. We compared our tool (named PhaGCN) withthree state-of-the-art models specifically designed for phage classification: Phage Orthologous Groups (POG) [16],vConTACT 2.0 [22], and ClassiPhage [18]. The experimental results demonstrated that PhaGCN outperforms otherpopular methods.
Figure 1: The pipeline of PhaGCN. A1: cut the contigs into 2kbp segments. A2: feature learning from the inputsusing CNN. A3: construct nodes using encoded vectors. B1: contig translation using 6 reading frames. B2: filtershort translations (12 amino acids). B3: align contigs against reference database using BLASTP. B4: choose the besttranslated frame to gain BLASTP result. B5: use the BLASTP result to construct protein clusters. B6 and B7: defineedges based on the sum of the E weight and P weight . C1: construct the knowledge graph for GCN.Semi-supervised learning is a machine learning approach that combines a small amount of labeled data with a largeamount of unlabeled data during training. The main purpose of using the unlabeled data is to utilize their conjunctioninformation with the labeled data to improve the classification accuracy. Because the number of reference (labeled)phage genomes is small and new (unlabeled) phage contigs are increasing quickly, we formulate the phage classificationproblem as a semi-supervised learning problem.One of the semi-supervised learning approaches, named GCN, is based on deep learning. The basic idea of GCN is toapply a convolutional layer on a graph to utilize the features on non-Euclidean structure [25]. The purpose of the graphconvolutional layer is to automatically learn the edge weights. Then, unlabeled samples/nodes can be represented as theweighted sum of their neighbor samples/nodes. In biological data analysis, there exist many non-Euclidean structuressuch as protein topology graph on the supersecondary structure, gene-sharing network, and diseases-gene relationshipgraph. GCN is expected to render high classification performance by employing the structural data. For example, GCNhas shown promising results in finding the relationship between long non-coding RNAs and diseases [26, 27]. In phageclassification, different phage genomes and contigs can share genes or proteins, which can be encoded in the graph ofGCN. In addition, the nodes in GCN can embed automatically learned feature from nucleotide sequences. During thetraining, convolution is conducted for each node and its neighbors defined by the graph. The learned edge weights willthen be applied for classifying samples without labels.The input to our GCN model is a knowledge graph. There are two key components in the knowledge graph: nodeencoding and edge construction. The node is a numerical vector learned from contigs using a CNN. The edge encodes3 PREPRINT - F
EBRUARY
9, 2021features from both the sequence similarity and the organization of genes. Fig. 1 contains the major components fornode and edge construction. To encode a sequence using a node, a pre-trained convolutional neural network (CNN) isadopted to capture features from the input DNA sequence (A1-A3). The CNN model is trained to convert proximatesubstrings into vectors of high similarity. The edge construction consists of several steps. We employ a greedy searchalgorithm to find the best BLASTP results between the 6-frame translations of the contigs and the database (B1-B4).Then the Markov clustering algorithm (MCL) is applied to generate protein clusters from the BLASTP result (B5)[22]. Based on the results of BLASTP (sequence similarity) and MCL (shared proteins), we define the edges betweensequences (contigs and reference genomes) using two metrics: P weight and E weight (B6-B7). By combining the node’sfeatures and edges (C1), we construct the knowledge graph and feed it to the GCN to classify new phage contigs. CNN can automatically learn motif-related features for sequence classification [28, 29]. Although CNN can be directlyapplied to phage classification, our experiments will show that using CNN alone cannot render the best classificationperformance. Thus, we only train the CNN for encoding input contigs.As shown in Fig. 2, there are two slightly different network structures in the CNN for “train mode” and “encodingmode”, respectively. In the train mode (Fig. 2(A)), we use the reference database to train the CNN model. In theencoding mode (Fig. 2(B)), the output of the first dense layer in the pre-trained CNN will be used to encode sequencesinto numerical vectors.
Train mode:
Because the CNN model can only handle fixed length input, all the inputs will be cut into 2kbp segments with user-specified stride value (default 50). The segment has the same label as the underlying genome according to the ICTVtaxonomic classification.The CNN model contains three different parts: embedding layer, convolutional layer, and dense layer. The embeddinglayer is used to convert the DNA sequence into numerical inputs for convolution. There are two major methods for theembedding layer: one-hot embedding and skip-gram embedding [30]. As shown in our previous work of using CNNfor classifying RNA viruses [24], the skip-gram based embedding can improve CNN’s learning ability. Thus, in thiswork, we implemented a skip-gram embedding layer that can map proximate k-mers into highly similar vectors. Wetrained the embedding layer using k-mers and their neighboring (proximate) k-mers so that the embedding layer canlearn their adjacent relationship. Specifically, in order to train the embedding layer, we use a 3-mer at position i asinput and 3-mers located at i + j as output, where − m ≤ j ≤ m . m is the hyperparameter that can be specified forthe skip-gram model. We employ 100 hidden units in the embedding layer to encode the 3-mers and the output of theembedded vector has 100 dimensions. Thus, each 2kbp segment is converted into embedded matrix M ∈ R , × .Figure 2: The structures of the CNN model in PhaGCN for training (A) and encoding (B). R1,R2,R3 are the referencegenomes used for training. C1,C2,C3 are the contigs that need to be encoded. In train mode, sequences will be fed toCNN to update parameters during back propagation. In encoding mode, The pre-trained CNN will be used to encodethe sequences into numerical vectors. Then these vectors will be adopted as node features in the knowledge graph.4 PREPRINT - F
EBRUARY
9, 2021 Z i ( M, w conv ) =
ReLU ( (cid:80) n conv j =1 w jconv ∗ M [ i : i + d − d ] + b ) (1) H (0) = M axpool ( Z ( M, w conv )) (2) H ( l +1) = ReLU ( H ( l ) , w ( l ) ) (3) output of train mode = Sof tM ax ( H (2) , w (2) ) (4)Then, the embedded matrix M will be fed into the convolutional layer. Eq. 1 is the convolution function. b is the biasterm; d and d are the filter sizes. Since the embedded vector has 100 dimensions, d = 100 . M [ i : i + d − d ] defines a 2D window size of d × d of the embedded matrix M . ReLU is the activation function. The convolutionalfilters w conv contain n conv
2D matrices and w jconv is the j -th filter. We applied filters repeatedly to each possiblewindow of the input embedded matrix to produce a feature map. Then the dense layer is applied to compress thefeatures captured by the convolutional layer as shown in Eq. 2 and Eq. 3. First, max-pooling (Eq. 2) is applied to thefeature map to capture the most useful information from the convolutional layer. Second, we use two dense layerswith ReLU activation function (Eq. 3) to learn and compress the feature map. H ( l ) is the feature map in hidden layer l and w ( l ) is the weight parameters in the l -th hidden layer. Since we only has two dense layers, l ∈ { , } . Finally,the SoftMax function (Eq. 4) is adopted to generate the prediction. As shown in Fig. 2, in the train mode, we employCrossEntropyLoss to calculate the error between prediction and real label and backpropagate the loss to update theparameters in the model. The detailed parameters are listed in our Github repository. Encoding mode:
After training the CNN model, we utilize the pre-trained parameters to convert contigs into numericalvectors. The main difference in the encoding mode is that we only use the output of the first dense layer as the learnedfeature rather than using the SoftMax function for prediction. Eq. 5 shows the equation to convert x (an input 2kbpsegment) into the output of the first dense layer. If a contig is cut into multiple segments of length 2kbp, we will conductvector addition for all the segments’ outputs and divide it by the number of segments. Thus, contigs of different lengthsare always converted into vectors of the same size (determined by the units of the dense layer, default 512). Out ( x ) = ReLU ( pool ( Z ( M, w conv ) , w (0) ) (5) The edges connect nodes that are likely in the same taxonomic group. We define the edge by incorporating both thenumber of shared protein clusters and also the average protein similarity between two sequences. Intuitively, if twosequences share a large number of common protein clusters with high similarity, they tend to belong to the same taxa.In order to quantify the significance of two sequences sharing c common proteins, we first define protein clusters. Apair of proteins from two sequences is called a shared protein if they are in the same protein cluster. We follow the idea in [21, 22] to construct protein clusters. We start by extracting proteins from all sequences. For thegenomes in the reference database, proteins are downloaded from NCBI RefSeq. For the input contigs, DNA sequencesare translated into amino acid sequences using 6 reading frames. We employ BLASTP to conduct all-against-allpairwise alignment between contigs’ 6-frame translations and proteins encoded by the genomes. If there are multiplealignments for different reading frames of a contig, only the best frame is kept. Then we create a weighted graph wherethe node is a protein sequence in the contig or genome and the edge represents an alignment with e-value less than athreshold. The edge weight is the e-value. Then protein clusters are subsequently identified using the Markov clusteringalgorithm (MCL). Finally, clusters that contain at least two proteins will be kept.
The edge is defined by computing two metrics: P weight and E weight . P weight is adopted to calculate the expectednumber of sequences sharing at least an observed number of common proteins (i.e. c proteins). Suppose we havetwo sequences A and B , whose proteins are in a and b protein clusters, respectively. Following vConTACT [21], weutilize the hypergeometric formula to estimate the probability of A and B sharing c common protein clusters. In thisformulation, the total number of protein clusters from all the sequences is n (i.e. the total number of draws in the5 PREPRINT - F
EBRUARY
9, 2021hypergeometric distribution). A success draw is a protein cluster that is shared by two sequences and a failure draw is aprotein cluster unique to one sequence. Thus, Eq. 6 estimates the probability that two genomes have at least c commonprotein clusters. y s the variable representing observed common proteins. Eq. 7 then computes the expected number ofpairs with at least c common proteins out of (cid:0) N (cid:1) sequence pairs. N is the number of sequences (contigs and referencegenomes). With increase of c , P in Eq. 6 becomes small enough to return a positive P weight . P ( y ≥ c ) = (cid:80) min ( a,b ) i = c (cid:0) ia (cid:1)(cid:0) b − in − a (cid:1)(cid:0) bn (cid:1) (6) P weight = − log ( P ( y ≥ c ) × (cid:18) N (cid:19) ) (7)While P weight is used to evaluate whether two sequences share a significant number of common proteins, E weight isadopted to calculate the sequence similarity using alignments’ e-values. We calculate the average e-value for each twosequences. N c is the number of proteins shared by genome A and genome B and e value ( i ) means the e-value of the i th protein pair. Then a score matrix S ∈ R N × N is calculated using Eq. 8. S ( A, B ) means the edge score betweengenome A and B. For each genome A , S ( A, A (cid:48) ) is ranked for all A’s adjacent nodes A (cid:48) . Users can decide how manyedges to keep by specifying a threshold. By default, we only keep the top 5 edges for each genome A . For all the keptedges, E weight = S ( A, B ) . S ( A, B ) = (cid:40) , if no alignment result − log ( (cid:80) N c i =0 e value ( i ) N c ) , otherwise (8) Edges in the knowledge graph:
The final edge in the knowledge graph is defined based on the sum of P weight and E weight , which is proportional tothe logarithm of the joint probability of two sequences having enough common proteins of high similarity. An edge isdefined when the sum is above a threshold τ (Eq. 9). Edge = (cid:26) , if P weight + E weight > τ , otherwise (9) After constructing the knowledge graph, we train a GCN to assign labels for all unlabeled contigs. H ( l +1) = ReLU ( ˜ D − ˜ G ˜ D H ( l ) W ( l ) ) (10) Out = Sof tM ax ( H (2) W (2) ) (11)The basic concept of graph convolutional layer is shown in Eq. 10. Suppose we have N sequences (nodes) in theknowledge graph. G is the R N × N adjacency matrix of the knowledge graph and I N is an R N × N identity matrix. ˜ G is calculated with ˜ G = G + I N . ˜ D is the R N × N diagonal matrix calculated with D ij = (cid:80) j ˜ G ij . H ( l ) is thefeature map in the l -th hidden layer; H (0) is the node feature matrix; and W ( l ) is a matrix of weight parameters. Afterthe graph convolutional layer, we apply a dense layer and use the SoftMax function to calculate the output matrix Out ∈ R N × n label (Eq. 11). Because we have two graph convolutional layers and one dense layer in our model, l ∈ { , } . The output dimension n label is decided by the number of classes in the database. As shown in Fig. 3, onlythe labeled samples will be used to calculate the loss in the training process. We adopt L2 loss to calculate the errorbetween prediction and the labeled samples and back propagate the loss to update the weight parameters. After trainingthe GCN model, we freeze the parameters and use the SoftMax value of the unlabeled samples to assign their labels inthe test mode. The detailed parameters are listed in our Github repository.6 PREPRINT - F
EBRUARY
9, 2021Figure 3: The structure of the GCN model in PhaGCN. Both feature from labeled (green, red, and blue color) andunlabled samples (grey color) will be used in training and prediction. In train mode, only labeled samples will beutilize to calculate the loss and update the parameters. In test mode, the SoftMax function will be applied to generateprediction for unlabeled samples.
We demonstrate the performance of PhaGCN on classifying contigs in families under
Caudovirales , which is an ordercontaining the majority of known phages from RefSeq (95.8% of total phage reference genomes). We downloaded the
Caudovirales reference genomes from the NCBI RefSeq database. As shown in Table 1, there are 3,639 genomes from8 different families. As the lower ranks contain few genomes in each group, we focus on family-level classification inthe experiment. Name Number of genomes
Ackermannviridae Autographiviridae
Demerecviridae Drexlerviridae
Herelleviridae
Myoviridae
Podoviridae
Siphoviridae
Caudovirales . PhaGCN was tested on both simulated and real sequencing data. For the simulated data, we applied two differentmethods to generate contigs with known labels: 1. randomly sample contigs from the reference genome; 2. simulatereads with ART-Illumina [31] and run MEGAHIT [15] to assemble contigs. After validating PhaGCN on simulated datawith known ground truth, we downloaded two real sequencing datasets from NCBI SRA and evaluated PhaGCN onassembled contigs. As phages are highly abundant in marine environment samples [12], we tested PhaGCN on virus-likecontigs from 71 metagenomic data sets that are sequenced from oyster. We recorded macro-accuracy, macro-recall,and macro-precision for each experiment (Eq.12, Eq. 13, and Eq. 14) when the ground truth can be derived. N class is the total number of classes. T P is the True positive,
T N is the True negative, and
F N is the false negative. Asmacro-average will compute each metric independently for each class and then take the average, these metrics treat allclasses equally. 7
PREPRINT - F
EBRUARY
9, 2021
Acc macro = (cid:80) N class i =0 Acc i N class (12) P recision macro = (cid:80) N class i =0 P recision i N class = (cid:80) N class i =0 T P i T P i + F P i N class (13) Recall macro = (cid:80) N class i =0 Recall i N class = (cid:80) N class i =0 T P i T P i + F N i N class (14)The main purpose of PhaGCN is to classify new phages that do not have reference genomes in the training data. Whenthe training and testing data share common genomes, high accuracy may be attributed to memorization rather thanlearning. Thus, in all the experiments conducted using PhaGCN, we use species-masking, meaning that the species inthe testing data will be removed from the training data so that they do not share any phages. The test contigs are thustreated as part of novel species.We compare our results with three representative and widely used pipelines: vConTACT 2.0 [22], Phage OrthologousGroups (POGs) [16], and ClassiPhage [18]. In addition, as CNN itself can conduct classification, we also comparedwith the CNN model we trained for PhaGCN. In this experiment, we randomly chose 20 genomes from each family as the testing species. The remaining genomeswere used as the training set. Thus the testing data contain 160 genomes from all 8 families under
Caudovirales . Then,we sampled contigs from each test genome by generating random starting positions. To estimate the impact of contiglength on PhaGCN, we generated contigs in three length ranges: [4kbp, 8kbp], [8kbp, 12kbp], and [12kbp, 16kbp]. Foreach of the length range, we generated 10 contigs. Thus we have 1,600 contigs for testing. Finally, we repeated thisexperiment for three times and recorded the average performance of the three experiments. In total, for each lengthrange, 4,800 contigs from 480 genomes were tested. We also recorded the results of using complete genomes as the testdata. To have a fair comparison, we applied the same method to construct the training set (or reference database) andthe test set for vConTACT 2.0, POGs, and ClassiPhage.Fig. 4 shows that PhaGCN outperforms other state-of-the-art tools across different length range. With the increase ofcontig length, the performance of all pipelines increases. This is expected because longer contigs contain more proteins,which can lead to better classification performance. We also evaluated the classification performance of only using theCNN model in PhaGCN. Both the recall and precision of using GCN is better than using only CNN, showing that theknowledge graph enhances the learning ability of the model. In addition, the classification performance of PhaGCN isstable with the change of the contig length, making it useful for classifying short contigs.Fig. 5 shows the classification performance of all tools on randomly sampled contigs. Note that each tool can outputeither a family label or no label at all (no prediction). Thus, the bar height shows the percentage of predicted contigs. Inaddition, each bar is divided into two parts. The top part (solid) is the misclassification rate while the bottom part withpatterns corresponds to the macro-accuracy for the classified contigs. The result shows that PhaGCN always has thelargest number of predicted contigs with the highest classification accuracy.Although vConTACT 2.0 can assign the correct labels to most of the predicted contigs as shown in Fig. 5, it onlygenerated predictions for a small number of the contigs in two families (
Podoviridae and
Siphoviridae ). For familieswithout any prediction, the precision is 0. Thus, the macro-precision of vConTACT is small.
In this experiment, we downloaded all newly released
Escherichia coli phages under
Caudovirales in 2020 fromNCBI RefSeq and used them as the testing species (a total of 99 species are downloaded). And we downloaded all
Caudovirales phages released in 2019 from NCBI RefSeq and used them as the training set. Consequently, these newlyreleased
Escherichia coli phages can be treated as unknown species for our model. Then we applied ART-Illumina tosimulate reads from the testing sequences. The parameters used for generating reads are -p, -l 150, -ss HS25, -f 20, -m200 and -s 10. The output contains 150bp paired-end reads simulated under HiSeq 2500. We mixed all the simulatedreads in one dataset and run MEGAHIT to assemble them. In order to quantify the performance of different tools, wedetermine the correct label of contigs by aligning them against the test genomes using BLAST in glocal mode. Onlycontigs of length above 2kbp with taxon-specific alignment results and query coverage >
85% will be kept. As a result,a total of 301 contigs were used as input for comparison. 8
PREPRINT - F
EBRUARY
9, 2021Figure 4: The precision and recall (Eq. 13 and Eq. 14) of PhaGCN, vConTACT2.0, ClassiPhage, POGs, and the CNNmodel in PhaGCN on simulated contigs, which are randomly sampled from phage genomes. X-axis: the length range ofcontigs. For each length range, there are 1,600 randomly sampled contigs from 20 genomes of 8 families. The reportedperformance is averaged on three such sets of contigs for each length range.Figure 5: The percentage of classified contigs and classification accuracy (Eq.12) of each model on simulated contigs,which are randomly sampled from phage genomes. Each bar shows the percentage of classified contigs. The solid partshows the misclassification rate and the bottom part with patterns represents the macro-accuracy.As shown in Fig. 6 and Fig. 7, PhaGCN outperforms other tools across different length, which is consistent with theconclusion in Section 3.2. We also find that when the length of the contigs becomes shorter ([2kbp, 4kbp]), PhaGCN9
PREPRINT - F
EBRUARY
9, 2021Figure 6: The precision and recall (Eq. 13 and Eq. 14) of PhaGCN, vConTACT2.0, ClassiPhage, POGs and CNNmodel in PhaGCN on contigs that are assembled using MEGAHIT from simulated reads. 301 contigs assembled fromsimulated reads of 99 species are used as inputs. The numbers of contigs for each length range are 101, 107, 51, and 42,respectively.Figure 7: The percentage of classified contigs and classification accuracy (Eq.12) of each model on contigs that areassembled using MEGAHIT from simulated reads. Each bar shows the percentage of classified contigs. The solid partshows the misclassification rate and the bottom part with patterns represents the macro-accuracy.can still achieve over 80% accuracy. Although the reads simulated by ART-Illumina contain sequencing error, the10
PREPRINT - F
EBRUARY
9, 2021performance of PhaGCN still achieves high accuracy (100% for contigs over 8kbp). Because we only tested newlyreleased
Escherichia coli phages in 2020, the classification performance of all methods were slightly better than theresults in Section 3.2.
The most resource demanding component in PhaGCN is the sequence alignment. We used it to produce the proteinclusters by conducting pairwise alignments between contigs and reference sequences. PhaGCN produces proteinclusters for each set of input contigs by assuming that they may contain novel proteins. In addition, alignment is alsoconducted for defining the edge in the knowledge graph. Table 2 shows the average elapsed time of classifying 100contigs for each tool. PhaGCN is not the fastest program. Optimization can be applied to reduce the number of pairwisealignments. For example, we can produce a database of protein clusters and reduce the number of pairwise alignments.
Program PhaGCN vConTACT 2.0 ClassiPhage POGsElapsed time(min/100 contigs) 25 32 4 7
Table 2: The average elapsed time to predict labels of 100 contigs for each method. All the methods are run on Intel ® Xeon ® Gold 6258R CPU with 8 cores.
In this experiment, we searched for real sequencing data that contain
Caudovirales at NCBI SRA and downloaded twodatasets, SRR12949983 and SRR13132427. Then we used MEGAHIT to assemble reads into contigs on these twodatasets separately. To quantify the performance of phage classification on these two datasets, we used the providedtaxonomic analysis by NCBI SRA as the ground truth. The phages in the two datasets provided by NCBI SRA are listedin Table 3. We used the same method introduced in Section 3.3 to label the contigs and removed all these genomes inTable 3 from the reference database before training PhaGCN. The contigs in the test data are listed in Table 4.SRR12949983 SRR13132427
Escherichia phage C5 Escherichia phage V18Salmonella phage C2 Escherichia virus FV3Serratia phage Pila Escherichia virus JES2013Escherichia virus E112 Escherichia phage CEC_Kaz_2018Escherichia virus ECML134 Escherichia phage SECphi18Escherichia virus T4 Escherichia phage vB_EcoS_PNS1
Table 3: Reference species in SRR12949983 and SRR13132427. These species are shown in the associated taxonomicanalysis at NCBI SRA. We use them as the ground truth to assign labels to the assembled contigs.
SRR12949983 SRR13132427
No. of contigs assembled by MEGAHIT (>2kbp)
24 (including 1 bacterial and 1 unknown) 20 (including 1 bacterial)
Phage family No. of contigs Phage family No. of contigs
Autographiviridae Myoviridae Myoviridae Siphoviridae Table 4: Contigs assembled by MEGAHIT on the two datasets and their family labels derived using alignment againstthe species in Table 3.We compared the phage labels assigned by PhaGCN, vConTACT, POGs, and ClassiPhage on the assembled contigs.Because the real sequencing data might contain bacteria sequences, we run DeepVirFinder [32] to reject the contigs thatbelong to bacteria. As shown in Fig. 8, there is only one bacterial contig in each dataset. All other contigs (43 contigs)were fed into the four tools for classification.Fig. 8 reveals that PhaGCN achieves better classification performance than vConTACT, POGs, and ClassiPhage.PhaGCN has 100% accuracy in both datasets. Many contigs could not be classified by vConTACT 2.0 (unlabeled in Fig.8). POGs and ClassiPhage are able to assign labels for all contigs but a number of them have wrong family labels.11
PREPRINT - F
EBRUARY
9, 2021In the dataset SRR121949983, there is one contig lacking ground truth because the contig cannot be aligned to anyreference genome in Table 3. We thus conducted homology search at NCBI against Nucleotide Collection (nr/nt)database. The search shows that this contig belongs to
Siphoviridae . We input this contig to the four tools and onlyPhaGCN assigned the correct label,
Siphoviridae , to this contig.Figure 8: Classification result of PhaGCN, vConTACT 2.0, POGs, and ClassiPhage on SRR12949983(A) andSRR13132427(B).Figure 9: Composition analysis of SRR12949983(A) and SRR13132427(B). Left: composition analysis published atNCBI SRA. Right: composition analysis presented by PhaGCN.Furthermore, we showed the composition of these two datasets before and after using PhaGCN in Fig. 9. The resultsrevealed that our model can greatly improve the composition analysis for the dark matter. Thus, PhaGCN can benefitmetagenomic analysis. 12
PREPRINT - F
EBRUARY
9, 2021
After validating PhaGCN on simulated datasets and two real sequencing datasets, we applied PhaGCN to contigsassembled from metagenomic data of oyster samples. The samples were collected by the co-author Dr. Jiang betweenApril, 2016 to July, 2019 from various sites along the coast of South China Sea. Metagenomic sequencing wasconducted from samples in the Gill, visceral mass, and mantle tissues of oyster using viral-like particle enrichment andprotocols in [33, 34]. There are about 2.5 billions of raw reads from 71 libraries.
Ackermannviridae Autographiviridae Demerecviridae Drexlerviridae Herelleviridae Myoviridae Podoviridae Siphoviridae unclassifiedvConTACT 0 0 0 0 0 50 116 102 22698PhaGCN 150 1727 301 74 32 5880 2682 6173 5767
Table 5: Prediction results of PhaGCN and vConTACT for contigs produced from oyster metagenomic data.After applying standard quality control and MEGAHIT with the default setting, there are about 3,375,091 contigs oflength above 500bp. After removing contigs that can be aligned to bacteria, archaea, eukaryota, we kept 22,966 contigswith length above 4,000bp as input to PhaGCN. Of them, 17,199 contigs can be assigned to
Caudovirales by PhaGCN.When users apply composition analysis for their samples, precision is important for generating valid hypothesis.Because vConTACT 2.0 has higher accuracy for what they can predict than ClassiPhage and POGs based on ourexperiments, we compared PhaGCN with vConTACT 2.0 in this experiment and summarized the results in Table 5.Although we don’t have the ground truth for this large-scale metagenomic sequencing data, the numbers of predictedcontigs are consistent with the results of experiments on simulated and real sequencing datasets. In addition the contigsthat can be assigned with labels by vConTACT 2.0 are significantly less than PhaGCN. ~74.8% contigs are predicted byPhaGCN while ~1.1% contigs are predicted by vConTACT 2.0. The contigs predicted by vConTACT is a subset ofPhaGCN. Also, PhaGCN can identify more families from the dataset. Because the classification accuracy of PhaGCNis more than 92% when the length of contigs is over 4kbp, the result can provide reliable family-level compositionanalysis for the oyster metagenomic data.
In this work, we demonstrate that PhaGCN can render better performance for novel phage classification. The majorimprovement of our method stems from combined strength of the reference-based model and the learning-based modelusing the knowledge graph: the nodes contain automatically learned features from nucleotide sequences and the edgesare created by protein-based alignment. Then the semi-supervised GCN is applied on the knowledge graph to utilizeboth labeled and unlabeled data for training.Figure 10: (A): classification result of three added families:
Rudiviridae , Microviridin , and
Inoviridae . (B): PhaGCNcan reject non-
Caudovirales phages.As
Caudovirales is the order with the most number of sequenced phages from RefSeq, we validated PhaGCN onclassifying families in this order. But PhaGCN can be conveniently extended to other taxa. We extended PhaGCN byadding families that contain at least 10 genomes. Using this criterion, three families (
Rudiviridae , Microviridin , and
Inoviridae ) were added. We used the leave-one-species-out method to choose testing species from these three familiesand use the same method introduced in Section 3.2 to generate contigs. The classification results in Fig. 10(A) showthat PhaCN can correctly classify almost all of the contigs from the extended families.Using knowledge graph enables PhaGCN to detect targeted phage families, which is useful for applications where onlysome phages are of interest. Specifically, phages that are not in the training families usually won’t form edges with the13
PREPRINT - F
EBRUARY
9, 2021nodes in the graph and thus will not be mis-classified into
Caudovirales . We validated the detection ability of PhaGCNby testing whether PhaGCN can reject contigs that do not belong to
Caudovirales . We downloaded 4,218 phagegenomes that do not belong to
Caudovirales from RefSeq according to the ICTV taxonomic affiliation information. Foreach genome, we apply the same method introduced in Section 3.2 to generate 10 contigs for each of them. Thus, atotal of 42,180 contigs were tested. As shown in Fig. 10(B), only 3 of them are accepted (predicted) by PhaGCN. Thisexperiment demonstrates that PhaGCN can be applied for targeted phage detection.In summary, PhaGCN provides a useful tool for phage classification. It can identify phage contigs from the metagenomicsamples with higher accuracy than other phage classification tools.
Funding
This work was supported by Hong Kong RGC GRF 9042828 and HKIDS (9360163) and NSF of China (31972847).
References [1] Stephen McGrath, D van Sinderen, et al.
Bacteriophage: genetics and molecular biology.
Caister Academic Press,2007.[2] Lucía Fernández, Ana Rodríguez, and Pilar García. Phage or foe: an insight into the impact of viral predation onmicrobial communities.
The ISME journal , 12(5):1171–1179, 2018.[3] Bonnie L Hurwitz and Jana M U’Ren. Viral metabolic reprogramming in marine ecosystems.
Current opinion inmicrobiology , 31:161–168, 2016.[4] Catherine Loc-Carrillo and Stephen T Abedon. Pros and cons of phage therapy.
Bacteriophage , 1(2):111–114,2011.[5] Lin-Fa Wang and Meng Yu. Epitope identification and discovery using phage display libraries: applications invaccine development and diagnostics.
Current drug targets , 5(1):1–15, 2004.[6] Justyna Bazan, Ireneusz Całkosi´nski, and Andrzej Gamian. Phage display—A powerful technique for im-munotherapy: 1. Introduction and potential of therapeutic applications.
Human vaccines & immunotherapeutics ,8(12):1817–1828, 2012.[7] Jing Liu, Mohammed Dehbi, Greg Moeck, Francis Arhin, Pascale Bauda, Dominique Bergeron, Mario Callejo,Vincent Ferretti, Nhuan Ha, Tony Kwan, et al. Antimicrobial drug discovery through bacteriophage genomics.
Nature biotechnology , 22(2):185–191, 2004.[8] Bas E JDutilh, Noriko Cassman, Katelyn McNair, Savannah E Sanchez, Genivaldo G Z Silva, Lance Boling,Jeremy J Barr, Daan R Speth, Victor Seguritan, Ramy K Aziz, Ben Felts, Elizabeth A Dinsdale, John L Mokili,and Robert A Edwards. A highly abundant bacteriophage discovered in the unknown sequences of human faecalmetagenomes.
Nature Communications , 5:4498, 2014.[9] Kira Moon, Ilnam Kang, Suhyun Kim, Sang-Jong Kim, and Jang-Cheon Cho. Genomic and ecological study oftwo distinctive freshwater bacteriophages infecting a Comamonadaceae bacterium.
Scientific reports , 8(1):1–9,2018.[10] Kira Moon, Jeong Ho Jeon, Ilnam Kang, Kwang Seung Park, Kihyun Lee, Chang-Jun Cha, Sang Hee Lee, andJang-Cheon Cho. Freshwater viral metagenome reveals novel and functional phage-borne antibiotic resistancegenes.
Microbiome , 8:1–15, 2020.[11] Kira Moon, Suhyun Kim, Ilnam Kang, and Jang-Cheon Cho. Viral metagenomes of Lake Soyang, the largestfreshwater lake in South Korea.
Scientific Data , 7(1):1–6, 2020.[12] Blanca Perez Sepulveda, Tamsin Redgwell, Branko Rihtman, Frances Pitt, David J. Scanlan, and Andrew Millard.Marine phage genomics: the tip of the iceberg.
FEMS Microbiology Letters , 363(15), 06 2016. fnw158.[13] Tasha M Santiago-Rodriguez and Emily B Hollister. Human virome and disease: high-throughput sequencingfor virus discovery, identification of phage-bacteria dysbiosis and development of therapeutic approaches withemphasis on the human gut.
Viruses , 11(7):656, 2019.[14] Kevin P Keegan, Elizabeth M Glass, and Folker Meyer. MG-RAST, a metagenomics service for analysis ofmicrobial community structure and function. In
Microbial environmental genomics (MEG) , pages 207–233.Springer, 2016. 14
PREPRINT - F
EBRUARY
9, 2021[15] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fastsingle-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.
Bioinformatics ,31(10):1674–1676, 2015.[16] David M Kristensen, Alison S Waller, Takuji Yamada, Peer Bork, Arcady R Mushegian, and Eugene V Koonin. Or-thologous gene clusters and taxon signature genes for viruses of prokaryotes.
Journal of bacteriology , 195(5):941–950, 2013.[17] Pakorn Aiewsakun, Evelien M Adriaenssens, Rob Lavigne, Andrew M Kropinski, and Peter Simmonds. Evaluationof the genomic diversity of viruses infecting bacteria, archaea and eukaryotes using a common bioinformaticplatform: steps towards a unified taxonomy.
The Journal of general virology , 99(9):1331, 2018.[18] Cynthia Maria Chibani, Anton Farr, Sandra Klama, Sascha Dietrich, and Heiko Liesegang. Classifying theunclassified: a phage classification method.
Viruses , 11(2):195, 2019.[19] Forest Rohwer and Rob Edwards. The Phage Proteomic Tree: a genome-based taxonomy for phage.
Journal ofbacteriology , 184(16):4529–4535, 2002.[20] Ho Bin Jang, Fernand F Fagutao, Seong Won Nho, Seong Bin Park, In Seok Cha, Jong Earn Yu, Jung Seok Lee,Se Pyeong Im, Takashi Aoki, and Tae Sung Jung. Phylogenomic network and comparative genomics reveal adiverged member of the φ kz-related group, marine Vibrio phage φ JM-2012.
Journal of virology , 87(23):12866–12878, 2013.[21] Benjamin Bolduc, Ho Bin Jang, Guilhem Doulcier, Zhi-Qiang You, Simon Roux, and Matthew B Sullivan.vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria.
PeerJ ,5:e3243, 2017.[22] Ho Bin Jang, Benjamin Bolduc, Olivier Zablocki, Jens H Kuhn, Simon Roux, Evelien M Adriaenssens, J RodneyBrister, Andrew M Kropinski, Mart Krupovic, Rob Lavigne, et al. Taxonomic assignment of uncultivatedprokaryotic virus genomes is enabled by gene-sharing networks.
Nature biotechnology , 37(6):632–639, 2019.[23] Qiong Wang, George M Garrity, James M Tiedje, and James R Cole. Naive Bayesian classifier for rapid assignmentof rRNA sequences into the new bacterial taxonomy.
Applied and environmental microbiology , 73(16):5261–5267,2007.[24] Jiayu Shang and Yanni Sun. CHEER: hierarCHical taxonomic classification for viral mEtagEnomic data via deepleaRning.
Methods , 2020.[25] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXivpreprint arXiv:1609.02907 , 2016.[26] Tianyi Zhao, Yang Hu, Jiajie Peng, and Liang Cheng. DeepLGP: a novel deep learning method for prioritizinglncRNA target genes.
Bioinformatics , 36(16):4466–4472, 2020.[27] Tanvir Alam, Hamada RH Al-Absi, and Sebastian Schmeier. Deep Learning in LncRNAome: Contribution,Challenges, and Perspectives.
Non-coding RNA , 6(4):47, 2020.[28] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificitiesof DNA-and RNA-binding proteins by deep learning.
Nature biotechnology , 33(8):831–838, 2015.[29] Seokjun Seo, Minsik Oh, Youngjune Park, and Sun Kim. DeepFam: deep learning based alignment-free methodfor protein family modeling and prediction.
Bioinformatics , 34(13):i254–i262, 2018.[30] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of wordsand phrases and their compositionality. arXiv preprint arXiv:1310.4546 , 2013.[31] Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. ART: a next-generation sequencing readsimulator.
Bioinformatics , 28(4):593–594, 2012.[32] Jie Ren, Kai Song, Chao Deng, Nathan A Ahlgren, Jed A Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, and FengzhuSun. Identifying viruses from metagenomic data using deep learning.
Quantitative Biology , pages 1–14, 2020.[33] Hong-Ying Wei, Sheng Huang, Tuo Yao, Fang Gao, Jing-Zhe Jiang, and Jiang-Yong Wang. Detection of virusesin abalone tissue using metagenomics technology.
Aquaculture Research , 49(8):2704–2713, 2018.[34] Jiang Jingzhe and Wei Hongying. Isolation of Viral Like Particles (VLP) from Tissues of Molluscs. protocols.ioprotocols.io