[PDF] Bacteriophage classification for assembled contigs using Graph Convolutional Network

Abstract

Motivation: Bacteriophages (aka phages), which mainly infect bacteria, play key roles in the biology of microbes. As the most abundant biological entities on the planet, the number of discovered phages is only the tip of the iceberg. Recently, many new phages have been revealed using high throughput sequencing, particularly metagenomic sequencing. Compared to the fast accumulation of phage-like sequences, there is a serious lag in taxonomic classification of phages. High diversity, abundance, and limited known phages pose great challenges for taxonomic analysis. In particular, alignment-based tools have difficulty in classifying fast accumulating contigs assembled from metagenomic data. Results: In this work, we present a novel semi-supervised learning model, named PhaGCN, to conduct taxonomic classification for phage contigs. In this learning model, we construct a knowledge graph by combining the DNA sequence features learned by convolutional neural network (CNN) and protein sequence similarity gained from gene-sharing network. Then we apply graph convolutional network (GCN) to utilize both the labeled and unlabeled samples in training to enhance the learning ability. We tested PhaGCN on both simulated and real sequencing data. The results clearly show that our method competes favorably against available phage classification tools.

Full PDF

BB ACTERIOPHAGE CLASSIFICATION FOR ASSEMBLED CONTIGSUSING G RAPH C ONVOLUTIONAL N ETWORK

Jiayu SHANG

Dept. of Electrical EngineeringCity University of Hong KongKowloon, Hong Kong SAR, China [email protected]

Jingzhe Jiang

South China Sea Fisheries Research InstituteChinese Academy of Fishery SciencesGuangzhou, Guangdong Province, China [email protected]

Yanni Sun

Dept. of Electrical EngineeringCity University of Hong KongKowloon, Hong Kong SAR, China [email protected]

February 9, 2021 A BSTRACT

Motivation:

Bacteriophages (aka phages), which mainly infect bacteria, play key roles in the biologyof microbes. As the most abundant biological entities on the planet, the number of discovered phagesis only the tip of the iceberg. Recently, many new phages have been revealed using high throughputsequencing, particularly metagenomic sequencing. Compared to the fast accumulation of phage-likesequences, there is a serious lag in taxonomic classiﬁcation of phages. High diversity, abundance, andlimited known phages pose great challenges for taxonomic analysis. In particular, alignment-basedtools have difﬁculty in classifying fast accumulating contigs assembled from metagenomic data.

Results:

In this work, we present a novel semi-supervised learning model, named PhaGCN, toconduct taxonomic classiﬁcation for phage contigs. In this learning model, we construct a knowledgegraph by combining the DNA sequence features learned by convolutional neural network (CNN) andprotein sequence similarity gained from gene-sharing network. Then we apply graph convolutionalnetwork (GCN) to utilize both the labeled and unlabeled samples in training to enhance the learningability. We tested PhaGCN on both simulated and real sequencing data. The results clearly show thatour method competes favorably against available phage classiﬁcation tools.

Availability:

The source code of PhaGCN is available via: https://github.com/KennthShang/PhaGCN

Contact: [email protected]

Bacteriophages (or phages), which mainly infect bacteria, are among the most common and diverse biological entitiesin the biosphere [1]. They regulate the actions of the ecosystem through killing, metabolic reprogramming, or genetransfer [2, 3]. As a major agent of horizontal gene transfer between bacteria, phages can change the virulence ofbacteria and indirectly cause human diseases. There are active studies that use phages for applications such as phagetherapy [4], disease diagnostics [5, 6], and antimicrobial drug discovery [7].Despite important functions of phages, our understanding of them is still very limited. Metagenomic sequencing, whichallows us to obtain total genomic DNA directly from host-associated and environmental samples, has contributedsigniﬁcantly to new phage discovery [8, 9, 10, 11]. In particular, metagenomic sequencing allows sequencing ofuncultured dark matter of the microbial biosphere, which can contain a large amount of phages [12]. The advancements a r X i v : . [ q - b i o . GN ] F e b PREPRINT - F

EBRUARY

9, 2021of high-throughput sequencing, assembly, and contig scaffolding have led to phage-like contigs or genomes fromdifferent types of samples. According to the RefSeq database supported by the National Center for Biotechnologyinformation (NCBI) , the number of identiﬁed phages changed from 1,468 in 2015 to 3,852 in 2020 in the RefSeqdatabase, which is more than twice of increase. Despite the increase, known phages is just the tip of the iceberg of thevirome on the planet [13]. How to automatically and accurately mine phages and assign their taxonomic groups fromvast amount of sequencing data remains a challenging problem.There are two speciﬁc challenges for phage classiﬁcation. First, the phages with known taxa are very limited. TheInternational Committee on Taxonomy of Viruses (ICTV) is responsible for the ofﬁcial virus taxonomy and organizesviruses in order, family, subfamily, genera and species. Current ICTV classiﬁcation procedures cannot catch up withnew phage discovery. For example, one of the phage order named

Caudovirales has 3,691 reference genomes. However,there are more than 1,800 new

Caudovirales sequences found in 2020 that are unclassiﬁed into families. Limitedlabeled genomes pose challenges for both alignment-based and learning-based classiﬁcation. Second, many phages indifferent taxa can share protein homologs, which adds ambiguity for alignment-based taxonomic classiﬁcation. Forexample, more than 7,616 (~10%) proteins in all annotated phage proteins are shared by phages in different familiesunder

Caudovirales . In addition, more than 18,970 (~27%) pairs of highly similar proteins (e-value of BLASTP result < − ) are encoded by phages in different families. Therefore, using homology search alone can return ambiguousclassiﬁcation.In this work, we present a method that automates taxonomic classiﬁcation for contigs, which are the outputs of assembly.Although taxonomic classiﬁcation can be conducted on both reads and contigs [14], recombination in viruses can makeread-level taxonomic classiﬁcation difﬁcult. In addition, more distinctive features can be derived from contigs andthus can lead to improved classiﬁcation accuracy. Current metagenomic assembly tools, such as MEGAHIT [15],have been extensively tested and can produce quality contigs from complex datasets. Thus, our tool accepts contigs asinput. In order to address the aforementioned challenges, we developed a semi-supervised learning framework thatincorporated the automatically learned features for each contig via a CNN, the protein sequence similarity, and thegene-sharing features between contigs/genomes. Both the unlabeled and labeled sequences were utilized for trainingin a graph convolutional neural network (GCN). We will demonstrate that the features from the unlabeled sequences(contigs) improve the learning ability and accuracy for phage classiﬁcation. Below we summarize related work forphage classiﬁcation. Many attempts have been made for phage taxonomic classiﬁcation. They can be roughly divided into two groups:alignment-based [16, 17, 18] and learning-based [19, 20, 21, 22]. Alignment-based methods utilize either nucleotide-level or protein-level homology search between query contigs and reference genomes for assigning the taxon for thequery. ClassiPhage [18] and Phage Orthologous Groups (POGs) [16] are two representative alignment-based phageclassiﬁcation tools. POGs extract taxon-speciﬁc marker genes and align query sequences against the marker genes usingBLASTP. If there are statistical signiﬁcant alignment for the contigs, the label of the best-aligned marker gene will beassigned to the contigs. ClassiPhage builds a proﬁle Hidden Markov Model (pHMM) for each phage taxonomic groupand apply HMM-based alignment for classiﬁcation. There are two limitations with alignment-based method. First, asgenes or proteins can be shared by different taxa, alignment-based method may lead to ambiguous label assignment orreturn a label with a higher rank using the lowest common ancestor (LCA) in the phylogenetic tree. Second, as phagesare highly abundant and diverse, alignment-based methods are not able to assign taxa for new species that harbor novelproteins or lack quality alignments with the references. For example, under the

Caudovirales order, 187,006 proteinsare named as hypothetical proteins without known family labels. 13,382 proteins from phages released in 2020 do nothave BLASTP results with the phages released before 2020. Thus, using only sequence similarity cannot provide idealresolution.There are a number of learning-based tools for microbe classiﬁcation such as the Naïve Bayes classiﬁer [23] and CNN[24]. These methods aim to automatically learn features from the reference sequences and predict taxonomic labels ofbacteria or RNA viruses. The most relevant learning-based tool to phage classiﬁcation is vConTACT 2.0 [22], whichapplies a graph clustering algorithm to assign labels for unknown contigs. In order to leverage gene organizationconservation for phage classiﬁcation, vConTACT utilizes a clustering algorithm to construct a gene-sharing network[21, 22]. If the reference genomes and contigs are in the same cluster, the labels of the reference genomes will beassigned to those contigs. While these gene-sharing network methods present satisfactory performance on classiﬁcationof complete genomes, the classiﬁcation accuracy decreases as the length of the contigs becomes shorter. The decreasedperformance stems from the fact that short contigs do not contain many proteins and thus do not lead to valid edges inthe gene-sharing network. As a result, the clustering algorithms fail to group contigs and reference genomes in thesame cluster. Then, no labels will be assigned to these contigs.2

PREPRINT - F

EBRUARY

9, 2021Given the enormous diversity of phages and the sheer amount of unlabeled phages, we formulate the phage classiﬁcationproblem as a semi-supervised learning problem. We choose the GCN as our learning model and combine the strengthof both the alignment-based and the learning-based methods. First, we utilize the pair-wise BLASTP result betweencontigs and references to improve the weighted edges in the gene-sharing network. Second, to handle the situation thatshort contigs lack gene organization-related features, a CNN-based model is adopted to encode nucleotide informationfrom the sequence. The GCN model allows us to utilize features from both labeled and unlabeled samples in trainingand thus lead to more accurate and sensitive phage classiﬁcation. We compared our tool (named PhaGCN) withthree state-of-the-art models speciﬁcally designed for phage classiﬁcation: Phage Orthologous Groups (POG) [16],vConTACT 2.0 [22], and ClassiPhage [18]. The experimental results demonstrated that PhaGCN outperforms otherpopular methods.

Figure 1: The pipeline of PhaGCN. A1: cut the contigs into 2kbp segments. A2: feature learning from the inputsusing CNN. A3: construct nodes using encoded vectors. B1: contig translation using 6 reading frames. B2: ﬁltershort translations (12 amino acids). B3: align contigs against reference database using BLASTP. B4: choose the besttranslated frame to gain BLASTP result. B5: use the BLASTP result to construct protein clusters. B6 and B7: deﬁneedges based on the sum of the E weight and P weight . C1: construct the knowledge graph for GCN.Semi-supervised learning is a machine learning approach that combines a small amount of labeled data with a largeamount of unlabeled data during training. The main purpose of using the unlabeled data is to utilize their conjunctioninformation with the labeled data to improve the classiﬁcation accuracy. Because the number of reference (labeled)phage genomes is small and new (unlabeled) phage contigs are increasing quickly, we formulate the phage classiﬁcationproblem as a semi-supervised learning problem.One of the semi-supervised learning approaches, named GCN, is based on deep learning. The basic idea of GCN is toapply a convolutional layer on a graph to utilize the features on non-Euclidean structure [25]. The purpose of the graphconvolutional layer is to automatically learn the edge weights. Then, unlabeled samples/nodes can be represented as theweighted sum of their neighbor samples/nodes. In biological data analysis, there exist many non-Euclidean structuressuch as protein topology graph on the supersecondary structure, gene-sharing network, and diseases-gene relationshipgraph. GCN is expected to render high classiﬁcation performance by employing the structural data. For example, GCNhas shown promising results in ﬁnding the relationship between long non-coding RNAs and diseases [26, 27]. In phageclassiﬁcation, different phage genomes and contigs can share genes or proteins, which can be encoded in the graph ofGCN. In addition, the nodes in GCN can embed automatically learned feature from nucleotide sequences. During thetraining, convolution is conducted for each node and its neighbors deﬁned by the graph. The learned edge weights willthen be applied for classifying samples without labels.The input to our GCN model is a knowledge graph. There are two key components in the knowledge graph: nodeencoding and edge construction. The node is a numerical vector learned from contigs using a CNN. The edge encodes3 PREPRINT - F

EBRUARY

9, 2021features from both the sequence similarity and the organization of genes. Fig. 1 contains the major components fornode and edge construction. To encode a sequence using a node, a pre-trained convolutional neural network (CNN) isadopted to capture features from the input DNA sequence (A1-A3). The CNN model is trained to convert proximatesubstrings into vectors of high similarity. The edge construction consists of several steps. We employ a greedy searchalgorithm to ﬁnd the best BLASTP results between the 6-frame translations of the contigs and the database (B1-B4).Then the Markov clustering algorithm (MCL) is applied to generate protein clusters from the BLASTP result (B5)[22]. Based on the results of BLASTP (sequence similarity) and MCL (shared proteins), we deﬁne the edges betweensequences (contigs and reference genomes) using two metrics: P weight and E weight (B6-B7). By combining the node’sfeatures and edges (C1), we construct the knowledge graph and feed it to the GCN to classify new phage contigs. CNN can automatically learn motif-related features for sequence classiﬁcation [28, 29]. Although CNN can be directlyapplied to phage classiﬁcation, our experiments will show that using CNN alone cannot render the best classiﬁcationperformance. Thus, we only train the CNN for encoding input contigs.As shown in Fig. 2, there are two slightly different network structures in the CNN for “train mode” and “encodingmode”, respectively. In the train mode (Fig. 2(A)), we use the reference database to train the CNN model. In theencoding mode (Fig. 2(B)), the output of the ﬁrst dense layer in the pre-trained CNN will be used to encode sequencesinto numerical vectors.

Train mode:

Because the CNN model can only handle ﬁxed length input, all the inputs will be cut into 2kbp segments with user-speciﬁed stride value (default 50). The segment has the same label as the underlying genome according to the ICTVtaxonomic classiﬁcation.The CNN model contains three different parts: embedding layer, convolutional layer, and dense layer. The embeddinglayer is used to convert the DNA sequence into numerical inputs for convolution. There are two major methods for theembedding layer: one-hot embedding and skip-gram embedding [30]. As shown in our previous work of using CNNfor classifying RNA viruses [24], the skip-gram based embedding can improve CNN’s learning ability. Thus, in thiswork, we implemented a skip-gram embedding layer that can map proximate k-mers into highly similar vectors. Wetrained the embedding layer using k-mers and their neighboring (proximate) k-mers so that the embedding layer canlearn their adjacent relationship. Speciﬁcally, in order to train the embedding layer, we use a 3-mer at position i asinput and 3-mers located at i + j as output, where − m ≤ j ≤ m . m is the hyperparameter that can be speciﬁed forthe skip-gram model. We employ 100 hidden units in the embedding layer to encode the 3-mers and the output of theembedded vector has 100 dimensions. Thus, each 2kbp segment is converted into embedded matrix M ∈ R , × .Figure 2: The structures of the CNN model in PhaGCN for training (A) and encoding (B). R1,R2,R3 are the referencegenomes used for training. C1,C2,C3 are the contigs that need to be encoded. In train mode, sequences will be fed toCNN to update parameters during back propagation. In encoding mode, The pre-trained CNN will be used to encodethe sequences into numerical vectors. Then these vectors will be adopted as node features in the knowledge graph.4 PREPRINT - F

EBRUARY

9, 2021 Z i ( M, w conv ) =

ReLU ( (cid:80) n conv j =1 w jconv ∗ M [ i : i + d − d ] + b ) (1) H (0) = M axpool ( Z ( M, w conv )) (2) H ( l +1) = ReLU ( H ( l ) , w ( l ) ) (3) output of train mode = Sof tM ax ( H (2) , w (2) ) (4)Then, the embedded matrix M will be fed into the convolutional layer. Eq. 1 is the convolution function. b is the biasterm; d and d are the ﬁlter sizes. Since the embedded vector has 100 dimensions, d = 100 . M [ i : i + d − d ] deﬁnes a 2D window size of d × d of the embedded matrix M . ReLU is the activation function. The convolutionalﬁlters w conv contain n conv

2D matrices and w jconv is the j -th ﬁlter. We applied ﬁlters repeatedly to each possiblewindow of the input embedded matrix to produce a feature map. Then the dense layer is applied to compress thefeatures captured by the convolutional layer as shown in Eq. 2 and Eq. 3. First, max-pooling (Eq. 2) is applied to thefeature map to capture the most useful information from the convolutional layer. Second, we use two dense layerswith ReLU activation function (Eq. 3) to learn and compress the feature map. H ( l ) is the feature map in hidden layer l and w ( l ) is the weight parameters in the l -th hidden layer. Since we only has two dense layers, l ∈ { , } . Finally,the SoftMax function (Eq. 4) is adopted to generate the prediction. As shown in Fig. 2, in the train mode, we employCrossEntropyLoss to calculate the error between prediction and real label and backpropagate the loss to update theparameters in the model. The detailed parameters are listed in our Github repository. Encoding mode:

After training the CNN model, we utilize the pre-trained parameters to convert contigs into numericalvectors. The main difference in the encoding mode is that we only use the output of the ﬁrst dense layer as the learnedfeature rather than using the SoftMax function for prediction. Eq. 5 shows the equation to convert x (an input 2kbpsegment) into the output of the ﬁrst dense layer. If a contig is cut into multiple segments of length 2kbp, we will conductvector addition for all the segments’ outputs and divide it by the number of segments. Thus, contigs of different lengthsare always converted into vectors of the same size (determined by the units of the dense layer, default 512). Out ( x ) = ReLU ( pool ( Z ( M, w conv ) , w (0) ) (5) The edges connect nodes that are likely in the same taxonomic group. We deﬁne the edge by incorporating both thenumber of shared protein clusters and also the average protein similarity between two sequences. Intuitively, if twosequences share a large number of common protein clusters with high similarity, they tend to belong to the same taxa.In order to quantify the signiﬁcance of two sequences sharing c common proteins, we ﬁrst deﬁne protein clusters. Apair of proteins from two sequences is called a shared protein if they are in the same protein cluster. We follow the idea in [21, 22] to construct protein clusters. We start by extracting proteins from all sequences. For thegenomes in the reference database, proteins are downloaded from NCBI RefSeq. For the input contigs, DNA sequencesare translated into amino acid sequences using 6 reading frames. We employ BLASTP to conduct all-against-allpairwise alignment between contigs’ 6-frame translations and proteins encoded by the genomes. If there are multiplealignments for different reading frames of a contig, only the best frame is kept. Then we create a weighted graph wherethe node is a protein sequence in the contig or genome and the edge represents an alignment with e-value less than athreshold. The edge weight is the e-value. Then protein clusters are subsequently identiﬁed using the Markov clusteringalgorithm (MCL). Finally, clusters that contain at least two proteins will be kept.

The edge is deﬁned by computing two metrics: P weight and E weight . P weight is adopted to calculate the expectednumber of sequences sharing at least an observed number of common proteins (i.e. c proteins). Suppose we havetwo sequences A and B , whose proteins are in a and b protein clusters, respectively. Following vConTACT [21], weutilize the hypergeometric formula to estimate the probability of A and B sharing c common protein clusters. In thisformulation, the total number of protein clusters from all the sequences is n (i.e. the total number of draws in the5 PREPRINT - F

EBRUARY

9, 2021hypergeometric distribution). A success draw is a protein cluster that is shared by two sequences and a failure draw is aprotein cluster unique to one sequence. Thus, Eq. 6 estimates the probability that two genomes have at least c commonprotein clusters. y s the variable representing observed common proteins. Eq. 7 then computes the expected number ofpairs with at least c common proteins out of (cid:0) N (cid:1) sequence pairs. N is the number of sequences (contigs and referencegenomes). With increase of c , P in Eq. 6 becomes small enough to return a positive P weight . P ( y ≥ c ) = (cid:80) min ( a,b ) i = c (cid:0) ia (cid:1)(cid:0) b − in − a (cid:1)(cid:0) bn (cid:1) (6) P weight = − log ( P ( y ≥ c ) × (cid:18) N (cid:19) ) (7)While P weight is used to evaluate whether two sequences share a signiﬁcant number of common proteins, E weight isadopted to calculate the sequence similarity using alignments’ e-values. We calculate the average e-value for each twosequences. N c is the number of proteins shared by genome A and genome B and e value ( i ) means the e-value of the i th protein pair. Then a score matrix S ∈ R N × N is calculated using Eq. 8. S ( A, B ) means the edge score betweengenome A and B. For each genome A , S ( A, A (cid:48) ) is ranked for all A’s adjacent nodes A (cid:48) . Users can decide how manyedges to keep by specifying a threshold. By default, we only keep the top 5 edges for each genome A . For all the keptedges, E weight = S ( A, B ) . S ( A, B ) = (cid:40) , if no alignment result − log ( (cid:80) N c i =0 e value ( i ) N c ) , otherwise (8) Edges in the knowledge graph:

The ﬁnal edge in the knowledge graph is deﬁned based on the sum of P weight and E weight , which is proportional tothe logarithm of the joint probability of two sequences having enough common proteins of high similarity. An edge isdeﬁned when the sum is above a threshold τ (Eq. 9). Edge = (cid:26) , if P weight + E weight > τ , otherwise (9) After constructing the knowledge graph, we train a GCN to assign labels for all unlabeled contigs. H ( l +1) = ReLU ( ˜ D − ˜ G ˜ D H ( l ) W ( l ) ) (10) Out = Sof tM ax ( H (2) W (2) ) (11)The basic concept of graph convolutional layer is shown in Eq. 10. Suppose we have N sequences (nodes) in theknowledge graph. G is the R N × N adjacency matrix of the knowledge graph and I N is an R N × N identity matrix. ˜ G is calculated with ˜ G = G + I N . ˜ D is the R N × N diagonal matrix calculated with D ij = (cid:80) j ˜ G ij . H ( l ) is thefeature map in the l -th hidden layer; H (0) is the node feature matrix; and W ( l ) is a matrix of weight parameters. Afterthe graph convolutional layer, we apply a dense layer and use the SoftMax function to calculate the output matrix Out ∈ R N × n label (Eq. 11). Because we have two graph convolutional layers and one dense layer in our model, l ∈ { , } . The output dimension n label is decided by the number of classes in the database. As shown in Fig. 3, onlythe labeled samples will be used to calculate the loss in the training process. We adopt L2 loss to calculate the errorbetween prediction and the labeled samples and back propagate the loss to update the weight parameters. After trainingthe GCN model, we freeze the parameters and use the SoftMax value of the unlabeled samples to assign their labels inthe test mode. The detailed parameters are listed in our Github repository.6 PREPRINT - F

EBRUARY

9, 2021Figure 3: The structure of the GCN model in PhaGCN. Both feature from labeled (green, red, and blue color) andunlabled samples (grey color) will be used in training and prediction. In train mode, only labeled samples will beutilize to calculate the loss and update the parameters. In test mode, the SoftMax function will be applied to generateprediction for unlabeled samples.

We demonstrate the performance of PhaGCN on classifying contigs in families under

Caudovirales , which is an ordercontaining the majority of known phages from RefSeq (95.8% of total phage reference genomes). We downloaded the

Caudovirales reference genomes from the NCBI RefSeq database. As shown in Table 1, there are 3,639 genomes from8 different families. As the lower ranks contain few genomes in each group, we focus on family-level classiﬁcation inthe experiment. Name Number of genomes

Ackermannviridae Autographiviridae

Demerecviridae Drexlerviridae

Herelleviridae

Myoviridae

Podoviridae

Siphoviridae

Caudovirales . PhaGCN was tested on both simulated and real sequencing data. For the simulated data, we applied two differentmethods to generate contigs with known labels: 1. randomly sample contigs from the reference genome; 2. simulatereads with ART-Illumina [31] and run MEGAHIT [15] to assemble contigs. After validating PhaGCN on simulated datawith known ground truth, we downloaded two real sequencing datasets from NCBI SRA and evaluated PhaGCN onassembled contigs. As phages are highly abundant in marine environment samples [12], we tested PhaGCN on virus-likecontigs from 71 metagenomic data sets that are sequenced from oyster. We recorded macro-accuracy, macro-recall,and macro-precision for each experiment (Eq.12, Eq. 13, and Eq. 14) when the ground truth can be derived. N class is the total number of classes. T P is the True positive,

T N is the True negative, and

F N is the false negative. Asmacro-average will compute each metric independently for each class and then take the average, these metrics treat allclasses equally. 7

PREPRINT - F

EBRUARY

9, 2021

Acc macro = (cid:80) N class i =0 Acc i N class (12) P recision macro = (cid:80) N class i =0 P recision i N class = (cid:80) N class i =0 T P i T P i + F P i N class (13) Recall macro = (cid:80) N class i =0 Recall i N class = (cid:80) N class i =0 T P i T P i + F N i N class (14)The main purpose of PhaGCN is to classify new phages that do not have reference genomes in the training data. Whenthe training and testing data share common genomes, high accuracy may be attributed to memorization rather thanlearning. Thus, in all the experiments conducted using PhaGCN, we use species-masking, meaning that the species inthe testing data will be removed from the training data so that they do not share any phages. The test contigs are thustreated as part of novel species.We compare our results with three representative and widely used pipelines: vConTACT 2.0 [22], Phage OrthologousGroups (POGs) [16], and ClassiPhage [18]. In addition, as CNN itself can conduct classiﬁcation, we also comparedwith the CNN model we trained for PhaGCN. In this experiment, we randomly chose 20 genomes from each family as the testing species. The remaining genomeswere used as the training set. Thus the testing data contain 160 genomes from all 8 families under

Caudovirales . Then,we sampled contigs from each test genome by generating random starting positions. To estimate the impact of contiglength on PhaGCN, we generated contigs in three length ranges: [4kbp, 8kbp], [8kbp, 12kbp], and [12kbp, 16kbp]. Foreach of the length range, we generated 10 contigs. Thus we have 1,600 contigs for testing. Finally, we repeated thisexperiment for three times and recorded the average performance of the three experiments. In total, for each lengthrange, 4,800 contigs from 480 genomes were tested. We also recorded the results of using complete genomes as the testdata. To have a fair comparison, we applied the same method to construct the training set (or reference database) andthe test set for vConTACT 2.0, POGs, and ClassiPhage.Fig. 4 shows that PhaGCN outperforms other state-of-the-art tools across different length range. With the increase ofcontig length, the performance of all pipelines increases. This is expected because longer contigs contain more proteins,which can lead to better classiﬁcation performance. We also evaluated the classiﬁcation performance of only using theCNN model in PhaGCN. Both the recall and precision of using GCN is better than using only CNN, showing that theknowledge graph enhances the learning ability of the model. In addition, the classiﬁcation performance of PhaGCN isstable with the change of the contig length, making it useful for classifying short contigs.Fig. 5 shows the classiﬁcation performance of all tools on randomly sampled contigs. Note that each tool can outputeither a family label or no label at all (no prediction). Thus, the bar height shows the percentage of predicted contigs. Inaddition, each bar is divided into two parts. The top part (solid) is the misclassiﬁcation rate while the bottom part withpatterns corresponds to the macro-accuracy for the classiﬁed contigs. The result shows that PhaGCN always has thelargest number of predicted contigs with the highest classiﬁcation accuracy.Although vConTACT 2.0 can assign the correct labels to most of the predicted contigs as shown in Fig. 5, it onlygenerated predictions for a small number of the contigs in two families (

Podoviridae and

Siphoviridae ). For familieswithout any prediction, the precision is 0. Thus, the macro-precision of vConTACT is small.

In this experiment, we downloaded all newly released

Escherichia coli phages under

Caudovirales in 2020 fromNCBI RefSeq and used them as the testing species (a total of 99 species are downloaded). And we downloaded all

Caudovirales phages released in 2019 from NCBI RefSeq and used them as the training set. Consequently, these newlyreleased

Escherichia coli phages can be treated as unknown species for our model. Then we applied ART-Illumina tosimulate reads from the testing sequences. The parameters used for generating reads are -p, -l 150, -ss HS25, -f 20, -m200 and -s 10. The output contains 150bp paired-end reads simulated under HiSeq 2500. We mixed all the simulatedreads in one dataset and run MEGAHIT to assemble them. In order to quantify the performance of different tools, wedetermine the correct label of contigs by aligning them against the test genomes using BLAST in glocal mode. Onlycontigs of length above 2kbp with taxon-speciﬁc alignment results and query coverage >

85% will be kept. As a result,a total of 301 contigs were used as input for comparison. 8

PREPRINT - F

EBRUARY

9, 2021Figure 4: The precision and recall (Eq. 13 and Eq. 14) of PhaGCN, vConTACT2.0, ClassiPhage, POGs, and the CNNmodel in PhaGCN on simulated contigs, which are randomly sampled from phage genomes. X-axis: the length range ofcontigs. For each length range, there are 1,600 randomly sampled contigs from 20 genomes of 8 families. The reportedperformance is averaged on three such sets of contigs for each length range.Figure 5: The percentage of classiﬁed contigs and classiﬁcation accuracy (Eq.12) of each model on simulated contigs,which are randomly sampled from phage genomes. Each bar shows the percentage of classiﬁed contigs. The solid partshows the misclassiﬁcation rate and the bottom part with patterns represents the macro-accuracy.As shown in Fig. 6 and Fig. 7, PhaGCN outperforms other tools across different length, which is consistent with theconclusion in Section 3.2. We also ﬁnd that when the length of the contigs becomes shorter ([2kbp, 4kbp]), PhaGCN9

PREPRINT - F

EBRUARY

9, 2021Figure 6: The precision and recall (Eq. 13 and Eq. 14) of PhaGCN, vConTACT2.0, ClassiPhage, POGs and CNNmodel in PhaGCN on contigs that are assembled using MEGAHIT from simulated reads. 301 contigs assembled fromsimulated reads of 99 species are used as inputs. The numbers of contigs for each length range are 101, 107, 51, and 42,respectively.Figure 7: The percentage of classiﬁed contigs and classiﬁcation accuracy (Eq.12) of each model on contigs that areassembled using MEGAHIT from simulated reads. Each bar shows the percentage of classiﬁed contigs. The solid partshows the misclassiﬁcation rate and the bottom part with patterns represents the macro-accuracy.can still achieve over 80% accuracy. Although the reads simulated by ART-Illumina contain sequencing error, the10

PREPRINT - F

EBRUARY

9, 2021performance of PhaGCN still achieves high accuracy (100% for contigs over 8kbp). Because we only tested newlyreleased

Escherichia coli phages in 2020, the classiﬁcation performance of all methods were slightly better than theresults in Section 3.2.

The most resource demanding component in PhaGCN is the sequence alignment. We used it to produce the proteinclusters by conducting pairwise alignments between contigs and reference sequences. PhaGCN produces proteinclusters for each set of input contigs by assuming that they may contain novel proteins. In addition, alignment is alsoconducted for deﬁning the edge in the knowledge graph. Table 2 shows the average elapsed time of classifying 100contigs for each tool. PhaGCN is not the fastest program. Optimization can be applied to reduce the number of pairwisealignments. For example, we can produce a database of protein clusters and reduce the number of pairwise alignments.

Program PhaGCN vConTACT 2.0 ClassiPhage POGsElapsed time(min/100 contigs) 25 32 4 7

Table 2: The average elapsed time to predict labels of 100 contigs for each method. All the methods are run on Intel ® Xeon ® Gold 6258R CPU with 8 cores.

In this experiment, we searched for real sequencing data that contain

Caudovirales at NCBI SRA and downloaded twodatasets, SRR12949983 and SRR13132427. Then we used MEGAHIT to assemble reads into contigs on these twodatasets separately. To quantify the performance of phage classiﬁcation on these two datasets, we used the providedtaxonomic analysis by NCBI SRA as the ground truth. The phages in the two datasets provided by NCBI SRA are listedin Table 3. We used the same method introduced in Section 3.3 to label the contigs and removed all these genomes inTable 3 from the reference database before training PhaGCN. The contigs in the test data are listed in Table 4.SRR12949983 SRR13132427

Escherichia phage C5 Escherichia phage V18Salmonella phage C2 Escherichia virus FV3Serratia phage Pila Escherichia virus JES2013Escherichia virus E112 Escherichia phage CEC_Kaz_2018Escherichia virus ECML134 Escherichia phage SECphi18Escherichia virus T4 Escherichia phage vB_EcoS_PNS1

Table 3: Reference species in SRR12949983 and SRR13132427. These species are shown in the associated taxonomicanalysis at NCBI SRA. We use them as the ground truth to assign labels to the assembled contigs.

SRR12949983 SRR13132427

No. of contigs assembled by MEGAHIT (>2kbp)

24 (including 1 bacterial and 1 unknown) 20 (including 1 bacterial)

Phage family No. of contigs Phage family No. of contigs

Autographiviridae Myoviridae Myoviridae Siphoviridae Table 4: Contigs assembled by MEGAHIT on the two datasets and their family labels derived using alignment againstthe species in Table 3.We compared the phage labels assigned by PhaGCN, vConTACT, POGs, and ClassiPhage on the assembled contigs.Because the real sequencing data might contain bacteria sequences, we run DeepVirFinder [32] to reject the contigs thatbelong to bacteria. As shown in Fig. 8, there is only one bacterial contig in each dataset. All other contigs (43 contigs)were fed into the four tools for classiﬁcation.Fig. 8 reveals that PhaGCN achieves better classiﬁcation performance than vConTACT, POGs, and ClassiPhage.PhaGCN has 100% accuracy in both datasets. Many contigs could not be classiﬁed by vConTACT 2.0 (unlabeled in Fig.8). POGs and ClassiPhage are able to assign labels for all contigs but a number of them have wrong family labels.11

PREPRINT - F

EBRUARY

9, 2021In the dataset SRR121949983, there is one contig lacking ground truth because the contig cannot be aligned to anyreference genome in Table 3. We thus conducted homology search at NCBI against Nucleotide Collection (nr/nt)database. The search shows that this contig belongs to

Siphoviridae . We input this contig to the four tools and onlyPhaGCN assigned the correct label,

Siphoviridae , to this contig.Figure 8: Classiﬁcation result of PhaGCN, vConTACT 2.0, POGs, and ClassiPhage on SRR12949983(A) andSRR13132427(B).Figure 9: Composition analysis of SRR12949983(A) and SRR13132427(B). Left: composition analysis published atNCBI SRA. Right: composition analysis presented by PhaGCN.Furthermore, we showed the composition of these two datasets before and after using PhaGCN in Fig. 9. The resultsrevealed that our model can greatly improve the composition analysis for the dark matter. Thus, PhaGCN can beneﬁtmetagenomic analysis. 12

PREPRINT - F

EBRUARY

9, 2021

After validating PhaGCN on simulated datasets and two real sequencing datasets, we applied PhaGCN to contigsassembled from metagenomic data of oyster samples. The samples were collected by the co-author Dr. Jiang betweenApril, 2016 to July, 2019 from various sites along the coast of South China Sea. Metagenomic sequencing wasconducted from samples in the Gill, visceral mass, and mantle tissues of oyster using viral-like particle enrichment andprotocols in [33, 34]. There are about 2.5 billions of raw reads from 71 libraries.

Ackermannviridae Autographiviridae Demerecviridae Drexlerviridae Herelleviridae Myoviridae Podoviridae Siphoviridae unclassiﬁedvConTACT 0 0 0 0 0 50 116 102 22698PhaGCN 150 1727 301 74 32 5880 2682 6173 5767

Table 5: Prediction results of PhaGCN and vConTACT for contigs produced from oyster metagenomic data.After applying standard quality control and MEGAHIT with the default setting, there are about 3,375,091 contigs oflength above 500bp. After removing contigs that can be aligned to bacteria, archaea, eukaryota, we kept 22,966 contigswith length above 4,000bp as input to PhaGCN. Of them, 17,199 contigs can be assigned to

Caudovirales by PhaGCN.When users apply composition analysis for their samples, precision is important for generating valid hypothesis.Because vConTACT 2.0 has higher accuracy for what they can predict than ClassiPhage and POGs based on ourexperiments, we compared PhaGCN with vConTACT 2.0 in this experiment and summarized the results in Table 5.Although we don’t have the ground truth for this large-scale metagenomic sequencing data, the numbers of predictedcontigs are consistent with the results of experiments on simulated and real sequencing datasets. In addition the contigsthat can be assigned with labels by vConTACT 2.0 are signiﬁcantly less than PhaGCN. ~74.8% contigs are predicted byPhaGCN while ~1.1% contigs are predicted by vConTACT 2.0. The contigs predicted by vConTACT is a subset ofPhaGCN. Also, PhaGCN can identify more families from the dataset. Because the classiﬁcation accuracy of PhaGCNis more than 92% when the length of contigs is over 4kbp, the result can provide reliable family-level compositionanalysis for the oyster metagenomic data.

In this work, we demonstrate that PhaGCN can render better performance for novel phage classiﬁcation. The majorimprovement of our method stems from combined strength of the reference-based model and the learning-based modelusing the knowledge graph: the nodes contain automatically learned features from nucleotide sequences and the edgesare created by protein-based alignment. Then the semi-supervised GCN is applied on the knowledge graph to utilizeboth labeled and unlabeled data for training.Figure 10: (A): classiﬁcation result of three added families:

Rudiviridae , Microviridin , and

Inoviridae . (B): PhaGCNcan reject non-

Caudovirales phages.As

Caudovirales is the order with the most number of sequenced phages from RefSeq, we validated PhaGCN onclassifying families in this order. But PhaGCN can be conveniently extended to other taxa. We extended PhaGCN byadding families that contain at least 10 genomes. Using this criterion, three families (

Rudiviridae , Microviridin , and

Inoviridae ) were added. We used the leave-one-species-out method to choose testing species from these three familiesand use the same method introduced in Section 3.2 to generate contigs. The classiﬁcation results in Fig. 10(A) showthat PhaCN can correctly classify almost all of the contigs from the extended families.Using knowledge graph enables PhaGCN to detect targeted phage families, which is useful for applications where onlysome phages are of interest. Speciﬁcally, phages that are not in the training families usually won’t form edges with the13

PREPRINT - F

EBRUARY

9, 2021nodes in the graph and thus will not be mis-classiﬁed into

Caudovirales . We validated the detection ability of PhaGCNby testing whether PhaGCN can reject contigs that do not belong to

Caudovirales . We downloaded 4,218 phagegenomes that do not belong to

Caudovirales from RefSeq according to the ICTV taxonomic afﬁliation information. Foreach genome, we apply the same method introduced in Section 3.2 to generate 10 contigs for each of them. Thus, atotal of 42,180 contigs were tested. As shown in Fig. 10(B), only 3 of them are accepted (predicted) by PhaGCN. Thisexperiment demonstrates that PhaGCN can be applied for targeted phage detection.In summary, PhaGCN provides a useful tool for phage classiﬁcation. It can identify phage contigs from the metagenomicsamples with higher accuracy than other phage classiﬁcation tools.

Funding

This work was supported by Hong Kong RGC GRF 9042828 and HKIDS (9360163) and NSF of China (31972847).

References [1] Stephen McGrath, D van Sinderen, et al.

Bacteriophage: genetics and molecular biology.

Caister Academic Press,2007.[2] Lucía Fernández, Ana Rodríguez, and Pilar García. Phage or foe: an insight into the impact of viral predation onmicrobial communities.

The ISME journal , 12(5):1171–1179, 2018.[3] Bonnie L Hurwitz and Jana M U’Ren. Viral metabolic reprogramming in marine ecosystems.

Current opinion inmicrobiology , 31:161–168, 2016.[4] Catherine Loc-Carrillo and Stephen T Abedon. Pros and cons of phage therapy.

Bacteriophage , 1(2):111–114,2011.[5] Lin-Fa Wang and Meng Yu. Epitope identiﬁcation and discovery using phage display libraries: applications invaccine development and diagnostics.

Current drug targets , 5(1):1–15, 2004.[6] Justyna Bazan, Ireneusz Całkosi´nski, and Andrzej Gamian. Phage display—A powerful technique for im-munotherapy: 1. Introduction and potential of therapeutic applications.

Human vaccines & immunotherapeutics ,8(12):1817–1828, 2012.[7] Jing Liu, Mohammed Dehbi, Greg Moeck, Francis Arhin, Pascale Bauda, Dominique Bergeron, Mario Callejo,Vincent Ferretti, Nhuan Ha, Tony Kwan, et al. Antimicrobial drug discovery through bacteriophage genomics.

Nature biotechnology , 22(2):185–191, 2004.[8] Bas E JDutilh, Noriko Cassman, Katelyn McNair, Savannah E Sanchez, Genivaldo G Z Silva, Lance Boling,Jeremy J Barr, Daan R Speth, Victor Seguritan, Ramy K Aziz, Ben Felts, Elizabeth A Dinsdale, John L Mokili,and Robert A Edwards. A highly abundant bacteriophage discovered in the unknown sequences of human faecalmetagenomes.

Nature Communications , 5:4498, 2014.[9] Kira Moon, Ilnam Kang, Suhyun Kim, Sang-Jong Kim, and Jang-Cheon Cho. Genomic and ecological study oftwo distinctive freshwater bacteriophages infecting a Comamonadaceae bacterium.

Scientiﬁc reports , 8(1):1–9,2018.[10] Kira Moon, Jeong Ho Jeon, Ilnam Kang, Kwang Seung Park, Kihyun Lee, Chang-Jun Cha, Sang Hee Lee, andJang-Cheon Cho. Freshwater viral metagenome reveals novel and functional phage-borne antibiotic resistancegenes.

Microbiome , 8:1–15, 2020.[11] Kira Moon, Suhyun Kim, Ilnam Kang, and Jang-Cheon Cho. Viral metagenomes of Lake Soyang, the largestfreshwater lake in South Korea.

Scientiﬁc Data , 7(1):1–6, 2020.[12] Blanca Perez Sepulveda, Tamsin Redgwell, Branko Rihtman, Frances Pitt, David J. Scanlan, and Andrew Millard.Marine phage genomics: the tip of the iceberg.

FEMS Microbiology Letters , 363(15), 06 2016. fnw158.[13] Tasha M Santiago-Rodriguez and Emily B Hollister. Human virome and disease: high-throughput sequencingfor virus discovery, identiﬁcation of phage-bacteria dysbiosis and development of therapeutic approaches withemphasis on the human gut.

Viruses , 11(7):656, 2019.[14] Kevin P Keegan, Elizabeth M Glass, and Folker Meyer. MG-RAST, a metagenomics service for analysis ofmicrobial community structure and function. In

Microbial environmental genomics (MEG) , pages 207–233.Springer, 2016. 14

PREPRINT - F

EBRUARY

9, 2021[15] Dinghua Li, Chi-Man Liu, Ruibang Luo, Kunihiko Sadakane, and Tak-Wah Lam. MEGAHIT: an ultra-fastsingle-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

Bioinformatics ,31(10):1674–1676, 2015.[16] David M Kristensen, Alison S Waller, Takuji Yamada, Peer Bork, Arcady R Mushegian, and Eugene V Koonin. Or-thologous gene clusters and taxon signature genes for viruses of prokaryotes.

Journal of bacteriology , 195(5):941–950, 2013.[17] Pakorn Aiewsakun, Evelien M Adriaenssens, Rob Lavigne, Andrew M Kropinski, and Peter Simmonds. Evaluationof the genomic diversity of viruses infecting bacteria, archaea and eukaryotes using a common bioinformaticplatform: steps towards a uniﬁed taxonomy.

The Journal of general virology , 99(9):1331, 2018.[18] Cynthia Maria Chibani, Anton Farr, Sandra Klama, Sascha Dietrich, and Heiko Liesegang. Classifying theunclassiﬁed: a phage classiﬁcation method.

Viruses , 11(2):195, 2019.[19] Forest Rohwer and Rob Edwards. The Phage Proteomic Tree: a genome-based taxonomy for phage.

Journal ofbacteriology , 184(16):4529–4535, 2002.[20] Ho Bin Jang, Fernand F Fagutao, Seong Won Nho, Seong Bin Park, In Seok Cha, Jong Earn Yu, Jung Seok Lee,Se Pyeong Im, Takashi Aoki, and Tae Sung Jung. Phylogenomic network and comparative genomics reveal adiverged member of the φ kz-related group, marine Vibrio phage φ JM-2012.

Journal of virology , 87(23):12866–12878, 2013.[21] Benjamin Bolduc, Ho Bin Jang, Guilhem Doulcier, Zhi-Qiang You, Simon Roux, and Matthew B Sullivan.vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria.

PeerJ ,5:e3243, 2017.[22] Ho Bin Jang, Benjamin Bolduc, Olivier Zablocki, Jens H Kuhn, Simon Roux, Evelien M Adriaenssens, J RodneyBrister, Andrew M Kropinski, Mart Krupovic, Rob Lavigne, et al. Taxonomic assignment of uncultivatedprokaryotic virus genomes is enabled by gene-sharing networks.

Nature biotechnology , 37(6):632–639, 2019.[23] Qiong Wang, George M Garrity, James M Tiedje, and James R Cole. Naive Bayesian classiﬁer for rapid assignmentof rRNA sequences into the new bacterial taxonomy.

Applied and environmental microbiology , 73(16):5261–5267,2007.[24] Jiayu Shang and Yanni Sun. CHEER: hierarCHical taxonomic classiﬁcation for viral mEtagEnomic data via deepleaRning.

Methods , 2020.[25] Thomas N Kipf and Max Welling. Semi-supervised classiﬁcation with graph convolutional networks. arXivpreprint arXiv:1609.02907 , 2016.[26] Tianyi Zhao, Yang Hu, Jiajie Peng, and Liang Cheng. DeepLGP: a novel deep learning method for prioritizinglncRNA target genes.

Bioinformatics , 36(16):4466–4472, 2020.[27] Tanvir Alam, Hamada RH Al-Absi, and Sebastian Schmeier. Deep Learning in LncRNAome: Contribution,Challenges, and Perspectives.

Non-coding RNA , 6(4):47, 2020.[28] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence speciﬁcitiesof DNA-and RNA-binding proteins by deep learning.

Nature biotechnology , 33(8):831–838, 2015.[29] Seokjun Seo, Minsik Oh, Youngjune Park, and Sun Kim. DeepFam: deep learning based alignment-free methodfor protein family modeling and prediction.

Bioinformatics , 34(13):i254–i262, 2018.[30] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of wordsand phrases and their compositionality. arXiv preprint arXiv:1310.4546 , 2013.[31] Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. ART: a next-generation sequencing readsimulator.

Bioinformatics , 28(4):593–594, 2012.[32] Jie Ren, Kai Song, Chao Deng, Nathan A Ahlgren, Jed A Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, and FengzhuSun. Identifying viruses from metagenomic data using deep learning.

Quantitative Biology , pages 1–14, 2020.[33] Hong-Ying Wei, Sheng Huang, Tuo Yao, Fang Gao, Jing-Zhe Jiang, and Jiang-Yong Wang. Detection of virusesin abalone tissue using metagenomics technology.

Aquaculture Research , 49(8):2704–2713, 2018.[34] Jiang Jingzhe and Wei Hongying. Isolation of Viral Like Particles (VLP) from Tissues of Molluscs. protocols.ioprotocols.io