[PDF] ProDOMA: improve PROtein DOMAin classification for third-generation sequencing reads using deep learning

Abstract

Motivation: With the development of third-generation sequencing technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in third-generation sequencing data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in third-generation sequencing data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. Results: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for third-generation sequencing reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject unrelated DNA reads such as those from noncoding regions. In the experiments on simulated reads of protein coding sequences and real reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification. In summary, ProDOMA is a useful end-to-end protein domain analysis tool for long noisy reads without relying on error correction. Availability: The source code and the trained model are freely available at this https URL. Contact: [email protected] http URL

Full PDF

PP RO DOMA:

IMPROVE

PRO

TEIN

DOMA

IN CLASSIFICATIONFOR THIRD - GENERATION SEQUENCING READS USING DEEPLEARNING

Nan Du

Dept. of Computer Science and EngineeringMichigan State UniversityEast Lansing, MI 48824, United States [email protected]

Jiayu Shang

Dept. of Electrical EngineeringCity University of Hong KongKowloon, Hong Kong SAR, China [email protected]

Yanni Sun

Dept. of Electrical EngineeringCity University of Hong KongKowloon, Hong Kong SAR, China [email protected]

September 29, 2020 A BSTRACT

Motivation:

With the development of third-generation sequencing technologies, people are able toobtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domainannotation without assembly, thus can produce important insights into the biological functions ofthe underlying data. However, the high error rate in third-generation sequencing data raises a newchallenge to established domain analysis pipelines. The state-of-the-art methods are not optimizedfor noisy reads and have shown unsatisfactory accuracy of domain classiﬁcation in third-generationsequencing data. New computational methods are still needed to improve the performance of domainprediction in long noisy reads.

Results:

In this work, we introduce ProDOMA, a deep learning model that conducts domainclassiﬁcation for third-generation sequencing reads. It uses deep neural networks with 3-frametranslation encoding to learn conserved features from partially correct translations. In addition, weformulate our problem as an open-set problem and thus our model can reject unrelated DNA readssuch as those from noncoding regions. In the experiments on simulated reads of protein codingsequences and real reads from the human genome, our model outperforms HMMER and DeepFamon protein domain classiﬁcation. In summary, ProDOMA is a useful end-to-end protein domainanalysis tool for long noisy reads without relying on error correction.

Availability:

The source code and the trained model are freely available athttps://github.com/strideradu/ProDOMA.

Contact: [email protected]

Third-generation sequencing technologies, such as Paciﬁc Biosciences single-molecule real-time sequencing (PacBio)and Oxford Nanopore sequencing (Nanopore), produce longer reads than next generation sequencing (NGS) technolo-gies. With increased read length, long reads can contain complete genes or protein domains, making gene-centricfunctional analysis for high throughput sequencing data more applicable [1, 2, 3]. In gene-centric analysis, often thereare speciﬁc sets of genes in pathways that are of special interest, for example G protein-coupled receptor (GPCR) genes a r X i v : . [ q - b i o . GN ] S e p PREPRINT - S

EPTEMBER

29, 2020in intracellular signaling pathways for environmental sensing, while other genes in the assemblies provide little insightto the speciﬁc questions.One basic step in gene-centric analysis is to assign sequences into different functional categories, such as families ofprotein domains (or domains for short), which are independent folding and functional units in a majority of annotatedprotein sequences. There are a number of tools available for protein domain annotation. They can be roughly dividedinto two groups depending on how they utilize the available protein domain sequences. One group of methods rely onalignments against the references. HMMER is the state-of-the-art proﬁle search tool based on proﬁle hidden Markovmodels (pHMM) [4, 5]. But the speed of the pHMM homology search suffers from the increase of the families.Extensive research has been conducted to improve the efﬁciency of the proﬁle homology search [6].The other group of tools are alignment-free [7]. Recent developments in deep learning have led to alignment-freeapproaches with automatic feature extraction [8, 9, 10, 11]. A review of some available methods and their applicationscan be found in [12]. Of the learning-based tools, the most relevant one to protein domain annotation is DeepFam[9], which used convolutional neural networks (CNN) to classify protein sequences into protein/domain families. Theauthors showed that it outperformed HMMER and previous alignment-free methods on protein domain classiﬁcation.Also, DeepFam is fast and the speed is not affected much by the number of families. For example, DeepFam is at leastten times faster than HMMER when 1,000 query sequences are searched against thousands of protein families [9]. Thusdeep learning-based methods have advantages for applications that do not need detailed alignments.Despite the success of existing protein domain annotation tools, they are not ideal choices for domain identiﬁcationin error-prone reads. In particular, most of these errors are insertions and deletions, which can cause frameshiftsduring translation [13]. Without knowing the errors and their positions, the frameshifts can lead to only short ornon-signiﬁcant alignments [14]. As the translation of each reading frame is partially correct, it also leads to poorclassiﬁcation performance for existing learning-based models.

As there are error correction tools for third-generation sequencing data [15, 13], an alternative pipeline is to applytools such as HMMER and DeepFam to error-corrected sequences. Error correction tools can be generally dividedinto hybrid and standalone depending on whether they need short reads for error correction. Recently, several groupsconducted comprehensive review and comparison of existing error correction tools [15, 13]. None of these tool canachieve optimal performance across all tested data sets. Particularly, the performance of standalone tools is profoundlyaffected by the coverage of the aligned sequences against the chosen backbone sequences. When the coverage is low(e.g. <50X for LoRMA [16]), less regions of the reads can be corrected. In addition, we found that error correction toolshave difﬁculty correcting mixed reads from homologous genes within the same family, such as GPCR. The similaritiesbetween different genes/domain sequences can confuse the error correction method.

In this work, we designed and implemented ProDOMA, a deep learning based method to predict the protein domains forthird-generation sequencing reads. By training a CNN-based model using 3-frame translation encoding, ProDOMA isable to classify error-containing reads into their correct domains with signiﬁcantly better accuracy. Compared to previousworks, ProDOMA has several merits. First, it does not require error correction. As a result, it has robust performancefor low coverage data. Second, although we use simulated PacBio reads as the training data, the experimental resultsshow that ProDOMA performs well on real PacBio and Nanopore data. Third, unlike previous deep learning works thatwere designed for classiﬁcation, ProDOMA can also be used for detection by distinguishing targeted domain homologsfrom irrelevant coding or non-coding sequences. The detection performance is better than HMMER after ProDOMAadopts a modiﬁed loss function from targeted image detection.The classiﬁcation accuracy of ProDOMA consistently outperformed the state-of-the-art method like HMMER andDeepFam across various error rates. And its performance is not affected by coverage. We tested it on real third-generation sequencing datasets, focusing on its function on detecting targeted domains using Outlier Exposure [17].ProDOMA achieved higher recall with comparable precision to HMMER.

Figure 1 sketches the architecture of ProDOMA. It is based on CNN because the convolutional ﬁlters can representmotifs, which are important for sequence classiﬁcation [9]. It incorporates 3-frame encoding to convert DNA reads intoa 3-channel tensor as the input to a neural network. From the input, ProDOMA automatically extracts features by using2

PREPRINT - S

EPTEMBER

29, 2020convolutional layer and max-over-time pooling layer. A classiﬁer with two fully connected layers was used to generatethe probabilities of the sequence against all input protein domains. To exclude the unrelated coding or non-coding DNAsequences, we trained the CNN using a modiﬁed loss function so that out-of-distribution samples tend to have uniformdistribution on softmax values.Figure 1: The overview of ProDOMA. The input sequence was translated and encoded to a 3-channel tensor. c i isdeﬁned in Equation (1). In the classiﬁcation task, the model directly outputs the family with the largest score asthe prediction. In the detection task, the maximum of softmax score needs to compare with a speciﬁed threshold todetermine whether the input contains a trained domain family or should be rejected. With frequent insertion and deletion errors, the correct translation of a read is composed of fragments of differentreading frames. In order to train the CNN to learn the conserved motifs from erroneous translations, we implementedand compared multiple encoding methods (see Section 2.3). The empirical results show that the 3-frame encodingscheme achieved the best performance. In this scheme, each DNA sequences is translated into 3 protein sequencesusing 3 reading frames. To accommodate domain identiﬁcation on the reverse strand, the reverse complement of thesequence can be used as input. All the residues in the translated sequence are one-hot encoded using a 21-dimensionalvector following IUPAC amino acid code notation. Then we combine three matrices into a single 3-channel tensor likean RGB image.Given a translated sequence of length n , the encoded input is a tensor with size × n × . The pseudo-code of 3-frameencoding can be found in Algorithm 1. Algorithm 1

Input:

DNA sequence x with length L , peptide to index dictionary idx , peptide alphabet size | Σ | , output sequencelength n . Output:

Input tensor for neural networks with size × n × . Initialize an array arr with dimensions × n × | Σ | with all 0 for i = 1 to do x i ← x [ i :] y i ← translation of x i (cid:46) translate x i into y i for residue a at position k in y i do if k ≤ n then arr [ i, k, idx [ a ]] ← (cid:46) one-hot encoding for each frame end if end for end for arr is the input tensor for neural networks 3 PREPRINT - S

EPTEMBER

29, 2020

ProDOMA consists of two convolutional layers, one max-over-time pooling layer, one hidden linear layer, and onelinear output layer with the softmax function. For a multi-channel input that we have from 3-frame encoding, wetransform the output arr from Algorithm 1 into a feature value using the following equation. c i = f ( (cid:88) j =1 w j · arr [ j ][ i : i + h − | Σ | ] + b ) (1) b is the bias term and h is the ﬁlter size. f is the activation function ReLU [18]. The ﬁlter consists of three 2D matrices w j for j = 1 , , and , corresponding to three reading frames. arr [ j ][ i : i + h − | Σ | ] deﬁnes a 2D window ofsize h × | Σ | for the one-hot matrix with reading frame j . We applied ﬁlters repeatedly to each possible window ofthe input one-hot matrix to produce the feature map. Then the max-over-time pooling is applied to the feature map tocapture the maximum value max( c i ) as the feature from this particular ﬁlter. The max-over-time pooling is ﬂexiblewith different input length.Algorithm 1 described how a single ﬁlter in the convolutional layer works. In our application, we used multiple ﬁlterswith various sizes to extract features of different lengths.ProDOMA has two convolutional layers. The ﬁrst convolutional layer uses consistent ﬁlter size to extract low-levelmotif-like patterns directly from 3-frame encoding input. Then second convolutional layer extracts high-level, intricatepatterns with varying distance from the output of the ﬁrst convolutional layer. By repeatedly applying the operations,we can ﬁnally generate a feature map. Then the max-over-time pooling was applied to keep the most important features.Dropout [19] is also used after pooling to prevent overﬁtting and to learn robust features. A two-layer classiﬁer withsoftmax function transfers the features to a vector of probabilities over each label. For classiﬁcation, we select the labelwith the maximum probability as the prediction from ProDOMA. We also tested other encoding methods with similar model structure to ProDOMA: (1) DNA one-hot encoding, whichdirectly transfers DNA sequence to one hot encoding matrix of size L × . For a fair comparison, we used ﬁlter sizes thatare 3 times as long as we used for 3-frame encoding; (2) 3-branch model, where we constructed a network architecturewith three branches processing each of the 3-frame translated protein sequence separately. Each of the branches consistsof identical convolution layers, and all the parameters are shared between the same layer of 3 branches. In other words,the Equation (1) becomes c i ( j ) = f ( w j · arr [ j ][ i : i + h − | Σ | ] + b ) for j = 1 to . In the 3-branch model,each branch models the translated protein sequences separately before the merging layer right before the two-layerclassiﬁer. In contrast, in our 3-frame encoding, all three translated protein sequences were processed and combined bythe 3-channel convolution ﬁlter in the ﬁrst convolutional layer.Our experimental results show that 3-frame encoding is a better encoding scheme, possibly because it can effectivelyencode the original DNA sequence information and also helps convolutional ﬁlters extract useful features for predictionof the protein domains (See results in Section 3.1.1). In addition, our experiments show that changing the order of theinput reading frames does not affect the classiﬁcation accuracy. We have described how ProDOMA predicts the domain labels for given DNA reads using CNN and softmax. However,with the close-set property of softmax, the classiﬁer will always assign a label for the input sample, even if the inputis not related to any label in the model (we call such inputs out-of-distribution samples, compared to in-distributionsamples). For example, in RNA-Seq data, not every read encodes targeted domain families in the model. In realapplications, this close-set property will lead to an undesired high false-positive rate. To address the problem, we adoptOutlier Exposure (OE) [17] with a threshold on softmax prediction probability [20] to distinguish the out-of-distributioninputs from in-distribution ones.

Usually, the samples from the out-of-distribution dataset tend to have small softmax values because their normalizedprobabilities are more uniformly distributed than the samples from the in-distribution dataset.Following [20], we extracted the maximum value of the softmax probability vector from the output of ProDOMAfor each input sample. We separated the in-distribution samples from the out-of-distribution samples by specifying4

PREPRINT - S

EPTEMBER

29, 2020a threshold on the maximum softmax probability. A holdout dataset with both in-distribution and out-of-distributionsamples was used to empirically determine the best threshold that can produce the largest F score: F = 2 · precision · recallrecall + precision .Then this learned softmax threshold is used to reject any sample with smaller softmax values. The performance of thisbaseline model is shown in the Figure 2(A). Figure 2: The histograms of maximum softmax values for in-distribution and out-of-distribution samples from basemodel ( A ) and model with Outlier Exposure ( B ). “In correct”: in-distribution samples with correct classiﬁcation. “InIncorrect”: in-distribution samples with incorrect classiﬁcation.To further improve the performance of the out-of-distribution sample detection, we adopt the Outlier Exposure (OE)method introduced by [17]. As we discussed previously, we expect the out-of-distribution samples to have uniformlydistributed softmax probabilities for all classes. However, as such inputs were never processed in training, sometimesthe model will yield unexpected high conﬁdence prediction for out-of-distribution inputs (Figure 2(A)). To address theproblem, we expose the model to out-of-distribution samples in the training process to let the model effectively learnthe heuristics for detecting out-of-distribution inputs. Compared with the threshold baseline, we need to introduce anew dataset with out-of-distribution samples in the training process.Given a model g and the learning objective L , the objective of OE is to minimize the original loss function with anauxiliary loss term to regularize the out-of-distribution examples. OE can be formulated as minimizing the followingobjective [17]: E ( x,y ) ∼D in [ L ( g ( x ) , y ) + λ E x (cid:48) ∼D out [ L OE ( g ( x (cid:48) ) , g ( x ) , y )]] (2) L OE = − (cid:88) i P ( i ) ln Q ( i ) P ( i ) (3) D in is the original in-distribution dataset, D out is the out-of-distribution dataset for OE. In the original classiﬁcation task,we use the cross-entropy loss function L . In order to force the out-of-distribution samples to have uniform distributionon all labels, we minimize the KL-divergence between out-of-distribution and the uniform distribution as shown inEquation (3). P (i) is the predicted distribution of out-of-distribution samples from the model and Q (i) is a normalizeduniform distribution. In the experiment, we use λ = 0 . for the coefﬁcient of the auxiliary loss. More detailed andcomprehensive description of OE can be found in the original publication [17].Figure 2 presents the distribution of the maximum softmax score for each input sequence with and without OE for thethreshold calibration dataset we used in Section 3.2. Without OE, there are still a lot of out-of-distribution samples withlarge softmax scores (0.5 to 1). With OE, most of the out-of-distribution samples accumulate with small softmax scores(0 to 0.4). With OE, the overlapping area between the two distributions (red vs combined green and blue) is decreasedfrom 26.06% to 21.99%. In addition, for the in-distribution samples with small softmax values, their classiﬁcationresults tend to be wrong (blue in Figure 2). Thus, using OE can provide better classiﬁcation accuracy at a cost ofrejecting some in-distribution samples. 5 PREPRINT - S

EPTEMBER

29, 2020

To evaluate ProDOMA, we applied ProDOMA on both simulated and real datasets: a simulated PacBio G protein-coupled receptor (GPCR) coding sequences (CDS) dataset [7], and two real third-generation sequencing datasets ofhuman genome [21, 22]. GPCR is a large protein family that is involved in many critical physiological processes, suchas visual sense, gustatory sense, sense of smell, regulation of immune system activity, and so on [23]. In addition, GPCRis a very diverse set of protein sequences and thus can pose challenges for classiﬁcation. It consists of 8,222 proteinsequences belonging to 5 families, 38 subfamilies, and 86 sub-subfamilies. Following DeepFam, all the experiments areconducted on the sub-subfamilies.We compared the performance of ProDOMA with HMMER and DeepFam, which are representatives of alignment-basedand alignment-free domain classiﬁcation tools. In both experiments, ProDOMA was trained with simulated PacBioreads from the GPCR CDS downloaded from NCBI. The simulation was conducted using a popular simulation toolPBSIM [24] with default setup and error rates from 1% to 15%. Following their instructions and design principle,HMMER and DeepFam were trained using the correct protein sequences in the GPCR dataset.In our ﬁrst experiment, we tested ProDOMA and its alternative implementations on simulated PacBio reads. In thesecond experiment, we tested ProDOMA on real PacBio and Nanopore reads from human data. All speciﬁc commands,parameters, and output of our experiments can be found along with the source code of ProDOMA.

The reference coding sequences of each sub-subfamily are divided into 80% training samples and 20% test samples.The number of reference sequences in each class is shown in Table S1 in Supplementary File 1. Then we used PBSIM togenerate simulated PacBio reads with 80X coverage on the plus strand for training and test samples with speciﬁed errorrates. As a result, the training set has 939,888 simulated reads, and the test dataset has 228,388 simulated reads for 86sub-subfamilies, respectively. Our strategy of conducting simulation after splitting the coding sequences can guaranteethat there is no overlap of the GPCR CDS sequences between the training and test datasets, which is important formeaningful evaluations. In our experiments, we used 5-fold cross validation. Thus, the above training and testingdataset construction process was repeated for ﬁve times.

We conducted a series of experiments by varying the key opponents in our base models: the number of convolutionallayers, the number of convolution ﬁlters, the size of convolution ﬁlters, and different encoding strategies. Totally, wecompared 14 different combinations of hyperparameters or architectures in the experiments. We listed all variationsand their accuracy in Table S2 and Figure S1 in the supplementary ﬁle, respectively. As shown in Figure S1, thehighest accuracy is 86.74%, which is achieved by using 3-frame encoding with two convolution layers. Based on thecomparision, the key factors affecting the performance are the encoding strategies and the size of ﬁlters.

Reads starting from different positions in the same transcript can have different reading frames corresponding to thesame translation. In our model training process, the three channels always take translations of reading frame 1, 2,and 3 of a read as input. It is thus fair to ask whether this speciﬁc order affects the classiﬁcation performance. Weinvestigate this question by inputting different orders of reading frames of test sequences to our trained model. Thus,we generated 6 inputs from each reads with different frame orders. As a result, 1,370,328 validation samples are testedin the experiment. Figure 3 shows the classiﬁcation accuracy of 5-fold cross validation using different reading frameorders as input. The performance keeps nearly the same.

Following the design principles and the instructions of HMMER and DeepFam, the training of HMMER and DeepFamwas conducted using correct protein sequences, rather than DNA sequences. The test sequences are simulated longreads from reference CDs. Their three-frame translations are used as input to HMMER and DeepFam. As long as oneof the three translated sequences is classiﬁed to the correct sub-subfamily, we call this a correct prediction.The classiﬁcation accuracy of ProDOMA and DeepFam was measured using 5-fold cross-validation. As it is tediousto perform 5-fold cross validation for HMMER, we used all 5-fold correctly translated protein sequences to train thepHMM model, which will favor HMMER as the trained model has seen the test sequences. MAFFT [25] was used togenerate the multiple sequence alignment for each sub-subfamily. Then we used hmmbuild in the HMMER package to6

PREPRINT - S

EPTEMBER

29, 2020Figure 3: The mean, min, and max value of classiﬁcation accuracy using different orders of reading frames as input.X-axis: order of reading frames as input. Y-axis: classiﬁcation accuracy.build pHMM models for each sub-subfamily. For each test DNA sequence, 3-frame translations were applied to getthree peptide sequences. All the translated sequences were tested using hmmscan against all 86 pHMM models we built.Figure 4: The mean, min, and max value of classiﬁcation results of ProDOMA, HMMER, and DeepFam on classifyingfour sets of simulated long reads with different error rates. Each test set contains roughly 228K simulated reads from 86sub-subfamilies. X-axis: different error rates. Y-axis: classiﬁcation accuracy.Figure 4 compared the classiﬁcation performance of all methods on the simulated PacBio reads. For this data set, ourmethod achieved better performance for datasets with different error rates. The high error rates heavily impacted theperformance of HMMER and DeepFam. It is expected because the proﬁle HMM search is much more sensitive toframeshifts caused by gaps, and DeepFam is designed for classifying relatively complete error-free protein sequences.7

PREPRINT - S

EPTEMBER

29, 2020

As there are error correction tools for long reads [15], existing domain classiﬁcation tools such as HMMER can beapplied on error-corrected reads. We conducted an experiment to test whether family classiﬁcation using correctedreads can achieve comparable performance to classiﬁcation of error-free sequences.Hybrid error correction tools tend to perform better than self-correction tools. However, not every sequence projecthas budget and manpower to generate both short read and long read libraries and data. Based on these recent reviews[15, 13], we chose LoRMA [16], a more recently published self-error correction tool, with the lowest error rate aftercorrection [15]. Note that LoRMA failed to generate outputs if we use all the simulated reads as input, probably becauseof the similarity between the homologous GPCR sequences or the large graph produced by the reads. Thus, we runLoRMA for each set of reads simulated from the same reference sequence of GPCR in order to achieve the best errorcorrection performance. We showed the error correction performance in Table S3 in Supplementary File 1. Althoughthe number of reads remained after error correction increased with the increase of the coverage, it still discarded alarge number of reads (e.g. 75.9% reads are discarded when the coverage is 30x). Also, the error correction becamesigniﬁcantly slower with the increase of the coverage.

All simulated reads Corrected readsProDOMA HMMER DeepFam ProDOMA HMMER DeepFamBest Average Best Average10x 87.12% 62.12% 25.06% 11.07% 95.05% 97.42% 69.06% 32.35%20x 86.47% 63.25% 25.01% 10.98% 94.94% 97.59% 77.36% 37.31%30x 86.69% 62.56% 25.18% 11.02% 94.79% 98.16% 80.76% 38.41%

Table 1: The classiﬁcation accuracy comparison between ProDOMA, HMMER and DeepFam on corrected reads. Theﬁrst column is the coverage.We recorded the domain classiﬁcation results for corrected reads in Table 1. HMMER achieved the best accuracyfor corrected reads. ProDOMA achieved comparable accuracy on corrected reads. With or without error correction,ProDOMA’s performance is consistent with the change of the coverage. Without relying on error correction, ProDOMAcan conduct domain classiﬁcation for all reads and thus lead to a more accurate estimation of the domain/familyabundance. As DeepFam will assign a family to each translation, it is unknown which one should be chosen in practicalapplications. “Best” means that we regard a read as a correct classiﬁcation as long as one of the translation is correctlyclassiﬁed. “Average” is the performance averaged on three translations.

As GPCR is a large protein family, some of their coding sequences can show high diversity, posing a challenge forboth alignment-based and alignment-free homology search. We plot the change of F1 score of each protein family withthe change of the intra-class identity in Figure 5. As DeepFam’s performance is inferior to others, we did not includeDeepFam in this experiment. The intra-class identities were calculated using the alistat tool provided in the HMMERpackage. The sub-subfamilies with low identities are more likely to have remote homologs. F1 score is the harmonicmean of recall and precision. For one protein family A, let its true member sequences be set A + and the predictedsequence set be A pred . The recall is thus | A + ∩ A pred || A + | . Precision is | A + ∩ A pred || A pred | .Figure 5 shows that both HMMER and ProDOMA suffer from low intra-class identity but the performance of ProDOMAdeceases slower with the decrease of the identity. One possible reason is that deep learning can learn more degeneratefeatures that are hard to model by HMMs. With a large amount of data generated by third-generation sequencing platforms, we require both high accuracy andefﬁciency for the algorithms. We run ProDOMA, DeepFam, and HMMER using Intel R (cid:13) Xeon R (cid:13) Gold 6148 CPU with20 cores at the High-Performance Computing Center at Michigan State University. We also tested ProDOMA andDeepFam with NVIDIA R (cid:13) Tesla R (cid:13) V100 GPU with Apex acceleration library (HMMER doesn’t support GPU). For eachmethod, We measured its execution time by averaging 5 independent trials with randomly selected 10,000 sequences.In CPU, HMMER with the default setup runs much faster than ProDOMA and DeepFam. One reason is that with highsequencing error rates, the alignment against many candidate sub-subfamilies cannot pass the ﬁlter stage of HMMER,8

PREPRINT - S

EPTEMBER

29, 2020Figure 5: Scatter plot and their logistic regression curves for all sub-subfamilies. The error bar (regions surrounding thecurves) is the 95% conﬁdence interval of the ﬁtting. Y-axis: F1 score of the test sequences. X-axis: intra-class identity.

Setup ProDOMA DeepFam HMMER HMMER − max CPU 1168.78s 276.74s 312.13s 3470.04sGPU 25.71s 20.37s unavailable unavailable

Table 2: The average elapsed time to predict sub-subfamily labels of 10,000 simulated PacBio reads for each method.9

PREPRINT - S

EPTEMBER

29, 2020skipping the expensive pHMM alignment. By turning off all ﬁlters, the sensitivity of HMMER increases, but at a largecost in speed. With --max (turning off all ﬁlters), HMMER is much slower than deep learning-based methods. WithGPU acceleration, the running time of ProDOMA is much shorter than the running time of HMMER with the defaultsetup.

To evaluate ProDOMA’s performance on real third-generation sequencing dataset, we tested ProDOMA on the

H.sapiens

10x Coverage data from PacBio [22] and Oxford Nanopore Human Reference Datasets Rel6 [26]. In thisexperiment, as the real genome sequencing data contains non-GPCR coding sequences, we will also test the performanceof ProDOMA on detecting out-of-distribution samples.

Training dataset.

The training dataset is the same as the previous experiments: long reads simulated using PBSIM.To apply Outlier Exposure, we constructed a dataset that mixed the previous 5-fold training dataset with an outlierdataset. In order to generate the outlier dataset with similar distributions to the real out-of-distribution samples, wesimulated a PacBio human genome dataset from GRCh37/hg19 human reference genome [27]. Then we kept thesimulated reads that cannot be aligned to any GPCR CDS by BLASR in the outlier dataset [28]. As a result, the outlierdataset has 800,000 simulated reads. Then we retrained ProDOMA with OE as discussed in Methods.

Test dataset.

We used two test datasets: a PacBio RS II test dataset from PacBio SMRT Sequencing for CHM1TERThuman cell line; and a Nanopore test dataset from Oxford Nanopore MinION on CEPH1463 (NA12878/GM12878,Ceph/Utah pedigree).We determine the ground truth of these test reads using sequence similarity, which favors alignment-based tools. Weﬁrst aligned all reads against GPCR CDS dataset using BLASR. We extracted the reads with alignment length longerthan 60% of aligned CDS sequence as our in-distribution test samples. For each sample, the ground truth is given bythe label of aligned CDS.We also randomly selected reads that were not aligned to any GPCR CDS as our out-of-distribution samples. As aresult, the dataset consists of 50% reads that are coding GPCR and 50% out-of-distribution reads. For reads longerthan 3,000 bps in the test dataset, we cropped each read into fragments that are at most 3,000 bps to be consistent withtraining dataset. As a result, the PacBio test dataset has 12,716 reads, and Nanopore test dataset has 43,654 reads,including both the GPCR CDS and the out-of-distribution samples.

Method PacBio NanoporeRecall Precision F1 score Recall Precision F1 scoreHMMER 0.1494 0.9507 0.2583 0.3984

Table 3: The performance of protein domain prediction with out-of-distribution examples using ProDOMA with OutlierExposure (OE), and HMMER on the real PacBio and Nanopore dataset.As the purpose of this experiment is to evaluate the performance of GPCR CDS detection, we computed the recall,precision, and F1 score of each tool on labeling reads from the 86 classes. Recall is the ratio of correctly predictedreads to the total number of reads from the 86 classes. Precision is the ratio of the reads from the 86 classes to thetotal number of reads predicted with the 86 class labels. F1 score is the harmonic mean of recall and precision and wereported micro-F1 score for our multi-classiﬁcation model.We benchmarked the OE model with HMMER, which is highly accurate in distinguishing protein domains from othersequences. As shown in Table 3, both methods have low recall but high speciﬁcity because many reads are rejected andnot classiﬁed into any of the 86 classes. Still, ProDOMA with OE achieved signiﬁcant improvement on recall while theprecision is comparable with HMMER (Table 3). In general, both methods have better performance on the Nanoporedataset. Note that in the whole pipeline, we only used simulated PacBio reads for training. This result suggests that ourstrategy is robust with different types of long reads. 10

PREPRINT - S

EPTEMBER

29, 2020

In this work, we showed that ProDOMA can render better accuracy for domain classiﬁcation in long noisy reads. Themajor differences between the architecture of ProDOMA and other CNN-based sequence classiﬁcation models arethe coding strategy designed for erroneous reads and the modiﬁed loss function to reject out-of-distribution samples.In order to interpret why this 3-frame encoding works, we extracted sequences corresponding to the most frequentlyactivated ﬁlters, which often represent the well-conserved motifs. Then we used Weblogo [29] to generate the logosfrom their alignments as shown in the supplementary ﬁle (Figure S3 to Figure S8). An example is provided in Figure 6.Because we translated the original input DNA sequences into 3 peptide sequences, we have 3 logos associated with 3reading frames respectively. These ﬁgures show that the logos corresponding to the three ﬁlters are very similar. This isconsistent with Figure 3, revealing that most ﬁlters for different channels are learning conserved motifs from error-freeregions.Figure 6: Logos derived from most frequently activated convolutional ﬁlters (index: 1033) for Cholecystokinin.As the model is trained using sequences with 3,000 nt, padding is applied to all inputs shorter than 3,000. Sequenceslonger than 3,000 will be cut into substrings of length 3,000, which will then be fed into the model. Thus our CNNmodel cannot accurately detect the start and end positions of domains. Instead, a model that can accurately assignweights for each base is needed for detecting the entering and existing positions of each domain. We plan to exploredeep learning-based annotation our future work.In summary, ProDOMA provides a complementary tool to current third-generation sequence analysis pipelines ongene-centric function analysis. It can directly identify protein domains in long noisy reads without relying on errorcorrection and its performance is robust to low coverage data and can tolerate higher error rates than other domainclassiﬁcation tools.

Funding

The computation was conducted at HPCC of Michigan State University. This work has been supported by the CityUniversity of Hong Kong 7200620.

References [1] Fuhao Zhang, Hong Song, Min Zeng, Yaohang Li, Lukasz Kurgan, and Min Li. Deepfunc: a deep learningframework for accurate prediction of protein functions from protein sequences and interactions.

Proteomics ,19(12):1900019, 2019. 11

PREPRINT - S

EPTEMBER

29, 2020[2] Nguyen Quoc Khanh Le and Van-Nui Nguyen. SNARE-CNN: a 2D convolutional neural network architecture toidentify SNARE proteins from high-throughput sequencing data.

PeerJ Computer Science , 5:e177, 2019.[3] Jiajun Hong, Yongchao Luo, Yang Zhang, Junbiao Ying, Weiwei Xue, Tian Xie, Lin Tao, and Feng Zhu. Proteinfunctional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by asequence-based deep learning.

Brieﬁngs in bioinformatics , 21(4):1437–1447, 2020.[4] Sean R. Eddy. Proﬁle hidden Markov models.

Bioinformatics (Oxford, England) , 14(9):755–763, 1998.[5] Sara El-Gebali, Jaina Mistry, Alex Bateman, Sean R Eddy, Aurélien Luciani, Simon C Potter, Matloob Qureshi,Lorna J Richardson, Gustavo A Salazar, Alfredo Smart, et al. The Pfam protein families database in 2019.

Nucleicacids research , 47(D1):D427–D432, 2018.[6] Sean R. Eddy, Travis J. Wheeler, and the HMMER development team. HMMER 3.1b2, 2015.[7] Matthew N Davies, David E Gloriam, Andrew Secker, Alex A Freitas, Miguel Mendao, Jon Timmis, and Darren RFlower. Proteomic applications of automated GPCR classiﬁcation.

Proteomics , 7(16):2800–2814, 2007.[8] Shumin Li, Junjie Chen, and Bin Liu. Protein remote homology detection based on bidirectional long short-termmemory.

BMC bioinformatics , 18(1):443, 2017.[9] Seokjun Seo, Minsik Oh, Youngjune Park, and Sun Kim. DeepFam: deep learning based alignment-free methodfor protein family modeling and prediction.

Bioinformatics , 34(13):i254–i262, 2018.[10] Brandon Carter, Maxwell Bileschi, Jamie Smith, Theo Sanderson, Drew Bryant, David Belanger, and Lucy JColwell. Critiquing protein family classiﬁcation models using sufﬁcient input subsets.

Journal of ComputationalBiology , 2019.[11] Da Zhang and Mansur Kabuka. Protein Family Classiﬁcation from Scratch: A CNN based Deep LearningApproach.

IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2020.[12] Andrzej Zielezinski, Hani Z. Girgis, Guillaume Bernard, Chris-Andre Leimeister, Kujin Tang, Thomas Dencker,Anna Katharina Lau, Sophie Röhling, Jae Jin Choi, Michael S. Waterman, Matteo Comin, Sung-Hou Kim, SusanaVinga, Jonas S. Almeida, Cheong Xin Chan, Benjamin T. James, Fengzhu Sun, Burkhard Morgenstern, andWojciech M. Karlowski. Benchmarking of alignment-free sequence comparison methods.

Genome Biology ,20(1):144, 2019.[13] Shuhua Fu, Anqi Wang, and Kin Fai Au. A comparative evaluation of hybrid error correction methods forerror-prone long reads.

Genome biology , 20(1):26, 2019.[14] Nan Du and Yanni Sun. Improve homology search sensitivity of PacBio data by correcting frameshifts.

Bioinfor-matics , 32(17):i529–i537, 2016.[15] Leandro Lima, Camille Marchet, Ségolène Caboche, Corinne Da Silva, Benjamin Istace, Jean-Marc Aury, HélèneTouzet, and Rayan Chikhi. Comparative assessment of long-read error correction software applied to NanoporeRNA-sequencing data.

Brieﬁngs in bioinformatics , 21(4):1164–1181, 2020.[16] L Salmela, R Walve, and E Rivals. Accurate self-correction of errors in long reads using de Bruijn graphs.

Bioinformatics , 33(6):799–806, 2017.[17] Dan Hendrycks, Mantas Mazeika, and Thomas G Dietterich. Deep Anomaly Detection with Outlier Exposure. In

International Conference on Learning Representations , 2019.[18] Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann machines. In

Proceedingsof the 27th international conference on machine learning (ICML-10) , pages 807–814, 2010.[19] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simpleway to prevent neural networks from overﬁtting.

The journal of machine learning research , 15(1):1929–1958,2014.[20] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassiﬁed and out-of-distribution examples inneural networks. arXiv preprint arXiv:1610.02136 , 2016.[21] Mark JP Chaisson, John Huddleston, Megan Y Dennis, Peter H Sudmant, Maika Malig, Fereydoun Hormozdiari,Francesca Antonacci, Urvashi Surti, Richard Sandstrom, Matthew Boitano, et al. Resolving the complexity of thehuman genome using single-molecule sequencing.

Nature , 517(7536):608–611, 2015.[22] Paciﬁc Biosciences. H. sapiens 10x Sequence Coverage with PacBio data, 2014.[23] B Trzaskowski, D Latek, S Yuan, U Ghoshdastider, A Debinski, and S Filipek. Action of molecular switches inGPCRs-theoretical and experimental studies.

Current medicinal chemistry , 19(8):1090–1109, 2012.[24] Yukiteru Ono, Kiyoshi Asai, and Michiaki Hamada. PBSIM: PacBio reads simulator—toward accurate genomeassembly.

Bioinformatics , 29(1):119–121, 2012. 12

PREPRINT - S

EPTEMBER

29, 2020[25] Kazutaka Katoh and Daron M Standley. MAFFT multiple sequence alignment software version 7: improvementsin performance and usability.

Molecular biology and evolution , 30(4):772–780, 2013.[26] Miten Jain, Sergey Koren, Karen H Miga, Josh Quick, Arthur C Rand, Thomas A Sasani, John R Tyson, Andrew DBeggs, Alexander T Dilthey, Ian T Fiddes, et al. Nanopore sequencing and assembly of a human genome withultra-long reads.

Nature biotechnology , 36(4):338, 2018.[27] Deanna M Church, Valerie A Schneider, Tina Graves, Katherine Auger, Fiona Cunningham, Nathan Bouk,Hsiu-Chuan Chen, Richa Agarwala, William M McLaren, Graham RS Ritchie, et al. Modernizing referencegenome assemblies.

PLoS biology , 9(7):e1001091, 2011.[28] Mark J Chaisson and Glenn Tesler. Mapping single molecule sequencing reads using basic local alignment withsuccessive reﬁnement (BLASR): application and theory.

BMC bioinformatics , 13(1):238, 2012.[29] Gavin E Crooks, Gary Hon, John-Marc Chandonia, and Steven E Brenner. WebLogo: a sequence logo generator.