i6mA-CNN: a convolution based computational approach towards identification of DNA N6-methyladenine sites in rice genome
Ruhul Amin, Chowdhury Rafeed Rahman, Md. Sadrul Islam Toaha, Swakkhar Shatabda
ii6mA-CNN: a convolution based computational approachtowards identification of DNA N6-methyladenine sites inrice genome
Ruhul Amin, Chowdhury Rafeed Rahman, Md. Sadrul Islam Toaha,Swakkhar Shatabda
Department of Computer Science and Engineering, United International University,Dhaka, 1207, Bangladesh
Abstract
DNA N6-methylation (6mA) in Adenine nucleotide is a post replicationmodification and is responsible for many biological functions. Experimentalmethods for genome wide 6mA site detection is an expensive and manuallabour intensive process. Automated and accurate computational meth-ods can help to identify 6mA sites in long genomes saving significant timeand money. Our study develops a convolutional neural network based tooli6mA-CNN capable of identifying 6mA sites in the rice genome. Our modelcoordinates among multiple types of features such as PseAAC inspired cus-tomized feature vector, multiple one hot representations and dinucleotidephysicochemical properties. It achieves area under the receiver operatingcharacteristic curve of 0.98 with an overall accuracy of 0.94 using 5 foldcross validation on benchmark dataset. Finally, we evaluate our model ontwo other plant genome 6mA site identification datasets besides rice. Resultssuggest that our proposed tool is able to generalize its ability of 6mA siteidentification on plant genomes irrespective of plant species. Web tool forthis research can be found at: https://cutt.ly/Co6KuWG . Supplementarydata (benchmark dataset, independent test dataset, comparison purposedataset, trained model, physicochemical property values, attention mecha-nism details for motif finding) are available at https: // cutt. ly/ PpDdeDH . Keywords:
N6-methyladenine site, Convolutional Neural Network, feature ∗ Corresponding author ∗∗ Corresponding author
Email addresses: [email protected] (Chowdhury Rafeed Rahman), [email protected] (Swakkhar Shatabda)
Preprint submitted to Genomics August 12, 2020 a r X i v : . [ q - b i o . GN ] A ug election
1. Introduction
N4-methylcytosine, N6-methyladenine and 5-methylcytosine exist in DNAof different species [1]. DNA N -methylation is a part of epigenetics wheremethyl groups are transferred to DNA molecules [2]. It is one kind of postreplication modification. Such modification has been identified in three king-doms of life such as Bacteria, Archaea, and Eukaryotes [3]. N -methylationoccurs when methyl group is transferred to the 6 th position of purine ringof Adenine nucleotide. Such modification is responsible for discriminationbetween original DNA strand and newly synthesized DNA strand, gene tran-scription regulation, replication and repair [4].Laboratory based experimental methods such as sequencing (SMRT-seq,MeDIP-seq) [5], methylated DNA Immunoprecipitation [6], capillary elec-trophoresis and laser-induced fluorescence [7] have been proposed to identify N -methyladenine (6mA) sites in the genome. 6mA profile of rice has beenobtained recently using mass spectrometry analysis and 6mA Immunopre-cipitation followed by sequencing (IP-seq) [8]. Genome wide identificationof 6mA sites is labor intensive and costly using such experimental methods.Computational experiment based pLogo plot ([9]) on benchmark dataset re-garding rice 6mA site identification shows nucleotide distribution to be dif-ferent near 6mA and non-6mA sites [10]. So, accurate computational meth-ods can be effective for automatic 6mA site identification in rice genome.iDNA6mA-PseKNC tool was established for N -methyladenine site iden-tification on Mus musculus genome dataset containing 1934 samples in eachclass [11]. They used custom feature vector for each sample inspired byPseAAC [12]. They used LibSVM package 3.18 for their classification task.This same package was used in i6mA-Pred tool for the same task in ricegenome dataset containing 880 samples in each class using similar featurevector [13]. They used Maximum Relevance Maximum Distance (MRMD)method ([14]) along with Incremental Feature Selection (IFS) for limitingfeature space. This same dataset was used to train and build iDNA6mA toolwhich utilized one hot matrix as sample feature in [15]. They used sequentialone dimensional convolutional neural network architecture for classification.This same approach was used in [16] to develop SNNRice6mA tool. Theonly difference was that they used group normalization layer in betweenconvolution layers and an adaptive learning rate. The same benchmarkrice genome dataset was used to build i6mA-DNCP tool [17]. They used2i-nucleotide composition frequency and dinucleotide based DNA proper-ties as sample features. TreeBagging algorithm was used as their classifier.They implemented heuristic based DNA property selection for selecting fourdinucleotide based DNA properties for their classification. iDNA6mA-Ricetool was developed using a rice genome dataset containing 1,54,000 samplesin each of the two classes [10]. They used one hot vector based linear featurespace with Random Forest algorithm. MM-6mAPred tool was developed in[18], where neighbour dependency information was used to detect 6mA sitein rice genome leveraging Markov Model. 6mA-Finder tool was developedin [19] using seven sequence oriented information and three physicochemicalbased features. They used Random Forest classifier on an optimal featuregroup using Recursive feature elimination (RFE) strategy.The main limitation of these methods is that they require single featurevector for training and validation purpose irrespective of the heterogeneityof the features they are using. Moreover, most of these proposed algorithmsare unable to use two dimensional matrix oriented features which limits localand sequential pattern detection capability.We propose i6mA-CNN, a one dimensional CNN based branched clas-sifier which can identify 6mA sites in plant genome. We work with fourdifferent types of feature matrices used for DNA sequence representation- PseAAC inspired customized feature matrix, monomer one hot matrix,dimer one hot matrix and dimer physicochemical properties. We use cor-relation based dimer physicochemical property selection technique in orderto remove unnecessary and/ or redundant property features. Our proposedmodel is able to coordinate between these heterogeneous feature matricesand learn distinguishing features between 6mA and non-6mA sites success-fully. Accurate experimental results on a large benchmark dataset and suc-cess on independent test datasets show the effectiveness of our proposedtool.
2. Materials and Methods
Gene Expression Omnibus (GEO) database maintained by National Cen-ter for Biotechnology Information (NCBI) has a total of 2,65,290 6mA sitecontaining sequences each of 41 base pair (bp) length with 6mA site at thecenter [20]. CD-HIT program ([21]) with 80% (following similar researchconducted in [10]) similarity threshold has been implemented on these se-quences in order to avoid redundancy bias which resulted in 1,54,000 mA3ite containing sequences. This is our positive set of samples. Same num-ber of negative samples (no 6mA site) have been obtained from NCBI toformulate a class balanced dataset following some specific criteria. Each41 bp long negative sequence has Adenine at the center, where the centeris not methylated (proven by experiment). 6mA sites have a tendency ofoccurring near GAGG rich sequences [22]. In order to avoid such bias whileclassification, negative sequences rich with GAGG motifs were consideredonly to make negative data more objective.
A sample sequence of our dataset looks as follows: D = N , N , N , · · · , N L where, L is the length of the DNA sequence, which is 41 in our case and N i ∈ { A, T, C, G } . With appropriate mathematical representation of thesesequences, deep learning models can learn class distinguishing features fromthe sequence local and global patterns [23]. It is clear from Figure 1 that ourmodel is capable of coordinating between multiple types of feature vectorsand matrices using branch like structure which has been explained in detailin Subsection 2.3. We use the following sequence representation features inour model: PseAAC (Pseudo Amino Acid Composition) has shown success in deal-ing with protein sequences [24, 25]. Inspired by such success, PseKNC(Pseudo K-tuple Nucleotide Composition) has been introduced for dealingwith DNA/RNA sequences in computational biology [26, 27]. The generalform of PseKNC has been designed in such a way that their values will de-pend on user specified feature extraction techniques from DNA samples.The four types of nucleotides of DNA can be classified into three categoriesof attributes. These attributes along with representation code have beenprovided in Table 1. The categories are as follows:
Ring number:
A and G have two rings, whereas C and T have only onering.
Chemical functionality:
A and C are from amino group, whereas G andT are from keto group.
Hydrogen bonding:
C and G are bonded to each other with 3 hydrogenbonds (strong), whereas A and T with only two (weak).4he above mentioned properties have significant impact on biological func-tions as well [28]. The i th nucleotide N i of a sequence can be represented by( x i , y i , z i ), where x i , y i and z i represent ring structure, functional group andhydrogen bonding code (see Table 1) of nucleotide N i , respectively. Besidesthese three properties, we consider lingering density of each nucleotide topreserve sequence oriented features. Lingering density of nucleotide N i isdefined below: D i = L i (cid:80) L i j =1 f ( N j )Here, L i is the length of the substring up to nucleotide N i and f ( N j ) = (cid:40) , if N i = N j , otherwiseSo, for each nucleotide, we have four properties to calculate. Since eachsequence in our dataset is 41 bp long, we have a 41 × , , , , [0 , , , . , [1 , , , . , [1 , , , . , [1 , , , . Tool iDNA6mA was developed based on monomer one hot encoding[15]. Their one dimensional convolutional neural network (1D CNN) wasquite successful in class discrimination based on one hot encoding basedmathematical representation. Each branch of our designed model is also1D CNN oriented. So, we have used this particular feature. There are fourtypes of nucleotides in DNA sequence and each of them is represented by
Category Attribute Nucleotides Code
RingStructure Purine A, G 1Pyrimidine C, T 0FunctionalGroup Amino A, C 1Keto G, T 0HydrogenBonding Strong C, G 1Weak A, T 0Table 1: Nucleotide categorical attribute-wise encoding5 four size one hot vector. A, T, C and G are represented by (1,0,0,0),(0,1,0,0), (0,0,1,0) and (0,0,0,1), respectively. So, a 41 bp long sequence isrepresented by a 41 × There is statistically significant difference (considering p-value of 0.05)between 6mA and non-6mA site containing samples in terms of dinucleotidecomposition frequencies [17]. There are sixteen possible unique dimers suchas AA, AT, AC, . . . in DNA sequences. We represent each dimer by a 16size one hot vector. 1D CNN based feature extraction technique is able tolearn from class distinguishing dimer patterns which include difference infrequency. In a 41 length sequence, there are 40 overlapping dimers such as N N , N N , N N , . . . . So, each sample of our dataset is represented bya 40 ×
16 dimensional matrix.
According to [29], Adenine methylation may have impact on DNA struc-ture and its stability. We have used the 90 dinucleotide physicochemicalproperties used by PseKNC tool [26]. So, each dimer of our sample se-quence is represented by a 90 size Z-score normalized vector. As a result,each 41 bp long sample has been represented by a 40 ×
90 dimensional matrix.This is certainly a large matrix. If the dinucleotide physicochemical propertyset contains features which include redundant or less important informationin terms of our classification task, it will lead to overfitting problem [30].The steps of dinucleotide physicochemical property (DPP) selection are asfollows: • We use Logistic regression based model for 6mA site identification taskusing one DPP at a time.
Pos Nuc X Y Z D
We note the 5 fold cross validation Mathew’s correlation coefficient(MCC) score for each DPP and select the the top 20 DPPs accordingto obtained MCC score. The higher the MCC, the better is the DPPfor our identification task. It is to note that MCC score is better thanF1 score and accuracy for binary classification evaluation [31]. • We select the topmost DPP. Now we want to remove redundant infor-mation. Each DPP is represented by a 16 size vector as there are 16possible dimers. DPPs which have a high positive or negative correla-tion coefficient represent the same type of information. We calculatePearson’s correlation coefficient for each of the 19 other selected DPPswith the topmost DPP. • We leave out those DPPs which have correlation score greater than0.9 (positive correlation) or less than -0.9 (negative correlation) withthe topmost DPP.Thus we select nine DPPs out of 90 which include
Roll rise, major Groovedistance, twist stiffness, protein induced deformability, mobility to bend to-wards major groove, melting temperature, Hartman trans free energy, proteinDNA twist and tip.2.3. Model Architecture
Our model architecture has been shown in Figure 1. There are fourbranches for four different feature matrices (described in detail in Subsec-tion
Sequence Representation ). Each branch has three 1D convolutionlayers put sequentially. The kernel size and filter number of each convolu-tion layer have been shown in the figure. 1D CNN plays important role inposition independent local feature extraction [32, 33]. Each branch of thearchitecture learns class distinguishing features regarding its correspondinginput feature matrix independent of the other three branches. The monomerone hot representation of the leftmost branch represents raw representationof sample sequence. Model parameters of this branch learn from this rawrepresentation free from any feature specific knowledge bias. This branchhas the potential of learning those class specific hidden features which maynot be apparent to humans. The second branch which has dimer one hotrepresentation as input helps the model learn statistically significant dif-ferences between 6mA and non-6mA site containing samples in terms ofdinucleotide pattern and frequency of occurrence. Such difference has beenpointed out in [17]. The third branch uses dimer physicochemical propertybased representation. These properties played an important role in DNA7 nput shape(41, 4) Input shape(40, 16) Input shape(40, 9) Input shape(41, 4)Mono-MerOne Hot Di-MerOne Hot Di-nucleotide Property PseAAC Inspired FeaturesFilters = 128Kernel = 5 Filters = 128Kernel = 5 Filters = 128Kernel = 5 Filters = 128Kernel = 5 1DConvolutionFilters = 64Kernel = 3 Filters = 64Kernel = 3 Filters = 128Kernel = 3 Filters = 64Kernel = 3 1DConvolutionFilters = 32Kernel = 3 Filters = 32Kernel = 3 Filters = 64Kernel = 3 Filters = 32Kernel = 3 1DConvolutionConcatenate
RD = Relu + Dropout (0.5)
RD RD RD RDRD RD RD RDNode = 1Flatten Flatten Flatten FlattenNode = 1 Node = 1 Node = 1 Node = 1 Dense LayerRD RD RD RD
S = Sigmoid
Dense LayerOutputSS S S S
Figure 1: Proposed Model Architecture specific classification and identification tasks such as promoter classification[33], 6mA site identification [17], recombination spot identification [34] andDNase I hypersensitive sites identification [35]. This third branch facilitates8earning the interaction between these property values in a sequence orderbasis. The last branch is based on learning from ring structure, functionalgroup, hydrogen bonding and lingering density based nucleotide features.These features have impact on DNA biological functions [28, 36]. This lastbranch parameters learn to identify patterns from these sequence coupledfeatures.We need to have a way to combine these branches for a mutual classifi-cation decision depending on the four heterogeneous feature matrices. Theconvolutional part of each branch i returns a matrix of dimension m i × n i .We flatten and convert this matrix into a linear vector. We pass this vectorthrough a dense node having Sigmoid activation function, which outputs avalue in the range of 0 to 1. This single value obtained from the i th branchis its representative for 6mA identification task. We have total four valuesin a range of 0 to 1 from the four branches, which we concatenate and turninto a four size vector. This four size vector is the representative vector ofall four branches. The final dense layer with Sigmoid activation functionprovides the output value depending on the four values of the concatenatedvector. If the output value is greater than 0.5, then we consider the samplesequence to have 6mA site. Else, we decide that there is no 6 mA site inthe input sequence. Accuracy (Acc), sensitivity (Sn), specificity (Sp) and Mathew’s corre-lation coefficient (MCC) have been used as evaluation metrics which aredescribed as follows:
Acc = T P + T NT P + T N + F P + F NSn = T PT P + F NSp = T NT N + F PM CC = T P × T N − F P × F N (cid:112) ( T P + F P )( T P + F N )( T N + F P )( T N + F N )Here, TP, FP, TN and FN represent true positive, false positive, true neg-ative and false negative, respectively. We consider the 6mA site containingsamples as positive class samples. Apart from the above four metrics, wehave also used area under receiver operating characteristic (auROC) curvefor performance evaluation. This metric provides classifier performance irre-9pective of provided threshold for class discrimination to the classifier. Thisscore has an impact on classification confidence value for each sample aswell. Acc, Sn, Sp and auROC lie in the range [0, 1] while for MCC score,the range is from -1 to 1. Higher value indicates better classification ability.We have tuned and selected our model hyperparameters using 5 foldcross validation on our benchmark dataset. We took a local search orientedapproach where we went in uphill direction in terms of obtained MCC value.The tuned model hyperparameters are as follows: • Convolution layer:
Change in number of convolution layers in eachbranch, convolution filter number and kernel size in each convolutionlayer • Learning rate:
Experimentation with different learning rates suchas 0.01, 0.001, 0.0001, 0.00001 and adaptive learning rate scheme • Dropout rate:
Testing out dropout rate of 0.1, 0.3 and 0.5. • Optimizer:
Model training with adaptive moment estimation (ADAM),stochastic gradient descent (SGD) and Nesterov accelerated gradient(NAG) optimizer. • Activation function:
Experimentation with Relu, Tanh and LeakyReluactivation function in initial convolution layersAs shown in Figure 1, Each branch has three convolution layers. Thesethree consecutive layers contain kernel size of 5, 3 and 3, whereas the num-ber of filters are 128, 64 and 32, respectively. Learning rate of 0.0001 hasbeen used in order to avoid divergence while training. Dropout of 0.5 hasbeen used for overfitting avoidance. Addition of dropout changes networkconnectivity on each training epoch and thus helps model to understand thetraining sample features instead of memorizing them [37]. ADAM has beenchosen as the optimizer for model weight update during training. Each in-termediate convolution layer output goes through Relu activation functionwhich is very popular for being effective and simple [38].
3. Results and Discussion i6mA-Pred [13], iDNA6mA-Rice [10], iDNA6mA [15], i6mA-DNCP [17],MM-6mAPred [18], SNNRice6mA [16] and 6mA-Finder [19] are some ofthe state-of-art tools which can identify 6mA sites in rice genome. For10omparison purpose of our proposed tool with these state-of-the-art tools,consistency in dataset usage and evaluation method are necessary.The benchmark dataset that we have used for this research has also beenused for constructing the latest rice genome i6mA site identification tool iDNA6mA-Rice and SNNRice6mA [10, 16]. Tools for the same task havebeen developed in [13, 15, 17, 18, 19], though they used a different datasetfor benchmarking which contains 880 samples in each of the two classes.They used either Jackknife testing or 10 fold cross validation for algorithmand model selection. For comparison purpose, we also test our custom modelon the same 880 sample per class dataset using similar validation methods.This comparison purpose dataset has been obtained from NCBI GEO underaccession number GSE103145. All sequences with 6mA site in the center andwith modification score greater than 30 (according to Methylome AnalysisTechnical Note, 30 is the minimum score to call a nucleotide as modified)have been considered as positive samples. Each sample sequence is 41 bplong. CD-HIT ([21]) with 60% similarity threshold has been applied on thechosen positive sequences in order to remove redundancy. Thus the 880positive samples were obtained. The 880 negative samples of this datasethave been obtained in a manner similar to our negative benchmark datasetsamples. Detailed performance comparison between the tools used for 6mAsite identification has been provided in Table 3. All the tool results exceptfor our proposed tool i6mA-CNN have been obtained from correspondingresearch articles ([13, 10, 15, 17, 18, 16, 19]). Our proposed tool clearlyoutperforms all other tools for similar task by all five performance metrics.Tool i6mA-CNN achieves a high 5 fold cross validation auROC score of 0.98on our benchmark dataset. ROC curve has been shown in Figure 2.Independent test set performance of a tool can help in proving the gener-alization ability of that tool. Following procedures similar to our benchmarkdataset construction, we have constructed two independent test datasetsin order to evaluate the robustness of our proposed tool in terms of N -Methyladenine site identification in plant genome. Fragaria Vesca (Wildstrawberry) and Arabidopsis thaliana (cress weed plant) 6mA site containingsequences have been obtained from the MDR database [39] and from NCBIGEO with accession number GSE81597 [40], respectively. Performance ofi6mA-CNN on these datasets have been shown in detail in Table 4. Resultsshow the robustness of the proposed tool for 6mA site identification in plantgenome.Our model is a CNN based branched model with no time stamp orientedsequential information processing capability. There are reasons why we didnot use recurrent layers which utilize time stamp information while working11 ataset EvaluationMethod Tool Sn Sp Acc MCC auROC i6mA-Pred 83.41% 83.64% 83.52% 0.67 0.91JackknifeTesting i6mA-CNN 90.61% 94.12% 93.15% 0.85 0.98 iDNA6mA 86.7% 86.59% 86.64% 0.73 0.93i6mA-DNCP 84.09% 88.07% 86.08% 0.72 0.93MM-6mAPred 89.32% 90.11% 89.72% 0.796mA-Finder 0.94Dataset 1 10 FoldCrossValidation i6mA-CNN 90.35% 94.62% 92.48% 0.85 0.98 iDNA6mA-Rice 93% 90.5% 91.7% 0.83 0.96SNNRice6mA 94.33% 89.75% 92.04% 0.84 0.97Dataset 2 5 FoldCrossValidation i6mA-CNN 95.13% 92.81% 93.97% 0.88 0.98 Table 3: i6mA-CNN performance comparison with state-of-the-art tools.Best performance results have been marked in bold.
Dataset 1 is the 880sample per class dataset used for comparison purpose, while
Dataset 2 isour benchmark dataset with 1,54,000 samples in each class
Figure 2: ROC curve of i6mA-CNN on benchmark dataset ndependent TestDataset TotalSample No. TPSample No. FNSample No. SuccessRate Fragaria Vesca 8979 8642 337 96.25%Arabidopsis Thaliana 67650 63252 4398 93.5%Table 4: i6mA-CNN performance on independent test datasetswith sequence type data. First, it is not possible to provide multiple featurerepresentations simultaneously in each time stamp input in recurrent neu-ral networks. Second, each time stamp of such networks only accept linearvector representation which would limit our current ability of using matri-ces. Third, recurrent neural networks such as RNN, GRU or LSTM giveimportance to positional information [41]. But classification of genome re-lated sample sequence often depends on the presence of potential motif typepatterns irrespective of their position in the sequence [22]. Finally, train-ing and validation using recurrent layer based networks are costly becauseof sequential type of computation. 1D CNN based model having multipleconvolution layers have the capability of encoding and learning both shortand long range patterns. Parallel processing is possible in such networkswhich allow fast computation. Our customized model architecture is able towork with multiple heterogeneous features simultaneously and is able to uti-lize the benefits of feature matrix representation. Because of such inherentcharacteristics, our proposed tool is fast, accurate and shows performancesuperior to the remaining state-of-the-art tools for 6mA site identificationin rice genome and also generalizes well to other plant genomes for the sametask.
Motif Sequence Active Occurrence %
ATAT 5.2AGGC 4.1AAAA 4.0AAAT 3.9GAGG 3.5Table 5: Potential motifs for N -methyladenine site identificationWe have found potential motifs which our deep learning based model ac-tively searches for in order to perform 6mA site identification in rice genome.13otifs are class specific substrings found frequently in the sample sequencesof that class. Following the procedure shown in [33], we have customized anattention model (details of this model have been provided as supplementaryinformation) based on our proposed CNN model provided in Figure 1 witha view to identifying the potential motifs. A list of potential motifs foundfor 6mA site identification on our benchmark dataset have been provided inTable 5. The top motif that we have found is ATAT . It was found to beactive 5.2% of the time during 6mA site identification. The pLogo plot con-structed in [10] showed the significant difference of nucleotide distributionin postive and negative samples. This analysis has not been repeated in ourcurrent work.
4. Conclusion
Tool i6mA-CNN is the outcome of the current research aimed at 6mA siteidentification in rice genome. The combination of four different feature ma-trices and redundant feature reduction have facilitated our customized CNNbased architecture to be robust and accurate. Experimental results on tworice genome datasets and two other independent plant genome datasets re-lated to 6mA site identification show the effectiveness of our tool. Proposedi6mA-CNN is expected to be an effective tool for automation in epigeneticsrelated research. Future research may aim at constructing a tool that canperform 6mA site identification in both plant and animal genome effectively.