[PDF] DeepAcid: Classification of macromolecule type based on sequences of amino acids

Abstract

The study of the amino acid sequence is vital in life sciences. In this paper, we are using deep learning to solve macromolecule classification problem using amino acids. Deep learning has emerged as a strong and efficient framework that can be applied to a broad spectrum of complex learning problems which were difficult to solve using traditional machine learning techniques in the past. We are using word embedding from NLP to represent the amino acid sequence as vectors. We are using different deep learning model for classification of macromolecules like CNN, LSTM, and GRU. Convolution neural network can extract features from amino acid sequences which are represented by vectors. The extracted features will be feed to a different type of model to train a robust classifier. our results show that Word2vec as embedding combine with VGG-16 has better performance than LSTM and GRU. our approach gets an error rate of 1.5%. Code is available at this https URL

Full PDF

DDeepAcid: Classiﬁcation of macromolecule typebased on sequences of amino acids

Sarwar Khan

National Chengchi University Taiwan international graduate program Institute of information science academia Sinica.

Abstract —The study of the amino acid sequence is vitalin life sciences. In this paper, we are using deep learningto solve macromolecule classiﬁcation problem using aminoacids. Deep learning has emerged as a strong and efﬁcientframework that can be applied to a broad spectrum of complexlearning problems which were difﬁcult to solve using traditionalmachine learning techniques in the past. We are using wordembedding from NLP to represent the amino acid sequenceas vectors. We are using different deep learning model forclassiﬁcation of macromolecules like CNN, LSTM, and GRU.Convolution neural network can extract features from aminoacid sequences which are represented by vectors. The extractedfeatures will be feed to a different type of model to train arobust classiﬁer. our results show that Word2vec as embeddingcombine with VGG-16 has better performance than LSTM andGRU. our approach gets an error rate of 1.5%. Code is availableat ht t ps : // g i thub . com / say sar war / Deep Acid

Index Terms —Proteins classiﬁcation, Amino acid, GRU, CNN,Embedding, LSTM

I. I

NTRODUCTION

The last decade has witnessed the great success of deeplearning as it has brought revolutionary advances in manyapplication domains, including computer vision, naturallanguage processing, and signal processing. The key ideabehind deep learning is to consider feature learning andclassiﬁcation in the same network architecture, and useback-propagation to update model parameters to learndiscriminative feature representations. More importantly,many novel deep learning methods have been devised andimproved classiﬁcation performance signiﬁcantly [4], [11],[16].Lee et al. [10] targeted on learning an informative featurerepresentation of protein sequence as the input of neuralnetwork models to obtain the ﬁnal predicting output ofbelonging protein family. Hou et al. [5] proposed a frame-work with deep 1D CNN (DeepSF) which is robust onboth fold recognition and the study of sequence-structurerelationship to classify protein sequence. Nguyen et al. [13]developed a framework with convolution neural networkwhich used the idea of translation to convert DNA sequenceto word sequence as for ﬁnal classiﬁcation.The revolution in machine learning particularly deeplearning [7]–[9] made it possible to study and extract a complex pattern from data in order to make the machinemodel more robust. Study of DNA in life sciences in an im-portant factor to understand organisms. Current sequencingtechnologies made it possible to read DNA sequences withlower cost. DNA databases are increasing day by day andwe need to use the power of modern computing to helpunderstand the DNA. one of the most important and basictasks is to classify DNA sequences.This work focus on the classiﬁcation of macromoleculebased on amino acids sequences. Within all lifeforms onEarth, from the tiniest bacterium to the giant sperm whale,there are four major classes of organic macromolecules thatare always found and are essential to life. These are thecarbohydrates, lipids (or fats), proteins, and nucleic acids.All of the major macromolecule classes are similar, in that,they are large polymers that are assembled from smallrepeating monomer subunits. Proteins are large, complexmolecules that play many critical roles in the body. Theyare made up of hundreds or thousands of smaller unitscalled amino acids, which are attached to one another inlong chains. There are 20 different types of amino acidsthat can be combined to make a protein. The name ofthese 20 common amino acids are as follows: alanine,arginine, asparagine, aspartic acid, cysteine, glutamic acid,glutamine, glycine, histidine, isoleucine, leucine, lysine,methionine, phenylalaine, proline, serine, threonine, tryp-tophan, tyrosine, and valine. The sequence of amino acidsdetermines each protein’s unique 3-dimensional structureand its speciﬁc function.Carbohydrates are polymers that include both sugars andpolymers of sugars, and they serve as fuel and buildingmaterials both within and outside of the cells. For instance,fructose and glucose are examples of carbohydrates whichare essential to life. Nucleic acids are polymeric macro-molecules that are essential for all known forms of life. Thetwo types of nucleic acids are DNA and RNA, which are bothfound in nuclei of cells. They allow organisms to reproducetheir complex components.II. P

ROBLEM STATEMENT

The interaction of Protein with protein and protein withDNA/RNA play a pivotal role in protein function. Exper-imental detection of residues in protein-protein interac- a r X i v : . [ q - b i o . B M ] J u l ion surfaces must come from the determination of thestructure of protein-protein, protein-DNA, and protein-RNAcomplexes. However, experimental determination of suchcomplexes lags far behind the number of known proteinsequences. Hence, there is a need for the developmentof reliable computational methods for identifying protein-protein, protein-RNA, and protein-DNA interface residues.Identiﬁcation of macromolecules and detection of speciﬁcamino acid residues that contribute to the strength ofinteractions is an important problem with broad applica-tions ranging from rational drug design to the analysis ofmetabolic and signal transduction networks. Against thisbackground, this project is aimed at developing a machinelearning algorithm that identiﬁes the macros moleculetypes given the sequence of amino acid, and residue count.III. MATERIAL AND METHOD A. Dataset

The dataset contains two ﬁles with a different number ofentries. Figure 1 shows the ﬁrst ﬁve rows of the ﬁle. Thedataset has 467304 entries with ﬁve columns. Table I showsthe database description.As we can see from the Table I that we have 4 types ofmacromolecule Protein, DNA, RNA and Protein/DNA/RNAHybrid. We dropped the other types during the pre-processing step. The second ﬁle is also arrange basedon structureId. This ﬁle contains protein meta-data i.e.resolution, extraction method, experimental technique, etc.this ﬁle has 141401 entries with 14 columns. We can mergeboth ﬁles, based structureId . the very step of pre-processingis to drop all the entries with NaN value or if label orsequence is missing. After removing the missing values,the sequence is checked for tags or numbers and removeit. once the data is cleaned. we divide the sequence intotri-gram, each sequence in now a combination of threecharacters strings. e.g. (CGC GAA TTC GCG). the ﬁnal blockmay not have all three and we added 0 in order to make itequal slice (padding).The ﬁnal output contains two columns, one for sequenceand the other for the label. there is 432474 rows in theprocessed data with four label as discussed earlier. Thisis unbalanced data and we will discuss the augmentationincoming section.We also create a special case of dataset to balance andnormalize all the classes. we take 424 sequences of eachclass to create mini dataset and test the performance of allthe models. This mini dataset have almost the same resultsand the whole set with some down/up sampling.

B. Biological Structures in the Dataset

DNA makes RNA, RNA makes amino-acids, amino-acidmakes protein. This is known as the central dogma of life. ADNA or RNA is made of Nucleotides, which are of four types(A, T/U, C, G). The nucleotide sequence is the combinationof these nucleotides in a row. Three nucleotides combineto form codon which is building a block of Amino Acids.

TABLE ID

ATASET DESCRIPTION .Label TYPE Data structure UniqueEntriesstructureId structure ID object 140250chainId Chain ID object 2837sequence Protein sequence Object 104813residue Count No of residues(ATCG’s) Integer 4737macromoleculeType Type ofMacro-molecule Object 14Fig. 1. PDB data sequence

The amino-acid then combines to form proteins. To makea protein at least 20 amino acids are necessary.Let’s explain it with a real example. ATT is a codon,which is basically three nucleotides. This codon representsamino-acid (isoleucine) represented by the letter "I". TTTis another codon which represents another amino-acid (phenylalanine) and is represented by letter "F". These "IF"combines along with others to make protein. The lettersin codon represent nucleotides while the letters in proteinsequence represent amino acids.At least 20 amino-acid must combine to make one func-tional protein. The maximum number depends on whenmachinery overcome a stop codon to stop making oneprotein. The machinery may overcome a stop codon after20 or may overcome after 500. the amino acid sequencedetermines the type of proteins.IV. P

ROPOSED M ETHOD

The task of classifying macromolecule type using se-quence can be seen as a sequence classiﬁcation problem.This is analogous to the sentence classiﬁcation task in NLP.Thus, we apply a skip-gram analysis from NLP research tomodel our problem. There are various sequence models inthe deep learning domain. Among them, we used LSTM,GRU and 1D Convolution for our problem. Recurrent neuralnetworks, such as the Long Short-Term Memory, or LSTM,network are speciﬁcally designed to support sequences ofinput data. 3 shows the layout of the model.They are capable of learning the complex dynamicswithin the temporal ordering of input sequences as wellas use internal memory to remember or use informationacross long input sequences.As it is crucial to any deep learning task, we applieda data prepossessing task before feeding to our model.This task includes handling missing values, downsamplingdominant class to balance the distribution of data. ig. 2. Proposed method using embedding and CNN.

Fig. 3. LSTM model using embedding at input

Figure 2 shows the block diagram of the proposed system.This system can be broadly divided into three categories.ﬁrst is dataset processing, the second part of this modelis embedding. we have different choice in embedding butword2vec [12], Fast-Text [6] and GloVe [14] are famousembedding technique in NLP. these embedding has beentested in different bioinformatics task and the results arepromising. we used word2vec in this task. one-hot vectoris another famous embedding method for amino acidrepresentation and we used it for comparison to otherembedding technique.

Fig. 4. word2vec model relation between sequences

The ﬁnal category is CNN, we will have to determinethe number of layers this network will need. we also needto ﬁnd hyper-parameters along with size and height andwidth of the model. the output layer is a softmax layer isused to classify the sequence. the coming section explainsthese parts in details.

A. Word Embedding according to Wikipedia "Word embedding is the collectivename for a set of language modeling and feature learningtechniques in natural language processing where wordsor phrases from the vocabulary are mapped to vectors ofreal numbers". we used two types of embedding techniquefor this work. the ﬁrst one is word2vec [12]. the reasonswe need these embedding techniques that Deep learningor machine learning model only deals with real num-bers. embedding not only convert these text or sequenceinto number but also produce some relationship betweenthem. we have two algorithms for word2vec, skip-gram andCommon Bag Of Words (CBOW). We will not explain this ig. 5. CNN model parameters with dimensions algorithm here. after using both of this algorithm in all theemodels we ﬁnd that skip-gram work better in this case.we used the tri-gram data to generate our own word2vecmodel. The 4 shows the relationship between sequences.We used different output dimension i.e 100, 150 and 300.we ﬁnd that 300 dimension output has better performance.

B. Convolution neural network

Convolution neural network(CNN) is the most famousnetwork in deep learning. we used the network architectureof VGG [ [15]- [2]]. we are using Convolution1D, we aredealing with one dimension data here. this network has4 convolution layers each layer followed by max-poolinglayer. the network also includes batch normalization anddropout in order to prevent the model from over-ﬁtting.The ﬁnal max-pooling layer is followed by two dense layers.We set the following hyper-parameters, which gives usthe best results. learning rate 0.001, batch size 512, lossfunction cross entropy, optimizer Adam, number of epochs20, dropout rate 0.5, activation ReLU and ﬁnal layer issoftmax. Figure 5 shows the model architecture along withsome model parameters.V. E

XPERIMENTS we evaluate our models on protein data bank (PDB)which is a database for the three-dimensional structuraldata of large biological molecules, such as proteins andnucleic acids. Performance comparison between differentmodel will only make sense if we keep pre-processing andembedding the same. our different experiments show that word2vec with 300 dimensions has better performance andwe will keep this setting unless mentioned.

A. CNN Model results

CNN model has been explained in section IV as shownin ﬁgure 5. Embedding dimension is 50 for this model.Figure 6 show the validation and training loss over the50 epochs. we used early stopping algorithm in order tosave and use our best model. Training loss is 0.028 whilevalidation loss is 0.034. Figure 7 show the accuracy curvefor the same setting. The ﬁnal training accuracy is 99.2%and test accuracy is 98.8%. Figure 8 shows the precision,recall and F1 score of CNN model. As we can see that microaverage and macro average are almost the same. Figure 9shows the confusion matrix of all four classes. the accuracyof each class is almost the same and we can see that fromdiagonal color. VI. A

DDITIONAL S IMULATION

Goodfellow in [3] explain "No free lunch theorem".In a very broad sense, it states that when averaged overall possible problems, no algorithm will perform betterthan all others. keeping this deﬁnition in mind we startwith the simple machine learning algorithm called Ran-dom Forest. Random forest already been used in Naturallanguage processing (NLP) and has been quite successful.the pre-processing step is the same and we use tri-gram forword2vec 4 embedding.

Fig. 6. Training and validation loss we used "gensim" to build our own word2vec modelusing skip-gram and we visualize the results in ﬁgure 4.we used tri-gram, bi-gram and uni-gram as features andaverage them to get single feature vector for each input se-quence. the same process is repeated for testing data usingsame word2vec model. Random forest was initialized by 100estimators and the results shows 96.83% test accuracy formore than 90 thousand test sequences. we test the randomforest for both balance data and unbalance data and theresults was almost the same. Table II shows precision, recall ig. 7. CNN model training and validation accuracy

Fig. 8. precision, recall and F1-score and F1 score of each class.compare to CNN the results are not good enough butRandom forest is much faster and less expansive than CNN.This comparison may be fair but we are reporting theseresults to show that traditional machine learning algorithmsalso work better some tvisual question answeringimes. thenext step is to compare different state of the art algorithmlike LSTM, Random Forest and GRU with CNN.[2]

TABLE IIR

ANDOM FOREST RESULTS IN TERM OF PRECISION AND R ECALL

Label Precision Recall F-1 ScoreProtein 0.96 0.99 0.97DNA 0.90 0.80 0.85RNA 0.93 0.69 0.79Hybrid 0.94 0.88 0.91Macro Avg 0.93 .84 0.88Weighted avg 0.96 0.96 0.96

Figure 3 shows the network diagram of LSTM. we are usingTensorﬂow embedding in this case with 50 dimension vec-tors. The LSTM and GRU have the same network structure

Fig. 9. Confusion Matrix for all classes and use the same number of cells, 512 in this case. Table IIIshows the comparison of accuracy and loss across differentnetwork. The table clearly shows that VGG-16 [15] hasbetter performance with Word2vec [12] as embedding.seven layer CNN also perform better compare to LSTM andGRU model. All the convolution neural network are onedimension.As we discussed we have full dataset and going undersampling and than using word2vec and feed it to CNN,we get the results shown in 10.

TABLE IIIP

ERFORMANCE OF DIFFERENT NETWORK

Model Train-Acc Train-Loss Val-Acc Val-LossCNN 98.19% 0.0486 97.74% 0.0819GRU 90.79% 0.2691 89.70% 0.2715LSTM 95.12% 0.3982 95.14% 0.1962CNN-GRU 94.85% 0.1509 92.74% 0.1962RF [1] 95.17% 0.4019 94.87% 0.4906

VGG16 [15] 99.11% 0.0288 98.1% 0.0297 R EFERENCES[1] Leo Breiman.

Random Forests . journal of Machine Learning, 2001.[2] Geoff Pleiss Zhuang Liu John Hopcroft Gao Huang, Yixuan Li andKilian Weinberger. Snapshot ensembles: Train 1, get m for free. In

International Conference on Learning Representations , 2017. [3] Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning .MIT Press, 2016.[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 770–778, 2016.ig. 10. Precison Recall for Full dataset[5] Jie Hou, Badri Adhikari, and Jianlin Cheng. Deepsf: deep convo-lutional neural network for mapping protein sequences to folds.

Bioinformatics , 34(8):1295–1303, 2017.[6] Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze,Hérve Jégou, and Tomas Mikolov. Fasttext: Compressing text classi-ﬁcation models. arXiv preprint arXiv:1612.03651 , 2016.[7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

Proceedingsof the 25th International Conference on Neural Information Processing

Systems - Volume 1 , NIPS’12, pages 1097–1105, USA, 2012. CurranAssociates Inc.[8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-basedlearning applied to document recognition.

Proceedings of the IEEE ,86(11):2278–2324, Nov 1998.[9] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.

Nature , 521(7553):436–444, 2015.[10] Timothy K Lee and Tuan Nguyen. Protein family classiﬁcation withneural networks, 2016.[11] Chien-Liang Liu, Wen-Hoar Hsaio, and Yao-Chung Tu. Time seriesclassiﬁcation with multivariate convolutional neural network.

IEEETransactions on Industrial Electronics , 66(6):4788–4797, 2019.[12] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and JeffDean. Distributed representations of words and phrases and theircompositionality. In

Advances in Neural Information ProcessingSystems 26 , pages 3111–3119. Curran Associates, Inc., 2013.[13] Ngoc Giang Nguyen, Vu Anh Tran, Duc Luu Ngo, Dau Phan, Fa-vorisen Rosyking Lumbanraja, Mohammad Reza Faisal, BahriddinAbapihi, Mamoru Kubo, and Kenji Satou. Dna sequence classiﬁcationby convolutional neural network.

Journal of Biomedical Science andEngineering , 9(05):280, 2016.[14] Jeffrey Pennington, Richard Socher, and Christopher D. Manning.Glove: Global vectors for word representation. In

In EMNLP , 2014.[15] K. Simonyan and A. Zisserman. Very deep convolutional networksfor large-scale image recognition. In

International Conference onLearning Representations , 2015.[16] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, andZbigniew Wojna. Rethinking the inception architecture for computervision. In