Attention based convolutional neural network for predicting RNA-protein binding sites
aa r X i v : . [ q - b i o . GN ] D ec Attention based convolutional neural network forpredicting RNA-protein binding sites
Xiaoyong Pan
Department of Medical InformaticsErasmus Medical Center,Rotterdam, The Netherlands [email protected]
Junchi Yan
IBM ResearchShanghai, China [email protected]
Abstract
RNA-binding proteins (RBPs) play crucial roles in many biological processes,e.g. gene regulation. Computational identification of RBP binding sites on RNAsare urgently needed. In particular, RBPs bind to RNAs by recognizing sequencemotifs. Thus, fast locating those motifs on RNA sequences is crucial and time-efficient for determining whether the RNAs interact with the RBPs or not. In thisstudy, we present an attention based convolutional neural network, iDeepA, to pre-dict RNA-protein binding sites from raw RNA sequences. We first encode RNAsequences into one-hot encoding. Next, we design a deep learning model with aconvolutional neural network (CNN) and an attention mechanism, which automat-ically search for important positions, e.g. binding motifs, to learn discriminanthigh-level features for predicting RBP binding sites. We evaluate iDeepA on pub-licly gold-standard RBP binding sites derived from CLIP-seq data. The resultsdemonstrate iDeepA achieves comparable performance with other state-of-the-artmethods.
RNA-binding proteins (RBPs) take over about 10% of the eukaryotic proteome and are closely as-sociated with many biological processes [1]. How to identify whether a RNA binds to a RBP isimportant for further analyzing the RNAs’ functions. Many experimental technologies have beendeveloped. such as CLIP-seq. However, they are still time-consuming and high-cost. Thus, compu-tational identification of RBP binding sites are urgently needed. To this end, many machine learningbased methods have been proposed. For example, GraphProt encodes RNA sequences and structuresin a graph, which is further fed into support vector machine to classify bound sites from unboundsites [2]. iONMF integrates multiple sources of data to predict RBP binding sites using Orthogonalmatrix factorization [3].Recently, deep learning have been successfully developed to predict RNA binding sites. For exam-ple, deepnet-rbp applies deep belief network to integrate k-mer frequency features of sequences andstructures to model RBP targets [4]. DeepBind [5] applies a convolutional neural network (CNN)[6] to identify RBP binding sequence specificity. iDeep uses multimodal deep learning to integratedifferent sources of data to infer RBP binding sites and sequence motifs [7]. iDeepS infers sequenceand structure motifs simultaneously using a convolutional neural network and long short temporalnetwork [8]. The core of all the above methods is CNN, which demonstrates high accuracy foridentifying RBP binding sites.It is commonly assumed that a RNA sequence that can be bound by a RBP, which contains at leastone binding subsequence (motif) of this RBP. Therefore, it is fairly intuitive to consider putting moreattention on this motif subsequence along the RNA sequence. To better model this characteristics f RBP binding sites, attention mechanism is introduced [9]. Attention mechanism allows deeplearning models to focus selectively on only the important features. Deep models augmented withattention mechanisms have obtained great success on machine translation [9, 10], and computationalbiology [11].In this study, we propose an attention-based convolutional neural network model, iDeepA, to predictRBP binding sites from RNA sequences alone. iDeepA combines learned features from CNNs andtwo levels of attentions to locate important subsequences.
We download RBP binding sites dataset derived from CLIP-seq from GraphProt( ) [2]. It contains 24 ex-periments of 21 RBPs. For each RBP, it has thousands of bound RNA subsequences with variablelength, and almost the same number of negative sequences are selected with no evidence showingthey are bound to this RBP.
In this study, we present a CNN based method with attention mechanism to classify RBP boundsites from unbound sites (Figure 1). We first encode RNA sequences into one-hot encoding showingthe presence of nucleotide A,C,G,U. Then the one-hot encode matrix is fed into a CNN, whichinvolves convolution, activation, and max-pool operations. The CNN layer preserves the spatialinformation and output feature maps for subsequent processing. Inspired by [9, 10], we introduceattention mechanism to further attend differentially to related motifs and locate important positionsfor predicting RBP binding sites. We extract three levels of abstract features: 1) The output featuremaps from the CNN. 2) The outputs from attention model 1 for sequence dimension, whose input isone copy of the two-dimensional hidden states from the CNN. 3) The outputs from another attentionmodel 2 for feature map dimension, whose input is transposition of hidden states from the CNN. Forboth attention models, we use the same structure with a feedforward neural network as decoder togenerate a representation vector. The output O from an attention model are: O = T X t =1 h t ∗ α t (1)where h t is hidden state from the CNN and α t is the softmax weight of each hidden state h t : α t = exp ( e t ) P Ti =1 e i (2)where e t is generated from the hidden state h t by a feedforward neural network.By augmenting with the attention mechanism, it learns a soft transformation between the inputand output sequences. Finally, the outputs from CNN layer and two attention models are con-nected to two fully connected layers. The last layer is the sigmoid layer used to classify theRBP bound sites from unbound sites. We optimize a categorical entropy loss function usingRMSProp [12] with number of epochs 30. iDeepA is implemented using Keras 1.1.2 library https://github.com/fchollet/keras . We compare iDeepA with other state-of-the-art methods, GraphProt, deepnet-rbp, Deepbind andMILCNN. A negative sequence has no any binding site, while a positive sequence contains at leastone binding sites of this RBP. It is intuitive to consider each sequence as a bag, whose any subse-quence is an instance. Inspired by the characteristics, MILCNN first breaks each RNA sequence intomultiple overlapping fixed-length subsequence, each subsequence is an instance and each sequenceis a bag of instances. Next, MILCNN trains a CNN under the multiple instance learning framework.Multiple instance learning has been used for predicting protein-DNA interactions [14].2igure 1: The flowchart of iDeepA. iDeepA first encodes the sequence into one-hot matrix, whichis fed into a CNN to output feature maps. Next, we input the last hidden states of the CNN to anattention model, and its transposition into another attention model. In the end, the outputs from twoattention models and the CNN are combined into two fully connected layers to predict RBP bindingsites.
GraphProt, deepnet-rbp, MILCNN, DeepBind and iDeepA achieve the average AUC 0.887, 0.902,0.861, 0.921 and 0.921 across 24 experiments (Figure 2), respectively. iDeepA and DeepBind yieldsimilar average AUC, which is higher than other three methods. In addition, iDeepA improves someRBPs with small training set on that DeepBind does not achieve high AUC. For example, iDeepA ob-tains an AUC of 0.839 for C17ORF85 with only 4000 training samples, which is an increase by 11%compared to an AUC 0.755 of DeepBind. The results indicates introducing attention mechanismcan enhance the learning ability on small dataset than DeepBind and it is fast to focus on importantsubsequences. However, introducing attention mechanism does not improve the performance onthose RBPs with large number of training samples, it is possible because feeding more samples intomodel training can make the model to converge to the same optimum model. In addition, MILCNNyields lower performance than other methods, it maybe because that training RNA sequences arethemselves subsequence anchored at the peak center derived from CLIP-seq, breaking them intosubsequence may also break the binding sites.
In this study, we present an attention-based CNN method to predict RBP binding sites. Our methodiDeepA yields comparable performance with other state-of-the-art methods. However, we still donot further investigate whether the attention can be used to identify interpretable motifs. In futurework, we expect to obtain more interpretablitity of iDeepA and comprehensively evaluate iDeepAon larger dataset with more RBPs.
References [1] Ray,D., Kazan,H., et al . (2013) A compendium of RNA-binding motifs for decoding gene regulation.
Nature. , 172-7. doi: 10.1038/nature12311.[2] Maticzka,D., Lange,S.J.,Costa,F., Backofen,R. (2014) GraphProt: modeling binding preferences of RNA-binding proteins.
Genome Biol.
R17. doi: 10.1186/gb-2014-15-1-r17.[3] Stražar,M., Žitnik,M., Zupan,B., Ule,J., Curk,T. (2016) Orthogonal matrix factorization enables integrativeanalysis of multiple RNA binding proteins.
Bioinformatics. , 1527-35. doi: 10.1093/bioinformatics/btw003. L K B H P A R - C L I P C O R F P A R - C L I P C O R F P A R - C L I P C A P R I N P A R - C L I P A g o H I T S - C L I P E L AV L H I T S - C L I P S F R S H I T S - C L I P H N R N P C i C L I P T D P i C L I P T I A i C L I P T I A L i C L I P A g o - P A R - C L I P E L AV L P A R - C L I P ( B ) E L AV L P A R - C L I P ( A ) E W S R P A R - C L I P F U S P A R - C L I P E L AV L P A R - C L I P ( C ) I G F B P - P A R - C L I P M O V P A R - C L I PP U M P A R - C L I P Q K I P A R - C L I P T A F P A R - C L I PP T B H I T S - C L I P Z C H B P A R - C L I P A U C GraphProtdeepnet-rbp MILCNNDeepBind iDeepA
Figure 2: The AUCs of different methods for predicting RBP binding sites. The AUCs of GraphProtand deepnet-rbp are taken from original papers, other three methods are ran on the same trainingand testing set with similar CNN network. [4] Zhang,S., Zhou,J., Hu,H., Gong,H., Chen,L., Cheng,C., Zeng,J. (2015) A deep learning framework for mod-eling structural features of RNA-binding protein targets.
Nucleic Acids Res. , e32. doi: 10.1093/nar/gkv1025[5] Alipanahi, B., et al. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins bydeep learning, Nature biotechnology , , 831-838.[6] LeCun,Y., Léon,B., Yoshua,B. &Patrick,H. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998. , 2278-2324.[7] Pan,X. & Shen,H.B. (2016) RNA-protein binding motifs mining with a new hybrid deep learning basedcross-domain knowledge integration approach. BMC Bioinformtics. et al . (2017) Attention Is All You Need. arXiv:1706.03762.[11] Wang,D. et al . (2017) MusiteDeep: a Deep-learning Framework for General and Kinase-specific Phospho-rylation Site Prediction.
Bioinformatics. btx496, https://doi.org/10.1093/bioinformatics/btx496[12] Tieleman,T. & Hinton,G.E. (2012) Lecture 6.5 - rmsprop: Divide the gradient by a run-ning average of itsrecent magnitude.
COURSERA: Neural Networks for Machine Learning. Bioinformatics. :2097-2105.:2097-2105.