Protein Secondary Structure Prediction with Long Short Term Memory Networks
PProtein Secondary Structure Prediction with Long Short Term MemoryNetworks
Søren Kaae Sønderby SOREN . SONDERBY @ BIO . KU . DK Ole Winther
OLWI @ DTU . DK Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark Department for Applied Mathematics and Computer Science, Technical University of Denmark (DTU), 2800 Lyngby,Denmark
Abstract
Prediction of protein secondary structure from the amino acid sequence is a classical bioinformatics problem.Common methods use feed forward neural networks or SVM’s combined with a sliding window, as these modelsdoes not naturally handle sequential data. Recurrent neural networks are an generalization of the feed forwardneural network that naturally handle sequential data. We use a bidirectional recurrent neural network with longshort term memory cells for prediction of secondary structure and evaluate using the CB513 dataset. On thesecondary structure 8-class problem we report better performance (0.674) than state of the art (0.664). Our modelincludes feed forward networks between the long short term memory cells, a path that can be further explored.
1. INTRODUCTION
Recently Long Short Term Memory (LSTM) [Hochreiter et al., 1997] recurrent neural networks (RNN) have shown goodperformance in a number of tasks, including machine translation [Sutskever et al., 2014], and speech recognition [Graves &Jaitly, 2014]. This paper uses the LSTM for prediction of protein secondary structure. Many machine learning algorithmshave been applied to this problem: Qian & Sejnowski 1988 introduced neural networks, Jones 1999 discovered that the useof evolutionary information, through position specific scorring matrices, improved performance, and Baldi et al. 1999introduced RNN’s for secondary structure prediction. Recent work includes conditional random fields hybrid models[Maaten et al., 2011; Peng et al., 2009; Wang et al., 2011] and generative stochastic networks [Troyanskaya, 2014].A common approach to secondary structure prediction is to use a non-sequential model, typically feed-forwardneural networks or SVM’s [Hua & Sun, 2001; Jones, 1999]. These models are not ideal for classifying data which cannotnaturally be presented as a vector of fixed dimensionality, why a sliding window approach is typically used to circumventthis problem. Window based models can only learn dependencies within the input window, recent methods for learningother dependencies includes conditional random field hybrid models. RNN’s can be applied to sequential data of any length,and should theoretically be able to learn longterm dependencies. In practice RNN’s suffer from exploding or vanishinggradients [Bengio et al., 1994], and on a secondary structure prediction task Baldi et al. 1999 reported that their RNN’swere only able to learn dependencies of ± amino acids relative to the target. The LSTM cell was invented to solve thevanishing gradients problem and enables the network to learn dependenceis over 100’s of time steps. The contribution ofthis paper is the application of bidirectional LSTM networks [Graves, 2012] to protein secondary structure prediction. Ourmodel architecture uses feed-forward neural networks for concatenation of predictions from the forward and backwardnetworks in the bidirectional model and the model also includes feed-forward neural networks between hidden states in therecurrent network, See figure 1. The use of feed-forward neural networks ”inside” the reucrrent neural network has alsobeen explored by [Pascanu et al., 2013]. This work primarily differs from the work by Baldi et al. 1999 in the introductionof the LSTM cell, the availability of much larger datasets and the possibility of training larger models by using a GPU. a r X i v : . [ q - b i o . Q M ] J a n . MATERIALS AND METHODS The LSTM cell is implemented as described in [Graves, 2013], however without peepholes, because recent papers haveshown good performance without peepholes [Sutskever et al., 2014; Zaremba & Sutskever, 2014; Zaremba et al., 2014].When predicting target x t a (forwards) RNN only know the past sequence, x ...x t . In tasks where the entire sequenceis known beforehand, e.g. secondary structure prediction, this is not desirable. Schuster & Paliwal 1997 introduced thebidirectional RNN as an elegant solution to this problem. One trains two separate RNN’s, the forward RNN starts therecursion from x and goes forwards, the backwards model starts at x n and goes backwards. The predictions from theforward and backward networks are combined and normalized, see Figure 1. The standard method for combining the forwardand backward models is to normalize the activations from each layer in a softmax layer [Graves, 2012]. We expand thestandard stacked bidirectional LSTM model by introducing a feed-forward network responsible for concatenating the outputfrom the forward and backward networks into a single softmax prediction. Secondly we expand the model by insertinga feed-forward network between recurrent hidden states, see equation (7), along with shortcut connections between therecurrent hidden layers. Similar ideas have been explored for RNN’s by [Pascanu et al., 2013]. Figure 2 shows a LSTM cell.Equation (1) to equation (10) describes the forward recursions for a single LSTM layer, h t − rec is forwarded to the next timeslice and h t is passed upwards in a multilayer LSTM. i t = σ ( x t W xi + h t − W hi + b i ) (1) f t = σ ( x t W xf + h t − W hf + b f ) (2) o t = σ ( x t W xo + h t − W ho + b o ) (3) g t = tanh( x t W xg + h t − W hg + b g ) (4) c t = f t (cid:12) c t − + i t (cid:12) g t (5) h t = o t (cid:12) tanh( c t ) (6) h t − rec = h t + f eedf orwardnet ( h t ) (7) σ ( z ) = 11 + exp( − z ) (8) (cid:12) : Elementwise multiplication (9) x t : input from the previous layer: h l − t (10) We use the dataset from Troyanskaya 2014 . The dataset consists of amino acid sequences labeled with secondary structure.Sequences and structures were downloaded from PDB and annotated with the DSSP program [Kabsch & Sander, 1983]. Inthe literature it is common to map the 8-class DSSP output (Q8) to helix, sheets and coils (Q3), see Table 1. We use theoriginal 8-class output, which is a harder problem. Each amino acid is encoded as an 42 dimensional vector, 21 dimensionsfor orthogonal encoding and 21 dimensions for sequence profiles. For further descriptions see Troyanskaya 2014. The fulldataset has 6128 non-homologous sequences (identity less than 30%). This set is further filtered such that no sequences hasmore than 25% identity with the CB513 dataset [Cuff & Barton, 1999]. The dataset is divided into a training (n=5278) and avalidation set (n=256), the CB513 dataset is used for testing. The LSTM is implemented in Theano [Bastien et al., 2012] using the Lasagne library . The model has 3 layers with either300 or 500 LSTM units in each layer. The feed-forward network, eq. (7), is a two layer ReLU network with 300 or 500units in each layer, this network has skip connections. The output from the bidirectional forwards and backwards networksare concatenated into a single vector which is passed through a two layer ReLU network with 200 or 400 hidden units in https://github.com/benanne/Lasagne x x t x x x t LSTM layerFeedforward net
Figure 1.
Unrolled recurrent neural networks. left : Unidirectional LSTM with a single layer. right : Bidirectional LSTM with single layer.The forward LSTM (red arrows) starts at time and the backwards LSTM (blue arrows) starts at time n , then they go forwards andbackwards respectively. The errors from the forward and backward nets are combined using a feed forward net and the result is used forback propagation. Note the feedforward nets between time slices. The figure shows a single layer model, but the model is easily extendedwith more layers. Adapted from [Graves, 2012]. each layer. The concatenation network is regularized using 50% dropout. In the LSTM cells all initial weights are sampleduniformly between -0.05 and 0.05 and biases are initialized at zero. In the fully connected layers weights are initializedusing Lasagne’s default settings. The LSTM initial hidden and cell states are learned. The learning rate is controlled withAdaDelta using default settings ( ρ = 0 . , (cid:15) = 10 − )[Zeiler, 2012]. After each epoch we calculate the norm of the gradientsupdates divided by the batch size: norm = (cid:13)(cid:13) gradient updatesbatch size (cid:13)(cid:13) If the norm exceeds 0.5 all gradients are scaled with . norm . The batch size is 128. Table 1.
Description of protein secondary structure classes and class frequencies in the dataset. In the litterature the 8-class DSSP outputis typically mapped to 3 classes. The 8 to 3 class mappings are included for reference.
H H 0.34535 α -helixE E 0.21781 β -strandL C 0.19185 loop or irregularT C 0.11284 β -turnS C 0.08258 bendG H 0.03911 -helixB E 0.01029 β -bridgeI C 0.00018 π -helix ih t-1 x t oh t-1 x t gh t-1 x t h t-1 x t fc h t Figure 2.
LSTM memory cell. i : input gate, f : forget gate, o : output gate, g : input modulation gate, c : memory cell. Blue arrow heads are c t − and red arrow heads are c t . The notation corresponds to equations (1) to (10) such that W xo is wights for x to output gate and W hf is weigts for h t − to forget gates etc. Adapted from [Zaremba & Sutskever, 2014]. Table 2.
Test set per amino acid accuracy for CB513. ∗ Reported by Wang et al. 2011
Q8 accuracy [Pollastri et al., 2002] (BRNN) ∗ . Wang et al. 2011 (CNF - 5-model ensemble) . Troyanskaya 2014 (GSN) . LSTM small . LSTM large .
3. RESULTS
The LSTM network has a correct classification rate of 0.674, better than current state of the art performance achieved by agenerative stochastic network (GSN) [Bengio & Thibodeau-Laufer, 2013; Troyanskaya, 2014] and a conditional neuralfield (CNF) [Lafferty et al., 2001; Peng et al., 2009]. Furthermore the LSTM network performs significantly better than thebidirectional RNN (BRNN) used in SSpro8 having a correct classification rate of 0.511 [Pollastri et al., 2002], see Table 2.
4. DISCUSSION AND CONCLUSION
We used the LSTM RNN for prediction of protein secondary structure. To our knowlegde the CB513 performance of0.674 is currently state-of-the-art. Comparision with the SSpro8 method shows that the LSTM significantly improves theperformance. Similarly the LSTM performs bettter than both Conditional neural fields and GSN methods. Inspired byPascanu et al. 2013 we used a feedforward network between the recurrent connections. We showed that a LSTM with thisarchitecture and a feedforward neural net for concatenation of the forward and backward nets performs significantly betterthan existing methods for secondary structure prediction. Future work includes investigation of different architectures forthe feedforwards networks.
5. AUTHORS CONTRIBUTIONS
SS is PhD student under the supervision of OW. SS developed the model and performed the experiments. Both authors readand approved the final version of the article. . ACKNOWLEDGEMENTS
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for thisresearch. We wish to acknowledge funding from the Novo Nordisk Foundation.
References
Baldi, P, Brunak, S, and Frasconi, P. Exploiting the past and the future in protein secondary structure prediction.
Bioinformatics , 15(11):937–946, 1999.Bastien, Fr´ed´eric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian, Bergeron, Arnaud, Bouchard, Nicolas, Warde-Farley, David, and Bengio, Yoshua. Theano: new features and speed improvements. arXiv preprint arXiv:1211.5590 , November2012.Bengio, Y, Simard, P, and Frasconi, P. Learning long-term dependencies with gradient descent is difficult.
Neural Networks, IEEETransactions , 5(2):157–166, 1994.Bengio, Yoshua and Thibodeau-Laufer, ´Eric. Deep Generative Stochastic Networks Trainable by Backprop. arXiv preprintarXiv:1306.1091 , 2013.Cuff, JA and Barton, GJ. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction.
Proteins:Structure, Function, and Bioinformatics , 34(4):508–519, 1999.Graves, A.
Supervised sequence labelling with recurrent neural networks . Springer, 2012. ISBN 978-3-642-24797-2.Graves, A and Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks.
Proceedings of the 31st InternationalConference on Machine Learning (ICML-14) , pp. 1764–1772, 2014.Graves, Alex. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 , 2013.Hochreiter, S, urgen Schmidhuber, J, and Elvezia, C. LONG SHORT-TERM MEMORY.
Neural Computation , 9(8):1735–1780, 1997.Hua, S and Sun, Z. A novel method of protein secondary structure prediction with high segment overlap measure: support vector machineapproach.
Journal of molecular biology , 308:397–407, 2001. ISSN 0022-2836. doi: 10.1006/jmbi.2001.4580.Jones, DT. Protein secondary structure prediction based on position-specific scoring matrices.
Journal of molecular biology , 292(2):195–202, 1999.Kabsch, W and Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features.
Biopolymers , 22(12):2577–2637, 1983.Lafferty, John, McCallum, Andrew, and Pereira, Fernando C N. Conditional random fields: Probabilistic models for segmenting andlabeling sequence data.
ICML ’01 Proceedings of the Eighteenth International Conference on Machine Learning , pp. 282–289, 2001.ISSN 1750-2799. doi: 10.1038/nprot.2006.61.Maaten, L, Welling, M, and Saul, LK. Hidden-unit conditional random fields. In
International Conference on Artificial Intelligence andStatistics , pp. 479–488, 2011.Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, and Bengio, Yoshua. How to Construct Deep Recurrent Neural Networks. arXivpreprint arXiv:1312.6026 , pp. 1–10, 2013.Peng, J, Bo, L, and Xu, J. Conditional neural fields. In
Advances in Neural Information Processing Systems 22 , pp. 1419–1427, 2009.Pollastri, Gianluca, Przybylski, Darisz, Rost, Burkhard, and Baldi, Pierre. Improving the prediction of protein secondary structure in threeand eight classes using recurrent neural networks and profiles.
Proteins , 47:228–235, 2002. ISSN 1097-0134. doi: 10.1002/prot.10082.Qian, N and Sejnowski, T J. Predicting the secondary structure of globular proteins using neural network models.
Journal of molecularbiology , 202:865–884, 1988. ISSN 0022-2836.Schuster, M and Paliwal, KK. Bidirectional recurrent neural networks.
Signal Processing , 45(11):2673–2681, 1997.Sutskever, I, Vinyals, O, and Le, QVV. Sequence to sequence learning with neural networks.
Advances in Neural Information ProcessingSystems , pp. 3104–3112, 2014.Troyanskaya, Olga G. Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction.
Proceedings of the 31st International Conference on Machine Learning , 32:745–753, 2014.ang, Zhiyong, Zhao, Feng, Peng, Jian, and Xu, Jinbo. Protein 8-class secondary structure prediction using conditional neural fields.
Proteomics , 11:3786–3792, 2011. ISSN 16159853. doi: 10.1002/pmic.201100196.Zaremba, Wojciech and Sutskever, Ilya. Learning to Execute. arXiv preprint arXiv:1410.4615 , October 2014.Zaremba, Wojciech, Kurach, Karol, and Fergus, Rob. Learning to Discover Efficient Mathematical Identities. In
Advances in NeuralInformation Processing Systems , pp. 1278–1286, June 2014.Zeiler, Matthew D. ADADELTA: An Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701arXiv preprint arXiv:1212.5701