Application and Comparison of Deep Learning Methods in the Prediction of RNA Sequence Degradation and Stability
AApplication and Comparison of Deep LearningMethods in the Prediction of RNA SequenceDegradation and Stability
Ankit Singhal
Abstract mRNA vaccines are receiving increased interest as potential alter-natives to conventional methods for the prevention of several diseases,including Covid-19. This paper proposes and evaluates three deeplearning models (Long Short Term Memory networks, Gated Recur-rent Unit networks, and Graph Convolutional Networks) as a methodto predict the stability/reactivity and risk of degradation of sequencesof RNA. Reasonably accurate results were able to be generated withthe Graph Convolutional Network being the best predictor of reac-tivity (RMSE = 0.249) while the Gated Recurrent Unit Network wasthe best at predicting risks of degradation under various circumstances(RMSE = 0.266). Results suggest feasibility of applying such methodsin mRNA vaccine research in the near future.
Over the last two decades, there has been increasing interest in the fieldof RNA-based technologies in the creation of prophylactic vaccines. Theyare widely regarded to be plausible alternatives to conventional approachesdue to their high potency, capacity for quick and low-cost manufacturing,and relatively safe administration. They are currently being researched formany diseases - included SARS-CoV-2 - although none have approved yetfor human use [6, 8]. One of the primary barriers in the creation of suchtherapeutics is the fragility of the RNA molecule; they are susceptible to rapiddegradation within minutes to hours [2] and as such need to be freeze dried1 a r X i v : . [ q - b i o . Q M ] N ov r incubated at low temperatures in order to be kept stable. In the currentpandemic, vaccines are seen as the most promising means to control thenovel coronavirus, but with current limitations on mRNA vaccines, efficient in vivo delivery of such molecules seems improbable; it would likely onlyreach a fraction of the population.Therefore, research into the the stability and degradation of RNA moleculeshas received continued interest, to date largely consisting of traditional sta-tistical approaches and biophysical models. However, it still remains unclearexactly which parts of RNA molecules are more prone to spontaneous degra-dation and thus difficult to accurately predict the reactivity and degradationof mRNA[7]. Therefore, experimentation, an incredibly time consuming pro-cess, is the default method in determining these values.The aim of this manuscript is to present three possible Deep Learningapproaches to this problem through the usage of the Stanford OpenVaccinedataset. Two variants of Recurrent Neural Networks (RNNs) are employed,Long Short Term Memory Networks (LSTMs) and Gated Recurrent UnitNetworks (GRUs), along with a variant of a Graph Neural Network (GNN),the Graph Convolutional Network (GCN). These models are applied andcompared to assess whether Machine Learning methods can provide helpfulresults in predicting the reactivity and degradation of mRNA molecules. The majority of this work was done in the Python programming languageusing a TensorFlow back end with a surface level Keras API. Refer to Table1 for all the software/packages used throughout the creation and testing ofthe models along with related information.2 oftware/Package Use DeveloperPython 3.7 Used to write the code for the modelsdiscussed. Python Software FoundationTensorFlow Used as a backend for the majority ofthe models presented. GoogleKeras Used as a high level API for someparts of the model. GoogleSKLearn Used for k-Fold Cross Validation. Cornapeau and MatthieuPandas Used for data handling. McKinneyNumpy Used for data handling. OliphantMatplotlib Used to create figures in manuscript. Droettboom and CaswellARNiE Used to generate augmentation data. DAS Labs
Table 1: All the packages/software used in the creation of the models presented.
The dataset used in this manuscript to evaluate the models is called the”Stanford OpenVaccine” dataset [3]. It consists of 6034 RNA sequences. Thetraining/validation set consisted of 3029 of these sequences with a length of107 nucleotide bases. The testing dataset consists of 3005 sequences with alength of 130 nucleotide bases. Due to experimental limitations, measure-ments on the final bases in the sequences cannot be carried out, therefore,only 68 (for the sequences with length 107) and 91 (for the sequences withlength 130) of the first bases in the respective sequences have experimentaldata associated with them.Three predictors were associated with each sequence: the sequence itself(described in A,G,C, and U bases), the expected structure of the molecule,and the Predicted Loop Type (derived from a simulation of the secondarystructure of the RNA molecule). A Base Pair Probability Matrix was alsoprovided for each individual sequence indicating the probability of certainbase-pair interactions. Five experimentally-determined sets of values (hence-forth referred to as ’target values’) were also given for the first 68 or 91 basepairs in the sequence: Reactivity values, degradation values at pH 10, degra-dation values at pH 10 with added Magnesium, degradation values at 50 ◦ ,and degradation values at 50 ◦ with added Magnesium. Refer to Table 2 formore information. 3 eature Classification Description SampleSequence Input
A sequence of 107 letters correspondingto the four bases in the sequence. A, G, U, U, C, ...Structure
Input
Expected structure of the molecule(length = 107). ’(’ and ’)’ refer to a basepair interaction. All ’.’ in the middle areassociated with no BP interactions. (..()...)()(... ...Predictedloop type
Input
Predicted secondary structure of theRNA molecule at different points. ’S’refers to a stem structure, ’M’ multiloop,’I’ internal loop, ’B’ bulge, ’H’ hairpinloop, ’E’ dangling end, ’X’ external loop. S, S, M, S, H, ...Reactivity
Target
Reactivity values at each individualpoint in thesequence. 1.23, 3.46, ...Deg pH 10
Target
Degradation values at pH 10. 0.89, 2.44, ...Deg pH 10 Mg
Target
Degradation values at pH 10 withadded Mg. 1.28, 0.88, ...Deg 50 ◦ C Target
Degradation values at 50 ◦ C. 2.02, 1.87, ...Deg 50 ◦ C Mg
Target
Degradation values at 50 ◦ Cwith added Mg. 1.11, 2.44, ...
Table 2: Inputs and target features of the Stanford OpenVaccine dataset.
The task of the algorithms that are presented in this paper is to take thesequence and other structural features of an RNA molecule (features markedas ’inputs’) and predict its stability (through the five target values).
As mentioned earlier, each base of the sequence has five target and two furtherstructural features associated with it. This will be represented as a featurematrix for all three ML models. What is of particular interest however, isthe Base Pair Probability Matrix associated with each sequence which takesthe form N × N , where N is the number of bases in the sequence. It can berepresented as a standard matrix (refer to Figure 1 for visualization) or asa graph (refer to Figure 2) with nodes and edges. This distinction is importfor the former form is used for the two RNN architectures whereas the latteris used for the GCN. 4 igure 1: Visualization of a sample BPPM Matrix. Purple indicates no base pair interaction, the moregreen a color, the higher the interaction between those two bases.Figure 2: Visualization of a an excerpt from sample graph structure of a BPPM. Information in the nodescorrespond to the type of base and its place on the sequence, the width and number next to the edgesrefer to the interaction between the two bases (0-1). Data from the dataset was first mapped to vector matrices and hot encoded.After splitting the training and testing data according to the specificationslined out in Section 2.1.2, the training data was augmented using the ARNiEpackage, the same package used to create the data in the first place. Finally,the resulting data was used to create the model using the architectures ex-plained below and predictions were generated on the test data before beingevaluated. Refer to Figure 3. 5 igure 3: High level diagram of the overall model architecture. ’Model’ in the diagram refers to one ofthe three ML approaches discussed in the next section.
A Long Short Term Memory Network (LSTM) is a type of RNN introducedby Hochreiter & Schmidhuber in 1997 [4] and further popularized by manyother pieces of work following it. Like other RNNs, it essentially consists ofcopies of the same network in which the output of the previous network (orcell) informs the next one, allowing information to persist. A single cell inan LSTM, consists of four distinct neural network layers, three of which areones that make ’decisions’ about the persistence of information as shown inequations 1 - 3: f t = σ ( W f [ h t − , x t ] + b f ) (1) i t = σ ( W i [ h t − , x t ] + b i ) (2) o t = σ ( W o [ h t − , x t ] + b o ) (3)Here the output of the previous cell along with the new input is passedthrough a basic neural layer with a non-linear activation function (tradition-ally sigmoid functions were used but in this paper, all activation functionswere ReLu). In parallel, candidate values for the cell state are also producedas shown by equation 4: ˜ C t = φ h ( W C [ h t − , x t ] + b C ) (4)Where the activation function in this case is a hyperbolic tangent function(compressing the values between -1 and 1). Finally using these output ma-trices, one can then update the cell state as given by equation 5 and producenew outputs as shown in equation 6: C t = f t · C t − + ˜ C t · i t (5) h t = o t · φ h ( C t ) (6)This output is then fed into the next cell. This model was used with thenormal vector matrix BPPM as represented in Figure 1.6 .2.3 Gated Recurrent Unit Networks Introduce by Cho et al. in 2014 [1], a Gated Recurrent Unit Network is avariation of traditional LSTM networks that have the primary advantage offaster computation (due to less neural nets in each cell). Like in the LSTM,decision matrices are produced through two non-linear activation functionbased neural networks as shown in equations 7 and 8: r t = σ ( W r [ h t − , x t ] + b r ) (7) z t = σ ( W z [ h t − , x t ] + b z ) (8)Candidate values are also generated through a hyperbolic-tangent-based net-work, however, in this case it is done directly for the output as there is no cellstate, unlike in an LSTM. The candidates are then chosen by the ’decision’matrices to produce the output as shown in equations 9 and 10:˜ h t = φ h ( W h [ h t − , x t ] + b h ) (9) h t = (1 − z t ) · h t − + ˜ h t · z t (10)Like any other form of RNN, this output is then passed to the next hiddenlayer. Graph Convolutional Networks were introduced by Kipf & Welling in 2017 [5]and provide a novel way to analyze arbitrarily structured data in the form of agraph. A GCN is not a form of an RNN although they are both connectionistmodels. A GCN operates on a graph defined by G = ( V, E ) where V is theset of nodes and E the set of edges. Nodes in the graph aggregate the featuresof the surrounding nodes and itself and use the following neural net (refer to11 to generate an output which is then assigned to the node: h t = σ ( W h · D − [ h t − , ˆ A ] + b h ) (11)Where D is the diagonal node degree matrix of the graph and ˆ A is equal to A + I , A being the adjacency matrix (taking the form N × N ) representing thegraph (in this case, the BPPM - see Figure 2). h = N · F i.e. it is a featurematrix. After an output is generated, this process can be repeated each timethe output of the nodes propagating outwards. Due to the localization ofsuch a problem, I only had two repetitions, any more resulted in negligibleinfluences. 7 .2.5 Performance Assessment The models will be evaluated based on the error produced in the predictionof the target values. The two loss measures employed in this paper are MeanAbsolute Error (MAE) and Root Mean Square Error (RMSE) described inequations 12 and 13.
RM SE = (cid:114) Σ ni =1 ( ˆ y i − y i ) n (12) M AE = 1 n Σ ni =1 | ˆ y i − y i | (13)The difference between the two metrics is that in MAE, all the errors areaveraged by weighting them equally, however, since RMSE has a quadraticterm, larger individual errors will be punished more than smaller ones. After being built using the specifications discussed earlier in this manuscript,the models were trained on the ’training’ dataset discussed in section 2.1.2.For k-fold cross validation, in the models presented, k = 4 and this was ini-tially repeated three times to improve accuracy. The model was split intofour groups for cross validation as larger values of k were found to have neg-ligible influence on the model results. Training and validation loss (RMSE)for the three models are show in Figure 4 and Table 3.Training ValidationRMSE MAE RMSE MAELSTM 0.1089 0.0663 0.1758 0.1052GRU 0.1143 0.0693 0.1787 0.1072GCN 0.1752 0.1055 0.2232 0.1394 Table 3: Mean loss values of models after the training and validation processes.
As can be seen after the initial training and validation phase, the LSTMmaintains the best performance. The GRU scores were similar to that ofthe LSTM, however, the GCN was in a distant third place averaging about0.05-0.06 higher RMSE values in comparison to the other two.8 a) LSTM Model (b) GRU Model (c) GCN Model
Figure 4: Training and validation loss for the twelve histories of the ML Models.
The performance for the three machine learning models on the trainingdataset are presented in Figures 5 - 9. The GRU maintained the lowestloss values for all target values, with the only exception being the reactivity(see Figure 5) and the degradation at pH 10 (only the RMSE - see Figure 6)in which the GCN had the best predictions. Notably, in Figures 7, 8, and9, the GCN had lower RMSE values than the LSTM, however the oppositewas true for the MAE values suggesting that the GCN is not as prone tolarge, individual errors.
Figure 5: Results (loss) for over 20 trials of the models predicting the reactivity values of the trainingset sequences (note change in scale between plots). The GCN performed the best, both in terms of RMSEand MAE, followed by the GRU and LSTM respectively. Boxplots show the minimum, maximum, andmedian values along with the 25th and 75th quartiles. igure 6: Results (loss) for over 20 trials of the models predicting the degradation at pH 10 of thetraining set sequences (note change in scale between plots). The GCN performed the best in terms ofRMSE, however, the GRU performed better in terms of MAE, suggesting the GRU is more prone tolarger errors. Boxplots show the minimum, maximum, and median values along with the 25th and 75thquartiles.Figure 7: Results (loss) for over 20 trials of the models predicting the degradation at pH 10 with addedMagnesium of the training set sequences (note change in scale between plots). The GRU performedthe best, both in terms of RMSE and MAE, followed by the GCN in RMSE and LSTM in MAE, furtherreaffirming the notion that the GCN is less prone to larger individual errors. Boxplots show the minimum,maximum, and median values along with the 25th and 75th quartiles. igure 8: Results (loss) for over 20 trials of the models predicting the degradation at 50 ◦ C of thetraining set sequences (note change in scale between plots). The GRU performed the best, both in termsof RMSE and MAE, the GCN in RMSE and LSTM in MAE, similar to Figure 7. Boxplots show theminimum, maximum, and median values along with the 25th and 75th quartiles.Figure 9: Results (loss) for over 20 trials of the models predicting the degradation at 50 ◦ C with addedMagnesium of the training set sequences (note change in scale between plots). The GRU performed thebest, both in terms of RMSE and MAE, the GCN in RMSE and LSTM in MAE, similar to Figures 7 and8. Boxplots show the minimum, maximum, and median values along with the 25th and 75th quartiles.
Mean loss across the 20 trials for each individual target were calculatedalong with that of all targets combined for each model. Results are shownin Figure 10. 11 igure 10: Mean loss results (RMSE and MAE - note change in scale between plots) for the three machinelearning methods. The GRU performed the best across both metrics.
Although the creation of stable mRNA molecules remains difficult, datasetsof sequences and corresponding information are becoming more popular andwidely available. Through the use of deep learning architectures, reasonablepredictions of structural features can be obtained, as demonstrated in thismanuscript, with mean RMSE values ranging from 0.24 to 0.31. The usageof such techniques has the capacity to increase the speed and efficiency ofmRNA vaccine discovery and has further implications in other related fieldsof research.However, such methods, as presented in this paper, are not without theirlimitations. It is important to note that while reasonably accurate resultsare produced, the error, when taking into account the scale of the valuesbeing predicted (0.5 - 4) is not insignificant; this could have the potential ofincorrectly predicting the stability of an important molecule which may beoverlooked during the discovery phase. Therefore, in its current state, as anextension, I would suggest a simple binary classification system to predictwhether the molecule is stable, at the end of each of the models that wouldaim to minimize the False Negative rate - effectively resulting in the modelbeing used as a screening test to remove highly unstable sequences, ratherthan a full-fledged research tool.Another limitation of this method is the length of the sequences of themRNA molecules used. The length used in this paper ranged from 107-130bases, while an actual Covid-19 vaccine would likely range from 3000-4000[2] bases long. As an improvement, this model could be trained on and12pplied to longer RNA sequences to see how this impacts the accuracy ofsuch methods.Despite these limitations, the results of this work show that such predic-tion algorithms are feasible and have the potential to save time during re-search processes, an especially valuable commodity during disease outbreaks.In the long term, such techniques may also better help researches in under-standing the reasoning behind the stability of certain RNA molecules and aidin the development of related technologies. It is hoped that this work willbe of some use to other data scientists as well in creating better predictionmodels for this field. 13