Neural representation and generation for RNA secondary structures
PPublished as a conference paper at ICLR 2021 N EURAL REPRESENTATION AND GENERATION FOR
RNA
SECONDARY STRUCTURES
Zichao Yan
School of Computer ScienceMcGill University, Mila [email protected]
William L. Hamilton
School of Computer ScienceMcGill University, Mila [email protected]
Mathieu Blanchette
School of Computer ScienceMcGill University [email protected] A BSTRACT
Our work is concerned with the generation and targeted design of RNA, a type ofgenetic macromolecule that can adopt complex structures which influence theircellular activities and functions. The design of large scale and complex biologicalstructures spurs dedicated graph-based deep generative modeling techniques, whichrepresents a key but underappreciated aspect of computational drug discovery. Inthis work, we investigate the principles behind representing and generating differentRNA structural modalities, and propose a flexible framework to jointly embedand generate these molecular structures along with their sequence in a meaningfullatent space. Equipped with a deep understanding of RNA molecular structures,our most sophisticated encoding and decoding methods operate on the moleculargraph as well as the junction tree hierarchy, integrating strong inductive biasabout RNA structural regularity and folding mechanism such that high structuralvalidity, stability and diversity of generated RNAs are achieved. Also, we seek toadequately organize the latent space of RNA molecular embeddings with regard tothe interaction with proteins, and targeted optimization is used to navigate in thislatent space to search for desired novel RNA molecules.
NTRODUCTION
There is an increasing interest in developing deep generative models for biochemical data, especially inthe context of generating drug-like molecules. Learning generative models of biochemical moleculescan facilitate the development and discovery of novel treatments for various diseases, reducing thelead time for discovering promising new therapies and potentially translating in reduced costs fordrug development (Stokes et al., 2020). Indeed, the study of generative models for molecules hasbecome a rich and active subfield within machine learning, with standard benchmarks (Sterling &Irwin, 2015), a set of well-known baseline approaches (G´omez-Bombarelli et al., 2018; Kusner et al.,2017; Liu et al., 2018; Jin et al., 2018), and high-profile cases of real-world impact .Prior work in this space has focused primarily on the generation of small molecules (with lessthan 100 atoms), leaving the development of generative models for larger and more complicatedbiologics and biosimilar drugs (e.g., RNA and protein peptides) an open area for research. Developinggenerative models for larger biochemicals is critical in order to expand the frontiers of automatedtreatment design. More generally, developing effective representation learning for such complexbiochemicals will allow machine learning systems to integrate knowledge and interactions involvingthese biologically-rich structures.In this work, we take a first step towards the development of deep generative models for complexbiomolecules, focusing on the representation and generation of RNA structures. RNA plays a crucial e.g. LambdaZero project for exascale search of drug-like molecules. a r X i v : . [ q - b i o . B M ] F e b ublished as a conference paper at ICLR 2021role in protein transcription and various regulatory processes within cells which can be influenced byits structure (Crick, 1970; Stefl et al., 2005), and RNA-based therapies are an increasingly active areaof research (Pardi et al., 2018; Schlake et al., 2012), making it a natural focus for the developmentof deep generative models. The key challenge in generating RNA molecules—compared to thegeneration of small molecules—is that RNA involves a hierarchical, multi-scale structure, includinga primary sequential structure based on the sequence of nucleic acids as well as more complex secondary and tertiary structures based on the way that the RNA strand folds onto itself. An effectivegenerative model for RNA must be able to generate sequences that give rise to these more complexemergent structures.There have been prior works on optimizing or designing RNA sequences—using reinforcementlearning or blackbox optimization—to generate particular RNA secondary structures (Runge et al.,2019; Churkin et al., 2018). However, these prior works generally focus on optimizing sequences toconform to a specific secondary structure. In contrast, our goal is to define a generative model, whichcan facilitate the sampling and generation of diverse RNA molecules with meaningful secondarystructures, while also providing a novel avenue for targeted RNA design via search over a tractablelatent space. Key contributions.
We propose a series of benchmark tasks and deep generative models for thetask of RNA generation, with the goal of facilitating future work on this important and challengingproblem. We propose three interrelated benchmark tasks for RNA representation and generation:1.
Unsupervised generation:
Generating stable, valid, and diverse RNAs that exhibit complexsecondary structures.2.
Semi-supervised learning:
Learning latent representations of RNA structure that correlate withknown RNA functional properties.3.
Targeted generation:
Generating RNAs that exhibit particular functional properties.These three tasks build upon each other, with the first task only requiring the generation of stable andvalid molecules, while the latter two tasks involve representing and generating RNAs that exhibitparticular properties. In addition to proposing these novel benchmarks for the field, we introduceand evaluate three generative models for RNA. All three models build upon variational autoencoders(VAEs) (Kingma & Welling, 2014) augmented with normalizing flows (Rezende & Mohamed, 2015;Kingma et al., 2016), and they differ in how they represent the RNA structure. To help readers betterunderstand RNA structures and properties, a self-contained explanation is provided in appendix B.The simplest model (termed LSTMVAE) learns using a string-based representation of RNA structure.The second model (termed GraphVAE) leverages a graph-based representation and graph neuralnetwork (GNN) encoder approach (Gilmer et al., 2017). Finally, the most sophisticated model (termedHierVAE) introduces and leverages a novel hierarchical decomposition of the RNA structure. Exten-sive experiments on our newly proposed benchmarks highlight how the hierarchical approach allowsmore effective representation and generation of complex RNA structures, while also highlightingimportant challenges for future work in the area.
ASK DESCRIPTION
Given a dataset of RNA molecules, i.e. sequences of nucleotides and corresponding secondarystructures, our goals are to: (a) learn to generate structurally stable, diverse, and valid RNA moleculesthat reflect the distribution in this training dataset; (b) learn latent representations that reflect thefunctional properties of RNA. A key factor in both these representation and generation processesis that we seek to jointly represent and generate both the primary sequence structure as well as thesecondary structure conformation. Together, these two goals lay the foundations for generatingnovel RNAs that satisfy certain functional properties. To meet these goals, we create two types ofbenchmark datasets, each one focusing on one aspect of the above mentioned goals:
Unlabeled and variable-length RNA.
The first dataset contains unlabeled RNA with moderate andhighly-variable length (32-512 nts), obtained from the human transcriptome (Aken et al., 2016) andthrough which we focus on the generation aspect of structured RNA and evaluate the validity, stabilityand diversity of generated RNA molecules. In particular, our goal with this dataset is to jointlygenerate RNA sequences and secondary structures that are biochemically feasible (i.e., valid), have2ublished as a conference paper at ICLR 2021low free energy (i.e., stable), and are distinct from the training data (i.e., diverse). We will give anextended assessment of the generation aspect under different circumstances, e.g., when constrainingthe generation procedures with explicit rules.
Labeled RNA.
The second dataset is pulled and processed from a previous study on in vitro
RNA-protein interaction, which features labeled RNAs with shorter and uniform length (40 nts) (Cooket al., 2017). With this dataset, our objective is slightly expanded (to include obj. a), so that thelatent space is adequately organized and reflective of the interaction with proteins. Therefore, keyassessment for the latent space includes AUROC for the classification of protein binding, which iscrucial for the design of desired novel RNA molecules.Essentially, this creates slight variations in the task formulation, with the first dataset suited tounsupervised learning of a generative model, while the second datasets involves additional supervision(e.g., for a semi-supervised model or targeted generation). Our specific modeling choices, to beintroduced in section 3, are invariant to different task formulations, and flexible enough to handledifferent representations of RNA secondary structures. We refer readers to appendix C for detailedexplanation for the dataset and evaluation metrics on the generated molecules and latent embeddings.
ETHODS
In this section, we introduce three different generative models for RNA. All three models are basedupon the variational autoencoder (VAE) framework, involving three key components:1.
A probabilistic encoder network q φ ( z | x ) , which generates a distribution over latent states givenan input representation of an RNA. We experiment with three different types of input encodingsfor RNA sequence and secondary structures (see Figure S1: a dot-bracket annotated string, a graphwith adjacency matrix representing base-pairings, and a graph augmented with a hierarchicaljunction tree annotation for the secondary structure.2. A probabilistic decoder network p θ ( x | z ) , which defines a joint distribution over RNA sequencesand secondary structures, conditioned on a latent input. As with the encoder network, we designarchitectures based on a linearized string decoding and a graph-based hierarchical junction-treedecoding approach.3. A parameterized prior p ψ ( z ) , which defines a prior distribution over latent states and is learnedbased on a continuous normalizing flow (CNF) (Chen et al., 2018).Figure 1: Hierarchical encoding. Panel (A)shows the minimum free energy structure (forillustration purpose only) which is represented asa graph. In panel (B), the original molecule is de-composed by grouping structurally adjacent nu-cleotides into subgraphs, giving rise to a higher-order junction tree representation.For all the approaches we propose, the modelis optimized via stochastic gradient descent tominimize the evidence lower bound (ELBO): L = − E q φ ( z | x ) [ p θ ( x | z )] + β KL ( q φ ( z | x ) | p ψ ( z )) where β is a term to allow KL-annealing over thestrength of the prior regularization.In the following sections, we explain our three dif-ferent instantiations of the encoder (section 3.1),decoder (section 3.2), as well as our procedures tostructurally constrain the decoding process usingdomain knowledge (section 3.3) and our proce-dures to avoid posterior collapse (section 3.4).3.1 E NCODING
RNA
SECONDARY STRUCTURES
The input to the encoder is a structured RNAmolecule, with its sequence given by an or-dered array of nucleotides x . . . x L , with x i ∈{ A, C, G, U } , where L is the length of the se-quence, and its secondary structure, either rep-resented as (1) a dot-bracket string S = ˙ x . . . ˙ x L with ˙ x i ∈ { ., ( , ) } ; (2) or as a graph G with twotypes of edges — covalent bonds along the RNA backbone, and hydrogen bonds between the base-3ublished as a conference paper at ICLR 2021pairs . We use x uv to denote edge features between nucleotides u and v ; (3) or as a hypergraph T — a depth-first ordered array of subgraphs ˆ G . . . ˆ G D with L ( ˆ G i ) ∈ { S, H, I, M } indicating thesubgraph label, and I ( ˆ G i ) = { j | j ∈ { . . . L }} indicating the assignment of nucleotides to eachsubgraph. Encoding RNA secondary structure as sequence.
First, we obtain a joint encoding overthe nucleotide and the dot-bracket annotation, using the joint sequence-structure vocabulary { A, C, G, U } × { ., ( , ) } . Then, these one-hot encodings are processed by a stacked bidirectionalLSTM (Hochreiter & Schmidhuber, 1997), followed by a multi-head self-attention module (Vaswaniet al., 2017) to weigh different positions along the RNA backbone. A global max-pooling is used toaggregate the information into h S , and then we obtain mean µ S and log variance log σ S from h S through linear transformations, and draw latent encoding z S from N ( µ S , σ S ) using the reparameteri-zation trick (Kingma & Welling, 2014). Learning graph representation of RNA secondary structure.
To encode the graph view G of anRNA secondary structure, we pass rounds of neural messages along the RNA structure, which fallsinto the framework of Message Passing Neural Network (MPNN) as originally discussed in Gilmeret al. (2017) and similarly motivated by Jin et al. (2018).For much longer RNAs, it is conceptually beneficial to pass more rounds of messages so that anucleotide may receive information on its broader structural context. However, this may introduceundesired effects such as training instability and over-smoothing issues. Therefor , we combineour MPNN network with gating mechanism, which is collectively referred as the G-MPNN: ˆ v t − uv = σ ( W glocal [ x u || x uv ] + W gmsg (cid:88) w ∈ N ( u ) v t − wu ) (1) v tuv = GRU(ˆ v t − uv , v t − uv ) (2)where [ . . . || . . . ] denotes concatenation, σ denotes the activation function and GRU indicates thegated recurrent unit (Cho et al., 2014). Then, after T iterations of message passing, the finalnucleotide level embedding is given by: h u = σ ( W gemb [ x u || (cid:80) v ∈ N ( u ) v Tvu ]) . Before pooling thenucleotide level embeddings into the graph level, we pass h . . . h L through a single bidirectionalLSTM layer, obtaining ˆ h . . . ˆ h L at each step, and h g = max( { ˆ h i | i ∈ ...L } ) . The latent encoding z G is similarly obtained from h G using the reparameterization trick. Hierarchical encoding of the RNA hypergraph.
To encode the junction tree T of RNA, we employa type of GRU specifically suited to tree-like structures, which has previously been applied in workssuch as GGNN (Li et al., 2016) and JTVAE (Jin et al., 2018). We refer to this tree encoding networkas T-GRU, and the format of its input is shown in Figure 1.One major distinction between our RNA junction tree and the one used for chemical compounds (Jinet al., 2018) is that an RNA subgraph assumes more variable nucleotide composition such that it isimpossible to enumerate based on the observed data. Therefore, we need to dynamically compute thefeatures for each node in an RNA junction tree based on its contained nucleotides, in a hierarchicalmanner to leverage the nucleotide level embeddings learnt by G-MPNN.Considering a subgraph ˆ G i in the junction tree T , we initialize its node feature with: x ˆ G i = [ L ( ˆ G i ) || max u ∈I ( ˆ G i ) h u ] . Notably, max u ∈ ˆ G i h u is a max-pooling over all nucleotidesassigned to ˆ G i , and nucleotide embedding h u comes from G-MPNN. To compute and pass neuralmessages between adjacent subgraphs in the RNA junction tree T , we use the T-GRU network in Eq.3 v t ˆ G i , ˆ G j = T - GRU( x ˆ G i , { v t − G k , ˆ G i | ˆ G k ∈ N ( ˆ G i ) } ) (3) h ˆ G i = σ ( W temb [ x ˆ G i || (cid:88) ˆ G∈ N ( ˆ G i ) h ˆ G ]) (4)with details of T-GRU provided in the appendix D, and compute the embeddings for subgraphs withEq. 4. Further, we obtain a depth-first traversal of the subgraph embeddings h ˆ G . . . h ˆ G D (cid:48) which isalso the order for hierarchical decoding to be discussed later. This ordered array of embeddings isprocessed by another bi-directional LSTM , and the final tree level representation h T is again givenby the max-pooling over the bi-LSTM outputs. Likewise, latent encoding z T is obtained from h T . We do not differentiate the number of hydrogen bonds, which can be different depending on the base-pairs.For example, G-C has three hydrogen bonds whereas A-U only contains two.
MOLECULAR GENERATION
Decoding linearized sequence and structure.
In this setting, the decoder simply autoregressivelydecodes a token at each step, from the joint sequence-structure vocabulary mentioned before insection 3.1, plus one additional symbol to signal the end of decoding. To simplify the design choice,we use a single-layered forward-directional LSTM, and its hidden state is initialized with the latentencoding z , which can be either z S , z G or z T . Figure 2: Hierarchical decoding of a structuredRNA, involving three types of predictions, thatare on the topological level, node level, and nu-cleotide level. These three types of prediction areinterleaved into the procedures of decoding thejunction tree structure of RNA and the nucleotidesegments. Hierarchically decoding hypergraph and nu-cleotide segments.
The input to this more so-phisticated hierarchical decoder are latent en-codings z G which contains order and basic con-nectivity information of the nucleotides, and z T which contains higher order information aboutthe arrangements of nucleotide branches andtheir interactions. We give a concise descriptionof the decoding procedures here, along with a de-tailed algorithm in appendix E. On a high level,we hierarchically decode the tree structure in adepth-first manner, and autoregressively gener-ate a nucleotide segment for each visited treebranch. For these purposes, we interleave threetypes of prediction (Figure 2).Denote the current tree node at decode step t andat the i -th visit as ˆ G t,i , whose features include(1) its node label L ( ˆ G t,i ) and, (2) a summaryover the already existing i − nucleotide seg-ments max { h l,ju | u ∈ ˆ G t,i and l < t and j
To better regulate the decoding process so that generated RNAs have valid secondary structures, a setof constraints can be added to the decoding procedures at the inference stage. Essentially, a validRNA secondary structure needs to observe the following rules: (1) base-pairing complementarity,5ublished as a conference paper at ICLR 2021Figure 3: RNAs generated with structural constraints from HierVAE on a random axis in the latentspace (step size: 1e-4), for short (top), medium-length (middle), and long (bottom) RNAs. The Freeenergy (FE) and its deviation (DEV) from the MFE are given for each structure.which means only the canonical base-pairs and Wobble base-pairs are allowed, i.e. [A-U], [G-C] and[G-U]; (2) hairpin loop should have a minimum of three unpaired nucleotides, i.e. for any two pairedbases at position i and j , | i − j | > ; (3) each nucleotide can only be paired once, and overlappingpairs are disallowed.We will translate the above rules into specific and applicable constraints, depending on specificdecoders. For the sake of space, we only give a broad remark and leave more details in the appendix. Linearized decoding constraints.
Since the linearized decoder simply proceeds in an autoregressivefashion, constraints can be easily enforced in a way that at each step, a nucleotide with an appropriatestructural annotation is sampled by making use of masks and re-normalizing the probabilities.Likewise, a stop token can only sampled when all opening nucleotides have been closed. More detailsto follow in appendix F.
Hierarchical decoding constraints.
The specific set of constraints for hierarchical decoding isdiscussed in appendix G. Overall, considering the different natures of the three associated types ofprediction, each one should require a set of different strategies, which are once again applicable byadding proper masks before sampling. As shown in the algorithm in appendix E, the set of constraintsare applied to line 13, 24 and 14 with marked asterisk.3.4 A
VOIDING POSTERIOR COLLAPSE
As discussed in a line of previous works, VAEs with strong autoregressive decoders are susceptible toposterior collapse, an issue where the decoder simply ignores the latent encoding of the encoder (Heet al., 2019). Therefore, to avoid posterior collapsing, we make use of a carefully chosen KL annealingschedule during training to help the encoder adapt its information content in the latent encoding andin coordination with the decoder. This schedule is detailed in section 4. We also learn a parameterizedprior as suggested in Chen et al. (2017), but using a CNF instead, following a similar implementationto Yang et al. (2019), with details given in appendix H.Our KL annealing schedule is chosen based on empirical observations, as to our knowledge, there hasyet to exist any principled methods of selecting such schedule. We have used diagnostic metrics suchas mutual information (He et al., 2019) and active units (Burda et al., 2016) along with a validationset to select a proper KL annealing schedule which is to be described later in section 4
ESULTS
We consider three modes of evaluation: (1) unsupervised RNA generation; (2) generation usingsemi-supervised VAE models and (3) targeted RNA design from an organized latent space. Resultsare presented below, and relevant hyperparameters can be found in Table S1.
Unsupervised RNA generation.
Here, we evaluate generated RNAs from models trained on theunlabeled RNA dataset for 20 epochs using a KL annealing schedule including 5 epochs of warm-up,6ublished as a conference paper at ICLR 2021Table 1: We evaluate RNAs sampled from the posterior distribution: q ( z | x ) , with a held-out test setof 20,000 RNAs. Each molecule is encoded and decoded 5 times. We also evaluate samples from theprior distribution: N (0 , I ) subject to the transformation of a latent CNF, where we sample 10,000encodings and each encoding is decoded 10 times. Normed refers to length normalized FE DEV.Posterior Decoding Prior DecodingModel Validity ↑ FE DEV ↓ Normed ↓ Diversity ↑ Validity ↑ FE DEV ↓ Normed ↓ Diversity ↑ Constrained & StochasticLSTMVAE 99.47% 18.197 0.061 6.786 99.50% 18.536 0.062 6.789GraphVAE 99.47% 17.275 0.061 6.790 99.36% 18.534 0.065 6.791HierVAE 99.97% 8.678 0.035 6.787 99.86% 8.676 0.036 6.791Unconstrained & StochasticLSTMVAE 62.98% 8.700 0.048 6.791 62.58% 9.060 0.049 6.793GraphVAE 65.79% 9.508 0.051 6.792 63.45% 10.166 0.055 6.794HierVAE 94.51% 8.257 0.035 6.787 92.75% 7.897 0.037 6.791followed by gradually increasing the KL annealing term to 3e-3 (for LSTMVAE and GraphVAE), or2e-3 (for HierVAE). The KL annealing schedule was chosen using a validation set of 1,280 RNAs.Table 1 compares the generation capability of different models, from the posterior as well as the priordistribution, and in scenarios such as applying structural constraints to the decoding process or not.It clearly shows that our most advanced model, HierVAE which employs a hierarchical view of thestructure in its encoding/decoding aspects, achieves the best performance across different evaluationregimes, generating valid and stable RNAs even when the decoding processed is unconstrained. Itis also observed that despite having structural constraints, the validity of our generated RNAs arealways slightly below 100%. This can be explained by the threshold hyperparameter which sets themaximum number of steps for topological prediction as well as the maximal length of each nucleotidesegment, as shown in Algorithm 1 in appendix E.To further demonstrate the benefits of model training from structural constraints, we sample RNAsfrom the prior of an untrained
HierVAE model. With structural constraints, the validity amounts to66.34% with an extremely high free energy deviation of 22.613. Without structural constraints, thevalidity translates to a mere 9.37% and the model can only decode short single stranded RNAs as itlacks the knowledge of constructing more complex structures. This comparison illustrates that modeltraining is essential for obtaining stable RNA folding.The junction tree hierarchy of RNAs developed in our work shares certain modelling similaritieswith the probabilistic context free grammar (Dowell & Eddy, 2004) used by covariance models(CM) (Eddy & Durbin, 1994). Infernal (Nawrocki & Eddy, 2013) is one of the representative worksbased on CM, which is capable of sampling RNA secondary structures from a CM built around aconsensus secondary structure for a conserved RNA family. However, due to the lack of homologoussequences in our dataset, Infernal is seriously limited and can only sample single stranded RNAs.Figure 3 illustrate RNA structures generated using HierVAE from a randomly chosen short paththrough the latent space. Notably, latent encoding provided by HierVAE translates smoothly inthe RNA structure domain: nearby points in the latent space result in highly similar, yet different,structures. The generated structures are particularly stable for short and medium-size RNAs, andslightly less so for longer RNAs with highly complex structures. A side-by-side comparison betweengenerated RNA secondary structures and MFE structures in Figure S3 shows that generated structurescan evolve smoothly in the latent space along with their corresponding MFE structures. We alsovisualize neighborhoods of a Cysteine-carrying transfer RNA and a 5S ribosomal RNA in figure S4and S5.
Supervised RNA generation.
We then evaluate our generative approaches in a semi-supervisedsetting using seven RBP binding data sets from RNAcompete-S. First, we compare the efficacyof different representational choices while excluding the generative components, i.e. we jointlytrain VAE encoders followed by simple MLP classifiers on top of the latent encodings for binaryclassification on RBP binding. 7ublished as a conference paper at ICLR 2021Table 2: Training semi-supervised HierVAE on labeled RNAcompete-S dataset. A test split is used toevaluate the accuracy of embedding classifiers and RNAs decoded from the posterior distributionunder two settings: constrained and stochastic (C& S), unconstrained and deterministic (NC&D).RECON ACC refers to reconstruction accuracy which measures the percentage of RNA moleculesdecoded exactly as the input.Test Post C&S Post NC&DDataset AUROC Valid FE DEV RECON ACC Valid FE DEV RECON ACCHuR 0.884 100% 0.426 55.85% 99.34% 0.269 68.73%PTB 0.907 100% 0.409 55.24% 92.07% 0.570 51.96%QKI 0.824 100% 0.439 55.80% 99.22% 0.296 66.83%Vts1 0.900 100% 0.539 47.96% 98.90% 0.367 60.60%RBMY 0.878 100% 0.634 48.09% 98.64% 0.419 61.84%SF2 0.900 100% 0.616 44.57% 98.88% 0.409 57.15%SLBP 0.792 100% 0.459 53.67% 98.60% 0.306 65.49%Table S3 shows that incorporating RNA secondary structures is overall beneficial for the classificationaccuracy, except for RBMY where a model with access to RNA sequence alone (LSTM-SeqOnly)has the best performance. Notably, different choices for representing RNA secondary structuresdo not lead to large variation in performance, with the exception of HuR and SLBP, where graphbased representations have an advantage over the linearized structural representation. On the otherhand, sequence based models often have comparable performance, possibly due to the capabilityof inferring RNA secondary structures from short RNA sequences. It is also worth exploring other in-vitro selection protocols such as HTR-SELEX which can select RNAs with higher binding affinitiesthan RNAcompete-S that only involves a single selection step.Next, we train full generative models (encoder, decoder, latent CNF and MLP embedding classifier),and show the results in Table 2. Since our strategy for targeted RNA design makes use of seedmolecules in the latent space, we mainly sample RNAs from the posterior distribution of thesesemi-supervised VAE models. Therefore, we select a KL annealing schedule that tends to retain moreinformation in the latent encodings, i.e. setting maximum β to 5e-4 and training 10 epochs.Results are promising in that classification AUROC measured by the held-out test set is comparableto the fully supervised classification models in Table S3, and much better compared to models onlyusing fixed and pretrained VAE embeddings as shown in Table S2. Also, RNA structures generatedfrom the posterior distribution, even under the setting of unconstrained and deterministic decoding,have high success rates, very stable conformation and good reconstruction accuracy.Table 3: Designing novel RNA withhigher chances of RBP binding.Dataset Success ImprovementHuR 96.88% 0.561 ± ± ± ± ± ± ± Targeted RNA design.
We next studied the task of de-signing RNAs with high RBP binding affinity. Startingfrom the latent encodings of 10,000 randomly chosen RNAmolecules that have negative labels in each RNAcompete-S test set, and use activation maximization to gradually al-ter the latent encodings so that the predicted binding proba-bility from the embedding classifiers increases. These em-bedding classifiers have been trained jointly with the VAEmodels with accuracy reported earlier (Table 2). Then, weuse separately trained full classifiers (also earlier shown inTable S3) as proxy of oracles for evaluating the “groundtruth” probability of RBP binding. Table 3, report thesuccess rate (fraction of RNAs whose “ground truth” RBP binding probability was improved), alongwith the average improvement in binding probabilities. An example of a trajectory of optimizedRNAs is shown in Fig. S6.
ELATED WORK
Over the years, the field of computational drug discovery has witnessed the emergence of graph-centric approaches. One of the earliest method, proposed in G´omez-Bombarelli et al. (2018), isdefined on the linearized format of molecular structures and represents a family of methods that8ublished as a conference paper at ICLR 2021rely on sequential models to represent and generate SMILES strings of chemical compounds. Latermethods have sought to construct more chemical priors into the model, via (1) leveraging graph basedrepresentation and generation techniques, (2) enforcing direct chemical constraints to the decodingprocess, (3) considering a multi-scale view of the molecular structures, or (4) using reinforcementlearning to integrate more training signal of the molecular structure and function. As a result, greatersuccess has been achieved by models such as Kusner et al. (2017); Liu et al. (2018); Jin et al. (2018);You et al. (2018) at generating and searching valid and more useful chemical compounds.Graph representation learning is at the heart of these more recent approaches, to help understand therules governing the formation of these molecular structures, as well as the correspondence betweenstructures and functions. Duvenaud et al. (2015) were among the first to apply GNN to learn molecularfingerprints, and the general neural message passing framework for molecules is proposed in Gilmeret al. (2017), which demonstrate the power of MPNN across various molecular benchmarking tasks.These prior works on molecular MPNN, together with other GNN architectures developed in otherareas, such as considering relational edges (Schlichtkrull et al., 2018) and attention (Velickovic et al.,2018), have laid the foundation for the success of these deep generative models.Despite the fact that RNA molecules can adopt complex structures, dedicated graph representationlearning techniques have been scarce, with some recent works beginning to leverage graph relatedlearning techniques to predict RNA folding (Chen et al., 2020; Singh et al., 2019) and to representRNA molecular structures (Yan et al., 2020; Oliver et al., 2020). Prior to our work, the design ofRNA has mostly focused on the inverse design problem, which is to conditionally generate an RNAsequence whose MFE secondary structure corresponds to an input secondary structure. Therefore,the line of prior works have predominantly relied on sequential techniques, with some representativemethods based on reinforcement learning (Runge et al., 2019), or more classically framed as acombinatorial optimization problem and solved with sampling based techniques (Churkin et al.,2018). These prior works are mainly concerned with querying from an energy model with fixedthermodynamic parameters and fixed dynamics of RNA folding, which is in itself limited comparedto learning based approaches (Chen et al., 2020; Singh et al., 2019), and are unable to model a jointdistribution over RNA sequences and possible folds.
ONCLUSION AND FUTURE WORKS
In this work we propose the first graph-based deep generative approach for jointly embedding andgenerating RNA sequence and structure, along with a series of benchmarking tasks. Our presentedwork has demonstrated impressive performance at generating diverse, valid and stable RNA secondarystructures with useful properties.For future works, there are several important directions to consider. First, it would be beneficialto obtain non-coding RNA families from the RFAM database (Kalvari et al., 2017) which wouldhelp our models learn more biologically-meaningful representation indicative of RNA homologyand functions, in addition to the evolutionarily conserved RNA structural motifs that would enablethe generation of more stable RNA secondary structures. In that context, a detailed comparison toInfernal and other probabilistic context-free grammar models would be meaningful.On the methodological aspect, in light of the recent advances in protein sequences pretraining acrossa large evolutionary-scale (Rives et al., 2019; Elnaggar et al., 2020), our models for RNAs maysimilarly benefit by such a procedure with the data collected from RFAM. After the pretrainingstep, reinforcement learning can be used to finetune the generative component of our model withcustomizable rewards defined jointly on RNA structural validity, folding stability and functions suchas binding to certain proteins.On the evaluation side, it would be of great interest to analyze our models for any potential RNAtertiary structural motifs and to compare them with those deposited in the CaRNAval (Reinharz et al.,2018) or RNA 3D motifs database (Parlea et al., 2016). Our models would also need modifications toallow non-canonical interactions and pseudoknots, which are common in RNA tertiary structures.All in all, the representation, generation and design of structured RNA molecules represent a rich,promising, and challenging area for future research in computational biology and drug discovery, andan opportunity to develop fundamentally new machine learning approaches.9ublished as a conference paper at ICLR 2021 R EFERENCES
Bronwen L Aken, Premanand Achuthan, Wasiu Akanni, M Ridwan Amode, Friederike Bernsdorff,Jyothish Bhai, Konstantinos Billis, Denise Carvalho-Silva, Carla Cummins, Peter Clapham, et al.Ensembl 2017.
Nucleic Acids Research , 45(D1):D635–D642, 2016.Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance Weighted Autoencoders. In
International Conference on Learning Representations (ICLR) , 2016.Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural Ordinary DifferentialEquations. In
Advances in Neural Information Processing Systems (NeurIPS) , 2018.Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, IlyaSutskever, and Pieter Abbeel. Variational Lossy Autoencoder. In
International Conference onLearning Representations (ICLR) , 2017.Xinshi Chen, Yu Li, Ramzan Umarov, Xin Gao, and Le Song. RNA secondary structure predictionby learning unrolled algorithms. In , 2020.Kyunghyun Cho, Bart van Merrienboer, C¸ aglar G¨ulc¸ehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoderfor statistical machine translation. In
Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) , 2014.A. Churkin, M. D. Retwitzer, V. Reinharz, Y. Ponty, J. Waldispuhl, and D. Barash. Design of RNAs:comparing programs for inverse RNA folding.
Brief Bioinformatics , 19(2):350–358, 2018.K. B. Cook, S. Vembu, K. C. H. Ha, H. Zheng, K. U. Laverty, T. R. Hughes, D. Ray, and Q. D. Morris.RNAcompete-S: Combined RNA sequence/structure preferences for RNA binding proteins derivedfrom a single-step in vitro selection.
Methods , 126:18–28, 2017.Francis Crick. Central Dogma of Molecular Biology.
Nature , 227(5258):561–563, 1970.Robin D. Dowell and Sean R. Eddy. Evaluation of several lightweight stochastic context-freegrammars for RNA secondary structure prediction.
BMC Bioinformatics , 5(1):71, 2004.David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael G´omez-Bombarelli, Tim-othy Hirzel, Al´an Aspuru-Guzik, and Ryan P. Adams. Convolutional Networks on Graphsfor Learning Molecular Fingerprints. In
Advances in Neural Information Processing Systems(NeurIPS) , 2015.Sean R. Eddy and Richard Durbin. RNA sequence analysis using covariance models.
Nucleic AcidsResearch , 22(11):2079–2088, 06 1994.Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones,Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, DEBSINDHU BHOWMIK,and Burkhard Rost. ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. bioRxiv , 2020.Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. NeuralMessage Passing for Quantum Chemistry. In
International Conference on Machine Learning(ICML) , 2017.Rafael G´omez-Bombarelli, Jennifer N Wei, David Duvenaud, Jos´e Miguel Hern´andez-Lobato,Benjam´ın S´anchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel,Ryan P Adams, and Al´an Aspuru-Guzik. Automatic chemical design using a data-driven continuousrepresentation of molecules.
ACS central science , 4(2):268–276, 2018.Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. FFJORD:free-form continuous dynamics for scalable reversible generative models. In
International Confer-ence on Learning Representations (ICLR) , 2019.10ublished as a conference paper at ICLR 2021Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor Berg-Kirkpatrick. Lagging InferenceNetworks and Posterior Collapse in Variational Autoencoders. In
International Conference onLearning Representations (ICLR) , 2019.S. Hochreiter and J. Schmidhuber. Long short-term memory.
Neural Computing , 9(8):1735–80,1997.Wengong Jin, Regina Barzilay, and Tommi S. Jaakkola. Junction Tree Variational Autoencoder forMolecular Graph Generation. In
International Conference on Machine Learning (ICML) , 2018.Ioanna Kalvari, Joanna Argasinska, Natalia Quinones-Olvera, Eric P Nawrocki, Elena Rivas, Sean REddy, Alex Bateman, Robert D Finn, and Anton I Petrov. Rfam 13.0: shifting to a genome-centricresource for non-coding RNA families.
Nucleic Acids Research , 46(D1):D335–D342, 11 2017.Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In
International Conferenceon Learning Representations (ICLR) , 2014.Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.Improved variational inference with inverse autoregressive flow. In
Advances in Neural InformationProcessing Systems (NeurIPS) , 2016.Matt J. Kusner, Brooks Paige, and Jos´e Miguel Hern´andez-Lobato. Grammar Variational Autoencoder.In
International Conference on Machine Learning (ICML) , 2017.Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S. Zemel. Gated Graph Sequence NeuralNetworks. In
International Conference on Learning Representations (ICLR) , 2016.Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, and Alexander L. Gaunt. Constrained GraphVariational Autoencoders for Molecule Design. In
Advances in Neural Information ProcessingSystems (NeurIPS) , 2018.R. Lorenz, S. H. Bernhart, C. Honer Zu Siederdissen, H. Tafer, C. Flamm, P. F. Stadler, and I. L.Hofacker. ViennaRNA Package 2.0.
Algorithms for Molecular Biology , 6:26, 2011.David H. Mathews, Matthew D. Disney, Jessica L. Childs, Susan J. Schroeder, Michael Zuker, andDouglas H. Turner. Incorporating chemical modification constraints into a dynamic programmingalgorithm for prediction of RNA secondary structure.
Proceedings of the National Academy ofSciences of the United States of America , 101(19):7287, 2004.Eric P. Nawrocki and Sean R. Eddy. Infernal 1.1: 100-fold faster RNA homology searches.
Bioinfor-matics , 29(22):2933–2935, 09 2013.Carlos Oliver, Vincent Mallet, Roman Sarrazin Gendron, Vladimir Reinharz, William L Hamilton,Nicolas Moitessier, and J´erˆome Waldisp¨uhl. Augmented base pairing networks encode RNA-smallmolecule binding preferences.
Nucleic Acids Research , 48(14):7690–7699, 07 2020.Norbert Pardi, Michael J. Hogan, Frederick W. Porter, and Drew Weissman. mRNA vaccines — anew era in vaccinology.
Nature Reviews Drug Discovery , 17(4):261–279, 2018.Lorena G. Parlea, Blake A. Sweeney, Maryam Hosseini-Asanjan, Craig L. Zirbel, and Neocles B.Leontis. The rna 3d motif atlas: Computational methods for extraction, organization and evaluationof rna motifs.
Methods , 103:99 – 119, 2016.D. Ray, H. Kazan, E. T. Chan, L. Pena Castillo, S. Chaudhry, S. Talukder, B. J. Blencowe, Q. Morris,and T. R. Hughes. Rapid and systematic analysis of the RNA recognition specificities of RNA-binding proteins.
Nature Biotechnology , 27(7):667–70, 2009.D. Ray, H. Kazan, K. B. Cook, M. T. Weirauch, H. S. Najafabadi, X. Li, S. Gueroussov, M. Albu,H. Zheng, A. Yang, H. Na, M. Irimia, L. H. Matzat, R. K. Dale, S. A. Smith, C. A. Yarosh, S. M.Kelly, B. Nabet, D. Mecenas, W. Li, R. S. Laishram, M. Qiao, H. D. Lipshitz, F. Piano, A. H.Corbett, R. P. Carstens, B. J. Frey, R. A. Anderson, K. W. Lynch, L. O. Penalva, E. P. Lei, A. G.Fraser, B. J. Blencowe, Q. D. Morris, and T. R. Hughes. A compendium of RNA-binding motifsfor decoding gene regulation.
Nature , 499(7457):172–7, 2013.11ublished as a conference paper at ICLR 2021Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In
International Conference on Learning Representations (ICLR) , 2018.Vladimir Reinharz, Antoine Soul´e, Eric Westhof, J´erˆome Waldisp¨uhl, and Alain Denise. Miningfor recurrent long-range interactions in RNA structures reveals embedded hierarchies in networkfamilies.
Nucleic Acids Research , 46(8):3841–3851, 03 2018.Danilo Jimenez Rezende and Shakir Mohamed. Variational Inference with Normalizing Flows. In
International Conference on Machine Learning (ICML) , 2015.Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Demi Guo, Myle Ott,C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological Structure and Function Emerge fromScaling Unsupervised Learning to 250 Million Protein Sequences. bioRxiv , 2019.Frederic Runge, Danny Stoll, Stefan Falkner, and Frank Hutter. Learning to Design RNA. In
International Conference on Learning Representations (ICLR) , 2019.Roman Sarrazin-Gendron, Hua-Ting Yao, Vladimir Reinharz, Carlos G. Oliver, Yann Ponty, andJ´erˆome Waldisp¨uhl. Stochastic Sampling of Structural Contexts Improves the Scalability andAccuracy of RNA 3D Module Identification. In Russell Schwartz (ed.),
Research in ComputationalMolecular Biology , pp. 186–201. Springer International Publishing, 2020.Thomas Schlake, Andreas Thess, Mariola Fotin-Mleczek, and Karl-Josef Kallen. Developing mRNA-vaccine technologies.
RNA Biology , 9(11):1319–1330, 2012.Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and MaxWelling. Modeling Relational Data with Graph Convolutional Networks. In
The Semantic Web -15th International Conference, ESWC 2018 , pp. 593–607, 2018.Jaswinder Singh, Jack Hanson, Kuldip Paliwal, and Yaoqi Zhou. RNA secondary structure predictionusing an ensemble of two-dimensional deep neural networks and transfer learning.
NatureCommunications , 10(1):5407, 2019.Richard Stefl, Lenka Skrisovska, and Fr´ed´eric H. T. Allain. RNA sequence- and shape-dependentrecognition by proteins in the ribonucleoprotein particle.
EMBO reports , 6(1):33–38, 2005.Teague Sterling and John J. Irwin. ZINC 15 – Ligand Discovery for Everyone.
Journal of ChemicalInformation and Modeling , 55(11):2324–2337, 2015.J. M. Stokes, K. Yang, K. Swanson, W. Jin, A. Cubillos-Ruiz, N. M. Donghia, C. R. MacNair,S. French, L. A. Carfrae, Z. Bloom-Ackermann, V. M. Tran, A. Chiappino-Pepe, A. H. Badran,I. W. Andrews, E. J. Chory, G. M. Church, E. D. Brown, T. S. Jaakkola, R. Barzilay, and J. J.Collins. A Deep Learning Approach to Antibiotic Discovery.
Cell , 181(2):475–483, 2020.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
Advances in Neural InformationProcessing Systems (NeurIPS) , 2017.Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li`o, and YoshuaBengio. Graph Attention Networks. In
International Conference on Learning Representations(ICLR) , 2018.Zichao Yan, William L. Hamilton, and Mathieu Blanchette. Graph neural representational learn-ing of RNA secondary structures for predicting RNA-protein interactions.
Bioinformatics , 36(Supplement 1):i276–i284, 2020.Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge J. Belongie, and Bharath Hariharan.PointFlow: 3D Point Cloud Generation With Continuous Normalizing Flows. In
InternationalConference on Computer Vision (ICCV) , 2019.Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay S. Pande, and Jure Leskovec. Graph ConvolutionalPolicy Network for Goal-Directed Molecular Graph Generation. In
Advances in Neural InformationProcessing Systems (NeurIPS) , 2018. 12ublished as a conference paper at ICLR 2021
A A
CKNOWLEDGEMENTS
We would like to thank all members of the Hamilton lab, Blanchette lab, and the four anonymousreviewers for their insightful suggestions. This work was funded by a Genome Quebec/Canada grantto MB and by the Institut de Valorisation des Donn´ees (IAVDO) PhD excellence scholarship to ZY.WLH is supported by a Canada CIFAR AI Chair. We also thank Compute Canada for providing thecomputational resources.
B B
ACKGROUND : RNA S
TRUCTURE AND K EY P ROPERTIES
Figure S1: A nested RNA secondarystructure can be represented by: (A) dot-bracket annotation, where base-pairs cor-responding to matching parentheses, or(B) a molecular planar graph with twotypes of edges, corresponding to consec-utive nucleotides (backbone) and base-pairing interactions, or (C) a junctiontree where node are labeled as stems (S),hairpins (H), internal loops (I), or multi-loops (M), and edges correspond to theconnections between these elements. Allthree forms are equivalent.
The representation of an RNA molecule starts from its primary sequence structure —i.e., a single chain of nu-cleotides (adenine (A), cytosine (C), guanine (G) anduracil (U)). RNA sequences are flexible and can fold ontothemselves, enabling the formation of bonds between com-plementary nucleotides (Watson-Crick base-pairs [A-U,G-C], and Wobble base-pairs [G-U]), hence stabilizing themolecule . The set of pairs of interacting nucleotides inan RNA forms its so-called RNA secondary structure . Incomputational analyses of RNA, it is standard to assumethat a secondary structure is nested : if [ i, j ] and [ k, l ] formbase pairs with i < k , then either l < j (nesting) or k > j (non-overlapping). This enables simple string or planargraph representations (Figure S1 a, b).The nested structure assumption means that secondarystructures can be modelled by a probabilistic context freegrammar (Dowell & Eddy, 2004), or by the closely relatedjunction tree structure (Figure S1 c) (Sarrazin-Gendronet al., 2020), where each hypernode corresponds to a par-ticular secondary substructure element: (1) stem : consec-utive stacked base-pairs locally forming a double-strandedstructure; (2) hairpin loop : unpaired regions closed bya base-pair; (3) internal loop : unpaired regions locatedbetween two stems; (4) multiloop : unpaired regions at thejunction of at least three stems. Edges link elements thatare adjacent in the structure. Validity and stability of RNA folding.
The notion offree energy of RNA secondary structures can be usedto characterize the stability of a particular conformation.Given an RNA sequence, there are combinatorially many valid RNA secondary structures which allneed to obey a set of constraints (summarized in section 3.3). However, some structures are morestable than the others by having lower free energy. Therefore, these structures are more likely toexist (hence more useful) in reality due to the stochastic nature of RNA folding. The free energyof an RNA secondary structure can be estimated by an energy-based model with thermodynamicparameters obtained from experiments (Mathews et al., 2004), wherein the minimum free energy(MFE) structure can be predicted, up to a reasonable approximation (Lorenz et al., 2011). There exists other non-canonical base-pairs which are excluded from our current work. Throughout this work, we use RNAfold (Lorenz et al., 2011) to compute free energy as well as the MFEstructure, due to its interpretability and acceptable accuracy for moderately sized RNAs.
C D
ATASET AND M ETRICS
The unlabeled dataset is obtained from the complete human transcriptome which is downloaded fromthe Ensembl database (Aken et al. (2016); version GRCh38). We slice the transcripts into snippetswith length randomly drawn between 32 and 512 nts, and use RNAfold to obtain the MFE structures.We randomly split the dataset into a training set that contains 1,149,859 RNAs, and 20,000 held-outRNAs for evaluating decoding from the posterior distribution. More information on the structuraldiversity and complexity of this dataset is shown in Figure S2, which should present significantchallenges for our algorithms.The labeled dataset is pulled from a previous study on sequence and structural binding preferenceof RNA binding proteins (RBP), using an in vitro selection protocol called RNAcompete-S (Cooket al., 2017) which generates synthesized RNA sequences bound or unbound to a given RBP. RNAsin this experiment are of uniform length, i.e. 40 nts, and offer a rich abundance of RNA secondarystructures compared to its predecessor protocols such as RNAcompete (Ray et al., 2009; 2013). Sinceno benchmark has been ever established since its publication, we randomly sample 500,000 positivesequences bound to an RBP, and the same amount of negative sequences from the pool of unboundsequences, to curate a dataset for each of the seven RBPs investigated in the paper. Then, 80% of allRNAs are randomly selected to the train split, and the rest goes to the test split.Our evaluation scheme for the generated RNA secondary structures includes the following metrics:• validity : percentage of generated RNA secondary structures that conform to the structural con-straints specified in section 3.3.• free energy deviation (FE DEV) : difference of free energy between the generated RNA secondarystructure and the MFE structure of the corresponding sequence, which quantifies the gap of bothstructures from an energy perspective. A lower FE DEV should indicate higher stability of generatedRNAs.• free energy deviation normalized by length (Normed FE DEV) : FE DEV divided by the lengthof generated RNA, which distributes the contribution of total FE DEV to each base.• : entropy of the normalized counts of 5-mer substrings, which directlymeasures the diversity of RNA sequences, and indirectly for RNA secondary structures when thismetric is combined with FE DEV, since monolithic structures of diverse sequences would lead tohigh FE DEV.
D T
REE ENCODING
GRU
Following Eq.3, T-GRU computes a new message v t ˆ G i , ˆ G j from ˆ G i and ˆ G j , based on the features in ˆ G i denoted by x ˆ G i , as well as neural messages from neighboring subgraphs to ˆ G i , i.e. { v t − G k , ˆ G i | ˆ G k ∈ N ( ˆ G i ) } . The internal structure of T-GRU is equivalent to the tree encoder employed in Jin et al.(2018), which is essentially a neural analogue of the belief propagation algorithm on junction trees.Nevertheless, we write down the message passing formulas of T-GRU here: s ˆ G i , ˆ G j = (cid:88) ˆ G k ∈ N ( ˆ G i ) v t − G k , ˆ G i (S1) z ˆ G i , ˆ G j = σ ( W z [ x ˆ G i || s ˆ G i , ˆ G j ] + b z ) (S2) r ˆ G k , ˆ G i = σ ( W r [ x ˆ G i || v t − G k , ˆ G i ] + b r ) (S3) ˆ v ˆ G i , ˆ G j = Tanh ( W [ x ˆ G i || (cid:88) ˆ G k ∈ N ( ˆ G i ) r ˆ G k , ˆ G i · v t − G k , ˆ G i ]) (S4) v t ˆ G i , ˆ G j = (1 − z ˆ G i , ˆ G j ) (cid:12) s ˆ G i , ˆ G j + z ˆ G i , ˆ G j (cid:12) ˆ v ˆ G i , ˆ G j (S5)14ublished as a conference paper at ICLR 2021 E A
LGORITHM FOR HIERARCHICALLY DECODING STRUCTURED
RNA
Algorithm 1:
DFS decode RNA secondary structure Given: z T , z G , M TI, M SI a Initialize: stack ← [ ] function decode( z T , z G ) root ← sample(MLP node ( z T )) ; root .add incoming message( z T ) ; stack. push(( root, )) ; t ← ; while t ≤ M TI and stack. size() ≥ do c node, last nuc ← stack .get last item(); all msg ← { msg | ∀ msg ∈ c node. get incoming message() } ; local f ield ← [ c node. label() || c node. get segment features()] ; new msg ← T - GRU( local f ield, all msg ) ; // topological prediction is backtrack ← sample(MLP topo ( z T )) ∗ ; // nucleotide segment prediction new msg, last nuc, decoded segment, segment f eatures ← decode segment ( new msg, last nuc, z T , z G , M SI) ∗ ; c node.add decoded segment ( decoded segment ) ; c node.add segment f eatures ( segment f eatures ) ; if is backtrack = T rue then // backtrack to the parent node c node .add incoming message( new msg ) ; p node, ← stack .get penultimate item(); p node .add neighbor( c node ) ; stack .update penultimate item(( p node, last nuc )); stack .pop () ; else // predict and expand to new tree node new node ← sample(MLP node ( new msg )) ∗ ; new node .add incoming message( new msg ) ; new node .add neighbor(( c node, last nuc )) ; stack .push( new node ) ; end t ← t + 1 ; end return root ; a M TI refers to the threshold which set the maximum allowed number of topological prediction steps; M SIis another threshold to limit the length of each decoded nucleotide segment.
F D
ETAILS FOR APPLYING
RNA
STRUCTURAL CONSTRAINTS TO LINEARIZEDDECODING PROCEDURES
When decoding from the joint vocabulary of sequence and dot-bracket structure ( { A, C, G, U } ×{ ., ( , ) } ), whenever a nucleotide nuc i with a left bracket is sampled at step i , we append them to astack, i.e. { (nuc i , i ) . . . (nuc i , i ) } . Then, at decode step j ,• if | i − j | ≤ , a proper mask will be added to the categorical logits of the vocabulary, to avoidsampling any nucleotides with right brackets, which means only an unpaired nucleotide or one thatcomes with a left bracket can be sampled;• if | i − j | > , a mask will be applied to make sure that only a nucleotide complementary to nuc i can be sampled with the right bracket. Sampling nucleotides with other forms of structures areallowed.As soon as a nucleotide with a closing right bracket is sampled, we pop out (nuc i , i ) from the stack.The special symbol for stop decoding can only be sampled when the stack has become empty. G D
ETAILS FOR APPLYING
RNA
STRUCTURAL CONSTRAINTS TOHIERARCHICAL DECODING PROCEDURES
Additional constraints to be enforced during the hierarchical decoding process to ensure the validityof the decoded RNA secondary structure. Recall in section 3.2 that three types of predictions areinvolved with the hierarchical decoding, therefore, each type is associated with its own set of rules.All set of rules can be observed by adding proper masks to the categorical logits before sampling,which are detailed below.Constraints for making topological prediction, when the current node is• stem node, then the algorithm always expands to a new node upon its first visit, or backtracks to itsparent node upon re-visit;• hairpin node, then the algorithm always backtracks;• internal loop, then the algorithm acts similarly as for stem node;• multi-loop, then the algorithm always expands upon first visit and the next re-visit. Further re-visitsto the same multi-loop node are not regulated.Constraints for predicting new tree node, when the current node is• stem node, then its child node when exists can be either a hairpin loop, an internal loop, or amulti-loop;• hairpin node, internal loop or multi-loop, then its child node must be a stem node.Constraints for decoding nucleotide segment. Due to the property of non-empty intersection betweenadjacent subgraphs, the start token for decoding a segment at the current node, is always the lastnucleotide decoded at the last node. Therefore, without explicitly mentioning, the algorithm needs todecode at least one new nucleotide at each segment. When the current node is• stem node, and if it is upon its first visit (i.e. decoding the first segment of a stem), then thereis no for constraints. Otherwise, upon its re-visit, the algorithm needs to decode exactly thecomplementary bases and in the reverse order, according to the first decoded segment;• hairpin node, then the decoder needs to decode at least four nucleotides before seeing the stopsymbol, unless the hairpin is also the root node.• internal loop node, and if it is upon its first, then constraint is not necessary. Otherwise, upon itsrevisit, the algorithm needs to decode at least one unpaired nucleotide on condition that the firstdecoded internal loop segment does not contain any unpaired nucleotides;• multi-loop node, then there is no need for constraints.16ublished as a conference paper at ICLR 2021
H D
ETAILS FOR PARAMETERIZING PRIOR DISTRIBUTION USINGNORMALIZING FLOW
A normalizing flow involves a series of bijective transformation with tractable Jacobian log-determinant, to map an observed datapoint x ∼ p θ ( x ) from a complex distribution to a simplerone, such as the standard normal distribution.Considering the simplified case where we have a single bijective function f θ : Z → X to map somesimple latent variables z to observed datapoint x , then, using the change of variable theorem, thelikelihood of the observed datapoint can be evaluated as: p θ ( x ) = p z ( f − θ ( x )) | det ∂f − θ ( x ) ∂x | (S6)where p z ( . ) denotes some simple base distribution, e.g. N (0; I ) . Then, it becomes clear theefficiency of this scheme heavily relies on the efficiency of inverting the forward mapping f θ as wellas computing its Jacobian log-determinant.In this project, we use a type of continuous normalizing flow (CNF) which simplifies the abovementioned computation (Chen et al., 2018). Consider a time continuous dynamics f ψ ( z ( t ) , t ) ofsome intermediate data representation z ( t ) , and again z ( t ) ∼ p z ( . ) , the transformation of variable,along with its inverse mapping, can be expressed as: z (cid:44) z ( t ) = z ( t ) + (cid:90) t t f ψ ( z ( t ) , t ) dt (S7) z ( t ) = z ( t ) + (cid:90) t t f ψ ( z ( t ) , t ) dt (S8)and the change of probability density can be expressed as: log p ψ ( z ) = log p z ( z ( t )) − (cid:90) t t tr( ∂f θ ∂z ( t ) ) dt (S9)Note that the invertibility issue is no longer a concern under some mild constraints (Chen et al., 2018).Also, Eq. S9 only involves a more light-weight trace operation on the Jacobian rather than evaluatingits log-determinant.Therefore, we learn a parameterized prior using a CNF, and observe the decomposition of the KLterm in the VAE objective: KL( q φ ( z | x ) | p ψ ( z )) = − E z ∼ q φ ( z | x ) [ p ψ ( z )] − H [ q φ ( z | x )] (S10)Therefore, during training our CNF parameterized with ψ works on the transformation of complexlatent encodings z ∼ q φ ( z | x ) to some simple z ( t ) ∼ N (0; I ) , with an exact likelihood described byEq. S9 and integrated into Eq. S10 for the complete training objective. During inference, we simplysample z t ∼ N (0; I ) , and use our CNF to reversely transform it to z ∼ p ψ ( . ) which should becloser to the approximate posterior.Our specific parameterization of the CNF follows from Yang et al. (2019) and Grathwohl et al. (2019),interleaving two hidden concatsquash layers of dimensionality 256 with Tanh non-linearity.17ublished as a conference paper at ICLR 2021 I I
NFORMATION OF THE UNLABELED
RNA
DATASET
Figure S2: This figure contains information of the unlabeled RNA dataset. (A) The number ofhypernodes appears to grow linearly with the length of RNA, and (B) the junction tree height alsogrows as the length increases but on a more moderate scale. (C) and (D) have shown bar-plots of thenumber of hypernodes and tree height, indicating that the junction tree of RNA can take on significantdepth hence contributing to the diversity and complexity of RNA secondary structures represented inthis dataset. 18ublished as a conference paper at ICLR 2021
J H
YPERPARAMETERS
Table S1: Hyperparameters for training VAE and full classifier models. Note that hidden units referto the dimensionality of encoders and decoders from LSTMVAE, GraphVAE as well as HierVAEmodels. Dropout is applied to the embedding MLP classifier in case of training semi-supervisedVAEs, which contains one hidden layer. for VAE modelslatent dimensionality 128hidden units 512G-MPNN iterations 5T-GRU iterations 10learning rate 1e-3batch size 32optimizer AMSGrad (Reddi et al., 2018)dropout ratio 0.2M TI 300S TI (hierarchical decoder) 100S TI (linearized decoder) 1000for full classifier models(overriding some above hyperparameters)learning rate 2e-4epochs 200early stopping epochs 5
K RNA
COMPETE -S CLASSIFIERS ON PRETRAINED AND FIXED
VAE
EMBEDDINGS
Table S2: Performance of simple MLP classifiers on top of fixed latent embeddings from VAE models,which have been pretrained on the unlabeled RNA dataset as originally shown in Table 1.RBP LSTMVAE GraphVAE HierVAEHuR 0.867 0.858 0.860PTB 0.886 0.878 0.883QKI 0.748 0.756 0.746Vts1 0.775 0.758 0.774RBMY 0.734 0.725 0.731SF2 0.867 0.862 0.866SLBP 0.749 0.737 0.74719ublished as a conference paper at ICLR 2021
L E ND - TO - END
RNA
COMPETE -S CLASSIFIERS
Table S3: We use the same encoding architectures as in the generative models, and report theirAUROC averaged across 6 runs, for each RNAcompete-S RBP dataset.RBP LSTM-SeqOnly LSTM Graph HierarchicalHuR 0.880 ± ± ± ± PTB 0.900 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± M A
LTERNATIVE H IER
VAE
TRAINING ON
RNA
COMPETE -S Table S4: Training HierVAE on supervised RNAcompete-S dataset. All models are trained with20 epochs, including 5 epochs for warm-up, 6 epochs to linearly raise beta from 0 to 3e-3, and 9remaining epochs with beta fixed at 3e-3. The test set measures AUROC and posterior decoding onthe final model.Test Post R&S Post NR&DDataset AUROC Valid FE DEV RECON ACC Valid FE DEV RECON ACCHuR 0.871 100% 0.951 18.97% 99.34% 0.702 31.52%PTB 0.899 100% 0.826 21.17% 98.64% 0.674 31.28%QKI 0.822 100% 0.867 17.82% 99.40% 0.627 30.62%Vts1 0.874 100% 1.056 13.71% 99.39% 0.770 24.97%RBMY 0.872 100% 0.963 11.86% 98.68% 0.690 22.91%SF2 0.874 100% 0.921 14.44% 99.32% 0.668 25.99%SLBP 0.764 100% 1.033 14.84% 99.44% 0.743 26.93%20ublished as a conference paper at ICLR 2021
N C
OMPARISON OF GENERATED
RNA
SECONDARY STRUCTURES TO
MFE
STRUCTURES
Figure S3: A comparison of generated RNAs (left) to their corresponding MFE structures (right).RNAs are generated with structural constraints from HierVAE on three random axes. The ground truthMFE structures are predicted by RNAfold, and the generated RNAs are shown to evolve smoothlyin the latent space along with their corresponding MFE structures which have also shown relativelysmooth transitions. 21ublished as a conference paper at ICLR 2021
O N
EIGHBORHOOD VISUALIZATION OF A C YSTEINE - CARRYINGTRANSFER -RNA
Figure S4: Neighborhood visualization of tRNA-Cys which is marked by the red bounding box in thecenter and the walk in the latent space takes place on two random orthogonal axes. Note that actualsecondary structure of tRNA-Cys plotted in the figure is different compared to the one depositedonline due to the prediction of RNAfold. https://rnacentral.org/rna/URS00001F47B5/9606 P N
EIGHBORHOOD VISUALIZATION OF A
5S R
IBOSOMAL
RNA
Figure S5: Neighborhood visualization of a 5S ribosomal RNA which is marked by the red boundingbox in the center and the walk in the latent space takes place on two random orthogonal axes. Notethat actual secondary structure of this 5S ribosomal RNA plotted in the figure is different comparedto the one deposited online due to the prediction of RNAfold. https://rnacentral.org/rna/URS000075B93F/9606 Q T
ARGETED
RNA
GENERATION — AN EXAMPLE
Figure S6: An example of searching novel structured RNAs with higher chance of binding to HuR.The optimization takes place in the latent space of HierVAE, starting from the initial encoding of arandom RNA molecule in the test set, and at each step altering the latent encoding by using activationmaximization on the embedding classifier. The trajectory of generated RNAs is shown in the orderof left to right and top to bottom, and the field