The Role of Nucleobase Interactions in RNA Structure and Dynamics
TThe Role of Nucleobase Interactions in RNA Structure and Dynamics
Sandro Bottaro , ∗ , Francesco di Palma , Giovanni Bussi , ∗ Scuola Internazionale Superiore di Studi Avanzati,International School for Advanced Studies, 265, Via Bonomea I-34136 Trieste, Italy
The intricate network of interactions observed in RNA three-dimensional structures is often de-scribed in terms of a multitude of geometrical properties, including helical parameters, base pair-ing/stacking, hydrogen bonding and backbone conformation. We show that a simple molecular rep-resentation consisting in one oriented bead per nucleotide can account for the fundamental structuralproperties of RNA. In this framework, canonical Watson-Crick, non-Watson-Crick base-pairing andbase-stacking interactions can be unambiguously identified within a well-defined interaction shell.We validate this representation by performing two independent, complementary tests. First, weuse it to construct a sequence-independent, knowledge-based scoring function for RNA structuralprediction, which compares favorably to fully atomistic, state-of-the-art techniques. Second, we de-fine a metric to measure deviation between RNA structures that directly reports on the differencesin the base-base interaction network. The effectiveness of this metric is tested with respect to theability to discriminate between structurally and kinetically distant RNA conformations, performingbetter compared to standard techniques. Taken together, our results suggest that this minimalist,nucleobase-centric representation captures the main interactions that are relevant for describingRNA structure and dynamics.
I. INTRODUCTION
Ribonucleic acid (RNA) plays a fundamental role in a large variety of biological processes, such as enzymatic catalysis[1], protein synthesis [2], and gene regulation [3]. RNA molecules fold in a well-defined three-dimensional structuredictated by their nucleotide sequence [4], that can be determined by means of X-ray crystallography or nuclearmagnetic resonance experiments [5]. Despite recent important advances in the field, RNA structure determination isstill a complex and expensive procedure. Computational/theoretical models, and in particular structure predictionmethods, ideally complement experiments, as they can in principle provide the tertiary, native structure for a givenRNA sequence.Although traditionally considered simpler than the protein folding problem [4], de novo prediction of RNA three-dimensional structure still represents a challenging task [6]. Atomistic molecular dynamics methods have been so farlimited to the predictive folding of very small systems [7, 8], whereas coarse-grained, physics-based models have provenuseful in folding larger RNA structures [9, 10]. However, the most popular and successful RNA structure predictionmethods are the so-called knowledge-based techniques, that typically employ fragment libraries for constructing RNAstructures. These fragments are either combined using secondary structure information [11, 12] or used to generatea large number of plausible, putative structures that are subsequently ranked according to a scoring function [13–17]. Knowledge-based methods often rely on two assumptions: First, that the main structural features of RNAmolecules can be described in terms of few relevant observables. Second, that the distribution of these observablesin a dataset of experimental structures displays specific features, that can be used for generating and/or scoringthree-dimensional structures [18]. A crucial issue is the choice of the aforementioned observables. While backboneatoms and torsional angles are the most natural choice for representing protein structures [19], RNA molecules aremore difficult to treat, due to the presence of a larger number of flexible backbone degrees of freedom and because ofthe important structural role of base-base interactions. For this reason RNA molecules are often described in termsof several structural observables, including backbone dihedrals [20, 21], pseudo dihedrals [22], helical parameters [23],hydrogen-bond networks, and stacking interactions [24, 25].The proper choice of simplified, insightful descriptors for RNA molecules is not limited to the development ofstructure prediction methods, but is a general and relevant issue in RNA structural analysis, as in the definition ∗ To whom correspondence should be addressed. Email: [email protected], [email protected] a r X i v : . [ q - b i o . B M ] O c t of metrics for comparing three-dimensional structures. The well-known root mean square deviation (RMSD) afteroptimal superposition [26] is known to be suboptimal in the analysis of RNA structure, as it does not provide reliableinformation about the differences in the base interaction network [6, 27]. This motivated the development of x y C2 C4 C6yxPyrimidinePurine z ( Å ) ρ (Å) StackingWatson-Crick Non-Watson-CrickStacking ab ρθ z r jk Base j Base k 0 12ab r = √2.5 ~ r =1~ r = 2.4~ -44 c FIG. 1: (a) Definition of the local coordinate system for purines and pyrimidines. (b) The vector r jk describes the position ofbase ring k in the coordinate system constructed on base ring j and is expressed in cylindrical coordinates ρ, θ, z . (c) Observedpositions of neighboring bases in the crystal structure of the H. marismortui large ribosomal subunit projected on the z − ρ plane. Isolines of the ellipsoidal distance ˜ r are shown. Each point is colored according to the interaction type: canonicalWatson-Crick pairs in orange, base-base non Watson-Crick pairs in blue and stacking in green. Base-sugar, base-phosphate andnon-annotated pairs are shown in light grey. On the right, empirical distribution along the z coordinate obtained consideringpoints with ˜ r < √ .
5. Pairing and stacking regions can be unambiguously identified ( | z | ≤ . | z | > . alternative RNA-specific deviation measures based on selected geometrical properties. These measures can be usedfor comparing observed structures with a priori known patterns, as is done for instance in annotation methods [28]and in motif recognition algorithms [29–31]. Other RNA specific similarity measures were introduced with the purposeof assessing the quality of RNA structural predictions [6] or for clustering/identifying recurrent structural motifs inlarge databases of known structures [21, 29]. The definition of a proper structural deviation measure is also a centralissue in the construction of Markov state models [32], that typically assume the geometric similarity to provide anadequate measure of kinetic proximity.The analysis of RNA three-dimensional structure, the construction of a knowledge-based structure predictionmethod, and the definition of a structural deviation share a common trait: all of them are based on the appar-ently arbitrary choice of the structural observables used to represent RNA. Is there a simple representation of anRNA molecule that recapitulates its key structural and dynamical properties?In this Paper we introduce a minimalist representation for RNA molecules consisting in one oriented bead pernucleotide. We first show that this description captures the fundamental base-pair and base-stacking interactionsobserved in experimental structures. Using this representation we define a simple sequence-independent scoringfunction for RNA structure prediction, which performs on par or better compared to existing atomistic techniques.Based on the same molecular description, we then introduce a metric for calculating structural deviation. Whencompared with the standard RMSD measure, this metric correlates better with the physical time required to explorethe conformational space in molecular dynamics simulations. As a further proof of the capability of this metric todistinguish relevant structural differences, we show that it can be successfully used for recognizing local structuralmotifs, reproducing results obtained with state-of-the-art techniques. Finally, in the Discussion Section, we summarizethe analogies between these applications and we proceed with an extensive comparison with existing techniques. Thesoftware for performing the calculations is freely available at https://github.com/srnas/barnaba. II. METHODS
This Section is structured as follows: first, we introduce a simple representation for RNA three-dimensional struc-tures that has an intuitive interpretation in terms of base-stacking and base-pairing interactions. We then use thismolecular description to define i) a scoring function for RNA structure prediction ( E SCORE) and ii) a metric forcalculating deviations between RNA three dimensional structures ( E RMSD). p(x,y) 10 -4 ρ (Å) 468 2 θ=0 ° ° ° ρ (Å) 468 2 θ=0 ° ° ° H o o g s t e e n E d g e W a t s o n - C r i c k E d g e S u g a r E d g e Pairing zone ( r < √ ≤ ~ Stacking zone ( r < √ > ~ FIG. 2: Empirical density distributions of neighboring nucleobases (˜ r < √ . θ − ρ plane. Densities are computed with a uniform binning in Cartesiancoordinates and using a Gaussian kernel density estimation with bandwidth 0.25 ˚A. Hoogsteen, Watson-Crick, and Sugar edgesare indicated. The 6-membered and 5-membered (for purine only) rings are sketched in red. Molecular representation
We construct a local coordinate system in the center of the 6-membered rings, as shown inFig. 1a. Following this definition, the relative position and orientation between two nucleobases is described by a vector r , that is conveniently expressed in cylindrical coordinates ρ , θ , and z (Fig. 1b). Note that r is invariant for rotationsaround the axis connecting the 6-membered rings. We highlight that this definition is similar to the local referentialsintroduced by Gendron and Major [25]. The use of a nucleotide-independent centroid makes it straightforward tocompare and combine collection of position vectors deriving from different combinations of nucleobases. This is ofparticular importance for constructing the knowledge-based scoring function (see below). The position vector r hasan intuitive interpretation in terms of base-stacking and base-pairing interactions. This aspect is illustrated in Fig. 1c,that shows the distribution of r vectors for all neighboring bases in the crystal structure of the Haloarcula marismortui large ribosomal subunit (PDB code 1S72) [2] projected on the ρ and z coordinates. In the figure, different colorscorrespond to different types of interactions detected by MC-annotate. Due to steric hindrance, no points are observedin a forbidden ellipsoidal region. Furthermore, almost all the base-stacking and base-pairing interactions ( ≈ . r = (cid:16) r x a , r y a , r z b (cid:17) = (cid:16) ρa cos θ, ρa sin θ, zb (cid:17) (1)with a = 5˚A and b = 3˚A, so that pairs of bases in the interaction shell are such that 1 < ˜ r < √ .
5. The majority ofbase-base contacts lying in this interaction shell are annotated either as Watson-Crick/non-Watson-Crick or as basestacking, as detailed in Table I. Within this region we distinguish a pairing zone and a stacking zone , according to thetype of featured interactions. The tri-modal histogram in Fig. 1c shows that these two zones can be defined withoutambiguity considering pairs such that the projection of r along the z axis is larger (stacking) or smaller (pairing) than2 ˚A.It is well known that the strength and nature of pairing and stacking interactions depend on the base-base distance,on the angle θ as well as on other angular parameters (e.g., twist, roll, tilt) in a non-trivial manner [24, 33]. Suchdependence can be observed in Fig. 2, where the points belonging to the pairing and stacking zone of Fig. 1c areprojected on two separate ρ - θ planes. These distributions give an average picture containing contributions fromdifferent base pair types (purine-purine, purine-pyrimidine and pyrimidine-pyrimidine) and with weights dictated bythe employed dataset. Nevertheless, the observations below hold also when considering the 16 possible combinationsof base pairs individually and different datasets (see Supporting Data (SD) Figs. 1-3). In the pairing zone (Fig.2, left panel) we first observe a dominant peak centered around ( ρ =5 . θ =60 ◦ ), corresponding to the positionof canonical Watson-Crick base pairs as well as wobble (GU) base pairs. The two other peaks correspond insteadto base pairs interacting through the Hoogsteen or sugar edge [13]. One can also appreciate the absence of basesin the region occupied by the sugar (190 ◦ < θ < ◦ ). The probability distribution in the stacking zone (Fig. 2,right panel) shows a broad peak in the proximity of the origin and extending up to ρ ≈ TABLE I:
Number of base-base interactions detected in the crystal structure of the
H. marismortui largeribosomal subunit.
Interaction Type Occurrences a ˜ r < √ . b Stacking 2328 2318Watson-Crick 723 723Non-Watson-Crick (base-base) 399 382Base-sugar/phosphate interactions 781 398Not annotated — 1214 a Number of interactions detected by MC-Annotate [25]. b Base pair (j,k) is counted when both ˜ r jk and ˜ r kj < √ . to the typical radius of the 6-membered ring ( ≈ . et al. [34]. This feature is more evident inpyrimidine-pyrimidine and purine-purine pairs, for which high overlap is the exception rather than the rule (see SDFig. 3), whereas overlap is systematically observed in pyrimidine-purine pairs. The fact that bases in the stackingzone are very often “imbricated,” similarly to roof tiles, rather than literally stacked one on top of the other, doesnot imply that they are not interacting. Indeed, base-base interaction is not limited to π - π stacking but also includeselectrostatic effects, London dispersion attraction, short range repulsion as well as backbone-induced effects [35]. Scoring Function
The empirical distribution of all r vectors observed in experimental RNA structures displaysvery specific features, as described above. We use these geometrical propensities to construct a knowledge basedscoring function for RNA structure prediction. More precisely, we define a function of the atomic coordinates (in thiscase of the r vectors in a molecule) that quantifies the compatibility of a given RNA 3D conformation with respectto the expected distribution observed in native RNA structures. This scoring function, called E SCORE, is defined asa weighted sum over all pairs of bases in a molecule: E SCORE = (cid:88) j,k p ( r jk ) (2)Notice that both permutations of the j, k indexes should be included in the sum since the vectors r jk and r kj providetwo partially independent pieces of information. The weights p ( r ) are given by the empirical probability distributionof nucleobases in the crystal structure of the H. marismortui large ribosomal subunit. With this definition, a lowerweight is assigned to structures with pairs of bases in relative positions not compatible with the training set, includingthose with steric clashes. The probability distribution p ( r ) is calculated considering only pairs of bases within theinteraction shell (˜ r < √ .
5) and using a Gaussian kernel density estimation with bandwidth = 0 .
25 ˚A. Note that p ( r )does not depend on the identity of the nucleotides, and the RNA molecule is therefore treated as a homopolymer.The sum of the probabilities is used instead of their product, in order to decrease the impact of low-count regionson the scoring function. This procedure has been first introduced in the field of hidden Markov modeling for speechrecognition [36] and has a formal justification known as “summation hack” [37]. A non-redundant set of high resolutionstructures [16] was also employed as a training dataset for the E SCORE, producing similar results.
Structural deviation
The collection of the scaled vectors ˜ r (see Eq. 1) in a RNA molecule provides informationabout the relative base arrangement. One could thus define a metric for measuring the distance between two RNAstructures, α and β , as d = (cid:115) N (cid:88) j,k | ˜ r αjk − ˜ r βjk | (3)Similarly to Eq. 2, also here both permutations of the j, k indexes should be included in the sum. The d quantity ishighly non-local, because large differences in distant pairs give large contributions to the sum. To remove this effect itis convenient to introduce a cutoff distance ˜ r cutoff in the same spirit as we did for the scoring function discussed above.However, we notice that a cutoff would introduce discontinuities in the metric. A possible solution is to introduce afunction that maps the vectors ˜ r to new vectors G (˜ r ) with the following properties:1. | G (˜ r α ) − G (˜ r β ) | ≈ | ˜ r α − ˜ r β | if ˜ r α , ˜ r β (cid:28) ˜ r cutoff .2. | G (˜ r α ) − G (˜ r β ) | = 0 if ˜ r α , ˜ r β ≥ ˜ r cutoff .3. G (˜ r ) is a continuous function.There is no requirement here on the dimensionality of the G vector. The above properties are satisfied by the4-dimensional vectorial function G (˜ r ) = sin ( γ ˜ r ) ˜ r x ˜ r sin ( γ ˜ r ) ˜ r y ˜ r sin ( γ ˜ r ) ˜ r z ˜ r γ ˜ r ) × Θ(˜ r cutoff − ˜ r ) γ (4)where γ = π/ ˜ r cutoff and Θ is the Heaviside step function. We note that the dependence of G on the direction of ˜ r vanishes as ˜ r approaches the cutoff value ˜ r cutoff . We thus set ˜ r cutoff = 2 .
4, so that for the typical distances betweenstacked and paired bases (˜ r < √ .
5, see Fig. 1c) the contribution of the ˜ r directionality is significant. In-depthdiscussion on Eq. 4 can be found in SD Text. 4. Having defined the mapping function G (˜ r ), the E RMSD distancereads E RMSD = (cid:115) N (cid:88) j,k | G (˜ r αjk ) − G (˜ r βjk ) | (5)Note that E RMSD is proportional to the Euclidean distance between G vectors, which are non-linear functions of theatomic coordinates. As such, E RMSD is positive semidefinite, symmetric, and satisfies triangular inequality. Duringthe developmental stage we tested a non-vectorial version of the E RMSD: instead of using the vectorial function G (˜ r ),we considered the scalar, continuous function G s (˜ r ) = (˜ r cutoff − ˜ r ) × Θ(˜ r cutoff − ˜ r ) (6)which is similar to a contact-map distance [38, 39]. This scalar version differs from E RMSD when considering structureswith 4 nucleotides or less, while the two quantities are highly correlated when analyzing larger structures (Figure SD5).
III. RESULTSA. Scoring Function for RNA Structure Prediction
We assess the quality of the E SCORE on its capability of recognizing the folded state among a set of decoys [40]. Weconsider here a total of 39 different decoy sets previously used to benchmark other knowledge-based scoring functions:20 decoy sets generated de novo using the FARNA algorithm [13] and 19 additional decoy sets from the work ofBernauer et al. [16] obtained by gradual deformation of the native structures. As a measure of performance weuse the normalized rank [41], defined as the percentage of decoys scoring better than the native structure. In orderto simulate a realistic structure prediction experiment, we additionally report the RMSD from native of the bestscoring structure in the decoy set. Because the RMSD has been reported to be a suboptimal indicator of structuralproximity for RNA, we also compute the interaction fidelity network (INF), which measures the overlap betweenannotations in two structures [27], as well as the E RMSD introduced here. Equivalent analyses are performed usingtwo state-of-the-art scoring functions, namely the FARFAR [15] and the all-atom knowledge-based scoring functionRASP [17]. Both FARFAR and RASP are defined at atomistic resolution. FARFAR combines the low resolutionFARNA potential with explicit terms for solvation and hydrogen bonding, while RASP is a statistical potential basedon pairwise interatomic distances. These two scoring functions are among the best available [6] and the only ones forwhich it was possible to perform automatic scoring of a large set of decoys.Figure 3 summarizes the results obtained on all the 39 decoy sets. For each method we report the number of decoysets below the value indicated on the horizontal axis. Thus, the better is a scoring function, the faster the curvereaches the maximum number 39. In general, whereas FARFAR performs well in ranking and RASP in finding thebest decoys, E SCORE performs well in both tasks (See also Table SD 6). This is a remarkable result consideringthat E SCORE employs a coarse-grained representation (one bead per nucleotide) and is not sequence dependent. Thescoring results on all RNA decoys, together with a short discussion of the cases for which E SCORE fails to identifycorrectly the native structure can be found in SD Figs. 7-8.
SCORE FARFAR RASP E Normalized rank
RMSD (Å) E RMSD N u m b e r o f d ec oy s e t s FIG. 3: Quantitative comparison of our scoring function E SCORE with FARFAR [15] and RASP [17] on 39 different decoysets. In the upper left panel, number of decoy sets with normalized rank lower than the value on the horizontal axis. Forexample, the normalized rank is 0 . E SCORE, 33/39 using FARFAR, and 31/39using RASP. The faster the curve reaches 39, the better the performance of the scoring function. In the other panels, equivalentplots for RMSD, INF [27], and E RMSD of the best scoring decoy to the native structure, as labeled. See Table SD6 for acomplete list of results.
B. RNA Structural Deviation
The definition of an accurate measure of structural deviation for comparing RNA three-dimensional structures is arelevant problem, as the standard RMSD approach does not report on differences of base-stacking and base-pairingpatterns, that are the fundamental, key features of RNA molecules [25]. As a striking example, the RMSD betweenan ideal A-form double helix
GGGGCCCC and the mismatched structure
GGGGCCCC is 1.9˚A, meaning that the threshold of 4˚Athat is sometimes employed to consider two structures as similar [8] may not be sufficiently strict. A similar issuearises in kinetic modeling of proteins, where the assumption that geometrically close configurations (in terms of theRMSD distance) are also kinetically close has been shown to be problematic in discretizing the register shift dynamicsof β strands [32].In order to elucidate the difference between the standard RMSD and the E RMSD, we compute these two quantitieson a series of snapshots taken from an atomistic molecular dynamics simulation of a short stem-loop. These structureshave been obtained by repeatedly melting and folding (see SD Text 9), and thus include near-native structures,partially folded, misfolded as well as extended conformations. Figure 4 shows that structures with similar RMSD donot necessarily present the same base-base interaction network of the native state: in fact, the secondary structurecan be substantially different. Conversely, the E RMSD discriminates structures with near-native base-base contacts( E RMSD < .
8) from structures with non-native base-base interactions ( E RMSD > E RMSD correspond to high INF values, except for the region around E RMSD= 1 .
5, RMSD= 2 .
7. In this regionwe observe structures with formed stem and distorted loop (high INF) but also structures with formed loop andincorrect stem (low INF). While Fig. 4 serves as an illustrative example, the validity of the E RMSD is more rigorouslyassessed by performing two tests that probe the accuracy of the method in the analysis of structural and dynamicalproperties of RNA molecules. More precisely, in the following two Subsections we demonstrate i) that the E RMSDis a good measure of kinetic proximity between configurations, performing considerably better compared to standardstructural distances and ii) that the E RMSD is highly specific in recognizing known structural motifs within a largeset of crystallographic structures.
Structural and Kinetic Proximity
The correspondence between a structural deviation (RMSD, E RMSD, etc.) andthe kinetic distance (defined as the time needed to evolve from one configuration to the other), can be quantitativelyassessed by computing the coefficient of variation (CV).The CV of a structural deviation d is defined as the ratio of the standard deviation to the mean calculated onstructures separated by a time lag τ CV d ( τ ) = σ d ( τ ) (cid:104) d ( τ ) (cid:105) (7)To a large CV corresponds a large standard deviation, and therefore a high uncertainty between geometric and kineticdistance [43]. Note that the standard deviation is normalized with the mean, and the CV is thus scale-invariant.In the approximation that the deviations at fixed time lag are normally distributed, this analysis is equivalent tothe histogram overlap discussed in Ref. [39]. The connection between kinetic and structural distance is inevitablylimited by the fact that the typical time necessary for a transition is in general different from the time necessary forthe reversed transition, in contrast with the symmetric definition of structural deviations. Nevertheless, this analysisprovides insightful and intuitive information for the purpose of qualitatively comparing different metrics.In Fig. 5 we show the CV as a function of time separation in molecular dynamics simulation of the add riboswitch[44, 45]. The molecular dynamics simulation was initialized in the native state and remained stable throughoutthe 100ns run, while experiencing non trivial dynamics such as base-pairing breakage/formation and double-helixbreathing. The CV for E RMSD is compared with RMSD and with dRMSD. Additionally, the scalar variant of E RMSD, based on Eq. 6, is also presented.We first observe that for short time separations the correspondence between geometric and kinetic distance is high(low CV) in all cases, and it diminishes at large time lag. The CV for the E RMSD is consistently lower compared toboth RMSD and dRMSD, suggesting this measure to better reflect the temporal separation between configurations.We also note that, in this test, the scalar and vectorial version of the E RMSD behave very similarly.
Search of Structural Motifs
RNA molecules display a great variety of base-base interaction patterns (RNA motifs) R M S D ( Å ) E RMSD G CCUU C GGGC G CCUU C GGGC G C CUU C GGG C G CCU U C GGGC E INF 0.82RMSD 3.2 ÅRMSD 0.6 INF 0.96RMSD 0.8 Å E RMSD 0.4 INF 0.11RMSD 3.2 Å E RMSD 1.8 INF 0.44RMSD 2.2 Å E RMSD 1.3
FIG. 4: Heavy-atom RMSD from native versus E RMSD of snapshots taken from an atomistic molecular dynamics simulation.The difference in the secondary structure between the native state and samples is calculated using the INF measure, whichranges from 1 (identical secondary structure) to zero (completely different). Very different secondary structures can be foundbelow 4 ˚A RMSD. Four selected structures with different values of RMSD, E RMSD and INF are shown, together with theirsecondary structure. Figures were generated with Varna [42]. that play a key role in RNA folding and that mediate important biological processes [46]. A structural deviation shouldtherefore accurately capture such RNA-specific features. This property can be assessed by searching for known RNAstructural motifs in the database of experimentally solved structures. Here, we use the E RMSD to search for 5 knowninternal loop motifs (K-Turn, C-loop, Sarcin Ricin, triple sheared, double sheared) and two hairpin loops (T-loop andGNRA) within the RNA structure atlas database [47] release 1.43, composed by 772 crystal structures. As a reference,we compared the results with “find RNA 3D” (FR3D) [29], the most extensively used and carefully tested method forRNA motif recognition. FR3D employs a reduced representation with one centroid per nucleobase, and computes thegeometrical deviation by combining fitting and orientation errors after optimal superposition. Reference and motifstructures were obtained from the RNA 3D motif atlas release 1.13. Except for the K-turn motif, we were able tocorrectly identify almost all FR3D matches, including motifs with bulged bases (Table II). By visual inspection, it canbe seen that most of the false positive (i.e., motifs that were found using the E RMSD and not by FR3D) are either bona fide matches or known motif variants, while the residual discrepancies are due to differences in the employedcutoff value [47]. We refer the reader to SD Text 10 for a detailed description of the search strategy and to SD Text11 for the full list of results.
IV. DISCUSSION
A fully atomistic description of an RNA molecule requires on average ≈
20 non-hydrogen atoms per nucleotide,and thus ≈
60 degrees of freedom per nucleotide. Biologically relevant RNA structures are composed of hundredsor even thousands of nucleotides that display an intricate network of interactions. The complexity of the problemis further increased when considering that many RNA molecules are highly dynamic entities, and multiple chainconfigurations are needed for describing their mechanism and function [3]. In order to provide a simpler representation,RNA structures are usually analyzed in terms of sugar pucker conformations, backbone dihedrals [20, 21] or pseudodihedrals [22]. Complementary approaches consider instead base-pair parameters (propeller, slide, roll, twist, etc.)and hydrogen bonding to classify base-pair interactions in terms of recurrent geometrical configurations observed inexperimentally solved RNA structures [23–25].A key result of the present work is that the main structural and dynamical properties of RNA can be describedin terms of the spatial arrangement of the nucleobases only. Since nucleobases are substantially rigid, this implies6 degrees of freedom per nucleotide. This observation is in full accordance with previous analysis of RNA crystalstructures [25, 48], where the position and orientation of nucleobases only were considered. This work complementsthe analysis by showing that nucleobase position and orientation are sufficient to correctly describe the RNA dynamicsas obtained from explicit solvent molecular dynamics simulations.As shown in Fig. 1, RNA base-base interactions mainly occur in a restricted neighborhood of the nucleobasethat is well approximated by an ellipsoidal shell. Since specific regions of the ellipsoidal shell surrounding nucleobasescorrespond to different types of pairing and stacking interactions, it is in principle possible to infer secondary structure C o e ff i c i e n t o f v a r i a ti on ( C V ) E RMSD E RMSD scalar
FIG. 5: Coefficient of variation calculated on a 100ns molecular dynamics simulation of the add riboswitch. The coefficientof variation for the E RMSD is compared with heavy-atoms RMSD, with the distance RMSD (dRMSD) and with the scalarvariant of the E RMSD. Standard deviations are calculated using a blocking procedure. information directly from the pairing and stacking plots of Fig. 2. This procedure is formally similar to the oneemployed in the analysis of protein structures, where the angular preferences of the backbone approximately correspondto different secondary structures [19]. Our approach depends on nucleobase positions only and is thus complementaryto that proposed in Ref. [22], where backbone pseudo dihedrals were used to classify RNA secondary structures.By considering only the relative spatial arrangement of nucleobases, we introduce a knowledge-based scoring functionfor RNA structure prediction ( E SCORE). In contrast with several studies underlining the importance of an atomisticdescription for an accurate scoring [15–17], our work shows that a minimalist description of nucleobase arrangementis sufficient to produce equivalent or better results. Two aspects are worth highlighting. First, while the backbonecan be ignored for scoring, one cannot avoid including the chain connectivity in the sampling algorithm used togenerate the decoys. Second, the outcome of the scoring benchmarks is strongly dependent on the employed decoysets. Some of the decoy sets generated using the FARNA algorithm are very challenging, as all scoring functions( E SCORE, FARFAR and RASP) are highly degenerate, and assign good scores to structures that are very far fromnative (see e.g. SD Fig. 6, decoy set 1A4D and Ref. [16]). These decoys can be therefore used to improve the currentmethodologies for de novo predictions of RNA structure.A second aspect intimately connected with the choice of molecular representation is the definition of a proper struc-tural proximity measure. Typically, structural deviations are used either to analyze and compare three-dimensionalstructures (e.g., RMSD, dRMSD, INF [27]) or to recognize and cluster RNA structural motifs [29–31]. The E RMSDintroduced here is suitable for both tasks. A distinctive feature of the E RMSD is the use of the ellipsoidal metric ˜ r ,that naturally stems from the empirical data and that takes into account both relative distance and orientation ofnucleobases. Interestingly, the intrinsic base anisotropy has been also discussed in the context of the electronic polar-izability of nucleobases [35] and anisotropic interactions have been used in coarse-grain models for proteins [49]. Wenote that the E RMSD is conceptually reminiscent of the INF measure [27], that also considers the differences in thenetwork of base-pairing and base stacking interactions. With respect to INF, however, the E RMSD presents importantadvantages: i) it does not depend on the annotation scheme employed and therefore on the quality of the structure,ii) it is a continuous function of the atomic coordinates, and iii) it is well defined even when the annotation is notinformative (e.g., for unfolded states, unstructured, or single-stranded RNAs). The last two items make the E RMSDparticularly well-suited for studying dynamical processes, such as RNA folding or binding events. The E RMSD alsoshares a number of similarities with the structural distance used in FR3D for recognizing three-dimensional motifs[29]. However, in the E RMSD calculation we introduce important simplifications: i) the definition of the local coor-dinate system is not nucleobase-specific and requires the knowledge of three atoms per nucleobase only (Fig. 1a), ii)the E RMSD calculation does not require least square fitting, and iii) the problem of combining the positional andorientational base-base distance is circumvented by introducing an ellipsoidal metric. The fact that the two methodsproduce similar results (Table 2) strongly suggests that for most practical uses the E RMSD is accurate enough tocapture the specific pattern of base-base interactions.In this study we introduce a minimalist representation of RNA three-dimensional structures that solely considers therelative orientation and position of nucleobases. Based on this representation, we construct a scoring function for RNAstructure prediction and we define a metric for calculating deviation between RNA three-dimensional structures. Inboth cases, the results are on par or better compared with state-of-the-art techniques. Taken together, the presentedresults suggest that this minimalist representation captures the main interactions that are relevant for describing RNAstructure and dynamics. As a final remark, we note that the methodology presented here can be readily applied toDNA molecules as well.
TABLE II: Structural search of known motifs within the RNA motif atlas. The table summarizes the number of found motifscompared with FR3D (http://rna.bgsu.edu/rna3dhub/motifs). See SD Text 11 for a full list.Motif Reference PDB Found motifs False positiveT-LOOP 3RG5 57/66 17GNRA 4A1B 217/219 63C-LOOP 3V2F 18 /22 7D-SHEAR 1SAQ 23/26 3T-SHEAR 3V2F 24/31 18K-TURN 3GX5 10/22 15SARCIN 1Q96 13/16 3 V. AVAILABILITY
Python scripts to compute E SCORE and E RMSD from PDB structures are available athttps://github.com/srnas/barnaba.
VI. SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online. Supplementary References [50–60].
VII. FUNDING
The research leading to these results has received funding from the European Research Council under the EuropeanUnion’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement n. 306662, S-RNA-S.
1. Conflict of interest statement.
None declared.
VIII. ACKNOWLEDGMENTS
We are grateful to Francesco Colizzi, Richard Andre Cunha, Eli Hershkovits, Alessandro Laio, Gian GaetanoTartaglia, Eric Westhof, and Craig Zirbel for carefully reading the manuscript and providing several enlightening sug-gestions. Massimiliano Bonomi, Thomas Hamelryck and Andrea Pagnani are also acknowledged for useful discussions. [1] Kruger, K., Grabowski, P. J., Zaug, A. J., Sands, J., Gottschling, D. E., and Cech, T. R. (1982) Self-splicing RNA:autoexcision and autocyclization of the ribosomal RNA intervening sequence of tetrahymena.
Cell , 147–157.[2] Klein, D., Moore, P., and Steitz, T. (2004) The roles of ribosomal proteins in the structure assembly, and evolution of thelarge ribosomal subunit. J. Mol. Biol. , 141–177.[3] Serganov, A. and Nudler, E. (2013) A decade of riboswitches.
Cell , 17–24.[4] Tinoco, I. and Bustamante, C. (1999) How RNA folds.
J. Mol. Biol. , 271–281.[5] Tinoco, I. (1996) Nucleic acid structures, energetics, and dynamics.
J. Phys. Chem. , 13311–13322.[6] Cruz, J. A., Blanchet, M.-F., Boniecki, M., Bujnicki, J. M., Chen, S.-J., Cao, S., Das, R., Ding, F., Dokholyan, N. V.,Flores, S. C., et al. (2012) RNA-puzzles: A CASP-like evaluation of RNA three-dimensional structure prediction.
RNA , 610–625.[7] Kuhrova, P., Banas, P., Best, R. B., Sponer, J., and Otyepka, M. (2013) Computer folding of RNA tetraloops? Are wethere yet? J. Chem. Theory Comput. , 2115–2125.[8] Chen, A. A. and Garcia, A. E. (2013) High-resolution reversible folding of hyperstable RNA tetraloops using moleculardynamics simulations. Proc. Natl. Acad. Sci. U.S.A. , 16820–16825.[9] Ding, F., Sharma, S., Chalasani, P., Demidov, V. V., Broude, N. E., and Dokholyan, N. V. (2008) Ab initio RNA foldingby discrete molecular dynamics: from structure prediction to folding mechanisms.
RNA , 1164–1173.[10] Denesyuk, N. A. and Thirumalai, D. (2013) Coarse-grained model for predicting RNA folding thermodynamics. J. Phys.Chem. B , 4901–4911.[11] Parisien, M. and Major, F. (2008) The MC-fold and MC-sym pipeline infers RNA structure from sequence data.
Nature , 51–55.[12] Cao, S. and Chen, S.-J. (2011) Physics-based de novo prediction of RNA 3D structures.
J. Phys. Chem. B , 4216–4226.[13] Das, R. and Baker, D. (2007) Automated de novo prediction of native-like RNA tertiary structures.
Proc. Natl. Acad. Sci.U.S.A. , 14664–14669.[14] Frellsen, J., Moltke, I., Thiim, M., Mardia, K. V., Ferkinghoff-Borg, J., and Hamelryck, T. (2009) A probabilistic modelof RNA conformational space.
PLoS Comput. Biol. , e1000406.[15] Das, R., Karanicolas, J., and Baker, D. (2010) Atomic accuracy in predicting and designing noncanonical RNA structure. Nat. Methods , 291–294.[16] Bernauer, J., Huang, X., Sim, A. Y., and Levitt, M. (2011) Fully differentiable coarse-grained and all-atom knowledge-basedpotentials for RNA structure evaluation. RNA , 1066–1075. [17] Capriotti, E., Norambuena, T., Marti-Renom, M. A., and Melo, F. (2011) All-atom knowledge-based potential for RNAstructure prediction and assessment. Bioinformatics , 1086–1093.[18] Tanaka, S. and Scheraga, H. A. (1976) Medium-and long-range interaction parameters between amino acids for predictingthree-dimensional structures of proteins. Macromolecules , 945–950.[19] Ramachandran, G., Ramakrishnan, C. T., and Sasisekharan, V. (1963) Stereochemistry of polypeptide chain configurations. J. Mol. Biol. , 95–99.[20] Murray, L. J., Arendall, W. B., Richardson, D. C., and Richardson, J. S. (2003) RNA backbone is rotameric. Proc. Natl.Acad. Sci. U.S.A. , 13904–13909.[21] Hershkovitz, E., Sapiro, G., Tannenbaum, A., and Williams, L. D. (2006) Statistical analysis of RNA backbone.
IEEE/ACMTrans. Comput. Biol. Bioinform. , 33.[22] Duarte, C. M. and Pyle, A. M. (1998) Stepping through an RNA structure: a novel approach to conformational analysis. J. Mol. Biol. , 1465–1478.[23] Saenger, W. (1984) Principles of nucleic acid structure., volume , Springer-Verlag New York, .[24] Leontis, N. B. and Westhof, E. (2001) Geometric nomenclature and classification of RNA base pairs. RNA , 499–512.[25] Gendron, P., Lemieux, S., and Major, F. (2001) Quantitative analysis of nucleic acid three-dimensional structures. J. Mol.Biol. , 919–936.[26] Kabsch, W. (1976) A solution for the best rotation to relate two sets of vectors.
Acta Crystallogr. A , 922–923.[27] Parisien, M., Cruz, J. A., Westhof, E., and Major, F. (2009) New metrics for comparing and assessing discrepancies betweenRNA 3D structures and models. RNA , 1875–1885.[28] Yang, H., Jossinet, F., Leontis, N., Chen, L., Westbrook, J., Berman, H., and Westhof, E. (2003) Tools for the automaticidentification and classification of RNA base pairs. Nucleic Acids Res. , 3450–3460.[29] Sarver, M., Zirbel, C. L., Stombaugh, J., Mokdad, A., and Leontis, N. B. (2008) FR3D: finding local and compositerecurrent structural motifs in RNA 3D structures. J. Math. Biol. , 215–252.[30] Apostolico, A., Ciriello, G., Guerra, C., Heitsch, C. E., Hsiao, C., and Williams, L. D. (2009) Finding 3D motifs inribosomal RNA structures. Nucleic Acids Res. , e29–e29.[31] Zhong, C., Tang, H., and Zhang, S. (2010) Rnamotifscan: automatic identification of RNA structural motifs using secondarystructural alignment. Nucleic Acids Res. , e176–e176.[32] Beauchamp, K. A., McGibbon, R., Lin, Y.-S., and Pande, V. S. (2012) Simple few-state models reveal hidden complexityin protein folding. Proc. Natl. Acad. Sci. U.S.A. , 17807–17813.[33] Sponer, J., Leszczynski, J., and Hobza, P. (2001) Electronic properties, hydrogen bonding, stacking, and cation binding ofDNA and RNA bases.
Biopolymers , 3–31.[34] Bugg, C., Thomas, J., Sundaralingam, M. t., and Rao, S. (1971) Stereochemistry of nucleic acids and their constituents.X. Solid-slate base-slacking patterns in nucleic acid constituents and polynucleotides. Biopolymers , 175–219.[35] Sponer, J., Riley, K. E., and Hobza, P. (2013) Nature and magnitude of aromatic stacking of nucleic acid bases. Phys.Chem. Chem. Phys. , 2595–2610.[36] Juang, B.-H., Hou, W., and Lee, C.-H. (1997) Minimum classification error rate methods for speech recognition. IEEETrans. Audio Speech Lang. Processing , 257–265.[37] Minka, T. The summation hack as an outlier model. (2003).[38] Bonomi, M., Branduardi, D., Gervasio, F. L., and Parrinello, M. (2008) The unfolded ensemble and folding mechanism ofthe C-terminal GB1 β -hairpin. J. Am. Chem. Soc. , 13938–13944.[39] Cossio, P., Laio, A., and Pietrucci, F. (2011) Which similarity measure is better for analyzing protein structures in amolecular dynamics trajectory?
Phys. Chem. Chem. Phys. , 10421–10425.[40] Simons, K. T., Kooperberg, C., Huang, E., and Baker, D. (1997) Assembly of protein tertiary structures from fragmentswith similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. , 209–225.[41] Cossio, P., Granata, D., Laio, A., Seno, F., and Trovato, A. (2012) A simple and efficient statistical potential for scoringensembles of protein structures.
Sci. Rep. , 351.[42] Darty, K., Denise, A., and Ponty, Y. (2009) VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics , 1974.[43] Zhou, T. and Caflisch, A. (2012) Distribution of reciprocal of interatomic distances: a fast structural metric. J. Chem.Theory Comput. , 2930–2937.[44] Serganov, A., Yuan, Y.-R., Pikovskaya, O., Polonskaia, A., Malinina, L., Phan, A. T., Hobartner, C., Micura, R., Breaker,R. R., and Patel, D. J. (2004) Structural basis for discriminative regulation of gene expression by adenine-and guanine-sensing mRNAs. Chemi. Biol. , 1729–1741.[45] Di Palma, F., Colizzi, F., and Bussi, G. (2013) Ligand-induced stabilization of the aptamer terminal helix in the addadenine riboswitch. RNA , 1517–1524.[46] Moore, P. B. (1999) Structural motifs in RNA. Annu. Rev. Biochem. , 287–300.[47] Petrov, A. I., Zirbel, C. L., and Leontis, N. B. (2013) Automated classification of RNA 3D motifs and the RNA 3D motifatlas. RNA , 1327–1340.[48] Lemieux, S. and Major, F. (2006) Automated extraction and classification of rna tertiary structure cyclic motifs. NucleicAcids Res. , 2340–2346.[49] Fogolari, F., Esposito, G., Viglino, P., and Cattarinussi, S. (1996) Modeling of polypeptide chains as C α chains, C α chainswith C β , and C α chains with ellipsoidal lateral chains. Biophys. J. , 1183–1197.[50] Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W., and Klein, M. L. (1983) Comparison of simple potentialfunctions for simulating liquid water. J. Chem. Phys. , 926–935. [51] Darden, T., York, D., and Pedersen, L. (1993) Particle mesh ewald: An n log (n) method for ewald sums in large systems. J. Chem. Phys. , 10089–10092.[52] Hess, B., Bekker, H., Berendsen, H. J., and Fraaije, J. G. (1997) LINCS: a linear constraint solver for molecular simulations. J. Comput. Chem. , 1463–1472.[53] Berendsen, H. J., Postma, J. P. M., vanGunsteren, W. F., DiNola, A., and Haak, J. (1984) Molecular dynamics withcoupling to an external bath. J. Chem. Phys. , 3684–3690.[54] Ennifar, E., Nikulin, A., Tishchenko, S., Serganov, A., Nevskaya, N., Garber, M., Ehresmann, B., Ehresmann, C., Nikonov,S., and Dumas, P. (2000) The crystal structure of UUCG tetraloop. J. Mol. Biol. , 35–42.[55] Hornak, V., Abel, R., Okur, A., Strockbine, B., Roitberg, A., and Simmerling, C. (2006) Comparison of multiple Amberforce fields and development of improved protein backbone parameters.
Proteins , 712–725.[56] P´erez, A., March´an, I., Svozil, D., Sponer, J., Cheatham III, T. E., Laughton, C. A., and Orozco, M. (2007) Refinementof the AMBER force field for nucleic acids: Improving the description of α γ conformers. Biophys. J. , 3817–3829.[57] Bussi, G., Donadio, D., and Parrinello, M. (2007) Canonical sampling through velocity rescaling. J. Chem. Phys. ,014101.[58] Banas, P., Hollas, D., Zgarbov´a, M., Jurecka, P., Orozco, M., Cheatham III, T. E., Sponer, J., and Otyepka, M. (2010)Performance of molecular mechanics force fields for RNA simulations: stability of UUCG and GNRA hairpins.
J. Chem.Theory Comput. , 3836–3849.[59] Pronk, S., P´all, S., Schulz, R., Larsson, P., Bjelkmar, P., Apostolov, R., Shirts, M. R., Smith, J. C., Kasson, P. M.,van derSpoel, D., Hess, B., and Lindhal, E. (2013) Gromacs 4.5: a high-throughput and highly parallel open sourcemolecular simulation toolkit. Bioinformatics , 845–854.[60] Tribello, G. A., Bonomi, M., Branduardi, D., Camilloni, C., and Bussi, G. (2014) PLUMED 2: New feathers for an oldbird. Comput. Phys. Commun.185