[PDF] Comparing secondary structures of RNA and calculating the free energy of an interior loop using a novel method for calculating free energy

Abstract

The thesis consists of two projects. In the first project, we present a software that analyses RNA secondary structures and compares them. The goal of this software is to find the differences between two secondary structures (experimental or predicted) in order to improve or compare algorithms for predicting secondary structures. Then, a comparison between secondary structures predicted by the Vienna package to those found experimentally is presented and cases in which there exists a difference between the prediction and the experimental structure are identified. As the differences originate mainly from faces and hydrogen bonds that are not allowed by the Vienna package, it is suggested that prediction may be improved by integrating them into the software. In the second project we calculate the free energy of an interior loop using Monte-Carlo simulation. We first present a semi-coarse grained model for interior loops of RNA, and the energy model for the different interactions. We then introduce the Monte-Carlo simulations and the method of Parallel Tempering which enables good sampling of configuration space by simulating a system simultaneously at several temperatures. Next we present Thermodynamic Integration, which is a method for calculating free energy differences. Then, We introduce a method that calculates the free energy significantly faster since we need to use only one parameter, T . To implement this method, we had to reach a regime in which the partition functions of the two systems are equal, which isn't satisfied for systems that have different entropies in the high temperature limit, so a solution to this problem had to be found. Free energy values calculated for various interior loops are shown, and may, if verified or with a more realistic modelling, be integrated into the Vienna package and supply an alternative to the experiments done.

Full PDF

aa r X i v : . [ q - b i o . B M ] D ec Thesis for the degree Master of ScienceBy Asaf FarhiComparing secondary structures of RNA andcalculating the free energy of an interior loop using anovel method for calculating free energy diﬀerences

Advisor: Prof. Eytan Domany

April 2011 cknowledgements

Firstly, I want to thank my supervisor Prof. Eytan Domany. I would like to thank himfor helping with identifying the upcoming obstacles in the project with his intuition andfor introducing me the interesting and skillful people I had the honor to work with. Iappreciate him for being ﬂexible with the time frames as the project was highly demand-ing, and for investing time in reading the thesis and helping with his exceptional skill forexplaining things.I want to thank Dr. Michael Bon for introducing me the world of RNA (I used to callhim “Doctor of RNAs”) and sharing his vast knowledge in biology, chemistry, physics andcomputer science. Next, I want to thank Dr. Guy Hed for the fruitful collaboration andhis contribution in the Monte Carlo simulations and the physics.I also want to thank Prof. Nestor Caticha for helping with his vast experience inMonte Carlo simulations.I want to mention that I really enjoyed working at the Weizmann institute that pro-vided an excellent environment for scientiﬁc research. Speciﬁcally I want to thank toMichal Shoval, Perla Zalcberg, Rachel Goldman and Yossi Drier that are to be appreci-ated for their dedication. I want to thank my group for the social environment and myfamily for the support.There is a provisional patent pending that includes part of the content of this thesis. [email protected] bstract The thesis consists of two projects. In the ﬁrst project, we present a software that anal-yses RNA secondary structures and compares them. The goal of this software is to ﬁndthe diﬀerences between two secondary structures (experimental or predicted) in order toimprove or compare algorithms for predicting secondary structures. Then, a comparisonbetween secondary structures predicted by the Vienna package to those found experimen-tally is presented and cases in which there exists a diﬀerence between the prediction andthe experimental structure are identiﬁed. As the diﬀerences originate mainly from facesand hydrogen bonds that are not allowed by the Vienna package, it is suggested thatprediction may be improved by integrating them into the software.In the second project we calculate the free energy of an interior loop using Monte-Carlo simulation. We ﬁrst present a semi-coarse grained model for interior loops ofRNA, and the energy model for the diﬀerent interactions. We then introduce the Monte-Carlo simulations and the method of Parallel Tempering which enables good sampling ofconﬁguration space by simulating a system simultaneously at several temperatures. Nextwe present Thermodynamic Integration, which is a method for calculating free energydiﬀerences. In this method simulations done at several values of a parameter λ are usedto yield the free energy diﬀerence in the form of an integral over λ . Hence, if simulation ateach value of λ necessitates parallel tempering, one has to simulate the system at a set of λ and T values. We introduce a method that calculates the free energy signiﬁcantly fastersince we need to use only one parameter, T . To implement this method, we had to reach aregime in which the partition functions of the two systems are equal, which isn’t satisﬁedfor systems that have diﬀerent entropies in the high temperature limit, so a solution tothis problem had to be found. Free energy values calculated for various interior loopsare shown, and may, if veriﬁed or with a more realistic modelling, be integrated into theVienna package and supply an alternative to the experiments done. More applications ofthis method are also suggested. 3 ontents ∆ F by parallel tempering . . . . . . . . . . . . . . . 252.5.5 Passing the steric energy barriers by introducing a cutoﬀ energy . 262.6 The Software and the simulations . . . . . . . . . . . . . . . . . . . . . . 292.6.1 General description . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6.3 Visualization of the simulation . . . . . . . . . . . . . . . . . . . . 302.6.4 Simulation details . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.7.1 Consistency checks . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.7.2 Energetically favorable micro-states . . . . . . . . . . . . . . . . . 322.7.3 Acceptance rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.8.1 Improvement proposals . . . . . . . . . . . . . . . . . . . . . . . . 43 Structural analysis and comparison of RNA secondarystructures

In this section, the components of the secondary structures of RNA will be deﬁned. Then,a software that analyzes these components and compares secondary structures, will beintroduced. The main goal of the software is to compare two secondary structures and ﬁndthe diﬀerences between them in order to improve algorithms of prediction or to comparetwo algorithms for secondary structure prediction. The software can be further used todetect elements that are repeated in a certain group of RNAs and suggest a possiblefunctionality.

By convention, the beginning of the RNA molecule is marked by 5’ and the end is markedby 3’. The N nucleotides are referred to as vertices and the N-1 covalent bonds betweenconsecutive nucleotides are referred to as exterior edges. Base pairing via Hydrogen bondsis represented by a line segment called interior edge. The pairings can be between thefollowing pairs: G-C, A-U and G-U. The entire collection of edges and vertices is calleda graph. An admissible secondary structure is deﬁned as a graph whose interior edgesdon’t intersect or cross each other (such crossings are called psuedo-knots).A face of a graph is deﬁned to be a planar region, bounded on all sides by edges. Aface with a single interior edge is called hairpin loop (see example in Fig 3). Faces withtwo interior edges are classiﬁed into 3 groups. If the interior edges are separated by singleexterior edges on both sides, the face is called a stacking region. If they are separatedby a single exterior edge on one side, but by more on the other side, the face is calleda bulge loop. Otherwise the face is referred to as an interior loop. A face with three ormore interior edges is referred to as a multi-loop or bifurcation loop (see Fig. 1).Figure 1: Illustration of the diﬀerent faces7he word hairpin refers to a structure or substructure, whose faces are consecutiveregion of stacking regions, bulges and interior loops, ending with a hairpin loop [1].

The software’s main tasks are to re-generate the structure from the output that representsthe hydrogen bond, to analyse it and to compare it with another structure of similarsequence. The software is suited for the output of the Vienna package in which a matchedpair is represented by parentheses (example is presented in Figure 2), and can be easilyadopted to suit the output that presents the paired bases, which is also common. It wasprogrammed in c++ in Unix environment (Unix compiler), using Eclipse.

The software uses tree representation (example is presented in Figure 2) for the structures,which is used for secondary structures and assumes no psuedo-knots. The software dividesthe structure into substructures, which include a sequence of faces with 2 interior edgesand a hairpin loop or a multiloop as their last face. Each substructure may containsons, which are substructures, that are connected to it via a multiloop. In the case of amultiloop, it contains the ﬁrst pair of the multiloop, and the sons contain as their ﬁrstpair the other pairs of the multiloop.Figure 2: An example of a secondary structure ( a ) and its parenthesis ( b ) and treerepresentations ( c )( a )( b ) (((....))).((((((...))(((..))))))). ( c ) structure ւց structure hairpin structure ւ ց hairpin structure hairpin structure8n the process of identifying the tree representation the faces are detected, classiﬁedand stored in each sub-structure (the pairs are also stored in a diﬀerent list). It is hereto mention that the software also stores in each face the unbound nucleotides. A detaileddescription of the algorithm is given in Appendix 3.1. • The functionality of the software in generating structures was veriﬁed on ﬁve RNAoutputs.

In order to compare two structures of the same RNA, the faces were compared. Theywere divided into three groups: identical , meaning that they have the same structure,nucleotide sequences and nucleotide indexes.

Similar , meaning that they have the samestructure, nucleotide sequence and not the same nucleotide indexes. And diﬀerent , mean-ing that it doesn’t fall into the two categories mentioned (A detailed description of thealgorithm is given in Appendix 3.2). This division enables us, in case of diﬀerent struc-tures, to identify the list of faces that have diﬀerent free energy values. Thus, faces whichare suspected of having incorrect energy values may be focused on. The functionality ofthis comparison was checked on the structures of the RNAs: PDB_00030, PDB_00213,PDB_00394 and yielded the expected results.

The main uses of the structural analysis can be the following: • Comparison between structures of minimum free energy, found by diﬀerent algo-rithms, and thus trying to extract in which cases they yield diﬀerent results. Itcan be used to compare the mfe (minimum free energy) structures predicted by theVienna package and by other algorithms that have diﬀerent free energy parameters.When the structures are diﬀerent, the software can output the faces predicted byboth and the origin of the diﬀerence can be estimated. • Comparison between structures of minimum free energy of a certain algorithm andthe experimental results. One can assess which are the cases where diﬀerencesexist. Here again, the origin of diﬀerence can be identiﬁed and the parameters ofthe algorithm can be corrected accordingly. • Identiﬁcation of structural motives (found by analyzing the structures predicted bythe algorithms) that are common in a group of RNAs, which may lead to classiﬁca-tion and possible functionality. It can be also used to associate RNA to a group ofRNAs once its structural motives are known. An example for such structural simi-larity is the family of tRNA that includes a helix, which leads to a multiloop which isfollowed by three helices that end with an hairpin loop (typical secondary structureof tRNA is shown in Fig 3). A possible application, can be the PASRs (promoterassociated small RNAs) that are transcribed from the promoter’s regions. ThesePASRs bind at the promoter, and reduce the expression of the associated gene. Thesecondary structure of PASRs can be predicted using existing software. Then, using9his software, structural motives that are common can be detected and a structuralpossible classiﬁcation, and even a functional mechanism can be estimated [2].Figure 3: Typical secondary structure of tRNA • Identiﬁcation of certain sequences that are repeated in a group of RNAs, especiallyin unpaired nucleotides, which may provide clues about their function (can be doneusing experimental or predicted secondary structures). An example of repeatedsequences is the selenoprotein mRNA sequences. These mRNAs include the UGAcodon which has a dual function. It signals both the termination of protein synthe-sis and incorporation of the amino acid selenocysteine. Decoding of selenoproteintranslation requires the SECIS element, a stem-loop motif in the 3’-UTR of themRNA, carrying short or large apical loop. SECIS elements contain an Adenosinethat precedes the quartet of non–Watson-Crick base pairs, a UGA_GA motif in thequartet, and two adenosines in the apical loop or bulge [3][4].

The following RNA sequences: PDB 30,32,45,213,394,547,584,750,1236 were compared(experimental structures versus Vienna package results) to get initial results for the dif-ferences between the structures.It seems that two of the main sources of diﬀerence between the experimental structuresand the ones predicted by the algorithm were the following: • Bonds that appeared in the experimental structures and weren’t allowed by thealgorithms (Vienna package and others), like A-A U-C G-A G-G.10

Hairpin Loops, containing less than 4 exterior edges (3 in the tested sequences) wereincluded in the experimental structures and aren’t allowed by the algorithms.It was seen that in some cases (PDB 32,213,394,547,750,1236), the diﬀerences were local,meaning that there were identical faces in a substructure, then diﬀerent faces, and thenan identical pair which terminates the local region of diﬀerence.Looking in the literature about the pairs which were detected in the experimentalstructures checked, and weren’t allowed by the algorithms, it seems that these pairsalso show up in other cases in RNA structures. Such an example is G-A/A-G pairingfound in SECIS element in the 3’ untranslated region of Se-protein mRNA and in thefunctional site of the hammerhead ribozyme. The measured melting temperature valuesof RNA oligonucleotides having this pair showed an intermediate value, between that ofthe Watson-Crick base pairs and that of non-base pairs [5][6]. Although rare, A-A andG-G pairs also occur in some cases. It was seen that a single G-G pair has a stabilizingeﬀect and can stabilize an internal loop [7] and that a group of Adenine bases can haveinteractions with each other in a bulge [8]. Another study shows that the water-mediatedUC pair is a structurally autonomous building block of the RNA [9]. Thus it seems thatthe structure of the RNA may be aﬀected also by interactions that are less dominant.The tetraloops (Hairpin Loops, containing 3 exterior edges) that were seen in theexperimental data, seem to belong to the three types of tetraloops that are common inRNA: GNRA,UNCG and CUUG (R=purine, N=every nucleotide). Meaning that thealgorithms exclude these types of faces although if formed they may lead to a lower totalfree energy.Since the total number of diﬀerent faces may be large, it is possible to detect regionsof diﬀerence and output local diﬀerences. Thus diﬀerences containing smaller number offaces can be detected, and tendencies may be pointed out. It can be done by detectinga sequence of identical faces followed by a sequence of diﬀerent faces that ends with anidentical pair. Similar faces that belong to that region can also be detected, and thenumber of diﬀerent faces can be further reduced.The local diﬀerences that were found are the following: (it should be noted here thatthe solid and dashed lines represent covalent and hydrogen bonds respectively and thatthe arrows point to the experimental data)11igure 4: local diﬀerences for PDB 32Experimental structure Vienna package prediction(a)(b)(c) Hairpin loop 12igure 5: local diﬀerences for PDB 213Experimental structure Vienna package prediction(a)(b)(c) Hairpin loop(d) Hairpin loop(e) 13igure 6: local diﬀerences for PDB 394Experimental structure Vienna package prediction(a) Hairpin loop(b) Multiloop

It can be concluded that the energy parameters of the algorithms give good predictionsfor the structures, and in most cases the general structure is well predicted. However,if we assume that the total free energy of a structure is the sum of the free energies ofthe faces, there are still faces whose free energy parameters aren’t estimated well enough.Local diﬀerences may lead to more speciﬁc cases of mismatch, and thus help us focus onthe faces that still have to be investigated.Certain diﬀerences between the structure of PDB 394 found experimentally and theone predicted by the Vienna package may suggest that the free energy of a multiloop mayvary signiﬁcantly by its extension by 2 nucleotides. (the multiloop was extended ratherthan form a G-C hydrogen bond as suggested by the Vienna package). This may be dueto the entropic loss caused by its reduction by two nucleotides.Since the diﬀerences mainly originate from faces and hydrogen bonds that are notallowed by the Vienna package it is suggested that prediction may be improved by in-tegrating parameters of these faces into the software and predicting unallowed hydrogenbonds according to the relevant face respectively. This may lead to a more accurate anddetailed description of the pairs in the secondary structure prediction.14

Calculation of free energy of an interior loop

In the last decade, RNA has gained much attention, due the discovery that it fulﬁlls aconsiderable diversity of biological tasks, including enzymatic and regulatory functions.Since RNA structure is highly related to its functionality, its of a top priority to predictand analyse secondary and tertiary RNA structures.The most common software that predict secondary structure are the Vienna pack-age/Mfold . These software assume that the total free energy of a secondary structure isthe sum of the free energies of its faces (planar regions, bounded on all sides by covalentor hydrogen bonds between nucleotides) [1]. These free energy parameters are extractedfrom experiments in which initial concentrations of reactors that are RNA segments arechanged, and the melting temperature for the reaction is measured (read Appendix 3.3.1for more details). However, in most of these experiments the heat capacity was assumednot to depend on the temperature, so they are to be redone (read Appendix 3.3.2 formore details). Here, we try to check the feasibility of a numerical calculation of the freeenergy values of faces, using a Monte-Carlo simulation. In the case that such a calculationwill be feasible, it might provide an alternative to the experiments that are done, andsupply more speciﬁc information, since in the experiments the faces’ free energy valuesare derived by ﬁtting measured values of the larger secondary structures in which theyare included.This work is also relevant for calculation of thermodynamic properties of a completeRNA sequence and its tertiary free energy landscape, as it can be easily adopted to do so(assuming no pseudoknots). This is of importance due to the relatively small number oftheoretical works on RNAs as compared to the vast number of studies on proteins. Thesetheoretical works on RNAs are composed of mostly MD simulations in which a deﬁnitionof the coordinate of interest is introduced in advance, in order to investigate the freeenergy landscape. There are relatively fewer Monte-Carlo simulations that involve thecalculation of thermodynamic properties.In this work we present how the information gathered in a parallel tempering procedure(details are in 2.4) can be used to calculate the free energy diﬀerences. In this calculationwe had to reach equality of partition functions and since we didn’t want to be restrictedto systems that have the same high temperature description, we had to go into imaginaryregimes in which the steric constraints were relaxed. This regime was reached due tothe use of cutoﬀ energy which is also relevant to sampling problems that appear in theThermodynamic Integration method [10][11][12].

The goals of this project are the following: • Model the RNA and the diﬀerent interactions between the sites. • Perform Monte-Carlo simulation in which the space of states will be sampled cor-rectly, and that will converge and be consistent.15

Writing a software that will run in reasonable time frames. • Calculation of free energy of internal loops that can, in a more elaborate model,generate the free energy values used in the Vienna package.

In this project, we modeled and simulated an interior loop composed of 6 nucleotides- 3 in each strand. In this type of interior loop the ﬁrst and last pair are bound incanonical or wobble base pairing, and the pair in the middle can have in the generalcase all combinations of bases, not belonging to the canonical or wobble base pairings.Speciﬁcally in this project we chose to simulate the following interior loop:Figure 7: Simulated Interior loop

G CX AG C where X stands for all types of nucleotides (A,C,G,U).

The model is partly coarse grained and partly in full atomistic detail. It is coarse-grained in its backbone description and in full atomistic detail in the description of thenucleobases. This choice of modelling of the structure is due to the important role thebases play in the interactions, as the base pairing and the stacking energies are thedominant ones.The model reduces the complexity of the backbone to two point interaction sites, onefor the phosphate and one for the starting point of the ribose. The phosphate interactionsite was chosen to reside where the P atom is located (all relevant atoms are presented inﬁg 8), and the (beginning of the) ribose interaction site was chosen to reside where the C4’atom is located. Reasoning for this choice is the fact that the torsion angles about the twoC-O bonds, are preferably in the trans rotational isomeric state [13], which means thatthe P-C4’ and C4’-P bond lengths don’t vary much. It should be noted that accordingto the location of the C4’ and P atoms, the building of the RNA backbone in atomicdetail [14] and as a result the ribose can be achieved (The torsion angle δ determinesthe conﬁguration of the ribose [15]).The bases were assumed to be rigid planar objectsas implied by [16]. The modelling of the bases included all the atoms (see Fig. 9 fordetails), what enables them to form Watson-Crick, and Hoogsteen base pairing [17]. Theatoms’ coordinates were calculated in the system of coordinates whose two base vectorsreside in the bases’ plane (based on the data form [18]). The bases were assumed to beconnected to the C4’ atom from the N1 atom in purines and the N9 in pyrimidines. The16istances between the two backbone sites, as well as the distance between the C4’ atomand the N1/N9 atom were chosen to be ﬁxed. This was supported by an analysis of PDBﬁles performed in which they showed small standard deviations (it was also reported in[19] that the backbone bonds deﬁned have lengths of ∼ . ◦ A ). The distances betweenthe interaction sites were chosen to be the ones found in a speciﬁc pdb ﬁle (1AFX.pdb),for simplicity. It has to be mentioned here, that this modelling of the interior loopdoesn’t necessarily have less degrees of freedom than a full atomistic model. However,the treatment in terms of programming time is shorter.For an interior loop, there are two pairs of canonical/wobble hydrogen bonds. Thepairing is between the ﬁrst base in one strand and the last base in the second strand, andbetween the last base in one strand and the ﬁrst base in the second strand. In the model,it was assumed that the orientation of the paired bases is constant with respect to eachother. Regarding the other bases, they were assumed to have freedom of movement - theC4’-N bond was free to rotate and the base was free to rotate with respect to this bond.All the interaction sites in the backbone were also free to move with the restrictions thatthe ﬁrst pair of bases is constant in space (due to rotation and translation symmetry),and the last pair of bases is treated like a rigid object (assumed to have ﬁxed pairinggeometry) and free to move.Figure 8: Atomic numbering scheme and deﬁnition of torsion angles17igure 9: The four bases and their atomsA UG C The following energy terms were considered in the model:

One of the dominant terms is the stacking energy betweenthe bases, which originates from dispersive and repulsive interactions. The stacking energyis calculated between all the atoms in the bases and is represented by the 6-12 LennardJones potential: H stacking = X α,β α = β X i,j ǫ ij "(cid:18) r m ri ( α ) j ( β ) r i ( α ) j ( β ) (cid:19) − (cid:18) r m ri ( α ) j ( β ) r i ( α ) j ( β ) (cid:19) (1)where r i ( α ) j ( β ) and r m ri ( α ) j ( β ) are the distance between atoms and the distance atwhich the energy is minimum respectively ( i ( α ) is atom i of base α ).It can be easily seen that the minimum energy is ǫ and that we get inﬁnite energy at r → and zero energy at r → ∞ . 18he distances of minimum energy were chosen from [20] and the energy values werechosen from [21] in order to be consistent with behavior of bases as reported in [22].The dominant energy terms have distance of minimum energy in the range ◦ A

Another dominating term is the base-pairing energythat originate mainly from the hydrogen bonds.The potential is much more localized and was chosen to be: H bp = X i,j h bp ij T ij (2)where h bp ij = ε ij "(cid:18) r m ij r ij (cid:19) + tgh (cid:18) (cid:18) r ij − r m ij (cid:19)(cid:19) − − . (cid:18) r m ij r ij (cid:19) (3)and T ij = (cid:26) cos θ ij h bp < h bp ≥ (4) θ ij is the angle between the acceptor atom, the polar hydrogen and the donor atom(see Fig 10), and i, j are the indexes of the hydrogen and the acceptor respectively.The term tgh (cid:16) (cid:16) r ij − r mij (cid:17)(cid:17) − was introduced to achieve locality in the potential(keeping the energy zero at inﬁnity), and the term cos θ was introduced to achieve direc-tionality.Figure 10: The angle between the acceptor atom, the polar hydrogen and the donor atomThe distance of min. energy is r m ij = 1 . ◦ A and the min. energies values ε ij are chosenaccording to the acceptor and the electronegativity of the donor. Explicitly, ε ij is the19roduct of the diﬀerence in electronegativity between the hydrogen atom and the donor,and a factor that accounts for the eﬀect of the number of the lone pairs of the acceptor -1 for N and 2 for O (electrons in the outer shell can participate in the interaction).Another proposed potential is with: T ij = (cid:26) cos θ ij h bp < h bp ≥ (5)and with the factor that accounts for the eﬀect of the number of the lone pairs ofthe acceptor - 1 for N and 1.15 for O. This choice is pretty much in accordance with theenergy parameters in [23]. Coulombic interactions are taken into account using theDebye-Huckel approximation, which is valid for the low-salt physiological conditions.The Coulombic interactions in the model are between the oxygens next to the phos-phates, and are calculated between phosphates of consecutive nucleotides (as in [24]). V qq = N X i

Our goal is to ﬁnd the free energy diﬀerence between two systems: two interior loops,which diﬀer in a base. Speciﬁcally we will calculate the free energy diﬀerence between theinterior loops ˜ G , whose second base is G, and ˜ C , with the base C, as shown in Fig 12.Figure 12: ˜ C and ˜ G G C

C AG C

System ˜ C G CG AG C

System ˜ G We denote this diﬀerence by △ F ˜ C → ˜ G ( β ) = F ˜ G ( β ) − F ˜ C ( β ) .The state of the bases G, C (the coordinates of all their atoms) for a given location ofthe N1/N9 atom of the base (see Fig 9 for details) is determined by three angles, sincethe bases are assumed to be rigid planar objects. Therefore the same degrees of freedomare associated with these diﬀerent bases, and since the two systems ˜ G, ˜ C diﬀer just in thebases, the two systems have the same degrees of freedom.The diﬀerence of free energy between any two systems that have the same degrees offreedom can, in principle, be calculated by the Thermodynamic Integration (ThI) method,as described in section 2.5.2. ThI requires running simulations of diﬀerent instances(several values of a parameter λ i , i = 1 , , ...m ) of a hybrid ˜ G − ˜ C system.23hese systems have a rugged energy landscape (with many local minima), which meansthat the decorrelation time in a “naive” Monte-Carlo simulation are exponentially long.One of the widely used methods to alleviate the problem of equilibrating systems withsuch landscapes is parallel tempering (PT), which we describe below in Section 2.5.3. PTinvolves simulating the system of interest, in parallel, at several (inverse) temperatures β i , i = 1 , , ...n .The natural way to calculate the free energy diﬀerence for our two systems, △ F ˜ C → ˜ G ( β ) is, therefore, by ThI, equilibrating m systems, one for each value of λ i . For each suchequilibration PT is used, at n temperatures, hence all together nm systems are simulated.In section 2.5.4 we describe a method, Temperature Integration (TeI), which allows us tocalculate the free energy diﬀerence much more eﬃciently than the use of ThI with paralleltempering. Thermodynamic Integration (ThI) is one of the most commonly used methods for calcu-lating the free energy diﬀerence between two systems. It is described in detail in Appendix3.5. This method is used to calculate the free energy diﬀerence between two systems withthe same degrees of freedom [12, 10, 11], by varying a parameter λ that interpolates be-tween the two compared systems. We will use the systems ˜ C and ˜ G described above todemonstrate application of this general method in a speciﬁc context.Denote the Hamiltonians of the two systems by H ˜ C ( θ ) and H ˜ G ( θ ) , where θ denotesthe coordinates of the system. Noting that the two systems have the same coordinatespace, we deﬁne a λ - weighted hybrid system, characterized by the Hamiltonian H ( λ ) : H ( λ, θ ) = λH ˜ G ( θ ) + (1 − λ ) H ˜ C ( θ ) (16)As shown in Appendix 3.5, the free energy diﬀerence is given by △ F ˜ C → ˜ G ( β ) = Z (cid:2) h H ˜ G ( θ ) i λ − h H ˜ C ( θ ) i λ (cid:3) dλ (17)where h X i λ denotes the equilibrium average of X in the ensemble characterized by H ( λ ) .The expression for △ F ˜ C → ˜ G ( β ) , written explicitly, takes the form: △ F ˜ C → ˜ G ( β ) = Z (Z [ H ˜ G ( θ ) − H ˜ C ( θ )] e − β [ λH ˜ G ( θ )+(1 − λ ) H ˜ C ( θ ) ] d θ Z ( λ ) ) dλ (18)with Z ( λ ) = Z e − β [ λH ˜ G ( θ )+(1 − λ ) H ˜ C ( θ ) ] d θ (19)The integration is performed numerically, with the integrand evaluated at each one ofa set of values of λ by Monte Carlo simulations. As implied by (18), the two systems arerequired to have the same degrees of freedom.24 .5.3 Thermodynamic integration with parallel tempering The systems ˜ C and ˜ G , and the λ -weighted system have rugged energy landscapes withmany local minima. As the decorrelation times grow exponentially with △ E/kT , where △ E is the energy barrier between nearby valleys, equilibration times in these systemscan be rather long. In order to overcome this problem, and obtain the thermodynamicaverages h H ( θ ) i λ , one can use the Parallel Tempering procedure [26, 27]. However, imple-menting both thermodynamic integration and parallel tempering yields an unnecessaryoverhead in running time, as we will demonstrate.In order to implement thermodynamic integration we ﬁrst choose a set of values λ i , i =1 , ...m , that will enable us to have a good sampling of the function h H ( θ ) i λ for theintegration in eq. (17). Thus, m is related to the desired precision of the integration.In principle parallel tempering should be performed for each λ i , simulating each of the m λ i weighted systems over a set of temperatures, given by β j , j = 1 , ...n . Finally, usingthe calculated values of h H ( θ ) i λ i at a temperature of interest β , we approximate theintegral of eq. (17) by a sum over m terms, to get the free energy diﬀerence △ F ˜ C → ˜ G ( β ) .The procedure involves running Monte-Carlo simulations over m λ -values, and n β -values, that is, over a grid of m × n instances of the hybrid system. An illustration of thisgrid is presented in ﬁgure 13.Figure 13: Illustration of the grid of values over which the Monte-Carlo simulations areperformed. The lowest values of β correspond to some very high ﬁnite temperature. B e t a [ M o l e / k C a l ] Lambda ∆ F by parallel tempering We present now a method which uses only simulations obtained in the process of paralleltempering (skipping the need for simulations at a set of λ i values), performed for the twosystems ˜ C and ˜ G , to obtain the free energy diﬀerence △ F ˜ C → ˜ G ( β ) . This method can beapplied to any two systems that have the same degrees of freedom θ .As β → , the limiting value of the partition function of a system yields the phasespace volume. In particular, if systems ˜ C and ˜ G , which have the same coordinate space25 θ } , have the same β → limit, we have Z ˜ G ( β →

0) = Z ˜ C ( β →

0) = Z d θ . (20)In this case we can use the following identity to obtain, for a ﬁnite β , the diﬀerenceof the free energies: ln Z ( β ) − ln Z ( β →

0) = Z β d ln Zdβ dβ = − Z β h H i dβ . (21)Using equations (20) and (21) we obtain △ F ˜ C → ˜ G ( β ) = 1 β [ln Z ˜ C ( β ) − ln Z ˜ G ( β )]= 1 β  β Z h H ˜ G i dβ − β Z h H ˜ C i dβ  . (22)For each of the systems ˜ C and ˜ G we estimate the integrals on the right hand sideby parallel tempering, sampling the system at a series of values β , . . . , β n . We choosevalues such that β − n is much larger than the internal energy of the system at β , so Z ( β n ) ∼ = Z ( β → .The condition of eq. (20) poses a problem for the particular systems ˜ C and ˜ G studiedhere. The steric excluded volume interactions are not bound in our model, and hence,the available phase space volumes for the two systems are not equal at any temperature,and equation (20) does not hold. In order to satisfy the condition stated above we hadto set a cutoﬀ over the steric interactions, E cutoﬀ . In the next section we show that ourresults do not depend on the choice of E cutoﬀ and β n , as long as E cutoﬀ is much largerthan any typical interaction energy in the system at β , and β − n ≫ E cutoﬀ . We now tackle the problem of sampling at temperatures which are higher than all energyterms, by using a cutoﬀ energy in the steric interaction potential. We denote the cutoﬀenergy and the distance from which the energy will be set to the cutoﬀ energy by E cutoﬀ and r cutoﬀ respectively.The modiﬁed steric potential can be written as follows: (cid:0) σr (cid:1) → H steric ( r ) = (cid:26) (cid:0) σr (cid:1) r > r cutoﬀ E cutoﬀ r ≤ r cutoﬀ

26n the following ﬁgure an example of such a steric potential is presented:Figure 14: Plot of an example of a steric interaction potential that includes a cutoﬀ energy E n e r g y The proposed calculation of the free energy diﬀerence between the two systems atthe temperature of interest β is legitimate only if our choice of the cutoﬀ energy has anegligible eﬀect on the partition function of each of the two systems at β . In addition, thehighest temperature used, β , must be such that the equality of the partition functionsof the two systems is satisﬁed to a good accuracy.We denote the Hamiltonian with the cutoﬀ energy by H ′ , and write the requirementsstated above explicitly as follows: lnZ ˜ G ( β , H ) ≃ lnZ ˜ G ( β , H ′ ) (23) lnZ ˜ C ( β , H ) ≃ lnZ ˜ C ( β , H ′ ) (24) lnZ ˜ G ( β , H ′ ) ≃ lnZ ˜ C ( β , H ′ ) (25)In order for the cutoﬀ to have a negligible eﬀect on the partition functions at thetemperature of interest it has to be set to a value that satisﬁes E cut oﬀ ≫ kT .As for β , if the cutoﬀ energy satisﬁes E cut oﬀ ≪ kT , the systems will have almostequal probability to be in all the regions of their phase space, including ones in which themolecules entered the “volume” of one another. Thus, the partition functions of the twosystems will be almost equal.Hence if these requirements are satisﬁed one can write: lnZ ˜ C ( β , H ) − lnZ ˜ G ( β , H ) ≃ (26) lnZ ˜ C ( β , H ′ ) − lnZ ˜ G ( β , H ′ ) ≃ (27) lnZ ˜ G ( β , H ′ ) − lnZ ˜ G ( β , H ′ ) − [ lnZ ˜ C ( β , H ′ ) − lnZ ˜ C ( β , H ′ )] (28)27sing the identity in eq (21), we can write: △ F C → G ( β , H ) = 1 β [ lnZ ˜ C ( β , H ) − lnZ ˜ G ( β , H )] ≃ β  β Z β h H ′ G i d ˜ β − β Z β h H ′ C i d ˜ β  (29)So the calculation of free energy will be negligibly aﬀected by the use of a cutoﬀ energyas long as we fulﬁll the relevant conditions.It has to be mentioned that this use of cutoﬀ energy is relevant also to ThI and similarmethods. In these methods if the systems have diﬀerent β → limit, at λ = 0 , due tothe steric energy terms (represented by (cid:0) σr (cid:1) ) micro-states may have inﬁnite energy. Asa result the internal energy will be inﬁnite and the sampling of the internal energy willbe infeasible.It can be seen that since the free energy diﬀerence between systems that have theHamiltonian with the cutoﬀ H ′ is almost equal to the one calculated between systemswith the Hamiltonian H △ F C → G ( β , H ) ≃ β [ lnZ ˜ C ( β , H ′ ) − lnZ ˜ G ( β , H ′ )] , ThI can be derived for systems with H ′ and yield almost the same result.In conclusion, with the use of cutoﬀ energy in ThI, that enables us to sample theintegrand in all cases, the free energy diﬀerence between any two systems that have thesame degrees of freedom can be calculated. • Cutoﬀ energies chosen for the simulation : E cutoff = 5 [ KCal/M ole ] , E cutoff =4 . KCal/M ole ] . These cutoﬀ energies satisfy the conditions mentioned above.Hence, the results of the calculations of the free energy for the two cutoﬀs shouldbe similar. 28 .6 The Software and the simulations The project was written in c/c++ in Unix environment under Eclipse. It was mostlywritten in object oriented c++, and at places where performance was critical it waswritten in c. The model was divided into classes that represent its components (Oligonu-cleotide,Nucleotide, Base etc.). In cases where the behavior of the class varies accordingto its type, derived classes were used (classes that inherit properties of a base class andimplement more properties according to their type). For example nucleotides were dividedinto regular nucleotides, ﬁrst nucleotide in ﬁrst strand, last nucleotide in ﬁrst strand etc.that inherited the properties of the class Nucleotide.The project is mostly written in a general form and can be applied to all kinds of faces(hairpin loops, bulges etc.) with slight software changes.In order to to have all the initial information of the interior loop, a PDB ﬁle wasread (1AFX.pdb) and the information of the nucleotides comprising the interior loop wasstored.The bases’ atoms, throughout the program, were generated using two base vectors thatspan the 2d plane in which the base resides (since the base is rigid and planar). Sincethe number of conﬁgurations over which we sum their energy values is huge, Kahan’salgorithm was used to reduce accumulative errors of the ﬂoating point numbers [28].In order to enable us to change the parameters needed for the runs without having tocompile the software each time, parameter ﬁles were written and read in run time.

As the computational power needed in the simulation is big, optimization of the runningtime was taken seriously.It was done according to the instructions in [29] with the aid of Visual Studio per-formance wizard, which enabled us to identify the bottlenecks and the time consumingparts of the program. The code was revised in many places and the running time wasminimized mostly due to the following steps : • Avoidance of using library functions for calculations if their full functionality isn’tused (factor of ∼ • Avoidance of memory allocations in places where it’s not obligatory (factor of ∼ • Use of inline functions in cases in which they are used frequently. Inlined functionsare copied into place instead of being called (factor of ∼ ∼ ∼ ∼ In order to visualize the simulation, the capability to write the information of the oligonu-cleotide instance to a PDB ﬁle was added. Since we also wanted to visualize the devel-opment of the simulation in time, PYMOL which is a software that generates a videofrom PDB ﬁles, was used. PYMOL commands have been studied in order to enable us tocreate movies. Then, the commands were integrated into the software in order to generateautomatically a script ﬁle that, when double clicked, shows a movie of the PDB ﬁles.

The simulation for each of the four systems consisted of parallel tempering over ﬁvetemperature ranges (see Appendix (3.6.1)), each performed on a single core. For eachreplica there were 200,000 conﬁguration exchange attempts with replicas at adjacenttemperatures. Since there are ∼ Monte-Carlo moves between adjacent conﬁgurationsexchange attempts, there were ∼ · Monte Carlo moves for each temperature. Overall,for the four systems for all the temperatures simulated for the two cutoﬀ energies ∼ · Monte-Carlo micro-states were generated, from which a movie of a length of ∼ yearscan be created (30 FPS). 30 .7 Results The consistency of the results was checked in several ways: • In the canonical ensemble, the following relation is satisﬁed: D ( △ E ) E = − ∂U∂β .Since in the simulation we can calculate the variance of the energy and also theinternal energy as a function of beta, we were able to plot and compare the twosides of the equation. The results for the diﬀerent bases are shown in ﬁgure 15:Figure 15: Variance of the energy and crude partial derivative of the energy with respectto beta as a function of temperature [(Kcal/Mole)^2]

320 340 360 380 400 420 440 460 480020406080100120

ATemperature [K]

Variance of the energyCrude partial derivative of the energy w.r to beta

320 340 360 380 400 420 440 460 480020406080100120

CTemperature [K]

Variance of the energyCrude partial derivative of the energy w.r to beta

320 340 360 380 400 420 440 460 480020406080100120

GTemperature [K]

Variance of the energyCrude partial derivative of the energy w.r to beta

320 340 360 380 400 420 440 460 480020406080100120

UTemperature [K]

Variance of the energyCrude partial derivative of the energy w.r to beta

It can be seen that the agreement is good, up to deviations that originate mainlyfrom the relatively large steps in beta and from the higher standard deviations inthe internal energy near the peaks. 31

The energy distribution function at given betas β , β are: P ( E, β ) = g ( E ) exp ( − β E ) Z β , P ( E, β ) = g ( E ) exp ( − β E ) Z β . Hence, we can estimate P ( E, β ) using P ( E, β ) asfollows: P ( E, β ) est = P ( E,β ) exp ( − E ( β − β )) R P ( E,β ) exp ( − E ( β − β )) dE . The results are shown in thefollowing ﬁgure:Figure 16: Energy distribution for T=310K,T=320K, and estimation for T=320K withU as the second base -40 -30 -20 -10 0 10 20 30 4000.0020.0040.0060.0080.010.0120.014 Energy [kCal/Mole] P r obab ili t y d i s t r i bu t i on T=310KT=320KEstimation at T=320K

It can be seen that there is good agreement between the energy distribution atT=320K (green line) and the one estimated at T=320K (red line), based on thedistribution at T=310K (the red and green lines practically coincide). • It had to be veriﬁed that the integral β R β h H X i d ˜ β is independent of the cutoﬀenergy as discussed in 2.5.5, and the results are shown later on. We ran the simulation and captured the energetically favorable micro-states in orderto asses the energy model. The selected micro-states, and their energy values for theconﬁguration with diﬀerent bases are presented in the following ﬁgures: (1-3 ﬁrst strand,4-6 second strand. atoms by their color: white - Hydrogen, blue- Nitrogen, red- Oxygen,green - Carbon) 32igure 17: Micro-state of the system with base A with low energy. Energy values(kCal/Mole): Total -23.12, Stacking -20.46, Hydrogen -5.19, Electrical 1.99, Steric 0.54Figure 18: Micro-state of the system with base C with low energy. Energy values(kCal/Mole): Total -23.05 , Stacking -20.38, Hydrogen -5.29 , Electrical 1.9 , Steric0.72 33igure 19: Micro-state of the system with base G with low energy. Energy values(kCal/Mole): Total <23 , Stacking -20.42 , Hydrogen <-8.5, Electrical 2.09, Steric 3.29Figure 20: Micro-state of the system with base U with low energy. Energy values(kCal/Mole): Total -23.04, Stacking -16.78, Hydrogen -8.63, Electrical 1.88, Steric 0.534t can be seen that the geometries of the 2nd base with respect to the 5th basein the low energy states are similar to the ones reported in the literature. The A-Gpairing geometry is similar to the one presented in [17]. The A-A pairing geometry isalso similar to the one presented in [17] and involves Hoogsteen base pairing. The U-Apairing geometry is similar to the well known Watson-Crick Base pairing.

The acceptance rates of the internal and external moves as a function of temperature areshown in the following ﬁgure:Figure 21: Acceptance rate of internal and external moves - system with U as the secondbase

300 320 340 360 380 400 420 440 460 4800.40.50.60.70.80.91 Temperature [K] Acceptance rate internal moveAcceptance rate external move

As expected, these acceptance rates are larger for higher temperatures.The conﬁguration exchange acceptance rates were high since we took small steps intemperature in order to sample well all the temperature range (values around ∼ . ).35 .7.4 Results Here we present the average of the total, stacking, Hydrogen bond, electrical, and stericenergies for the four systems:Figure 22: The diﬀerent average energies as a function of temperature for system withthe second base A,C

300 320 340 360 380 400 420 440 460 480-25-20-15-10-505

ATemperature [K] E ne r g y [ k C a l / M o l e ] Total energyStacking energyHydrogen energyElectrical energySterical energy

300 320 340 360 380 400 420 440 460 480-25-20-15-10-505

CTemperature [K] E ne r g y [ k C a l / M o l e ] Total energyStacking energyHydrogen energyElectrical energySterical energy

300 320 340 360 380 400 420 440 460 480-25-20-15-10-505

GTemperature [K] E ne r g y [ k C a l / M o l e ] Total energyStacking energyHydrogen energyElectrical energySterical energy

300 320 340 360 380 400 420 440 460 480-25-20-15-10-505

UTemperature [K] E ne r g y [ k C a l / M o l e ] Total energyStacking energyHydrogen energyElectrical energySterical energy

It can be seen that the dominating terms are the stacking and Hydrogen interaction(the Hydrogen interaction energy of the ﬁrst and last pair aren’t taken into account). Asexpected all the energies are monotonically increasing function of the temperature. Themost signiﬁcant change is in the stacking and Hydrogen energies since there’s unwindingof the pair in the middle. 37he following is a graph of the heat capacity at constant pressure C P = (cid:0) ∂H∂T (cid:1) N.P forthe four systems:Figure 24: C p as a function of temperature for the diﬀerent bases

300 350 400 450 50000.10.20.30.4 A C p [ k C a l / M o l e K ] Temperature [K] 300 350 400 450 50000.10.20.30.4 C C p [ k C a l / M o l e K ] Temperature [K]300 350 400 450 50000.10.20.30.4 G C p [ k C a l / M o l e K ] Temperature [K] 300 350 400 450 50000.10.20.30.4 U C p [ k C a l / M o l e K ] Temperature [K]

In all the conﬁgurations there is a maximum in the heat capacity. This maximumrepresents the transition from a system in which the dominant energy terms are thestacking and Hydrogen energies to a system which is dominated mainly by the stericenergy terms. 38ere we present the normalized energy histogram for various temperatures for a systemwith the second base U:Figure 25: Normalized energy histogram for diﬀerent temperatures for a system with thesecond base U -40 -30 -20 -10 0 10 20 30 4000.0020.0040.0060.0080.010.0120.014 Energy [kCal/Mole] P r obab ili t y d i s t r i bu t i on The probability for a given energy for the continuous case is: P ( E ) = g ( E ) e − E/kT /Z and in the MC simulation it’s simply the normalized histogram of the energies.It can be concluded from the graph that there are two peaks in the density of statesat these temperatures. This is since there is an increase in the energy distribution thatcan’t be attributed to e − E/kT . As a result, the standard deviation in the energy attemperatures around 370K is high. The following graphs are the internal energies as afunction of β :Figure 26: Internal energy as a function of beta for the diﬀerent bases E ne r g y [ k C a l / M o l e ] ACGU βs appear because the high temperature enablesthe molecules to cross one another so the average steric energy is high. In the next ﬁgurewe compared the energy as a function of β for conﬁgurations with bases A and C.Figure 27: Log of the energy as a function of log β - comparison between interior loopswith A and C -5 -4 -3 -2 -1 Beta [Mole/kCal] E ne r g y [ k C a l / M o l e ] AC As can be seen from the ﬁgure, there is saturation in the energy at small β s. This isdue to the fact that kT at these temperatures is much higher than the energy terms andthere is free movement. The fact that the conﬁguration with A has higher internal energyat low β s can be explained by the higher number of atoms that A contains compared toC, which results in more atoms on average that have high energy terms.As we raise the cutoﬀ energy, the temperature from which the behavior of the systemdiﬀers from the original one is higher. Together with it, the internal energy that we’ll haveat inﬁnite temperatures will be higher as the molecules are free to enter the volume of oneanother and will experience higher energies on average (since the cutoﬀ energy is higher).This is in agreement with the assumption that the integral of the internal energies overbeta is independent of the cutoﬀ energies as long as we satisfy the relevant conditions. Inthe following graph we can see that for the higher cutoﬀ we have lower internal energies at β ∼ M ole/kCal ] and higher internal energies for β < − [ M ole/kCal ] . This behavioris more discernible as the diﬀerence between the cutoﬀ energies becomes larger.40igure 28: Log of the energy as a function of log of beta for the two cutoﬀs - A -5 -4 -3 -2 -1 Beta [K] E ne r g y [ k C a l / M o l e ] Ecutoff=5[kCal/Mole]Ecutoff=4.4[kCal/Mole]

We calculated the values of the integral of the internal energy over β for the chosencutoﬀ energies. The results are shown in the following ﬁgure :Figure 29: Values of the integral for the diﬀerent cutoﬀ energies (in kCal/Mole) at bodytemperature E cutoﬀ = 5 kCal/M ole E cutoﬀ = 4 . kCal/M ole β R β h H A i d ˜ β β R β h H C i d ˜ β β R β h H G i d ˜ β -1.72 -1.66 β R β h H U i d ˜ β E cutoﬀ = 5 kCal/M ole E cutoﬀ = 4 . kCal/M oleU A ( T = 310 K ) -23.29 -23.23 U C ( T = 310 K ) -20.04 -19.85 U G ( T = 310 K ) -29.08 -28.97 U U ( T = 310 K ) -24.01 -23.9841ince we have the internal energy of the diﬀerent conﬁgurations (see Fig 30), we canalso calculate the entropy diﬀerence.For example, the entropy diﬀerence between A and G is calculated as follows for E cutoﬀ = 5 [ kCal/M ole ] : ∆ F A → G = ∆ H A → G − T ∆ S A → G = − . − · ∆ S A → G = − .

45 [ kCal/M ole ] ⇒ ∆ S A → G = − . kCal/KM ole ] , − T ∆ S A → G = 3 .

34 [ kCal/M ole ] For E cutoﬀ = 4 . kCal/M ole ] we have: ∆ F A → G = ∆ H A → G − T ∆ S A → G = − . − · ∆ S A → G = − .

49 [ kCal/M ole ] ⇒ ∆ S A → G = − . kCal/KM ole ] , − T ∆ S A → G = 3 .

25 [ kCal/M ole ] It can be seen that there is ∼ deviation in the entropy value between the systemswith the two cutoﬀs. This deviation can be minimized by using higher cutoﬀs (the cutoﬀssatisfy E cutoﬀ ≈ kT body , E cutoﬀ ≈ kT body ) and larger number of iterations in the MCsimulations.We can also calculate the entropies for higher temperatures in the same manner.Results are shown in the following ﬁgure:Figure 31: ∆ S A → G as a function of temperature

310 315 320 325 330 335 340-0.018-0.017-0.016-0.015-0.014-0.013-0.012-0.011-0.01 Temperature [K] A - G E n t r op y d i ff e r en c e [ kc a l / K M o l e ] The entropy diﬀerence in absolute value gets larger as we raise the temperature. Theentropy is a monotonically increasing function of the internal energy, which is a mono-tonically increasing function of the temperature. Hence, as we increase the temperature,the entropies are expected to get larger and it’s likely that the entropy diﬀerence betweenthe systems will also be such.

In conclusion, we presented models for faces and the interactions in RNAs and introduceda procedure to calculate the free energy of interior loops, and faces in general. Thisprocedure, if veriﬁed or improved may suggest an alternative to the experiments in whichfree energy of faces are calculated. Under the assumptions that the secondary structure42s known and that faces are decoupled, simulations on several faces can also be performedtogether to generate the whole tertiary structure, and more global properties can becollected. In the calculation of free energy we showed a method in which the informationgathered in the course of parallel tempering can be used to calculate the free energydiﬀerence between systems. In this method, if the energy landscape is rugged, instead ofsampling both over the space of λ , the parameter related to the TI procedure, and thetemperature, we sample just over the temperatures of the conﬁgurations of interest (upto inﬁnity). This method is much easier to implement than ThI and is likely to be muchfaster in rugged energy landscapes. This is crucial when the cost of the calculation ishigh, as is often the case in RNA tertiary structures simulations. In the method we hadto bring the two systems to regimes at which the partition functions are equal. Althoughthis is easily satisﬁed for systems that have the same β → limit, we wanted the conditionto be satisﬁed for all systems that have the same degrees of freedom. We introduced acutoﬀ energy in the repulsion term which enabled the molecules to enter the “volume” ofone another and as a result the condition could be satisﬁed. This use cutoﬀ energy canalso be applied to other methods such as ThI and enable the calculation of free energy inthe cases where the sampling of the integrand isn’t feasible.Regarding the optimization of software written in c/c++ languages, though the rangeof possible actions is huge, if the bottlenecks are identiﬁed and the relevant actions areperformed (as listed in 2.6.2), together with the use of the proper compiler, and thecapabilities of vectorization of the processor, a tremendous improvement in the runningtime can be achieved in a relatively short time. As reported by [30], the backbone bond angles deﬁned in our model are in the range of ◦ − ◦ . The P-C-N bond angle also showed small standard deviation by PDB analysis.In order to take it into account we can add the following energy terms: H bond angle = P i kT (cid:20)(cid:16) θ PCPi − θ PCPavg σ θP CP (cid:17) + (cid:16) θ CPCi − θ CPCavg σ θCPC (cid:17) + (cid:16) θ PCNi − θ PCNavg σ θPCN (cid:17) (cid:21) Alternatively, we can switch to full atomistic modelling of the backbone and the ribose.As will become clear later, this modelling will include 7 degrees of freedom per nucleotidefor the seven torsion angles. A CONROT (concerted rotation) Monte-Carlo move can beused, in which 7 consecutive torsion angles are changed keeping the bond angles, distancesand the rest of the conﬁguration ﬁxed [31, 32]. As the backbone torsion angle δ is directlycorrelated to the ribose dihedral angle ν which determines the ribose conﬁguration [33],it seems that the ribose doesn’t introduce a degree of freedom [15] other than the torsionangle χ that deﬁnes the orientation of the base with respect to the ribose. Since thefaces are assumed to be decoupled from one another, the conﬁguration of the ribose andhence δ and the torsion angle χ have to be constant in the nucleotides that are sharedby two faces. Hence, the degrees of freedom that we have consist of the 7 torsion angles α, β, γ, δ, ε, ζ, χ for the nucleotides that are not shared between faces and 5 torsion angles α, β, γ, ε, ζ for the nucleotides that are shared between faces. A more natural choice ofdivision of the atoms that belong to a nucleotide would be C4’ to C4’ since the faces areassumed to be decoupled 43he ribose also has δ dependent energy landscape [33] which can be taken into account.Regarding the Monte Carlo moves, we can perform the following moves: • CONROT move in the backbone that won’t change δ of the unshared nucleotides. • Rotation of the χ torsion angle in the unshared nucleotides. • A move that will aﬀect the location of the ribose and the base of the shared nu-cleotides and that will keep δ , χ , δ , χ constant. Such a move can be the fol-lowing: we can deﬁne virtual bond C ′ − C ′ that connects the two shared nu-cleotides. Since the bond and torsion angles between these atoms are assumed tobe constant, the length of this bond can be considered to be ﬁxed. Since rotationaround the ε or γ torsion angle rotates the virtual bond, the virtual bond angles O ′ − C ′ − C ′ , C ′ − C ′ − C ′ can be considered ﬁxed. Now, since the require-ments for the CONROT move are satisﬁed (ﬁxed bond length and bond angle), the C ′ − C ′ can be considered as a bond in a CONROT move that includes twomore atoms and involves the changing the location of the ribose and the base ofthe shared nucleotides. This use of virtual bonds in the CONROT move is rathergeneral and can be used in other places.44 Appendixes

The software reads the string of ’(’, ’.’ and ’)’ characters from left to right and pro-cesses the information iteratively. In each iteration, the software runs over the ’(’ and ’.’characters until it reaches a ’)’ character. In this process it saves the indexes of the ’(’characters in the “bases to be paired” list. Then it runs over the ’)’ characters and pairsthem with the bases from the list (saves them in the “paired bases” list) until it reachesa ’(’ character. At this point we have a list of “paired bases” and a list of “bases to bepaired” . Here we stop to deﬁne a process in which a son is generated and all the pairedbases are moved to it (which will be called “process” from now on). We also mentionhere that each structure has, as one of its ﬁelds, the number of unpaired bases after theprocess (The initial value for this ﬁeld is -1). If, at this point of the iteration, there aremore unpaired bases in the list, a process is done - a son is generated, and the pairs arestored in it (we’ll call this condition 1). If, it continued the pairing task with the father (ifduring the iteration, there are more ’)’ characters after all the bases have been paired, itwill continue the pairing task with the father ( ∗ ) ) , and the father’s substructure has lessunpaired bases, than it had before, it means that it detected a multiloop (called condition2). In that case, a son which is a multiloop is created and all the structure’s sons aremoved to it. If a son was generated, in the next iteration, if a ’(’ is detected ﬁrst, then anew son is generated, and the iteration is processed with this son ( ∗∗ ) .For example, for the following string: ( .... ( .... ) .... ( ... ( ... ) .... ( .... ) ... ) .... ( .... ) .... ) A..B...B ′ ...C..D.D ′ ..E..E ′ ..C ′ ..F..F ′ ..A ′ (Where the letters specify locations in the string).Using the algorithm speciﬁed above we’ll have the following steps:45teration 1 2 3 4startingpoint A C E F endingpoint

C E F A ′ conditionsfulﬁlledduring theiteration ( ∗ ) , ( ∗∗ ) ( ∗ ) , ( ∗∗ ) list of pairsafter theiteration BB ′ DD ′ EE ′ list ofbases to bepairedafter theiteration A A, C A conditionfulﬁlledafter theiteration 1 1 2 2Where * is used the specify the structure that the algorithm begins the iteration with.The software can give as an output pairs, faces and nucleotide’s sequences for eachsubstructure.

For the ﬁrst structure, it runs over all faces, and for each face, it looks for identical onesin the second structure. If it ﬁnds an identical face, it updates the faces’ ﬁeld similaritylevel to identical. It then removes it from the list of faces of the ﬁrst and second structureand adds it to lists of similar faces, in order (identical face is also similar).Then it runs over the nonidentical faces in the ﬁrst structure and for each face, itlooks for similar faces in the second structure. In case it ﬁnds a similar face, it updatesthe faces’ ﬁeld similarity level to similar. It then removes them from the list of faces ofthe ﬁrst and second structure and adds it to the new lists. Then it generates one vectorfor the ﬁrst structure and one vector for the second structure that include the diﬀerentfaces. It then sorts them according to the face type (hairpin loop, stacking region bulge,interior loop or multiloop). Thus it enables us to see which faces are preferred over whichfaces in the two structures. 46 .3 Experimental calculation of free energy

Say we have two strands of RNA that can bind in the following reaction: [ A ] + [ B ] → [ AB ] (30)where [ A ] , [ B ] denote the concentrations of the single strands and [ AB ] denotes theconcentration of their bound product.The free energy diﬀerence is: △ G = − RT lnK (31)Where the dissociation constant K is deﬁned as follows: K , [ AB ][ A ] [ B ] (32)At T m , by deﬁnition, the concentration of the reactors and the products is equal.Say we take an initial concentration of the reactors of [ A ] = [ B ] = X , and zero initialconcentration of the product [ AB ] = 0 .According to the reaction, it can be seen that at T m (where T m denotes meltingtemperature) we have: [ A ] = [ B ] = [ AB ] = X .Substituting it in (31), we get: △ G T m = − RT lnK T m = RT ln (cid:18) X (cid:19) (33)We can use: △ G T m = △ H ( T m ) − T m △ S ( T m ) (34)and get: T m = RH ( T m ) ln (cid:18) X (cid:19) + △ S ( T m ) △ H ( T m ) (35)Now each time the initial concentration of the reactors is changed and T m is measured.Thus, a graph of T m as a function of ln (cid:0) X (cid:1) can be generated.From this graph △ H ( T m ) , △ S ( T m ) can be extracted.It is here to mention that the product consists of many faces and the values of theenthalpy and entropy are assigned by ﬁtting results to the faces from many experiments.47 .3.2 Finding enthalpy and entropy values at body temperature Even if we assume that the former experiments are performed in a small range of temper-atures, in which the enthalpy and entropy are constant, it can’t be assumed that thesetemperatures are close to body temperature.Measuring the heat capacity as a function of temperature enables the calculation ofthe enthalpy and entropy values at body temperature. C p = (cid:18) dQdT (cid:19) P (36)The enthalpy and entropy can be expressed as functionals dependent on the functionof heat capacity over temperature.Since the process is quasistatic: dQ = T dS , and we can write: (cid:18) dSdT (cid:19) P = 1 T C P (37)Regarding the enthalpy: H ( S, P, N ) since it’s a function of S, P and N and P, N are held constant we canwrite the derivative of the enthalpy with respect to the temperature as follows: (cid:18) ∂H∂T (cid:19)

P,N = ∂H∂S ∂S∂T = T T C P = C p (38)So we can write: △ H ( T body ) = △ H ( T m ) + Z T m T body C p dT (39) △ S ( T body ) = △ S ( T m ) + Z T m T body T C p dT (40)And In conclusion : △ H ( T m ) , △ S ( T m ) , C p ( T ) ⇒ △ H ( T body ) , △ S ( T body ) However, in many calculations, dependency of the heat capacity on the temperaturewas neglected, so the experiments are to be redone.

According to the detailed balance criteria, it can be written for the canonical ensemble: P ( i → j ) P ( j → i ) = P ( j ) P ( i ) = e − βH j e − βH i = e − β ( H j − H i ) = (41) min (cid:8) , e − β ( H j − H i ) (cid:9) min (cid:8) , e + β ( H j − H i ) (cid:9) (42)48o for the choice of P ( i → j ) = min n , e − β ( H j − H i ) o , P ( j → i ) = min n , e + β ( H j − H i ) o (43)detailed balance is fulﬁlled. Thermodynamic integration is a method to calculate free energy diﬀerences between twosystems (in our case they can diﬀer by of a base).In the method a parameter that transforms the ﬁrst system to the second one isintroduced.We can deﬁne A as the ﬁrst system, and B as the second system.We can write the free energy diﬀerence as follows: △ F A → B ( β ) = Z dF λ dλ dλ = (44)Where λ is a general parameter. We can substitute F = − kT lnZ and get: − kT Z dlnZdλ dλ = − kT Z Z dZdλ dλ = Substituting Z = R e − βH d Ω we get: − kT Z Z R e − β H d Ω dλ dλ = kT Z Z Z e − β H dHdλ d Ω β dλ = Z R e − β H dHdλ d Ω Z dλ = Z (cid:28) dHdλ (cid:29) dλ = For a choice of: 49 = λH B + (1 − λ ) H A (45)We have for dHdλ = H B − H A And it can be written: △ F A → B ( β ) = Z h H B − H A i dλ (46)Or in a more explicit form: △ F A → B ( β ) = Z (cid:8)R ( H B − H A ) e − β [ λH B +(1 − λ ) H A ] d Ω (cid:9)R e − β [ λH A +(1 − λ ) H B ] d Ω dλ In order to perform the integration, the following temperatures were chosen: • range 1: 310 320 329 340 350 359 370 380 390 400 410 420 429 440 470 • range 2: 500 580 670 780 890 1000 • range 3: 1000 2285 3571 4857 6142 7428 8714 • range 4: 10000 22857 35714 48571 61428 74285 87142 • range 5: 100000 200000 400000 1000000 10000000 We can estimate the number of steps it takes for two conﬁgurations to be independent, bythe average number of steps it takes for a replica of the system in a given temperature toexchange conﬁguration with a replica of the system in an adjacent temperature, multipliedby the square of the number of temperatures. Since the average number of steps fora conﬁguration exchange is the number of steps after which we attempt to exchange,divided the acceptance rate, we can estimate the number of independent measurementsin the following way: n = MC moves(steps for a conﬁguration exchange attempt / acceptancerate) · N For a temperature in the ﬁrst temperature range we can write: n = · · . · = 879 (where n is the number of independent measurements).50ince the standard deviation of the average of n independent variables with a standarddeviation σ i is: σ A = σ i √ n , we can use it to estimate the standard deviation of the internal energy.For example for the ﬁrst temperature, we can substitute the standard deviation of theenergy from the simulation and get: σ T = σ L √ n = . . = 0 .

14 [ kCal/M ole ] In the free energy calculation, we integrate numerically the average energy over functionof beta: β hR β h H X i d ˜ β i It is done by estimating the area between two points as the average energy multipliedby the diﬀerence in β . F X = β hR β h H X i d ˜ β i = β P i ( h H X ( β i ) i + h H X ( β i +1 ) i ) / · ∆ β i The standard deviation and the error range can be calculated as follows: σ F x = β s(cid:18)P i σ h H X ( β i ) i + σ h H X ( β i +1 ) i (cid:19) · ∆ β i = 0 . kCal/M ole ⇒ error range ∼ σ F x = 0 . kCal/M ole eferences [1] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences usingthermodynamics and auxiliary information. Nucleic acids research , 9(1):133, 1981.[2] K. Fejes-Toth, V. Sotirova, R. Sachidanandam, G. Assaf, G.J. Hannon, P. Kapranov,S. Foissac, A.T. Willingham, R. Duttagupta, E. Dumais, et al. Post-transcriptionalprocessing generates a diversity of 5?-modiﬁed long and short RNAs.

Nature ,457(7232):1028–1032, 2009.[3] G.V. Kryukov, S. Castellano, S.V. Novoselov, A.V. Lobanov, O. Zehtab, R. Guigo,and V.N. Gladyshev. Characterization of mammalian selenoproteomes.

Science ,300(5624):1439, 2003.[4] D. Fagegaltier, A. Lescure, R. Walczak, P. Carbon, and A. Krol. Structural analysisof new local features in SECIS RNA hairpins.

Nucleic Acids Research , 28(14):2679,2000.[5] Y. Ito, Y. Sone, and T. Mizutani. Stability of non-Watson-Crick GA/AG base pair insynthetic DNA and RNA oligonucleotides.

Molecular Biology Reports , 31(1):31–36,2004.[6] Y. Ito and T. Mizutani. Evidence of GA/AG pair, not AG/GA pair, in SECISelement. In

Nucleic Acids Symposium Series , volume 1, page 41. Oxford Univ Press,2001.[7] S.E. Morse and D.E. Draper. Purine–purine mismatches in RNA helices: evidencefor protonated G · A pairs and next-nearest neighbor eﬀects.

Nucleic acids research ,23(2):302, 1995.[8] Clark C. Barciszewski J, Frederic B.

RNA biochemistry and biotechnology. , pages73–87, 1999.[9] C. Schneider, M. Brandl, M. Meyer, and J. S"uhnel. Water-Mediated Uracil-Cytosine Base Pairs in RNA-A ComputationalStudy.[10] J.G. Kirkwood. Statistical mechanics of ﬂuid mixtures.

The Journal of ChemicalPhysics , 3:300, 1935.[11] TP Straatsma and JA McCammon. Multiconﬁguration thermodynamic integration.

The Journal of chemical physics , 95:1175, 1991.[12] D. Frenkel and B. Smit.

Understanding molecular simulation: from algorithms toapplications . Academic Press, Inc. Orlando, FL, USA, 1996.[13] W.K. Olson. Conﬁgurational statistics of polynucleotide chains. A single virtualbond treatment.

Macromolecules , 8(3):272–275, 1975.5314] K.S. Keating and A.M. Pyle. Semiautomated model building for RNA crystallogra-phy using a directed rotameric approach.

Proceedings of the National Academy ofSciences , 107(18):8177, 2010.[15] L.J.W. Murray, W.B. Arendall, D.C. Richardson, and J.S. Richardson. RNA back-bone is rotameric.

Proceedings of the National Academy of Sciences of the UnitedStates of America , 100(24):13905, 2003.[16] Wolfarm Saenger. Principles of Nucleic Acid Structure. page 125, 1984.[17] N.B. LEONTIS and E. Westhof. Geometric nomenclature and classiﬁcation of RNAbase pairs.

Rna , 7(04):499–512, 2001.[18] C. Smith P.J Struther Arnott and R. Chandrasekaran. Handbook of biochemistryand Molecular Biology. pages 411–422, 1976.[19] A. Rich, D.R. Davies, FHC Crick, and JD Watson. The molecular structure ofpolyadenylic acid.

Journal of molecular biology , 3(1):71–86, 1961.[20] Wolfarm Saenger. Principles of Nucleic Acid Structure. page 40, 1984.[21] Arthur J. Olson. Van der waals potential energy.[22] Wolfarm Saenger. Principles of Nucleic Acid Structure. page 139, 1984.[23] A. Vedani. YETI: An interactive molecular mechanics program for small-moleculeprotein complexes.

Journal of Computational Chemistry , 9(3):269–280, 1988.[24] T.A. Knotts IV, N. Rathore, D.C. Schwartz, and J.J. De Pablo. A coarse grain modelfor DNA.

The Journal of chemical physics , 126:084901, 2007.[25] J. Parsons, J.B. Holmes, J.M. Rojas, J. Tsai, and C.E.M. Strauss. Practical conver-sion from torsion space to Cartesian space for in silico protein synthesis.

Journal ofcomputational chemistry , 26(10):1063–1068, 2005.[26] U.H.E. Hansmann. Parallel tempering algorithm for conformational studies of bio-logical molecules.

Chemical Physics Letters , 281(1-3):140–150, 1997.[27] D.J. Earl and M.W. Deem. Parallel tempering: Theory, applications, and new per-spectives.

Physical Chemistry Chemical Physics , 7(23):3910–3916, 2005.[28] William Kahan. Further remarks on reducing truncation errors.

Communications ofthe ACM , 8(1):40,48, 1965.[29] Pete Isensee. C++ optimization strategies and techniques.[30] R. Malathi and N. Yathindra. Virtual bond probe to study ordered and randomcoil conformations of nucleic acids.

International Journal of Quantum Chemistry ,20(1):241–257, 1981. 5431] LR Dodd, TD Boone, and DN Theodorou. A concerted rotation algorithm foratomistic Monte Carlo simulation of polymer melts and glasses.

Molecular Physics ,78(4):961–996, 1993.[32] J.P. Ulmschneider and W.L. Jorgensen. Monte Carlo backbone sampling for nu-cleic acids using concerted rotations including variable bond angles.