Comparing secondary structures of RNA and calculating the free energy of an interior loop using a novel method for calculating free energy
aa r X i v : . [ q - b i o . B M ] D ec Thesis for the degree Master of ScienceBy Asaf FarhiComparing secondary structures of RNA andcalculating the free energy of an interior loop using anovel method for calculating free energy differences
Advisor: Prof. Eytan Domany
April 2011 cknowledgements
Firstly, I want to thank my supervisor Prof. Eytan Domany. I would like to thank himfor helping with identifying the upcoming obstacles in the project with his intuition andfor introducing me the interesting and skillful people I had the honor to work with. Iappreciate him for being flexible with the time frames as the project was highly demand-ing, and for investing time in reading the thesis and helping with his exceptional skill forexplaining things.I want to thank Dr. Michael Bon for introducing me the world of RNA (I used to callhim “Doctor of RNAs”) and sharing his vast knowledge in biology, chemistry, physics andcomputer science. Next, I want to thank Dr. Guy Hed for the fruitful collaboration andhis contribution in the Monte Carlo simulations and the physics.I also want to thank Prof. Nestor Caticha for helping with his vast experience inMonte Carlo simulations.I want to mention that I really enjoyed working at the Weizmann institute that pro-vided an excellent environment for scientific research. Specifically I want to thank toMichal Shoval, Perla Zalcberg, Rachel Goldman and Yossi Drier that are to be appreci-ated for their dedication. I want to thank my group for the social environment and myfamily for the support.There is a provisional patent pending that includes part of the content of this thesis. [email protected] bstract The thesis consists of two projects. In the first project, we present a software that anal-yses RNA secondary structures and compares them. The goal of this software is to findthe differences between two secondary structures (experimental or predicted) in order toimprove or compare algorithms for predicting secondary structures. Then, a comparisonbetween secondary structures predicted by the Vienna package to those found experimen-tally is presented and cases in which there exists a difference between the prediction andthe experimental structure are identified. As the differences originate mainly from facesand hydrogen bonds that are not allowed by the Vienna package, it is suggested thatprediction may be improved by integrating them into the software.In the second project we calculate the free energy of an interior loop using Monte-Carlo simulation. We first present a semi-coarse grained model for interior loops ofRNA, and the energy model for the different interactions. We then introduce the Monte-Carlo simulations and the method of Parallel Tempering which enables good sampling ofconfiguration space by simulating a system simultaneously at several temperatures. Nextwe present Thermodynamic Integration, which is a method for calculating free energydifferences. In this method simulations done at several values of a parameter λ are usedto yield the free energy difference in the form of an integral over λ . Hence, if simulation ateach value of λ necessitates parallel tempering, one has to simulate the system at a set of λ and T values. We introduce a method that calculates the free energy significantly fastersince we need to use only one parameter, T . To implement this method, we had to reach aregime in which the partition functions of the two systems are equal, which isn’t satisfiedfor systems that have different entropies in the high temperature limit, so a solution tothis problem had to be found. Free energy values calculated for various interior loopsare shown, and may, if verified or with a more realistic modelling, be integrated into theVienna package and supply an alternative to the experiments done. More applications ofthis method are also suggested. 3 ontents ∆ F by parallel tempering . . . . . . . . . . . . . . . 252.5.5 Passing the steric energy barriers by introducing a cutoff energy . 262.6 The Software and the simulations . . . . . . . . . . . . . . . . . . . . . . 292.6.1 General description . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6.3 Visualization of the simulation . . . . . . . . . . . . . . . . . . . . 302.6.4 Simulation details . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.7.1 Consistency checks . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.7.2 Energetically favorable micro-states . . . . . . . . . . . . . . . . . 322.7.3 Acceptance rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352.7.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422.8.1 Improvement proposals . . . . . . . . . . . . . . . . . . . . . . . . 43 Structural analysis and comparison of RNA secondarystructures
In this section, the components of the secondary structures of RNA will be defined. Then,a software that analyzes these components and compares secondary structures, will beintroduced. The main goal of the software is to compare two secondary structures and findthe differences between them in order to improve algorithms of prediction or to comparetwo algorithms for secondary structure prediction. The software can be further used todetect elements that are repeated in a certain group of RNAs and suggest a possiblefunctionality.
By convention, the beginning of the RNA molecule is marked by 5’ and the end is markedby 3’. The N nucleotides are referred to as vertices and the N-1 covalent bonds betweenconsecutive nucleotides are referred to as exterior edges. Base pairing via Hydrogen bondsis represented by a line segment called interior edge. The pairings can be between thefollowing pairs: G-C, A-U and G-U. The entire collection of edges and vertices is calleda graph. An admissible secondary structure is defined as a graph whose interior edgesdon’t intersect or cross each other (such crossings are called psuedo-knots).A face of a graph is defined to be a planar region, bounded on all sides by edges. Aface with a single interior edge is called hairpin loop (see example in Fig 3). Faces withtwo interior edges are classified into 3 groups. If the interior edges are separated by singleexterior edges on both sides, the face is called a stacking region. If they are separatedby a single exterior edge on one side, but by more on the other side, the face is calleda bulge loop. Otherwise the face is referred to as an interior loop. A face with three ormore interior edges is referred to as a multi-loop or bifurcation loop (see Fig. 1).Figure 1: Illustration of the different faces7he word hairpin refers to a structure or substructure, whose faces are consecutiveregion of stacking regions, bulges and interior loops, ending with a hairpin loop [1].
The software’s main tasks are to re-generate the structure from the output that representsthe hydrogen bond, to analyse it and to compare it with another structure of similarsequence. The software is suited for the output of the Vienna package in which a matchedpair is represented by parentheses (example is presented in Figure 2), and can be easilyadopted to suit the output that presents the paired bases, which is also common. It wasprogrammed in c++ in Unix environment (Unix compiler), using Eclipse.
The software uses tree representation (example is presented in Figure 2) for the structures,which is used for secondary structures and assumes no psuedo-knots. The software dividesthe structure into substructures, which include a sequence of faces with 2 interior edgesand a hairpin loop or a multiloop as their last face. Each substructure may containsons, which are substructures, that are connected to it via a multiloop. In the case of amultiloop, it contains the first pair of the multiloop, and the sons contain as their firstpair the other pairs of the multiloop.Figure 2: An example of a secondary structure ( a ) and its parenthesis ( b ) and treerepresentations ( c )( a )( b ) (((....))).((((((...))(((..))))))). ( c ) structure ւց structure hairpin structure ւ ց hairpin structure hairpin structure8n the process of identifying the tree representation the faces are detected, classifiedand stored in each sub-structure (the pairs are also stored in a different list). It is hereto mention that the software also stores in each face the unbound nucleotides. A detaileddescription of the algorithm is given in Appendix 3.1. • The functionality of the software in generating structures was verified on five RNAoutputs.
In order to compare two structures of the same RNA, the faces were compared. Theywere divided into three groups: identical , meaning that they have the same structure,nucleotide sequences and nucleotide indexes.
Similar , meaning that they have the samestructure, nucleotide sequence and not the same nucleotide indexes. And different , mean-ing that it doesn’t fall into the two categories mentioned (A detailed description of thealgorithm is given in Appendix 3.2). This division enables us, in case of different struc-tures, to identify the list of faces that have different free energy values. Thus, faces whichare suspected of having incorrect energy values may be focused on. The functionality ofthis comparison was checked on the structures of the RNAs: PDB_00030, PDB_00213,PDB_00394 and yielded the expected results.
The main uses of the structural analysis can be the following: • Comparison between structures of minimum free energy, found by different algo-rithms, and thus trying to extract in which cases they yield different results. Itcan be used to compare the mfe (minimum free energy) structures predicted by theVienna package and by other algorithms that have different free energy parameters.When the structures are different, the software can output the faces predicted byboth and the origin of the difference can be estimated. • Comparison between structures of minimum free energy of a certain algorithm andthe experimental results. One can assess which are the cases where differencesexist. Here again, the origin of difference can be identified and the parameters ofthe algorithm can be corrected accordingly. • Identification of structural motives (found by analyzing the structures predicted bythe algorithms) that are common in a group of RNAs, which may lead to classifica-tion and possible functionality. It can be also used to associate RNA to a group ofRNAs once its structural motives are known. An example for such structural simi-larity is the family of tRNA that includes a helix, which leads to a multiloop which isfollowed by three helices that end with an hairpin loop (typical secondary structureof tRNA is shown in Fig 3). A possible application, can be the PASRs (promoterassociated small RNAs) that are transcribed from the promoter’s regions. ThesePASRs bind at the promoter, and reduce the expression of the associated gene. Thesecondary structure of PASRs can be predicted using existing software. Then, using9his software, structural motives that are common can be detected and a structuralpossible classification, and even a functional mechanism can be estimated [2].Figure 3: Typical secondary structure of tRNA • Identification of certain sequences that are repeated in a group of RNAs, especiallyin unpaired nucleotides, which may provide clues about their function (can be doneusing experimental or predicted secondary structures). An example of repeatedsequences is the selenoprotein mRNA sequences. These mRNAs include the UGAcodon which has a dual function. It signals both the termination of protein synthe-sis and incorporation of the amino acid selenocysteine. Decoding of selenoproteintranslation requires the SECIS element, a stem-loop motif in the 3’-UTR of themRNA, carrying short or large apical loop. SECIS elements contain an Adenosinethat precedes the quartet of non–Watson-Crick base pairs, a UGA_GA motif in thequartet, and two adenosines in the apical loop or bulge [3][4].
The following RNA sequences: PDB 30,32,45,213,394,547,584,750,1236 were compared(experimental structures versus Vienna package results) to get initial results for the dif-ferences between the structures.It seems that two of the main sources of difference between the experimental structuresand the ones predicted by the algorithm were the following: • Bonds that appeared in the experimental structures and weren’t allowed by thealgorithms (Vienna package and others), like A-A U-C G-A G-G.10
Hairpin Loops, containing less than 4 exterior edges (3 in the tested sequences) wereincluded in the experimental structures and aren’t allowed by the algorithms.It was seen that in some cases (PDB 32,213,394,547,750,1236), the differences were local,meaning that there were identical faces in a substructure, then different faces, and thenan identical pair which terminates the local region of difference.Looking in the literature about the pairs which were detected in the experimentalstructures checked, and weren’t allowed by the algorithms, it seems that these pairsalso show up in other cases in RNA structures. Such an example is G-A/A-G pairingfound in SECIS element in the 3’ untranslated region of Se-protein mRNA and in thefunctional site of the hammerhead ribozyme. The measured melting temperature valuesof RNA oligonucleotides having this pair showed an intermediate value, between that ofthe Watson-Crick base pairs and that of non-base pairs [5][6]. Although rare, A-A andG-G pairs also occur in some cases. It was seen that a single G-G pair has a stabilizingeffect and can stabilize an internal loop [7] and that a group of Adenine bases can haveinteractions with each other in a bulge [8]. Another study shows that the water-mediatedUC pair is a structurally autonomous building block of the RNA [9]. Thus it seems thatthe structure of the RNA may be affected also by interactions that are less dominant.The tetraloops (Hairpin Loops, containing 3 exterior edges) that were seen in theexperimental data, seem to belong to the three types of tetraloops that are common inRNA: GNRA,UNCG and CUUG (R=purine, N=every nucleotide). Meaning that thealgorithms exclude these types of faces although if formed they may lead to a lower totalfree energy.Since the total number of different faces may be large, it is possible to detect regionsof difference and output local differences. Thus differences containing smaller number offaces can be detected, and tendencies may be pointed out. It can be done by detectinga sequence of identical faces followed by a sequence of different faces that ends with anidentical pair. Similar faces that belong to that region can also be detected, and thenumber of different faces can be further reduced.The local differences that were found are the following: (it should be noted here thatthe solid and dashed lines represent covalent and hydrogen bonds respectively and thatthe arrows point to the experimental data)11igure 4: local differences for PDB 32Experimental structure Vienna package prediction(a)(b)(c) Hairpin loop 12igure 5: local differences for PDB 213Experimental structure Vienna package prediction(a)(b)(c) Hairpin loop(d) Hairpin loop(e) 13igure 6: local differences for PDB 394Experimental structure Vienna package prediction(a) Hairpin loop(b) Multiloop
It can be concluded that the energy parameters of the algorithms give good predictionsfor the structures, and in most cases the general structure is well predicted. However,if we assume that the total free energy of a structure is the sum of the free energies ofthe faces, there are still faces whose free energy parameters aren’t estimated well enough.Local differences may lead to more specific cases of mismatch, and thus help us focus onthe faces that still have to be investigated.Certain differences between the structure of PDB 394 found experimentally and theone predicted by the Vienna package may suggest that the free energy of a multiloop mayvary significantly by its extension by 2 nucleotides. (the multiloop was extended ratherthan form a G-C hydrogen bond as suggested by the Vienna package). This may be dueto the entropic loss caused by its reduction by two nucleotides.Since the differences mainly originate from faces and hydrogen bonds that are notallowed by the Vienna package it is suggested that prediction may be improved by in-tegrating parameters of these faces into the software and predicting unallowed hydrogenbonds according to the relevant face respectively. This may lead to a more accurate anddetailed description of the pairs in the secondary structure prediction.14
Calculation of free energy of an interior loop
In the last decade, RNA has gained much attention, due the discovery that it fulfills aconsiderable diversity of biological tasks, including enzymatic and regulatory functions.Since RNA structure is highly related to its functionality, its of a top priority to predictand analyse secondary and tertiary RNA structures.The most common software that predict secondary structure are the Vienna pack-age/Mfold . These software assume that the total free energy of a secondary structure isthe sum of the free energies of its faces (planar regions, bounded on all sides by covalentor hydrogen bonds between nucleotides) [1]. These free energy parameters are extractedfrom experiments in which initial concentrations of reactors that are RNA segments arechanged, and the melting temperature for the reaction is measured (read Appendix 3.3.1for more details). However, in most of these experiments the heat capacity was assumednot to depend on the temperature, so they are to be redone (read Appendix 3.3.2 formore details). Here, we try to check the feasibility of a numerical calculation of the freeenergy values of faces, using a Monte-Carlo simulation. In the case that such a calculationwill be feasible, it might provide an alternative to the experiments that are done, andsupply more specific information, since in the experiments the faces’ free energy valuesare derived by fitting measured values of the larger secondary structures in which theyare included.This work is also relevant for calculation of thermodynamic properties of a completeRNA sequence and its tertiary free energy landscape, as it can be easily adopted to do so(assuming no pseudoknots). This is of importance due to the relatively small number oftheoretical works on RNAs as compared to the vast number of studies on proteins. Thesetheoretical works on RNAs are composed of mostly MD simulations in which a definitionof the coordinate of interest is introduced in advance, in order to investigate the freeenergy landscape. There are relatively fewer Monte-Carlo simulations that involve thecalculation of thermodynamic properties.In this work we present how the information gathered in a parallel tempering procedure(details are in 2.4) can be used to calculate the free energy differences. In this calculationwe had to reach equality of partition functions and since we didn’t want to be restrictedto systems that have the same high temperature description, we had to go into imaginaryregimes in which the steric constraints were relaxed. This regime was reached due tothe use of cutoff energy which is also relevant to sampling problems that appear in theThermodynamic Integration method [10][11][12].
The goals of this project are the following: • Model the RNA and the different interactions between the sites. • Perform Monte-Carlo simulation in which the space of states will be sampled cor-rectly, and that will converge and be consistent.15
Writing a software that will run in reasonable time frames. • Calculation of free energy of internal loops that can, in a more elaborate model,generate the free energy values used in the Vienna package.
In this project, we modeled and simulated an interior loop composed of 6 nucleotides- 3 in each strand. In this type of interior loop the first and last pair are bound incanonical or wobble base pairing, and the pair in the middle can have in the generalcase all combinations of bases, not belonging to the canonical or wobble base pairings.Specifically in this project we chose to simulate the following interior loop:Figure 7: Simulated Interior loop
G CX AG C where X stands for all types of nucleotides (A,C,G,U).
The model is partly coarse grained and partly in full atomistic detail. It is coarse-grained in its backbone description and in full atomistic detail in the description of thenucleobases. This choice of modelling of the structure is due to the important role thebases play in the interactions, as the base pairing and the stacking energies are thedominant ones.The model reduces the complexity of the backbone to two point interaction sites, onefor the phosphate and one for the starting point of the ribose. The phosphate interactionsite was chosen to reside where the P atom is located (all relevant atoms are presented infig 8), and the (beginning of the) ribose interaction site was chosen to reside where the C4’atom is located. Reasoning for this choice is the fact that the torsion angles about the twoC-O bonds, are preferably in the trans rotational isomeric state [13], which means thatthe P-C4’ and C4’-P bond lengths don’t vary much. It should be noted that accordingto the location of the C4’ and P atoms, the building of the RNA backbone in atomicdetail [14] and as a result the ribose can be achieved (The torsion angle δ determinesthe configuration of the ribose [15]).The bases were assumed to be rigid planar objectsas implied by [16]. The modelling of the bases included all the atoms (see Fig. 9 fordetails), what enables them to form Watson-Crick, and Hoogsteen base pairing [17]. Theatoms’ coordinates were calculated in the system of coordinates whose two base vectorsreside in the bases’ plane (based on the data form [18]). The bases were assumed to beconnected to the C4’ atom from the N1 atom in purines and the N9 in pyrimidines. The16istances between the two backbone sites, as well as the distance between the C4’ atomand the N1/N9 atom were chosen to be fixed. This was supported by an analysis of PDBfiles performed in which they showed small standard deviations (it was also reported in[19] that the backbone bonds defined have lengths of ∼ . ◦ A ). The distances betweenthe interaction sites were chosen to be the ones found in a specific pdb file (1AFX.pdb),for simplicity. It has to be mentioned here, that this modelling of the interior loopdoesn’t necessarily have less degrees of freedom than a full atomistic model. However,the treatment in terms of programming time is shorter.For an interior loop, there are two pairs of canonical/wobble hydrogen bonds. Thepairing is between the first base in one strand and the last base in the second strand, andbetween the last base in one strand and the first base in the second strand. In the model,it was assumed that the orientation of the paired bases is constant with respect to eachother. Regarding the other bases, they were assumed to have freedom of movement - theC4’-N bond was free to rotate and the base was free to rotate with respect to this bond.All the interaction sites in the backbone were also free to move with the restrictions thatthe first pair of bases is constant in space (due to rotation and translation symmetry),and the last pair of bases is treated like a rigid object (assumed to have fixed pairinggeometry) and free to move.Figure 8: Atomic numbering scheme and definition of torsion angles17igure 9: The four bases and their atomsA UG C The following energy terms were considered in the model:
One of the dominant terms is the stacking energy betweenthe bases, which originates from dispersive and repulsive interactions. The stacking energyis calculated between all the atoms in the bases and is represented by the 6-12 LennardJones potential: H stacking = X α,β α = β X i,j ǫ ij "(cid:18) r m ri ( α ) j ( β ) r i ( α ) j ( β ) (cid:19) − (cid:18) r m ri ( α ) j ( β ) r i ( α ) j ( β ) (cid:19) (1)where r i ( α ) j ( β ) and r m ri ( α ) j ( β ) are the distance between atoms and the distance atwhich the energy is minimum respectively ( i ( α ) is atom i of base α ).It can be easily seen that the minimum energy is ǫ and that we get infinite energy at r → and zero energy at r → ∞ . 18he distances of minimum energy were chosen from [20] and the energy values werechosen from [21] in order to be consistent with behavior of bases as reported in [22].The dominant energy terms have distance of minimum energy in the range ◦ A Another dominating term is the base-pairing energythat originate mainly from the hydrogen bonds.The potential is much more localized and was chosen to be: H bp = X i,j h bp ij T ij (2)where h bp ij = ε ij "(cid:18) r m ij r ij (cid:19) + tgh (cid:18) (cid:18) r ij − r m ij (cid:19)(cid:19) − − . (cid:18) r m ij r ij (cid:19) (3)and T ij = (cid:26) cos θ ij h bp < h bp ≥ (4) θ ij is the angle between the acceptor atom, the polar hydrogen and the donor atom(see Fig 10), and i, j are the indexes of the hydrogen and the acceptor respectively.The term tgh (cid:16) (cid:16) r ij − r mij (cid:17)(cid:17) − was introduced to achieve locality in the potential(keeping the energy zero at infinity), and the term cos θ was introduced to achieve direc-tionality.Figure 10: The angle between the acceptor atom, the polar hydrogen and the donor atomThe distance of min. energy is r m ij = 1 . ◦ A and the min. energies values ε ij are chosenaccording to the acceptor and the electronegativity of the donor. Explicitly, ε ij is the19roduct of the difference in electronegativity between the hydrogen atom and the donor,and a factor that accounts for the effect of the number of the lone pairs of the acceptor -1 for N and 2 for O (electrons in the outer shell can participate in the interaction).Another proposed potential is with: T ij = (cid:26) cos θ ij h bp < h bp ≥ (5)and with the factor that accounts for the effect of the number of the lone pairs ofthe acceptor - 1 for N and 1.15 for O. This choice is pretty much in accordance with theenergy parameters in [23]. Coulombic interactions are taken into account using theDebye-Huckel approximation, which is valid for the low-salt physiological conditions.The Coulombic interactions in the model are between the oxygens next to the phos-phates, and are calculated between phosphates of consecutive nucleotides (as in [24]). V qq = N X i Our goal is to find the free energy difference between two systems: two interior loops,which differ in a base. Specifically we will calculate the free energy difference between theinterior loops ˜ G , whose second base is G, and ˜ C , with the base C, as shown in Fig 12.Figure 12: ˜ C and ˜ G G C C AG C System ˜ C G CG AG C System ˜ G We denote this difference by △ F ˜ C → ˜ G ( β ) = F ˜ G ( β ) − F ˜ C ( β ) .The state of the bases G, C (the coordinates of all their atoms) for a given location ofthe N1/N9 atom of the base (see Fig 9 for details) is determined by three angles, sincethe bases are assumed to be rigid planar objects. Therefore the same degrees of freedomare associated with these different bases, and since the two systems ˜ G, ˜ C differ just in thebases, the two systems have the same degrees of freedom.The difference of free energy between any two systems that have the same degrees offreedom can, in principle, be calculated by the Thermodynamic Integration (ThI) method,as described in section 2.5.2. ThI requires running simulations of different instances(several values of a parameter λ i , i = 1 , , ...m ) of a hybrid ˜ G − ˜ C system.23hese systems have a rugged energy landscape (with many local minima), which meansthat the decorrelation time in a “naive” Monte-Carlo simulation are exponentially long.One of the widely used methods to alleviate the problem of equilibrating systems withsuch landscapes is parallel tempering (PT), which we describe below in Section 2.5.3. PTinvolves simulating the system of interest, in parallel, at several (inverse) temperatures β i , i = 1 , , ...n .The natural way to calculate the free energy difference for our two systems, △ F ˜ C → ˜ G ( β ) is, therefore, by ThI, equilibrating m systems, one for each value of λ i . For each suchequilibration PT is used, at n temperatures, hence all together nm systems are simulated.In section 2.5.4 we describe a method, Temperature Integration (TeI), which allows us tocalculate the free energy difference much more efficiently than the use of ThI with paralleltempering. Thermodynamic Integration (ThI) is one of the most commonly used methods for calcu-lating the free energy difference between two systems. It is described in detail in Appendix3.5. This method is used to calculate the free energy difference between two systems withthe same degrees of freedom [12, 10, 11], by varying a parameter λ that interpolates be-tween the two compared systems. We will use the systems ˜ C and ˜ G described above todemonstrate application of this general method in a specific context.Denote the Hamiltonians of the two systems by H ˜ C ( θ ) and H ˜ G ( θ ) , where θ denotesthe coordinates of the system. Noting that the two systems have the same coordinatespace, we define a λ - weighted hybrid system, characterized by the Hamiltonian H ( λ ) : H ( λ, θ ) = λH ˜ G ( θ ) + (1 − λ ) H ˜ C ( θ ) (16)As shown in Appendix 3.5, the free energy difference is given by △ F ˜ C → ˜ G ( β ) = Z (cid:2) h H ˜ G ( θ ) i λ − h H ˜ C ( θ ) i λ (cid:3) dλ (17)where h X i λ denotes the equilibrium average of X in the ensemble characterized by H ( λ ) .The expression for △ F ˜ C → ˜ G ( β ) , written explicitly, takes the form: △ F ˜ C → ˜ G ( β ) = Z (Z [ H ˜ G ( θ ) − H ˜ C ( θ )] e − β [ λH ˜ G ( θ )+(1 − λ ) H ˜ C ( θ ) ] d θ Z ( λ ) ) dλ (18)with Z ( λ ) = Z e − β [ λH ˜ G ( θ )+(1 − λ ) H ˜ C ( θ ) ] d θ (19)The integration is performed numerically, with the integrand evaluated at each one ofa set of values of λ by Monte Carlo simulations. As implied by (18), the two systems arerequired to have the same degrees of freedom.24 .5.3 Thermodynamic integration with parallel tempering The systems ˜ C and ˜ G , and the λ -weighted system have rugged energy landscapes withmany local minima. As the decorrelation times grow exponentially with △ E/kT , where △ E is the energy barrier between nearby valleys, equilibration times in these systemscan be rather long. In order to overcome this problem, and obtain the thermodynamicaverages h H ( θ ) i λ , one can use the Parallel Tempering procedure [26, 27]. However, imple-menting both thermodynamic integration and parallel tempering yields an unnecessaryoverhead in running time, as we will demonstrate.In order to implement thermodynamic integration we first choose a set of values λ i , i =1 , ...m , that will enable us to have a good sampling of the function h H ( θ ) i λ for theintegration in eq. (17). Thus, m is related to the desired precision of the integration.In principle parallel tempering should be performed for each λ i , simulating each of the m λ i weighted systems over a set of temperatures, given by β j , j = 1 , ...n . Finally, usingthe calculated values of h H ( θ ) i λ i at a temperature of interest β , we approximate theintegral of eq. (17) by a sum over m terms, to get the free energy difference △ F ˜ C → ˜ G ( β ) .The procedure involves running Monte-Carlo simulations over m λ -values, and n β -values, that is, over a grid of m × n instances of the hybrid system. An illustration of thisgrid is presented in figure 13.Figure 13: Illustration of the grid of values over which the Monte-Carlo simulations areperformed. The lowest values of β correspond to some very high finite temperature. B e t a [ M o l e / k C a l ] Lambda ∆ F by parallel tempering We present now a method which uses only simulations obtained in the process of paralleltempering (skipping the need for simulations at a set of λ i values), performed for the twosystems ˜ C and ˜ G , to obtain the free energy difference △ F ˜ C → ˜ G ( β ) . This method can beapplied to any two systems that have the same degrees of freedom θ .As β → , the limiting value of the partition function of a system yields the phasespace volume. In particular, if systems ˜ C and ˜ G , which have the same coordinate space25 θ } , have the same β → limit, we have Z ˜ G ( β → 0) = Z ˜ C ( β → 0) = Z d θ . (20)In this case we can use the following identity to obtain, for a finite β , the differenceof the free energies: ln Z ( β ) − ln Z ( β → 0) = Z β d ln Zdβ dβ = − Z β h H i dβ . (21)Using equations (20) and (21) we obtain △ F ˜ C → ˜ G ( β ) = 1 β [ln Z ˜ C ( β ) − ln Z ˜ G ( β )]= 1 β β Z h H ˜ G i dβ − β Z h H ˜ C i dβ . (22)For each of the systems ˜ C and ˜ G we estimate the integrals on the right hand sideby parallel tempering, sampling the system at a series of values β , . . . , β n . We choosevalues such that β − n is much larger than the internal energy of the system at β , so Z ( β n ) ∼ = Z ( β → .The condition of eq. (20) poses a problem for the particular systems ˜ C and ˜ G studiedhere. The steric excluded volume interactions are not bound in our model, and hence,the available phase space volumes for the two systems are not equal at any temperature,and equation (20) does not hold. In order to satisfy the condition stated above we hadto set a cutoff over the steric interactions, E cutoff . In the next section we show that ourresults do not depend on the choice of E cutoff and β n , as long as E cutoff is much largerthan any typical interaction energy in the system at β , and β − n ≫ E cutoff . We now tackle the problem of sampling at temperatures which are higher than all energyterms, by using a cutoff energy in the steric interaction potential. We denote the cutoffenergy and the distance from which the energy will be set to the cutoff energy by E cutoff and r cutoff respectively.The modified steric potential can be written as follows: (cid:0) σr (cid:1) → H steric ( r ) = (cid:26) (cid:0) σr (cid:1) r > r cutoff E cutoff r ≤ r cutoff 26n the following figure an example of such a steric potential is presented:Figure 14: Plot of an example of a steric interaction potential that includes a cutoff energy E n e r g y The proposed calculation of the free energy difference between the two systems atthe temperature of interest β is legitimate only if our choice of the cutoff energy has anegligible effect on the partition function of each of the two systems at β . In addition, thehighest temperature used, β , must be such that the equality of the partition functionsof the two systems is satisfied to a good accuracy.We denote the Hamiltonian with the cutoff energy by H ′ , and write the requirementsstated above explicitly as follows: lnZ ˜ G ( β , H ) ≃ lnZ ˜ G ( β , H ′ ) (23) lnZ ˜ C ( β , H ) ≃ lnZ ˜ C ( β , H ′ ) (24) lnZ ˜ G ( β , H ′ ) ≃ lnZ ˜ C ( β , H ′ ) (25)In order for the cutoff to have a negligible effect on the partition functions at thetemperature of interest it has to be set to a value that satisfies E cut off ≫ kT .As for β , if the cutoff energy satisfies E cut off ≪ kT , the systems will have almostequal probability to be in all the regions of their phase space, including ones in which themolecules entered the “volume” of one another. Thus, the partition functions of the twosystems will be almost equal.Hence if these requirements are satisfied one can write: lnZ ˜ C ( β , H ) − lnZ ˜ G ( β , H ) ≃ (26) lnZ ˜ C ( β , H ′ ) − lnZ ˜ G ( β , H ′ ) ≃ (27) lnZ ˜ G ( β , H ′ ) − lnZ ˜ G ( β , H ′ ) − [ lnZ ˜ C ( β , H ′ ) − lnZ ˜ C ( β , H ′ )] (28)27sing the identity in eq (21), we can write: △ F C → G ( β , H ) = 1 β [ lnZ ˜ C ( β , H ) − lnZ ˜ G ( β , H )] ≃ β β Z β h H ′ G i d ˜ β − β Z β h H ′ C i d ˜ β (29)So the calculation of free energy will be negligibly affected by the use of a cutoff energyas long as we fulfill the relevant conditions.It has to be mentioned that this use of cutoff energy is relevant also to ThI and similarmethods. In these methods if the systems have different β → limit, at λ = 0 , due tothe steric energy terms (represented by (cid:0) σr (cid:1) ) micro-states may have infinite energy. Asa result the internal energy will be infinite and the sampling of the internal energy willbe infeasible.It can be seen that since the free energy difference between systems that have theHamiltonian with the cutoff H ′ is almost equal to the one calculated between systemswith the Hamiltonian H △ F C → G ( β , H ) ≃ β [ lnZ ˜ C ( β , H ′ ) − lnZ ˜ G ( β , H ′ )] , ThI can be derived for systems with H ′ and yield almost the same result.In conclusion, with the use of cutoff energy in ThI, that enables us to sample theintegrand in all cases, the free energy difference between any two systems that have thesame degrees of freedom can be calculated. • Cutoff energies chosen for the simulation : E cutoff = 5 [ KCal/M ole ] , E cutoff =4 . KCal/M ole ] . These cutoff energies satisfy the conditions mentioned above.Hence, the results of the calculations of the free energy for the two cutoffs shouldbe similar. 28 .6 The Software and the simulations The project was written in c/c++ in Unix environment under Eclipse. It was mostlywritten in object oriented c++, and at places where performance was critical it waswritten in c. The model was divided into classes that represent its components (Oligonu-cleotide,Nucleotide, Base etc.). In cases where the behavior of the class varies accordingto its type, derived classes were used (classes that inherit properties of a base class andimplement more properties according to their type). For example nucleotides were dividedinto regular nucleotides, first nucleotide in first strand, last nucleotide in first strand etc.that inherited the properties of the class Nucleotide.The project is mostly written in a general form and can be applied to all kinds of faces(hairpin loops, bulges etc.) with slight software changes.In order to to have all the initial information of the interior loop, a PDB file wasread (1AFX.pdb) and the information of the nucleotides comprising the interior loop wasstored.The bases’ atoms, throughout the program, were generated using two base vectors thatspan the 2d plane in which the base resides (since the base is rigid and planar). Sincethe number of configurations over which we sum their energy values is huge, Kahan’salgorithm was used to reduce accumulative errors of the floating point numbers [28].In order to enable us to change the parameters needed for the runs without having tocompile the software each time, parameter files were written and read in run time. As the computational power needed in the simulation is big, optimization of the runningtime was taken seriously.It was done according to the instructions in [29] with the aid of Visual Studio per-formance wizard, which enabled us to identify the bottlenecks and the time consumingparts of the program. The code was revised in many places and the running time wasminimized mostly due to the following steps : • Avoidance of using library functions for calculations if their full functionality isn’tused (factor of ∼ • Avoidance of memory allocations in places where it’s not obligatory (factor of ∼ • Use of inline functions in cases in which they are used frequently. Inlined functionsare copied into place instead of being called (factor of ∼ ∼ ∼ ∼ In order to visualize the simulation, the capability to write the information of the oligonu-cleotide instance to a PDB file was added. Since we also wanted to visualize the devel-opment of the simulation in time, PYMOL which is a software that generates a videofrom PDB files, was used. PYMOL commands have been studied in order to enable us tocreate movies. Then, the commands were integrated into the software in order to generateautomatically a script file that, when double clicked, shows a movie of the PDB files. The simulation for each of the four systems consisted of parallel tempering over fivetemperature ranges (see Appendix (3.6.1)), each performed on a single core. For eachreplica there were 200,000 configuration exchange attempts with replicas at adjacenttemperatures. Since there are ∼ Monte-Carlo moves between adjacent configurationsexchange attempts, there were ∼ · Monte Carlo moves for each temperature. Overall,for the four systems for all the temperatures simulated for the two cutoff energies ∼ · Monte-Carlo micro-states were generated, from which a movie of a length of ∼ yearscan be created (30 FPS). 30 .7 Results The consistency of the results was checked in several ways: • In the canonical ensemble, the following relation is satisfied: D ( △ E ) E = − ∂U∂β .Since in the simulation we can calculate the variance of the energy and also theinternal energy as a function of beta, we were able to plot and compare the twosides of the equation. The results for the different bases are shown in figure 15:Figure 15: Variance of the energy and crude partial derivative of the energy with respectto beta as a function of temperature [(Kcal/Mole)^2] 320 340 360 380 400 420 440 460 480020406080100120 ATemperature [K] Variance of the energyCrude partial derivative of the energy w.r to beta 320 340 360 380 400 420 440 460 480020406080100120 CTemperature [K] Variance of the energyCrude partial derivative of the energy w.r to beta 320 340 360 380 400 420 440 460 480020406080100120 GTemperature [K] Variance of the energyCrude partial derivative of the energy w.r to beta 320 340 360 380 400 420 440 460 480020406080100120 UTemperature [K] Variance of the energyCrude partial derivative of the energy w.r to beta It can be seen that the agreement is good, up to deviations that originate mainlyfrom the relatively large steps in beta and from the higher standard deviations inthe internal energy near the peaks. 31 The energy distribution function at given betas β , β are: P ( E, β ) = g ( E ) exp ( − β E ) Z β , P ( E, β ) = g ( E ) exp ( − β E ) Z β . Hence, we can estimate P ( E, β ) using P ( E, β ) asfollows: P ( E, β ) est = P ( E,β ) exp ( − E ( β − β )) R P ( E,β ) exp ( − E ( β − β )) dE . The results are shown in thefollowing figure:Figure 16: Energy distribution for T=310K,T=320K, and estimation for T=320K withU as the second base -40 -30 -20 -10 0 10 20 30 4000.0020.0040.0060.0080.010.0120.014 Energy [kCal/Mole] P r obab ili t y d i s t r i bu t i on T=310KT=320KEstimation at T=320K It can be seen that there is good agreement between the energy distribution atT=320K (green line) and the one estimated at T=320K (red line), based on thedistribution at T=310K (the red and green lines practically coincide). • It had to be verified that the integral β R β h H X i d ˜ β is independent of the cutoffenergy as discussed in 2.5.5, and the results are shown later on. We ran the simulation and captured the energetically favorable micro-states in orderto asses the energy model. The selected micro-states, and their energy values for theconfiguration with different bases are presented in the following figures: (1-3 first strand,4-6 second strand. atoms by their color: white - Hydrogen, blue- Nitrogen, red- Oxygen,green - Carbon) 32igure 17: Micro-state of the system with base A with low energy. Energy values(kCal/Mole): Total -23.12, Stacking -20.46, Hydrogen -5.19, Electrical 1.99, Steric 0.54Figure 18: Micro-state of the system with base C with low energy. Energy values(kCal/Mole): Total -23.05 , Stacking -20.38, Hydrogen -5.29 , Electrical 1.9 , Steric0.72 33igure 19: Micro-state of the system with base G with low energy. Energy values(kCal/Mole): Total <23 , Stacking -20.42 , Hydrogen <-8.5, Electrical 2.09, Steric 3.29Figure 20: Micro-state of the system with base U with low energy. Energy values(kCal/Mole): Total -23.04, Stacking -16.78, Hydrogen -8.63, Electrical 1.88, Steric 0.534t can be seen that the geometries of the 2nd base with respect to the 5th basein the low energy states are similar to the ones reported in the literature. The A-Gpairing geometry is similar to the one presented in [17]. The A-A pairing geometry isalso similar to the one presented in [17] and involves Hoogsteen base pairing. The U-Apairing geometry is similar to the well known Watson-Crick Base pairing. The acceptance rates of the internal and external moves as a function of temperature areshown in the following figure:Figure 21: Acceptance rate of internal and external moves - system with U as the secondbase 300 320 340 360 380 400 420 440 460 4800.40.50.60.70.80.91 Temperature [K] Acceptance rate internal moveAcceptance rate external move As expected, these acceptance rates are larger for higher temperatures.The configuration exchange acceptance rates were high since we took small steps intemperature in order to sample well all the temperature range (values around ∼ . ).35 .7.4 Results Here we present the average of the total, stacking, Hydrogen bond, electrical, and stericenergies for the four systems:Figure 22: The different average energies as a function of temperature for system withthe second base A,C 300 320 340 360 380 400 420 440 460 480-25-20-15-10-505 ATemperature [K] E ne r g y [ k C a l / M o l e ] Total energyStacking energyHydrogen energyElectrical energySterical energy 300 320 340 360 380 400 420 440 460 480-25-20-15-10-505 CTemperature [K] E ne r g y [ k C a l / M o l e ] Total energyStacking energyHydrogen energyElectrical energySterical energy 300 320 340 360 380 400 420 440 460 480-25-20-15-10-505 GTemperature [K] E ne r g y [ k C a l / M o l e ] Total energyStacking energyHydrogen energyElectrical energySterical energy 300 320 340 360 380 400 420 440 460 480-25-20-15-10-505 UTemperature [K] E ne r g y [ k C a l / M o l e ] Total energyStacking energyHydrogen energyElectrical energySterical energy It can be seen that the dominating terms are the stacking and Hydrogen interaction(the Hydrogen interaction energy of the first and last pair aren’t taken into account). Asexpected all the energies are monotonically increasing function of the temperature. Themost significant change is in the stacking and Hydrogen energies since there’s unwindingof the pair in the middle. 37he following is a graph of the heat capacity at constant pressure C P = (cid:0) ∂H∂T (cid:1) N.P forthe four systems:Figure 24: C p as a function of temperature for the different bases 300 350 400 450 50000.10.20.30.4 A C p [ k C a l / M o l e K ] Temperature [K] 300 350 400 450 50000.10.20.30.4 C C p [ k C a l / M o l e K ] Temperature [K]300 350 400 450 50000.10.20.30.4 G C p [ k C a l / M o l e K ] Temperature [K] 300 350 400 450 50000.10.20.30.4 U C p [ k C a l / M o l e K ] Temperature [K] In all the configurations there is a maximum in the heat capacity. This maximumrepresents the transition from a system in which the dominant energy terms are thestacking and Hydrogen energies to a system which is dominated mainly by the stericenergy terms. 38ere we present the normalized energy histogram for various temperatures for a systemwith the second base U:Figure 25: Normalized energy histogram for different temperatures for a system with thesecond base U -40 -30 -20 -10 0 10 20 30 4000.0020.0040.0060.0080.010.0120.014 Energy [kCal/Mole] P r obab ili t y d i s t r i bu t i on The probability for a given energy for the continuous case is: P ( E ) = g ( E ) e − E/kT /Z and in the MC simulation it’s simply the normalized histogram of the energies.It can be concluded from the graph that there are two peaks in the density of statesat these temperatures. This is since there is an increase in the energy distribution thatcan’t be attributed to e − E/kT . As a result, the standard deviation in the energy attemperatures around 370K is high. The following graphs are the internal energies as afunction of β :Figure 26: Internal energy as a function of beta for the different bases E ne r g y [ k C a l / M o l e ] ACGU βs appear because the high temperature enablesthe molecules to cross one another so the average steric energy is high. In the next figurewe compared the energy as a function of β for configurations with bases A and C.Figure 27: Log of the energy as a function of log β - comparison between interior loopswith A and C -5 -4 -3 -2 -1 Beta [Mole/kCal] E ne r g y [ k C a l / M o l e ] AC As can be seen from the figure, there is saturation in the energy at small β s. This isdue to the fact that kT at these temperatures is much higher than the energy terms andthere is free movement. The fact that the configuration with A has higher internal energyat low β s can be explained by the higher number of atoms that A contains compared toC, which results in more atoms on average that have high energy terms.As we raise the cutoff energy, the temperature from which the behavior of the systemdiffers from the original one is higher. Together with it, the internal energy that we’ll haveat infinite temperatures will be higher as the molecules are free to enter the volume of oneanother and will experience higher energies on average (since the cutoff energy is higher).This is in agreement with the assumption that the integral of the internal energies overbeta is independent of the cutoff energies as long as we satisfy the relevant conditions. Inthe following graph we can see that for the higher cutoff we have lower internal energies at β ∼ M ole/kCal ] and higher internal energies for β < − [ M ole/kCal ] . This behavioris more discernible as the difference between the cutoff energies becomes larger.40igure 28: Log of the energy as a function of log of beta for the two cutoffs - A -5 -4 -3 -2 -1 Beta [K] E ne r g y [ k C a l / M o l e ] Ecutoff=5[kCal/Mole]Ecutoff=4.4[kCal/Mole] We calculated the values of the integral of the internal energy over β for the chosencutoff energies. The results are shown in the following figure :Figure 29: Values of the integral for the different cutoff energies (in kCal/Mole) at bodytemperature E cutoff = 5 kCal/M ole E cutoff = 4 . kCal/M ole β R β h H A i d ˜ β β R β h H C i d ˜ β β R β h H G i d ˜ β -1.72 -1.66 β R β h H U i d ˜ β E cutoff = 5 kCal/M ole E cutoff = 4 . kCal/M oleU A ( T = 310 K ) -23.29 -23.23 U C ( T = 310 K ) -20.04 -19.85 U G ( T = 310 K ) -29.08 -28.97 U U ( T = 310 K ) -24.01 -23.9841ince we have the internal energy of the different configurations (see Fig 30), we canalso calculate the entropy difference.For example, the entropy difference between A and G is calculated as follows for E cutoff = 5 [ kCal/M ole ] : ∆ F A → G = ∆ H A → G − T ∆ S A → G = − . − · ∆ S A → G = − . 45 [ kCal/M ole ] ⇒ ∆ S A → G = − . kCal/KM ole ] , − T ∆ S A → G = 3 . 34 [ kCal/M ole ] For E cutoff = 4 . kCal/M ole ] we have: ∆ F A → G = ∆ H A → G − T ∆ S A → G = − . − · ∆ S A → G = − . 49 [ kCal/M ole ] ⇒ ∆ S A → G = − . kCal/KM ole ] , − T ∆ S A → G = 3 . 25 [ kCal/M ole ] It can be seen that there is ∼ deviation in the entropy value between the systemswith the two cutoffs. This deviation can be minimized by using higher cutoffs (the cutoffssatisfy E cutoff ≈ kT body , E cutoff ≈ kT body ) and larger number of iterations in the MCsimulations.We can also calculate the entropies for higher temperatures in the same manner.Results are shown in the following figure:Figure 31: ∆ S A → G as a function of temperature 310 315 320 325 330 335 340-0.018-0.017-0.016-0.015-0.014-0.013-0.012-0.011-0.01 Temperature [K] A - G E n t r op y d i ff e r en c e [ kc a l / K M o l e ] The entropy difference in absolute value gets larger as we raise the temperature. Theentropy is a monotonically increasing function of the internal energy, which is a mono-tonically increasing function of the temperature. Hence, as we increase the temperature,the entropies are expected to get larger and it’s likely that the entropy difference betweenthe systems will also be such. In conclusion, we presented models for faces and the interactions in RNAs and introduceda procedure to calculate the free energy of interior loops, and faces in general. Thisprocedure, if verified or improved may suggest an alternative to the experiments in whichfree energy of faces are calculated. Under the assumptions that the secondary structure42s known and that faces are decoupled, simulations on several faces can also be performedtogether to generate the whole tertiary structure, and more global properties can becollected. In the calculation of free energy we showed a method in which the informationgathered in the course of parallel tempering can be used to calculate the free energydifference between systems. In this method, if the energy landscape is rugged, instead ofsampling both over the space of λ , the parameter related to the TI procedure, and thetemperature, we sample just over the temperatures of the configurations of interest (upto infinity). This method is much easier to implement than ThI and is likely to be muchfaster in rugged energy landscapes. This is crucial when the cost of the calculation ishigh, as is often the case in RNA tertiary structures simulations. In the method we hadto bring the two systems to regimes at which the partition functions are equal. Althoughthis is easily satisfied for systems that have the same β → limit, we wanted the conditionto be satisfied for all systems that have the same degrees of freedom. We introduced acutoff energy in the repulsion term which enabled the molecules to enter the “volume” ofone another and as a result the condition could be satisfied. This use cutoff energy canalso be applied to other methods such as ThI and enable the calculation of free energy inthe cases where the sampling of the integrand isn’t feasible.Regarding the optimization of software written in c/c++ languages, though the rangeof possible actions is huge, if the bottlenecks are identified and the relevant actions areperformed (as listed in 2.6.2), together with the use of the proper compiler, and thecapabilities of vectorization of the processor, a tremendous improvement in the runningtime can be achieved in a relatively short time. As reported by [30], the backbone bond angles defined in our model are in the range of ◦ − ◦ . The P-C-N bond angle also showed small standard deviation by PDB analysis.In order to take it into account we can add the following energy terms: H bond angle = P i kT (cid:20)(cid:16) θ PCPi − θ PCPavg σ θP CP (cid:17) + (cid:16) θ CPCi − θ CPCavg σ θCPC (cid:17) + (cid:16) θ PCNi − θ PCNavg σ θPCN (cid:17) (cid:21) Alternatively, we can switch to full atomistic modelling of the backbone and the ribose.As will become clear later, this modelling will include 7 degrees of freedom per nucleotidefor the seven torsion angles. A CONROT (concerted rotation) Monte-Carlo move can beused, in which 7 consecutive torsion angles are changed keeping the bond angles, distancesand the rest of the configuration fixed [31, 32]. As the backbone torsion angle δ is directlycorrelated to the ribose dihedral angle ν which determines the ribose configuration [33],it seems that the ribose doesn’t introduce a degree of freedom [15] other than the torsionangle χ that defines the orientation of the base with respect to the ribose. Since thefaces are assumed to be decoupled from one another, the configuration of the ribose andhence δ and the torsion angle χ have to be constant in the nucleotides that are sharedby two faces. Hence, the degrees of freedom that we have consist of the 7 torsion angles α, β, γ, δ, ε, ζ, χ for the nucleotides that are not shared between faces and 5 torsion angles α, β, γ, ε, ζ for the nucleotides that are shared between faces. A more natural choice ofdivision of the atoms that belong to a nucleotide would be C4’ to C4’ since the faces areassumed to be decoupled 43he ribose also has δ dependent energy landscape [33] which can be taken into account.Regarding the Monte Carlo moves, we can perform the following moves: • CONROT move in the backbone that won’t change δ of the unshared nucleotides. • Rotation of the χ torsion angle in the unshared nucleotides. • A move that will affect the location of the ribose and the base of the shared nu-cleotides and that will keep δ , χ , δ , χ constant. Such a move can be the fol-lowing: we can define virtual bond C ′ − C ′ that connects the two shared nu-cleotides. Since the bond and torsion angles between these atoms are assumed tobe constant, the length of this bond can be considered to be fixed. Since rotationaround the ε or γ torsion angle rotates the virtual bond, the virtual bond angles O ′ − C ′ − C ′ , C ′ − C ′ − C ′ can be considered fixed. Now, since the require-ments for the CONROT move are satisfied (fixed bond length and bond angle), the C ′ − C ′ can be considered as a bond in a CONROT move that includes twomore atoms and involves the changing the location of the ribose and the base ofthe shared nucleotides. This use of virtual bonds in the CONROT move is rathergeneral and can be used in other places.44 Appendixes The software reads the string of ’(’, ’.’ and ’)’ characters from left to right and pro-cesses the information iteratively. In each iteration, the software runs over the ’(’ and ’.’characters until it reaches a ’)’ character. In this process it saves the indexes of the ’(’characters in the “bases to be paired” list. Then it runs over the ’)’ characters and pairsthem with the bases from the list (saves them in the “paired bases” list) until it reachesa ’(’ character. At this point we have a list of “paired bases” and a list of “bases to bepaired” . Here we stop to define a process in which a son is generated and all the pairedbases are moved to it (which will be called “process” from now on). We also mentionhere that each structure has, as one of its fields, the number of unpaired bases after theprocess (The initial value for this field is -1). If, at this point of the iteration, there aremore unpaired bases in the list, a process is done - a son is generated, and the pairs arestored in it (we’ll call this condition 1). If, it continued the pairing task with the father (ifduring the iteration, there are more ’)’ characters after all the bases have been paired, itwill continue the pairing task with the father ( ∗ ) ) , and the father’s substructure has lessunpaired bases, than it had before, it means that it detected a multiloop (called condition2). In that case, a son which is a multiloop is created and all the structure’s sons aremoved to it. If a son was generated, in the next iteration, if a ’(’ is detected first, then anew son is generated, and the iteration is processed with this son ( ∗∗ ) .For example, for the following string: ( .... ( .... ) .... ( ... ( ... ) .... ( .... ) ... ) .... ( .... ) .... ) A..B...B ′ ...C..D.D ′ ..E..E ′ ..C ′ ..F..F ′ ..A ′ (Where the letters specify locations in the string).Using the algorithm specified above we’ll have the following steps:45teration 1 2 3 4startingpoint A C E F endingpoint C E F A ′ conditionsfulfilledduring theiteration ( ∗ ) , ( ∗∗ ) ( ∗ ) , ( ∗∗ ) list of pairsafter theiteration BB ′ DD ′ EE ′ list ofbases to bepairedafter theiteration A A, C A conditionfulfilledafter theiteration 1 1 2 2Where * is used the specify the structure that the algorithm begins the iteration with.The software can give as an output pairs, faces and nucleotide’s sequences for eachsubstructure. For the first structure, it runs over all faces, and for each face, it looks for identical onesin the second structure. If it finds an identical face, it updates the faces’ field similaritylevel to identical. It then removes it from the list of faces of the first and second structureand adds it to lists of similar faces, in order (identical face is also similar).Then it runs over the nonidentical faces in the first structure and for each face, itlooks for similar faces in the second structure. In case it finds a similar face, it updatesthe faces’ field similarity level to similar. It then removes them from the list of faces ofthe first and second structure and adds it to the new lists. Then it generates one vectorfor the first structure and one vector for the second structure that include the differentfaces. It then sorts them according to the face type (hairpin loop, stacking region bulge,interior loop or multiloop). Thus it enables us to see which faces are preferred over whichfaces in the two structures. 46 .3 Experimental calculation of free energy Say we have two strands of RNA that can bind in the following reaction: [ A ] + [ B ] → [ AB ] (30)where [ A ] , [ B ] denote the concentrations of the single strands and [ AB ] denotes theconcentration of their bound product.The free energy difference is: △ G = − RT lnK (31)Where the dissociation constant K is defined as follows: K , [ AB ][ A ] [ B ] (32)At T m , by definition, the concentration of the reactors and the products is equal.Say we take an initial concentration of the reactors of [ A ] = [ B ] = X , and zero initialconcentration of the product [ AB ] = 0 .According to the reaction, it can be seen that at T m (where T m denotes meltingtemperature) we have: [ A ] = [ B ] = [ AB ] = X .Substituting it in (31), we get: △ G T m = − RT lnK T m = RT ln (cid:18) X (cid:19) (33)We can use: △ G T m = △ H ( T m ) − T m △ S ( T m ) (34)and get: T m = RH ( T m ) ln (cid:18) X (cid:19) + △ S ( T m ) △ H ( T m ) (35)Now each time the initial concentration of the reactors is changed and T m is measured.Thus, a graph of T m as a function of ln (cid:0) X (cid:1) can be generated.From this graph △ H ( T m ) , △ S ( T m ) can be extracted.It is here to mention that the product consists of many faces and the values of theenthalpy and entropy are assigned by fitting results to the faces from many experiments.47 .3.2 Finding enthalpy and entropy values at body temperature Even if we assume that the former experiments are performed in a small range of temper-atures, in which the enthalpy and entropy are constant, it can’t be assumed that thesetemperatures are close to body temperature.Measuring the heat capacity as a function of temperature enables the calculation ofthe enthalpy and entropy values at body temperature. C p = (cid:18) dQdT (cid:19) P (36)The enthalpy and entropy can be expressed as functionals dependent on the functionof heat capacity over temperature.Since the process is quasistatic: dQ = T dS , and we can write: (cid:18) dSdT (cid:19) P = 1 T C P (37)Regarding the enthalpy: H ( S, P, N ) since it’s a function of S, P and N and P, N are held constant we canwrite the derivative of the enthalpy with respect to the temperature as follows: (cid:18) ∂H∂T (cid:19) P,N = ∂H∂S ∂S∂T = T T C P = C p (38)So we can write: △ H ( T body ) = △ H ( T m ) + Z T m T body C p dT (39) △ S ( T body ) = △ S ( T m ) + Z T m T body T C p dT (40)And In conclusion : △ H ( T m ) , △ S ( T m ) , C p ( T ) ⇒ △ H ( T body ) , △ S ( T body ) However, in many calculations, dependency of the heat capacity on the temperaturewas neglected, so the experiments are to be redone. According to the detailed balance criteria, it can be written for the canonical ensemble: P ( i → j ) P ( j → i ) = P ( j ) P ( i ) = e − βH j e − βH i = e − β ( H j − H i ) = (41) min (cid:8) , e − β ( H j − H i ) (cid:9) min (cid:8) , e + β ( H j − H i ) (cid:9) (42)48o for the choice of P ( i → j ) = min n , e − β ( H j − H i ) o , P ( j → i ) = min n , e + β ( H j − H i ) o (43)detailed balance is fulfilled. Thermodynamic integration is a method to calculate free energy differences between twosystems (in our case they can differ by of a base).In the method a parameter that transforms the first system to the second one isintroduced.We can define A as the first system, and B as the second system.We can write the free energy difference as follows: △ F A → B ( β ) = Z dF λ dλ dλ = (44)Where λ is a general parameter. We can substitute F = − kT lnZ and get: − kT Z dlnZdλ dλ = − kT Z Z dZdλ dλ = Substituting Z = R e − βH d Ω we get: − kT Z Z R e − β H d Ω dλ dλ = kT Z Z Z e − β H dHdλ d Ω β dλ = Z R e − β H dHdλ d Ω Z dλ = Z (cid:28) dHdλ (cid:29) dλ = For a choice of: 49 = λH B + (1 − λ ) H A (45)We have for dHdλ = H B − H A And it can be written: △ F A → B ( β ) = Z h H B − H A i dλ (46)Or in a more explicit form: △ F A → B ( β ) = Z (cid:8)R ( H B − H A ) e − β [ λH B +(1 − λ ) H A ] d Ω (cid:9)R e − β [ λH A +(1 − λ ) H B ] d Ω dλ In order to perform the integration, the following temperatures were chosen: • range 1: 310 320 329 340 350 359 370 380 390 400 410 420 429 440 470 • range 2: 500 580 670 780 890 1000 • range 3: 1000 2285 3571 4857 6142 7428 8714 • range 4: 10000 22857 35714 48571 61428 74285 87142 • range 5: 100000 200000 400000 1000000 10000000 We can estimate the number of steps it takes for two configurations to be independent, bythe average number of steps it takes for a replica of the system in a given temperature toexchange configuration with a replica of the system in an adjacent temperature, multipliedby the square of the number of temperatures. Since the average number of steps fora configuration exchange is the number of steps after which we attempt to exchange,divided the acceptance rate, we can estimate the number of independent measurementsin the following way: n = MC moves(steps for a configuration exchange attempt / acceptancerate) · N For a temperature in the first temperature range we can write: n = · · . · = 879 (where n is the number of independent measurements).50ince the standard deviation of the average of n independent variables with a standarddeviation σ i is: σ A = σ i √ n , we can use it to estimate the standard deviation of the internal energy.For example for the first temperature, we can substitute the standard deviation of theenergy from the simulation and get: σ T = σ L √ n = . . = 0 . 14 [ kCal/M ole ] In the free energy calculation, we integrate numerically the average energy over functionof beta: β hR β h H X i d ˜ β i It is done by estimating the area between two points as the average energy multipliedby the difference in β . F X = β hR β h H X i d ˜ β i = β P i ( h H X ( β i ) i + h H X ( β i +1 ) i ) / · ∆ β i The standard deviation and the error range can be calculated as follows: σ F x = β s(cid:18)P i σ h H X ( β i ) i + σ h H X ( β i +1 ) i (cid:19) · ∆ β i = 0 . kCal/M ole ⇒ error range ∼ σ F x = 0 . kCal/M ole eferences [1] M. Zuker and P. Stiegler. Optimal computer folding of large RNA sequences usingthermodynamics and auxiliary information. Nucleic acids research , 9(1):133, 1981.[2] K. Fejes-Toth, V. Sotirova, R. Sachidanandam, G. Assaf, G.J. Hannon, P. Kapranov,S. Foissac, A.T. Willingham, R. Duttagupta, E. Dumais, et al. Post-transcriptionalprocessing generates a diversity of 5?-modified long and short RNAs. Nature ,457(7232):1028–1032, 2009.[3] G.V. Kryukov, S. Castellano, S.V. Novoselov, A.V. Lobanov, O. Zehtab, R. Guigo,and V.N. Gladyshev. Characterization of mammalian selenoproteomes. Science ,300(5624):1439, 2003.[4] D. Fagegaltier, A. Lescure, R. Walczak, P. Carbon, and A. Krol. Structural analysisof new local features in SECIS RNA hairpins. Nucleic Acids Research , 28(14):2679,2000.[5] Y. Ito, Y. Sone, and T. Mizutani. Stability of non-Watson-Crick GA/AG base pair insynthetic DNA and RNA oligonucleotides. Molecular Biology Reports , 31(1):31–36,2004.[6] Y. Ito and T. Mizutani. Evidence of GA/AG pair, not AG/GA pair, in SECISelement. In Nucleic Acids Symposium Series , volume 1, page 41. Oxford Univ Press,2001.[7] S.E. Morse and D.E. Draper. Purine–purine mismatches in RNA helices: evidencefor protonated G · A pairs and next-nearest neighbor effects. Nucleic acids research ,23(2):302, 1995.[8] Clark C. Barciszewski J, Frederic B. RNA biochemistry and biotechnology. , pages73–87, 1999.[9] C. Schneider, M. Brandl, M. Meyer, and J. S"uhnel. Water-Mediated Uracil-Cytosine Base Pairs in RNA-A ComputationalStudy.[10] J.G. Kirkwood. Statistical mechanics of fluid mixtures. The Journal of ChemicalPhysics , 3:300, 1935.[11] TP Straatsma and JA McCammon. Multiconfiguration thermodynamic integration. The Journal of chemical physics , 95:1175, 1991.[12] D. Frenkel and B. Smit. Understanding molecular simulation: from algorithms toapplications . Academic Press, Inc. Orlando, FL, USA, 1996.[13] W.K. Olson. Configurational statistics of polynucleotide chains. A single virtualbond treatment. Macromolecules , 8(3):272–275, 1975.5314] K.S. Keating and A.M. Pyle. Semiautomated model building for RNA crystallogra-phy using a directed rotameric approach. Proceedings of the National Academy ofSciences , 107(18):8177, 2010.[15] L.J.W. Murray, W.B. Arendall, D.C. Richardson, and J.S. Richardson. RNA back-bone is rotameric. Proceedings of the National Academy of Sciences of the UnitedStates of America , 100(24):13905, 2003.[16] Wolfarm Saenger. Principles of Nucleic Acid Structure. page 125, 1984.[17] N.B. LEONTIS and E. Westhof. Geometric nomenclature and classification of RNAbase pairs. Rna , 7(04):499–512, 2001.[18] C. Smith P.J Struther Arnott and R. Chandrasekaran. Handbook of biochemistryand Molecular Biology. pages 411–422, 1976.[19] A. Rich, D.R. Davies, FHC Crick, and JD Watson. The molecular structure ofpolyadenylic acid. Journal of molecular biology , 3(1):71–86, 1961.[20] Wolfarm Saenger. Principles of Nucleic Acid Structure. page 40, 1984.[21] Arthur J. Olson. Van der waals potential energy.[22] Wolfarm Saenger. Principles of Nucleic Acid Structure. page 139, 1984.[23] A. Vedani. YETI: An interactive molecular mechanics program for small-moleculeprotein complexes. Journal of Computational Chemistry , 9(3):269–280, 1988.[24] T.A. Knotts IV, N. Rathore, D.C. Schwartz, and J.J. De Pablo. A coarse grain modelfor DNA. The Journal of chemical physics , 126:084901, 2007.[25] J. Parsons, J.B. Holmes, J.M. Rojas, J. Tsai, and C.E.M. Strauss. Practical conver-sion from torsion space to Cartesian space for in silico protein synthesis. Journal ofcomputational chemistry , 26(10):1063–1068, 2005.[26] U.H.E. Hansmann. Parallel tempering algorithm for conformational studies of bio-logical molecules. Chemical Physics Letters , 281(1-3):140–150, 1997.[27] D.J. Earl and M.W. Deem. Parallel tempering: Theory, applications, and new per-spectives. Physical Chemistry Chemical Physics , 7(23):3910–3916, 2005.[28] William Kahan. Further remarks on reducing truncation errors. Communications ofthe ACM , 8(1):40,48, 1965.[29] Pete Isensee. C++ optimization strategies and techniques.[30] R. Malathi and N. Yathindra. Virtual bond probe to study ordered and randomcoil conformations of nucleic acids. International Journal of Quantum Chemistry ,20(1):241–257, 1981. 5431] LR Dodd, TD Boone, and DN Theodorou. A concerted rotation algorithm foratomistic Monte Carlo simulation of polymer melts and glasses. Molecular Physics ,78(4):961–996, 1993.[32] J.P. Ulmschneider and W.L. Jorgensen. Monte Carlo backbone sampling for nu-cleic acids using concerted rotations including variable bond angles.