BPPart and BPMax: RNA-RNA Interaction Partition Function and Structure Prediction for the Base Pair Counting Model
Ali Ebrahimpour-Boroojeny, Sanjay Rajopadhye, Hamidreza Chitsaz
BBPPart and BPMax: RNA-RNA Interaction PartitionFunction and Structure Prediction for the Base PairCounting Model
Ali Ebrahimpour-Boroojeny Sanjay RajopadhyeHamidreza Chitsaz ∗ Department of Computer Science, Colorado State University
Abstract
RNA-RNA interaction (RRI) is ubiquitous and has complex roles in the cellular functions.In human health studies, miRNA-target and lncRNAs are among an elite class of RRIs that havebeen extensively studied and shown to play significant roles in various diseases including cancer.Bacterial ncRNA-target and RNA interference are other classes of RRIs that have receivedsignificant attention. Accordingly, RRI bioinformatics tools tailored for those elite classes havebeen proposed in the last decade.Interestingly, there are instances of mRNA-mRNA interactions where both partners appearin the same KEGG pathway without any direct link between them, or any prior knowledge inthe literature about their relationship. Those recently discovered cases suggest that RRI scope ismuch wider than those aforementioned elite classes. Hence, there is a need for high-throughput generic
RNA-RNA interaction bioinformatics tools.In this paper, we revisit our RNA-RNA interaction partition function algorithm, piRNA ,which happens to be the most comprehensive, and albeit the most computationally-intensive,thermodynamic model for RNA-RNA interaction. piRNA computes the partition function, base-pairing probabilities, and structure for the comprehensive Turner energy model using 96 differentdynamic programming tables. In this study, we embark on a journey to retreat from sophisti-cated thermodynamic models to much simpler models such as base pair counting. That mightseem counter-intuitive at the first glance; however, our idea is to benefit from the advantagesof such simple models in terms of implementation, tuning, running time, and memory footprintand compensate for the associated information loss by adding data-oriented machine learningcomponents in the future to the pipeline.In this work, we simplify the energy model and instead consider only simple weighted basepair counting to obtain
BPPart algorithm for Base-pair Partition function and
BPMax for Base-pair Maximization, which use 9 and 2 tables respectively. They are empirically 225 and 1350fold faster than piRNA due to reduction in the number of table look-ups and the fact that the 96tables of piRNA make the CPU cache much less effective. After a search-based optimization of theweights of base pairs, a correlation of 0.855 and 0.836 was achieved between piRNA and
BPPart and between piRNA and
BPMax , respectively, in 37 ◦ C on 50,500 experimentally characterizedRRIs. This correlation increases to 0.920 and 0.904 for those algorithms, respectively, in − ◦ C due to a decrease in the effect of thermodynamic entropy in lower temperatures.The results show that simplifying the model does not result in a noticeable loss of thethermodynamic information that piRNA captures. Therefore, the proposed algorithms can be ∗ Corresponding Author. Email: [email protected] a r X i v : . [ q - b i o . B M ] A p r sed along with machine learning methods in a strategic retreat from slower comprehensivephysical models such as piRNA . Since mid 1990s with the advent of RNA interference discovery, RNA-RNA interaction (RRI) hasmoved to the spotlight in modern, post-genome biology. RRI is ubiquitous and has increasinglycomplex roles in cellular functions. In human health studies, miRNA-target and lncRNAs areamong an elite class of RRIs that have been extensively studied and shown to play significantroles in various diseases including cancer. Bacterial ncRNA-target and RNA interference are otherclasses of RRIs that have received significant attention. However, new evidence suggests that otherclasses of RRI, such as mRNA-mRNA interactions, are biologically important.The RISE database [1] reports a number of biologically significant instances of mRNA-mRNAinteractions. These representative mRNA-mRNA interactions suggest that general RRIs, includ-ing mRNA-mRNA interactions, play major roles in human biology. Hence, there is a need forhigh-throughput generic
RNA-RNA interaction bioinformatics tools for all types of RNAs. As anexample of this necessity for all types of RNAs, we found 3 cliques of size 4 of interacting protein-coding RNAs in ribosome which conform to what we generally expect from the structure of theribosome. These cliques are highly entangled together to form an interaction graph as Figure 1.RPS3 which seems to be one of the genes with the highest number of connections interacts withat least 14 other genes in ribosome pathway. Another interesting clique of size 4 that we couldfind consists of 4 genes in the pathway of regulation of actin cytoskeleton, ACTB, ACTG1, PFN1,and TMSB4X. These genes are involved in vital tasks of proliferation, migration, mobility, anddifferentiation of the cell. Being able to capture all the interactions that RNAs might have helpsus to better understand the post-transcriptional regulation of the genes.Figure 1: A substructure of the genes in the ribosome pathway. Each node represents a gene andeach edge represents an experimentally observed interaction between the corresponding genes.In this paper, we revisit our RNA-RNA interaction partition function algorithm, piRNA , whichhappens to be the most comprehensive, albeit the most computationally intensive, thermodynamic2odel for RNA-RNA interaction [2]. piRNA is a dynamic programming algorithm that computes thepartition function, base-pairing probabilities, and structure for the comprehensive Turner energymodel in O ( n m + n m ) time and O ( n + m ) space. Due to intricacies of the energy model,including various loops such as hairpin loop, bulge/internal loop, and multibranch loop, piRNA involves 96 different dynamic programming tables and needs multiple table look-ups for computingtheir values.In this paper, we introduce a strategic retreat from the slower comprehensive models such as piRNA by simplifying the energy model and instead considering only simple weighted base paircounting to obtain BPPart algorithm for Base-pair Partition function and
BPMax for Base-pairMaximization, which are much faster. By the explosion of experimental data which makes us ableto use machine learning methods, such as deep learning, for detection of RNA subsequences thatinteract, this retreat is necessary if one is willing to build physics-guided models by using thefeatures that are derived by an energy model.
BPPart involves 10 dynamic programming tables,and
BPMax involves only 2 tables. Both
BPPart and
BPMax compared with piRNA are much simplerdynamic programming algorithms which are more than 225 fold and 1300 fold faster, respectively,on the 50500 RRI samples we used for our experiments. The reason for this noticeable speed-up isreducing the number of tables and the number of table look-ups for computing the new values andalso the fact that the 96 large tables of piRNA reduces the efficiency of using the cache. Moreover,from the point of view of code optimization and development/debugging for different hardwareplatforms, it is much more convenient to work with
BPPart and
BPMax because of the significantlyreduced memory footprint, and this provides room for further optimization of these methods in thefuture.The first question is, how much accuracy do we lose by simplifying the scoring model from thecomprehensive Turner model to simply weighted base pair counting? We answer that question bycomputing both the Pearson and rank correlations in different temperatures between the results of
BPPart , BPMax , and piRNA on 50,500 experimentally characterized RRIs in the RISE database [1].We find that the Pearson correlations between
BPPart and piRNA is 0.957 and
BPMax and piRNA is 0.941 at − ◦ C after optimizing the weights for base pairs. As the temperature increases, theeffect of entropy, which is not taken into account in the simple base pair counting model, increases.Completely conforming with the theoretical expectations, we find that the Pearson correlationsbetween BPPart and piRNA and also between
BPMax and piRNA is 0.883 at 37 ◦ C . We concludethat both BPPart and
BPMax capture a significant portion of the thermodynamic information thatcan possibly be complemented with machine learning techniques in the future for more accuratepredictions.
Related work
During the last few decades, several computational methods emerged to study the secondary struc-ture of single and interacting nucleic acid strands. Most use a thermodynamic model such as thewell-known Nearest Neighbor Thermodynamic model [3, 4, 5, 2, 6, 7, 8, 9, 10, 11]. Some pre-vious attempts to analyze the thermodynamics of multiple interacting strands concatenate inputsequences in some order and consider them as a single strand [12, 13, 14]. Alternatively, severalmethods avoid internal base-pairing in either strand and compute the minimum free energy sec-ondary structure for their hybridization under this constraint [15, 16, 17]. The most comprehensivesolution is computing the joint structure between two interacting strands under energy models witha growing complexity [18, 19, 20, 21, 22, 2, 23]. 3ther methods predict the secondary structure of individual RNA independently, and predictthe (most likely) hybridization between the unpaired regions of the two interacting molecules as amultistep process: 1) unfolding of the two molecules to expose bases needed for hybridization, 2) thehybridization at the binding site, and 3) restructuring of the complex to a new minimum free energyconformation [24, 25, 26, 27]. The success of such methods, including our biRNA algorithm [27],suggests that the thermodynamic information vested in subsequences and pairs of subsequences ofthe input RNAs can provide valuable information for predicting features of the entire interaction.In addition to general RNA-RNA interaction tools, many tools have been developed to predictthe secondary structure of interacting RNAs for a specific type of interest which has been shown tobe more effective in some cases due to the utilization of certain properties belonging to that type.As mentioned earlier, miRNA-target prediction is one such class of high interest for which suchspecialized tools have been created to incorporate various properties specific to miRNAs; some ofthese tools use the seed region of a miRNA which is highly conserved [28, 29, 30, 31], some considerthe free energy to compute accessibility to the binding site in 3’ UTR [32, 20, 29], some utilize theconservation level which is derived using the phylogenetic distance [33, 34, 35, 36, 28, 29], and someothers consider other target sites as well, such as the 5’ UTR, Open Reading Frames (ORF), andthe coding sequence (CDS) for mRNAs [37, 38, 39, 40].There are also several other tools developed for other specific types of RNA; IntaRNA [41, 42] isone such tool that although is used for RNA-RNA interaction in general, it is primarily designed forpredicting target sites of non-coding RNAs (ncRNAs) on mRNAs. There are many other examples,such as PLEXY [43] which is a tool designed for C/D snoRNAs, RNAsnoop [44] that is designedfor H/ACA snoRNAs, TargetRNA [45] which is a tool aimed at predicting interaction of bacterialsRNAs [46].
Here we describe how our algorithm,
BPPart , utilizes a dynamic programming approach to computethe partition function for RNA-RNA interaction when entropy is ignored and only a weighted scorefor pairing different nucleotides is considered. This algorithm guarantees to be mutualy exclusiveon the set of structures; in other words, it counts each structure exactly once.
In this paper, we mostly follow the notations and definition of the authors of piRNA [2]. We denotethe two nucleic acid strands by R and S . Strand R is indexed from 1 to L R , and S is indexedfrom 1 to L S both in 5 (cid:48) to 3 (cid:48) direction. Note that the two strands interact in opposite directions,e.g. R in 5 (cid:48) → (cid:48) with S in 3 (cid:48) ← (cid:48) direction; however, we consider the reverse of S in the figuresand equations for the sake of easier illustration and convenience. Each nucleotide is paired with atmost one nucleotide in the same or the other strand. The subsequence from the i th nucleotide tothe j th nucleotide in a strand is denoted by [ i, j ].An intramolecular base pair between the nucleotides i and j in a strand is called an arc anddenoted by a bullet i • j . We represent the score of such arc by score ( i, j ). An intermolecular basepair between the nucleotides i and i is called a bond and denoted by a circle i ◦ i . We representthe score of such bond by iscore ( i , i ). An arc i • j covers a bond k ◦ k if i < k < j . Wecall i • j an interaction arc if there is a bond k ◦ k covered by i • j . We call a base on either4trand an event if it is either the end-point of a bond or an interaction arc.Assuming i < j , two bonds i ◦ i and j ◦ j are called crossing bonds if i > j . An interactionarc i • j in a strand subsumes a subsequence [ i , j ] in the other strand if for all bonds k ◦ k ,if i ≤ k ≤ j then i < k < j . In other words, none of the bases in [ i , j ] has a bond with abase outside the i • j arc. Two interaction arcs are equivalent if they subsume one another. Twointeraction arcs i • j and i • j are part of a zigzag , if neither i • j subsumes [ i , j ] nor i • j subsumes [ i , j ].In this work, we assume there are no pseudoknots in individual secondary structures of R and S , and also there are no crossing bonds and zigzags between R and S . These constraints are beingmade to make the problem a polynomial problem rather than an NP-hard one as the general caseof considering all possible structures [19].We denote the ensemble of unpseudoknotted structures of R and S by S ( R ) and S ( S ) respec-tively. The ensemble of unpseudoknotted, crossing-free, and zigzag-free joint interaction structuresin denoted by S I ( R , S ).For a given structure s ∈ S ( R ) ∪ S ( S ), let AU ( s ) denote the number of A-U base pairs in s .Similarly, CG ( s ) and GU ( s ) denote the number of C-G and G-U base pairs in s respectively. Wedefine bpcount( s ) = c GU ( s ) + c AU ( s ) + c CG ( s ) , (1)in which c ’s are constants. In this study, we try a range of values of these constants. The detailsand results of using these different values can be found in Section 3.For a given joint interaction structure s ∈ S I ( R , S ), let AU ( s ), CG ( s ), and GU ( s ) denote thenumber of corresponding intramolecular base pairs in s as defined above. Let AU I ( s ), CG I ( s ), and GU I ( s ) denote the number of corresponding intermolecular base pairs in s . We definebpcount I ( s ) = c (cid:48) GU I ( s ) + c (cid:48) AU I ( s ) + c (cid:48) CG I ( s ) (2)and bpcount( s ) = c GU ( s ) + c AU ( s ) + c CG ( s ) + bpcount I ( s ) , (3)in which c ’s and c (cid:48) ’s are the tunable weights for base pairs. In this paper, we solve two problems:1.
Base Pair Counting Partition Function: we give a dynamic programming algorithm
BPPart to compute the partition function Q ( R , S ) = (cid:88) s ∈S I ( R , S ) e bpcount( s ) , (4)2. Base Pair Maximization: we give a dynamic programming algorithm
BPMax to find thestructure that has the maximum weighted base pair count, i.e.bpmax( R , S ) = argmax s ∈S I ( R , S ) bpcount( s ) . (5)5his problem was previously studied by D. Pervouchine [18] in an algorithm called IRIS.However, there is no publicly available functional implementation of IRIS. Moreover, wedefine a novel interaction scoreinteraction-score( R , S ) = max (cid:8) bpcount I ( s ) | bpcount( s ) = bpcount (bpmax( R , S )) (cid:9) (6)and compute it by backtracing all possible optimal structures and selecting the one that hasmaximum interaction portion. First, we start with the recursions for computing the partition function on a single strand which isgoing to occur in many cases of the double-stranded version. Let represent the partition functionof the subsequence from the i th nucleotide to the j th one, inclusive, as Q i,j . As shown in Figure2, there are two mutually exclusive cases; either there is no arc (the left case) or there is a uniqueleftmost arc (the right case) which starts at the k th position. We show the structure that starts atthe k th base in the second case by Qz .The property of the Qz i,j is that it has to have at least one arc starting at its first nucleotide, i . Therefore, due to the assumption that no pairing is allowed between two bases that are lessthan 3 bases apart, for the subsequences of a length less than 5, the value of Qz is 0. Otherwise,assuming the first nucleotide is paired with the k th base, as Figure 3 shows, we can split the Qz i,j structure into a Q structure inside i • k and a segment after k , [ k + 1 , j ], which is a Q structureagain. Therefore, the value of Qz i,j can be computed using the equation 8.According to the explanation and corresponding recursion formulas, equations 7 and 8, we needtwo 2-dimensional tables for solving the base pair counting partition function on each strand. Inthe following equations, we distinguish these tables by using superscripts (1) and (2) for the firststrand (the one that appears at the top in the figures) and the second one (the one at the bottom)respectively. = j k i j i i j QzQ
Figure 2: For computing Q , notice that either there is no pairing or there is at least one arc whichstarts at some index k and results in a case of Qz . Q i,j = j < = i j − (cid:88) k = i Qz k,j otherwise (7)6 i j i j k + 1 k Q k +1 ,j Qz Q i +1 ,k − Figure 3: Computing Qz can be achieved by considering the base k that is paired with i and thetwo Q substructures it forms, one between i and k and one after k . Qz i,j = j − i < j (cid:88) k = i +4 Q i +1 ,k − × score ( i, k ) × Q k +1 ,j otherwise (8)Now, for the partition function of a pair of RNA sequences, we consider a 4-dimensional table QI in which QI i ,j ,i ,j is the value of base pair counting partition function for the subsequences[ i , j ] on R and [ i , j ] on S . As Figure 4 shows, we can split the set of all possible structures of QI into 3 mutually exclusive subsets. The leftmost case shows the structures in which there exist nobonds. Therefore the value of QI i ,j ,i ,j can be computed using the first case of equation 9. Theother two cases occur when there is at least one bond; so, there is at least one event on both R and S which we call k and k , respectively. In the second case, these left-most events are end-pointsof a bond; hence, this case can be broken into a bond-free section on the left side of k ◦ k , anda section called QIb which contains the bond itself and a general case of QI on the right side ofthe bond, ( QI k +1 ,j ,k +1 ,j ). Therefore, we do not need a separate table for QIb . The third caseoccurs when k and k are not end-points of a bond. We call this structure QIa .For computing
QIa i ,j ,i ,j , we have to consider the property of this structure that the left-most bases on both R and S have to be events, but they cannot both be the end-points of a bond.Therefore, either one or both of them have to be end-points of an interaction arc. For the casewhere both i and i are end-points of some interaction arcs i • k and i • k such that those arcsare equivalent, QIa splits to two exclusive substructures
QIe i ,k ,i ,k and QI k +1 ,j ,k +1 ,j where QIe is a structure in which first and last bases on each strand are paired and the two arcs areequivalent.In
QIe i ,j ,i ,j , if we remove the arcs i • j and i • j , we will get the general case of QI i +1 ,j − ,i +1 ,j − for the inner-section with the constraint that there has be at least one bondin that region because the assumption is that the extracted arcs where interaction arcs. To fulfillthis constraint we can exclude all the cases where no bond exists as shown in equation 10. Sincethis case can be reduced to a special case of QI , we do not need a separate table for that and wecan directly replace it in all other equation with the formula of equation 10.7 IaIbIi j i j k k k k Figure 4: Each case of QI structure (left side of the equation) can lead to 3 cases. It is clear thateither no bonds exist (leftmost case), or at least one bond exists. If the first event on both of thesequences is a bond we have the case QIb (the middle case) which is actually QI with one bond onthe left; if not, we will have a case of QIa (the rightmost case). QI i ,j ,i ,j = Q (1) i ,j × Q (2) i ,j j < i or j < i Q (1) i ,j × Q (2) i ,j + j (cid:88) k = i j (cid:88) k = i Q (1) i ,k − × Q (2) i ,k − × iscore ( k , k ) × QI k +1 ,j ,k +1 ,j + j (cid:88) k = i j (cid:88) k = i Q (1) i ,k − × Q (2) i ,k − × QIa k ,j ,k ,j otherwise (9) QIe i ,j ,i ,j = j < i + 4 or j < i + 4( QI i +1 ,j − ,i +1 ,j − − Q (1) i +1 ,j − × Q (2) i +1 ,j − ) × score ( i , j ) × score ( i , j ) otherwise (10)Now, for the other cases of QIa , we have to consider all the structures where either exactly oneof the left-most events on R and S are end-point of a bond or both of them are end-points of somearc where the arcs are not equivalent. Then, QIa can be split into one such structure and a generalcase of QI on its right side. The set of such structure can be split into two symmetric set of casesfor which we will explain the structures covering them.Let consider QIs (1) i ,j ,i ,j as the structure in which we have arc i • j and i is either the end-point of a bond with the other end at some k where i < k < j , or is the end-point of somearc i • k that does not subsume i • j and i < = k < j . The other constraint of this set ofstructures is that j is the right-most event on S that is subsumed by i • j . Also, let considerthe symmetric case of this structure as QIs (2) i ,j ,i ,j . Hence, all other cases of QIa i ,j ,i ,j can be8epresented as QIs (1) i ,k ,i ,k and QI k +1 ,j ,k +1 ,j or QIs (2) i ,k ,i ,k and QI k +1 ,j ,k +1 ,j . = I a I s (2) I II s (1) I e Ii j i j k k k k k k Figure 5: There are 3 cases for computing the
QIa structure; either the leftmost base of only oneof the strands is an end point of an arc or both of them.
QIa i ,j ,i ,j = j (cid:88) k = i j (cid:88) k = i QIac i ,k ,i ,k × QI k +1 ,j ,k +1 ,j (11) QIac i ,j ,i ,j = QIs (1) i ,j ,i ,j + QIs (2) i ,j ,i ,j + QI i ,j ,i ,j (12)Since QIs (1) and
QIs (2) are symmetric, here we only explain the computation of
QIs (1) . Forcomputing
QIs (1) i ,j ,i ,j , if we extract i • j , there remains the subsequence [ i + 1 , j −
1] on R which is known to have at least one event since i • j is an interaction arc. Let call the left-mostevent on [ i + 1 , j −
1] as k . Therefore, QIs (1) i ,j ,i ,j can be split into Q (1) i ,k − and QIaux (1) k ,j ,i ,j which is a structure with the property of having event on k , i , and j , and i • j is not allowed. = QI aux (1) QI s (1) i j j i j k j − i + 1 i Figure 6:
QIs (1) has one arc that can be extracted and the structure derived will have the propertythat the two end bases of the bottom strand cannot be paired (the new structure inherits thisproperty from
QIs (1) ). On the top strand, we consider the leftmost event. This new structure is
QIaux (1) ). 9 Is (1) i ,j ,i ,j = j < i + 4 or j < i Q (1) i ,j × Q (2) i ,j + j (cid:88) k = i j (cid:88) k = i Q (1) i ,k − × Q (2) i ,k − × iscore ( k , k ) × QI k +1 ,j ,k +1 ,j + j (cid:88) k = i j (cid:88) k = i Q (1) i ,k − × Q (2) i ,k − × QIa k ,j ,k ,j otherwise (13)To compute QIaux i ,j ,i ,j , by considering the right-most event on R , k , we have a Q (1) k +1 ,j structure on the right side of k on R and the remaining part is a structure in which all four cornersare events, and i • j is not allowed. If there is an arc from i to k , then we will have another QIs (1) structure; if not, we call the structure
QIm . The property of this new structure is that allthe corners are events, but neither the two corners on R nor the two ones on S form an arc withone another. = QI m QI aux (1) i k + 1 k + 1 i j j i j j k k QI s (1) i i i j j Figure 7: Two cases must be considered for the
QIaux (1) structure, in which the 2 end points ofthe bottom strand are events. For the top strand, only the leftmost end point is required to be anevent. It can either be the end point of an arc (rightmost case) or not (leftmost case).
QIaux (1) i ,j ,i ,j = j (cid:88) k = i QIs (1) i ,k ,i ,j × Q (1) k +1 ,j + j (cid:88) k = i QIm i ,k ,i ,j × Q (1) k +1 ,j (14)For computing QIm i ,j ,i ,j , we have to consider 3 mutual exclusive cases in Figure 8. The firstone shows the case where i ◦ i and j ◦ j and the remaining part will be QI i +1 ,j − ,i +1 ,j − . Inthe second case, i ◦ i , but j and j do not form a bond. Since j and j are both events but donot form a bond, we can form a QIac structure on the right side the current structure which startsat index k on R and index k on S . Therefore, we will end up with QI i +1 ,k − ,i +1 ,k − in themiddle. The third case is symmetric to the second case. For the fourth case, neither i and j nor i and j can form a bond. By extracting a QIac structure from the left starting at indices k on R and k on S , we will end up with a QIa i ,k − ,i ,k − structure on the left.10 QI a QI QI ac QI m i i j i j i j i j j i j i j i j i + 1 i + 1 j − j − i +1 k k j − j − k − i +1 k − QI Figure 8: For computing
QIm , since we know the four end points are events, but none of the twoend points in one strand can form an arc, we must consider the 3 different cases shown above.
QIm i ,j ,i ,j = QI i +1 ,j − ,i +1 ,j − × iscore ( i , i ) × iscore ( j , j )+ iscore ( i , i )+ j (cid:88) k = i +1 j (cid:88) k = i +1 QI i +1 ,k − ,i +1 ,k − × QIac k ,j ,k ,j + j (cid:88) k = i +1 j (cid:88) k = i +1 QIa i ,k − ,i ,k − × QIac k ,j ,k ,j + iscore ( j , j ) × QIa i ,j − ,i ,j − i < j & i < j iscore ( i , i ) i = j & i = j otherwise (15) Here, we explain BPMax algorithm which is the first implementation (as far as we know) of thebase pair counting method explained in [18], with some small tweaks to emphasize the interactionbetween two RNAs by letting the score of bonds to be different from that of arcs. It also generatesnormalized interaction score so that it becomes independent from the length of the two interactingsequences which can directly affect the number of pairings otherwise. In the rest of this section,we explain the algorithm implemented.Here, we use the same notation as before. In addition, for a single strand of nucleotides, S i,j represents the maximum number of base pairs that we can have on subsequence [ i, j ]. For eachstrand we need to make such table; to distinguish these tables from one another, we will usesuperscripts (1) and (2) for R and S strand respectively. F i ,j ,i ,j represents the the maximumnumber of pairings (considering both intra- and inter-pairings) on subsequences [ i , j ] and [ i , j ]from R and S respectively. 11o compute S i,j , since the pairing of two bases that are less than 3 nucleotides apart are notallowed, the value of S for sequences of length less than 5 is considered as 0. Otherwise, therecursion in the second case of equation 16 can be utilized. It considers the case where we have arc i • j and recurs on [ i + 1 , j − i th and j th bases are not paired andthe [ i, j ] is split into two smaller subsequences. S i,j = max( S i +1 ,j − + score ( i, j ) , j − max k = i S i,k + S k +1 ,j ) (16)Now, to compute F i ,j ,i ,j , as you can see in Figure 9 and (17), conceptually similar to whatwe had for S , 3 cases have to be considered: i • j and F i +1 ,j − ,i ,j , arc i • j and F i ,j ,i +1 ,j − ,or none of these arcs and two smaller cases of F i ,k ,i ,k and F k +1 ,j ,k +1 ,j . = F F F SSi j i j i i j j i j i j i j j i k +1 k +1 k k i +1 i +1 j − j − i i j j Figure 9: All the 4 cases that have to be considered to compute table F . Note that in BPMax algorithm the cases do not have to be mutually exclusive. F i ,j ,i ,j = S (2) i ,j j < i S (1) i ,j j < i iscore ( i , i ) i = j and i = j max( F i +1 ,j − ,i ,j + score ( i , j ) ,F i ,j ,i +1 ,j − + score ( i , j ) , j max k = i j max k = i ( F i ,k ,i ,k + F k +1 ,j ,k +1 ,j )) otherwise (17) To investigate to what extent the scores of
BPPart and
BPMax are correlated with that of piRNA ,we used the RISE database which combines the information about interacting RNAs from multipleexperiments. For human dataset, we extracted all the interaction windows for those pairs thathave that data and eliminated the ones that contained a window with length less than 15 becausethey are too short to provide us with an unbiased comparison. Then, the remaining pairs weresorted based on the product of the lengths of the interacting windows. Finally, the first 50500 pairs12f sequences were chosen as our dataset for different experiments and analysis. Figure 10 showsthe distribution of the lengths of the sequences present in our dataset and also the product of thelengths of the RNA subsequences in each pair.First, piRNA was ran on our dataset at 8 different temperatures, 37, 25, 13, 0, − − − −
180 degrees celcius.
BPPart and
BPMax were ran on the dataset with different weights foreach base-pair combination. In general, we want to use the stack energies of the base pairs in theTurner model to tune their weights. We considered a fixed weight of 3 for CG (and GC ). Using theexperimentally computed stack energies of the Turner model, a minimum and maximum value forthe weights of AU and GU were computed. As an example, to compute the maximum weight of AU (and UA ), we consider the maximum released energy when AU (or UA ) is stacked with another pair;this happens when UA is stacked with CG and 2 . kcal/mol energy is released. Then, we consider theminimum value of released energy in an stack for CG or GC (for which we assumed a constant weightof 3), which is 1 . kcal/mol . By multiplying 2 . . the maximum weight of AU and UA , whichis 5 . AU and UA contains this maximum value (we chose 5 . AU and UA , their minimum stack energy is considered which is 0 . kcal/mol .Now, given the maximum energy of CG , which is 3 . kcal/mol , the value of interest is computed as0 . × . = 0 . − . − BPPart and
BPMax , respectively.Finally, for all the combinations of weights of AU and GU , in steps of 0 .
5, the Pearson andSpearman’s Rank correlations with the scores from piRNA at different temperatures were computed.When computing the correlations, we divide the scores from all algorithms by the sum of the lengthsof corresponding sequences to normalize them. This normalization mitigates the effect of lengthon the computed correlations. This step is necessary because, generally, as the length of the pairof sequences increases the scores of all three algorithms increases, and if unnormalized scores areused, a biased higher correlation will be derived. Notice that for the scores of partition functions, piRNA and
BPPart , we used the log of the scores; that is why we factor out the sum of the lengthsfor normalization. If the original values were used, we had to divide the scores by exp( L R + L S ).Figures 11 and 12 show the final correlation values. The optimum value of correlation for eachtemperature is presented in Tables 1 and 2. Figure 13 shows the scatter plots of the scores of BPPart and piRNA at 37 ◦ C and − ◦ C . The red line shows the regression line that is fitted to thepoints by minimizing the mean squared error (MSE). These plots conform with our expectationsand findings that there is a high Pearson and Spearman’s Rank correlation between the two andthese correlations are higher for lower temperatures.As the tables show, there is a high correlation between piRNA and BPPart as well as between piRNA and
BPMax , especially when the temperature decreases which is due to a decrease in the roleof thermodynamic entropy in the lower temperatures. Also, the Pearson and Spearman’s Rankcorrelation between
BPPart and
BPMax were computed with their optimum weights at 37 ◦ C andvalues 0.971 and 0.968 were derived, respectively. It is evident that the correlations for BPPart and
BPMax are very high which is expected because of the similar nature of them that is being basedon the principle of Minimizing Free Energy (MFE).Finally, to understand better the behavior of the surface around the higher values in the cor-relations plots of Figures 11 and 12, the Shannon entropy for the values above a threshold wascomputed. Figure 15 shows the value of the Shannon entropy for the top 30 values of Pearson andSpearman’s Rank correlation at each temperature.13igure 10: Distribution of the lenghts of all the RNA subsequences used in our experiment (left)and distribution of the product of the lengths of each pair in the samples (right)Table 1: Pearson correlation between piRNA and
BPPart and between piRNA and
BPMax in differenttemperaturesMethod \ T 37 25 13 0 -40 -80 -130 -180BPPart 0.855 0.862 0.869 0.877 0.896 0.908 0.916 0.920BPMax 0.836 0.846 0.855 0.864 0.884 0.895 0.901 0.904Table 2: Spearman rank correlation between piRNA and
BPPart and between piRNA and
BPMax indifferent temperaturesMethod \ T 37 25 13 0 -40 -80 -130 -180BPPart 0.864 0.867 0.871 0.876 0.889 0.896 0.901 0.901BPMax 0.830 0.835 0.841 0.847 0.862 0.871 0.877 0.877
The Gibbs free energy ∆ G = ∆ H − T ∆ S (18)is composed of a term ∆ H called enthalpy that does not depend on temperature and a term T ∆ S called entropy that linearly depends on temperature T . Intuitively, enthalpy is the chemical energythat is often released upon formation of chemical bonds such as base pairing. Entropy, on theother hand, captures the size of all possible spatial conformations for a fixed secondary structure.In other words, entropy captures the amount of 3D freedom of the molecule. A base pair bringsenthalpy down, hence favorable from enthalpy point of view, and decreases freedom (entropy),hence unfavorable from entropy point of view. These two opposing objectives are combined linearly14igure 11: Pearson correlation between piRNA and BPPart (vertical axis), at − ◦ C (left) and37 ◦ C (right), for different values of constant factors (weights) for AU (left axis) and GU (rightaxis). The weight of CG pair is fixed at 3.Figure 12: Pearson correlation between piRNA and BPMax (vertical axis), at − ◦ C (left) and 37 ◦ C (right), for different values of constant factors (weights) for AU (left axis) and GU (right axis).The weight of CG pair is fixed at 3.through the temperature coefficient.In the full thermodynamic model, we consider both terms. In the base pair counting, we consideronly a simplistic enthalpy term. Partition function for the full thermodynamic model is (cid:88) s ∈S I e − ∆ G/RT , (19)15igure 13: Scatter plots of the values of the corresponding axis for each sample (interaction windowsof a pair of RNAs from RISE dataset) in − ◦ C (left) and 37 ◦ C (right). In both of the plots, thered line is a straight regression line fitted to the points by minimizing MSE.Figure 14: Pearson correlation (left) and Spearman’s Rank correlation (right) between piRNA and BPPart and between piRNA and
BPMax at different temperatures.in which R is the gas constant. Note that − ∆ GT = − ∆ HT + ∆ S, (20)and as T → − ∆ H/T → ∞ and the contribution of ∆ S is diminished to 0 since it is finite.Hence in low temperatures, the effect of entropy becomes negligible, and we expect to see strongcorrelation between the base pair counting model and full thermodynamic model.16igure 15: Shannon entropy for the top 30 Pearson (left) and Spearman’s Rank (right) correlationvalues at different temperatures for BPPart and
BPMax .Figure 14 shows the Pearson correlations between
BPPart and
BPMax scores and that of piRNA for a a fixated combination of weights that results in the highest correlation at 37 ( ◦ C ). For BPPart the chosen weights are 0 .
5, 1 .
0, and 3 for AU , GU , and CG , respectively, while the correspondingweights for BPMax are 1 .
0, 1 .
5, and 3.Perfectly conforming with the theory, we see higher correlations at low temperatures. Thatsomewhat validates our implementations as piRNA was written totally independently about 10 yearsago. Moreover, as can be seen in Figures 11 and 12, the surface around the optimum value forhigher temperatures becomes flatter. Figure 15, which shows the entropy of the top 30 correlationvalues, confirms this observation. This means the correlation values are less sensitive to a change inthe weights of the base pairs as the temperature increases; this conforms with the theory because athigher temperatures, the thermodynamic entropy increases and the total score of piRNA becomesless sensitive to the energy released by pairings. It is worth mentioning that having less Shannonentropy for the top values at higher temperatures decreases the possibility of having universaloptimum values for the weights of the base pairs.Another noticeable characteristics of the plots 11 and 12 is the region in which the scores ofboth AU and GU are non-positive. This region for BPMax is flat because when both of these pairsare penalized (or not rewarded when their score is zero), the algorithm simply avoids making suchpairs because it is trying to maximize the score. Therefore, it only tries to maximize the numberof CG pairs, which is independent of the score (penalty in this case) of the other two types of basepairs. This also applies to the case where one of the base pairs has a non-positive score; in thatcase, BPMax works independently of the score of that base pair. So, as soon as any of the scoresbecomes zero or less than zero,
BPMAX remains constant along the corresponding axis. For
BPPart ,however, the story is different because it simply counts all the possible pairings and even if thescore of a base pair becomes negative, it does not ignore counting that.Moreover,
BPPart has a higher correlation than
BPMax does which comes with the price of a6 fold increase in computation time. Also, as Figure 15 shows, the Shannon entropy for the top30 values is less in
BPMax and the gap between them grows as temperature decreases; this showsthat
BPPart has a flatter region around the optimum value and its optimum value is less sensitive17o changes in the weights. Meanwhile, having a curvier surface in
BPMax which has less entropyincreases the possibility of having more stable and universal optimum values for the weights. Asmentioned earlier, the running time difference between the two is noticeable:
BPMax is about 6 foldfaster than
BPPart . Hence, we now have three choices in increasing order of computational cost:
BPMax , BPPart , and piRNA . Running time increases about 6 and 225 fold, respectively, from one tothe next.
We revisited the problems of partition function and structure prediction for interacting RNAs. Wesimplified the energy model and instead considered only simple weighted base pair counting toobtain
BPPart for the partition function and
BPMax for structure prediction. As a result,
BPPart runs about 225 fold and
BPMax runs about 1300 fold faster than piRNA does. Hence, we gainedsignificant speedup by potentially sacrificing accuracy.To evaluate practical accuracy of both new algorithms, we computed the Pearson and rankcorrelations in different temperatures between the results of
BPPart , BPMax , and piRNA on 50,500experimentally characterized RRIs in the RISE database [1].
BPPart and
BPMax results correlatewell with those of piRNA at low temperatures. At the room and body temperatures, there isconsiderable correlation and therefore, significant information in the results of
BPPart and
BPMax .We conclude that both
BPPart and
BPMax capture a significant portion of the thermodynamicinformation. Both tools can be used as filtering steps in more sophisticated RRI prediction pipelines.Also, the information captured by
BPPart and
BPMax can possibly be complemented with machinelearning techniques in the future for more accurate predictions. We now have three choices for RRIthermodynamics in increasing computational cost:
BPMax , BPPart , and piRNA . Depending on theapplication and the trade-off between time and accuracy, one can be chosen.
References [1] Gong, J. et al.
RISE: a database of RNA interactome from sequencing experiments.
NucleicAcids Res. (2017).[2] Chitsaz, H., Salari, R., Sahinalp, S. & Backofen, R. A partition function algorithm for interact-ing nucleic acid strands.
Bioinformatics , i365–i373 (2009). Also ISMB/ECCB proceedings.[3] Mathews, D., Sabina, J., Zuker, M. & Turner, D. Expanded sequence dependence of ther-modynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. ,911–940 (1999).[4] Cao, S. & Chen, S. Predicting RNA pseudoknot folding thermodynamics.
Nucleic Acids Res. , 2634–2652 (2006).[5] Dirks, R. M. & Pierce, N. A. A partition function algorithm for nucleic acid secondary structureincluding pseudoknots. Journal of Computational Chemistry , 1664–1677 (2003).[6] Nussinov, R., Piecznik, G., Grigg, J. R. & Kleitman, D. J. Algorithms for loop matchings. SIAM Journal on Applied Mathematics , 68–82 (1978).187] Waterman, M. S. & Smith, T. F. RNA secondary structure: A complete mathematical analysis. Math. Biosc. , 257–266 (1978).[8] Zuker, M. & Stiegler, P. Optimal computer folding of large RNA sequences using thermody-namics and auxiliary information. Nucleic Acids Research , 133–148 (1981).[9] Rivas, E. & Eddy, S. A dynamic programming algorithm for RNA structure prediction includ-ing pseudoknots. J. Mol. Biol. , 2053–2068 (1999).[10] McCaskill, J. The equilibrium partition function and base pair binding probabilities for RNAsecondary structure.
Biopolymers , 1105–1119 (1990).[11] Wenzel, A., Akba¸sli, E. & Gorodkin, J. Risearch: fast rna–rna interaction search using asimplified nearest-neighbor energy model. Bioinformatics , 2738–2746 (2012).[12] Andronescu, M., Zhang, Z. & Condon, A. Secondary structure prediction of interacting RNAmolecules. J. Mol. Biol. , 987–1001 (2005).[13] Bernhart, S. et al.
Partition function and base pairing probabilities of RNA heterodimers.
Algorithms Mol Biol , 3 (2006).[14] Dirks, R. M., Bois, J. S., Schaeffer, J. M., Winfree, E. & Pierce, N. A. Thermodynamic analysisof interacting nucleic acid strands. SIAM Review , 65–88 (2007).[15] Rehmsmeier, M., Steffen, P., Hochsmann, M. & Giegerich, R. Fast and effective prediction ofmicroRNA/target duplexes. RNA , 1507–1517 (2004).[16] Dimitrov, R. A. & Zuker, M. Prediction of hybridization and melting for double-strandednucleic acids. Biophysical Journal , 215–226 (2004).[17] Markham, N. & Zuker, M. UNAFold: software for nucleic acid folding and hybridization. Methods Mol. Biol. , 3–31 (2008).[18] Pervouchine, D. IRIS: intermolecular RNA interaction search.
Genome Inform , 92–101(2004).[19] Alkan, C., Karakoc, E., Nadeau, J. H., Sahinalp, S. C. & Zhang, K. RNA-RNA interactionprediction and antisense RNA target search. Journal of Computational Biology , 267–282(2006).[20] Lorenz, R. et al. Viennarna package 2.0.
Algorithms for Molecular Biology , 26 (2011).[21] DiChiacchio, L., Sloma, M. F. & Mathews, D. H. Accessfold: predicting rna–rna interactionswith consideration for competing self-structure. Bioinformatics , 1033–1039 (2015).[22] Kato, Y., Akutsu, T. & Seki, H. A grammatical approach to RNA-RNA interaction prediction. Pattern Recognition , 531–538 (2009).[23] Huang, F. W. D., Qin, J., Reidys, C. M. & Stadler, P. F. Partition function and base pairingprobabilities for RNA-RNA interaction prediction. Bioinformatics , 2646–2654 (2009).1924] M¨uckstein, U. et al. Thermodynamics of RNA-RNA binding.
Bioinformatics , 1177–1182(2006).[25] Walton, S., Stephanopoulos, G., Yarmush, M. & Roth, C. Thermodynamic and kinetic char-acterization of antisense oligodeoxynucleotide binding to a structured mRNA. Biophys. J. ,366–377 (2002).[26] Busch, A., Richter, A. S. & Backofen, R. IntaRNA: Efficient prediction of bacterial sRNAtargets incorporating target site accessibility and seed regions. Bioinformatics , 2849–2856(2008).[27] Chitsaz, H., Backofen, R. & Sahinalp, S. biRNA: Fast RNA-RNA binding sites prediction.In Salzberg, S. & Warnow, T. (eds.) Workshop on Algorithms in Bioinformatics (WABI) , vol.5724 of
LNBI , 25–36 (Springer-Verlag, Berlin, Heidelberg, 2009).[28] Krek, A. et al.
Combinatorial microrna target predictions.
Nature genetics , 495 (2005).[29] Kertesz, M., Iovino, N., Unnerstall, U., Gaul, U. & Segal, E. The role of site accessibility inmicrorna target recognition. Nature genetics , 1278 (2007).[30] Kr¨uger, J. & Rehmsmeier, M. Rnahybrid: microrna target prediction easy, fast and flexible. Nucleic acids research , W451–W454 (2006).[31] Zhang, Y. miru: an automated plant mirna target prediction server. Nucleic acids research , W701–W704 (2005).[32] Hofacker, I. L. et al. Fast folding and comparison of rna secondary structures.
Monatsheftef¨ur Chemie/Chemical Monthly , 167–188 (1994).[33] Nam, J.-W. et al.
Global analyses of the effect of different cellular contexts on micrornatargeting.
Molecular cell , 1031–1043 (2014).[34] Betel, D., Koppal, A., Agius, P., Sander, C. & Leslie, C. Comprehensive modeling of micrornatargets predicts functional non-conserved and non-canonical sites. Genome biology , R90(2010).[35] Reczko, M., Maragkakis, M., Alexiou, P., Grosse, I. & Hatzigeorgiou, A. G. Functional mi-crorna targets in protein coding sequences. Bioinformatics , 771–776 (2012).[36] Gaidatzis, D., van Nimwegen, E., Hausser, J. & Zavolan, M. Inference of mirna targets usingevolutionary conservation and pathway analysis. BMC bioinformatics , 69 (2007).[37] Riffo-Campos, ´A., Riquelme, I. & Brebi-Mieville, P. Tools for sequence-based mirna targetprediction: what to choose? International journal of molecular sciences , 1987 (2016).[38] Miranda, K. C. et al. A pattern-based method for the identification of microrna binding sitesand their corresponding heteroduplexes.
Cell , 1203–1217 (2006).[39] Hsu, J. B.-K. et al. mirtar: an integrated system for identifying mirna-target interactions inhuman.
BMC bioinformatics , 300 (2011).2040] Xu, W., San Lucas, A., Wang, Z. & Liu, Y. Identifying microrna targets in different generegions. BMC bioinformatics , S4 (2014).[41] Busch, A., Richter, A. S. & Backofen, R. Intarna: efficient prediction of bacterial srna targetsincorporating target site accessibility and seed regions. Bioinformatics , 2849–2856 (2008).[42] Mann, M., Wright, P. R. & Backofen, R. Intarna 2.0: enhanced and customizable predictionof rna–rna interactions. Nucleic acids research , W435–W439 (2017).[43] Kehr, S., Bartschat, S., Stadler, P. F. & Tafer, H. Plexy: efficient target prediction for boxc/d snornas. Bioinformatics , 279–280 (2010).[44] Tafer, H., Kehr, S., Hertel, J., Hofacker, I. L. & Stadler, P. F. Rnasnoop: efficient targetprediction for h/aca snornas. Bioinformatics , 610–616 (2010).[45] Tjaden, B. Targetrna: a tool for predicting targets of small rna action in bacteria. Nucleicacids research , W109–W113 (2008).[46] Umu, S. U. & Gardner, P. P. A comprehensive benchmark of rna–rna interaction predictiontools for all domains of life. Bioinformatics33