A Remark on Deriving Precise Upper Bounds of the Number of the RNA Secondary Structures
aa r X i v : . [ m a t h . C V ] J u l A Remark on Deriving Precise Upper Bounds ofthe Number of the RNA Secondary Structures
Alexander I. Kheyfits akheyfi[email protected]
Abstract.
An elementary, at the undergraduate level, derivation is given of preciseupper bounds of the number of the various RNA structures. The method workswhen the generating function has multiple singularities at the circle of convergence.Nor does the method require the Taylor coefficients to be positive. Examples fromthe current research literature are considered.October 24, 20182010 Mathematics Subject Classification: 92-08, 05A16.1.
Introduction
Since Waterman and Stein [10, 7] defined the RNA secondary structures in graph-theoretical terms, derivation of the upper bounds or precise asymptotic formulasfor the various structures became an important problem; the number of relevantpapers is growing, see, e.g., [1, 2, 5] and the references therein. Derivation of theseestimates in the current literature is based on the deep result of complex analysis,that can be traced back to G. Darboux; we refer the reader to the very informativesurvey of various methods [5].During the workshop
Teaching Discrete and Algebraic Mathematical Biology toUndergraduates at the Mathematical Biosciences Institute at Ohio State University,Columbus, OH, 7/29/2013 - 8/02/2013, several speakers discussed these estimates.However, the Darboux theorem and its modern analogs are well beyond the currentundergraduate curriculum.The goal of this methodical note is to show that precise upper bounds for thenumber of the secondary structures in many cases can be derived quite elementary,within the power of an undergraduate student taking an Introductory ComplexAnalysis class, and without the use of CAS. Of course, the method is not limitedto the secondary structures, in the last two examples we find the bounds for thenumber of RNA shapes [5]. The method is based on the well-known Cauchy-Hadamard formula for the radius of convergence of Taylor series, or even on itsreal-valued relative – the root-test for convergence of the power series.Moreover, unlike some other methods, it is irrelevant for the method, whetherthe generating function has more than one singularity on the circle of convergence.Nor does the method require the Taylor coefficients to be positive.For the readers convenience, in the next section we remind what are the RNAprimary and secondary structures. Then we describe our method, and in Section 3we consider examples of its application. RNA Structures and the Cauchy-Hadamard Formula
RNA Primary Structures.
Living cells contain important (macro)molecules,called RiboNucleic Acids (RNA). These acids contain essential genetic information,for example, about viruses, thus, it is important to know their structure. Biologistsdistinguish primary, secondary, and tertiary structures of RNA. Unlike the doublehelix of the DNA, each RNA is a linearly ordered strand , or just a string , consist-ing of other molecules, called ribonucleotides . This string is the backbone of anyRNA. Traditionally, it is represented by a horizontal straight segment with nodesoccupied by the nucleotides. If the RNA contains n ribonucleotides, we select n points of the segment and number them consecutively from left to the right by thenatural numbers 1 , , . . . , n ; these dots represent the ribonucleotides in the RNA.For more information the reader can consult, for example, [6] or [8].When we start studying new objects, it is often necessary to know their quantity.In particular, it is important to know the number of the primary and secondarystructures of RNA. It is not always possible to find a precise formula for the numberof the secondary structures subject to various restrictions. And even if such aformula is derived, it can be very cumbersome, and therefore useless. That iswhy, a lot of work has been done to derive different asymptotic formulas for thenumbers of various secondary structures, see, e.g., [6, 8] and the references therein.A formula is called asymptotic, if it gives better and better relative approximation of a quantity under consideration, when a certain parameter (e.g., the size of asystem or time) is approaching a crucial threshold; for example, if time tends toinfinity.There are four different ribonucleotides, called adenine (denoted hereafter A ), cytosine ( C ), guanine ( G ), and uracil ( U ). The linear ordering of these four nu-cleotides in either order, where each of them can repeat indefinitely, is called the primary structure of the RNA. Thus, the primary structure of an RNA can bedepicted by drawings like this, A G G U A C
Figure 1.
A string (primary structure) with 6 nodes occupied bythe nucleotides
A − G − G − U − A − C .In mathematical parlance, the pictures like this are called graphs . They arestudied in the graph theory . The graph in Fig. 1 is labeled – its vertices are labeledby the symbols of the nucleotides. The graph theory is a mathematical theory,and even though mathematics by itself cannot solve biological problems, it can giveuseful insights and help to solve biological problems [11].2.2.
Counting the Primary Structures.
We begin by solving an easy problemof calculating the number of primary structures of RNA of some specified length,say n . We denote the number of different linear strings, containing n nucleotides,without any restrictions on the neighboring ones, as e R ( n ). Since every string startswith one of the four nucleotides, either A , or C , or G , or U , followed by a string of length n −
1, we can immediately produce the basic equation e R ( n ) = 4 × e R ( n − . Such equations are called recurrence relations or difference equations . In the samefashion, e R ( n −
1) = 4 × e R ( n − , and we can iterate this equation, getting the equation e R ( n ) = 4 e R ( n −
1) = 4 e R ( n −
2) = 4 e R ( n −
3) = · · · = 4 n − e R (1) . Since we have an obvious initial condition e R (1) = 4, the total number of primarystructures without any restriction is e R ( n ) = 4 n . We can notice that this is just the number of permutations (or arrangements) withrepetitions of n elements of four different kinds of elements .Thus, the number of the RNA grows exponentially, as 4 n . Therefore, the numberof the secondary structures in the literature is usually compared with the exponen-tial function b n .Now we take up the RNA strings with restrictions on the neighboring nucleotides.Of course, in real molecules there are always small deviations from the basic rules,that is, certain ”forbidden” pairs can occur, even though with a small probability.We neglect these ”outliers” and consider only RNA, where all the base pairs areonly of these three types, called Watson-Crick pairing:(1) A − U ; C − G ; G − U . We compute the number of the primary structures satisfying these restrictions.The very first nucleotide can be either A , or C , or G , or U . Let us denote thenumber of strings of length n and starting with A , as R A ( n ), and similarly, R C ( n ), R G ( n ), R U ( n ). If the very first molecule is an A , then the second nucleotide canbe only U . Since this U starts a string of n − R A ( n ) = R U ( n − . Similarly, R C ( n ) = R G ( n − . However, if the first nucleotide is a G , then the second molecule must be either C or U , thus R G ( n ) = R C ( n −
1) + R U ( n − . Similarly, if the first nucleotide is a U , then the second molecule is either A or G .Since an RNA must start with some nucleotide, then R ( n ) = R A ( n ) + R C ( n ) + R G ( n ) + R U ( n ) . Collecting all these equations together, we deduce the recurrence equation R ( n ) = R U ( n −
1) + R G ( n −
1) + R C ( n −
1) + R U ( n −
1) + R A ( n −
1) + R G ( n − R A ( n −
1) + R C ( n −
1) + 2 R G ( n −
1) + 2 R U ( n − . For all basic information from combinatorics and graph theory see for example [3].
To have the unique solution, we must supply the initial conditions , which in thiscase are, obviously,(3) R A (1) = R C (1) = R G (1) = R U (1) = 1; R (1) = 4 . Using equations (2)-(3), we can easily compute the number R ( n ) for any given n ; for large n we should probably use computers. For example, if n = 2 then n = 14we get R (2) = 1 + 1 + 2 + 2 = 6. Indeed, we can list these strands explicitly, A − U ; C − G ; G − C ; G − U , U − A ; U − G . Secondary Structures.
An RNA molecule is not rigid like a metal bar, it isflexible and can be conveniently thought of as a smooth flexible string, which canbe crumpled, and then stretched again without any noticeable change.Imagine now that we attached small pieces of velcro tape at some places ofthis string. If we now fold it over, then these pieces of velcro tape can hook oneanother, and we cannot easily stretch the tape in a linear structure as before. Inreal molecules instead of velcro tape there are certain pairs of nucleotides. If theyhappen to be close enough one to another, they are capable of forming chemicalbonds. According to Watson-Crick, these are three base pairs of the nucleotides,given by equation (1) in either order. For example, the string in Fig. 1 can foldover into the one in Fig. 2, where the new ties are pairs
C − G and
G − C , but itcannot fold into the one in Fig. 3.
A G G UAC
Figure 2.
A secondary structure built on the primary string
A − G − G − U − A − C . A G G UAC
Figure 3.
This structure is forbidden by the Watson-Crick pairing.
This folding of RNA molecules in the plane is called the secondary structure of the RNA molecule. Since different nucleotides can come close to each other, aprimary structure can generate different secondary structures, which significantlycomplicates the analysis of the RNA. It is supposed that the secondary structureis a two-dimensional object, thus it can be drawn in the plane.The secondary structure of the RNA describes the ordering and location of thesebase pairs of the nucleotides. The secondary structure is responsible for manycrucial biological phenomena, and the graph theory is helpful in discovering thesestructures.The secondary structure, as defined by M. Waterman [10], ia also a graph. Weconsider hereafter only simple graphs, i.e., graphs without loops or parallel edges.To a simple graph, there corresponds a square matrix of size n × n , such thatits element ai, j = 1 if and only if the vertices v i and v j are connected with anedge and is 0 otherwise; this matrix is called the adjacency matrix of the graph.The secondary structure is a horizontal backbone of length n , i.e., just a primarystructure, enriched by several arcs in the upper half-plane, whose end-points arethe nodes of the backbone. In notation and terminology we follow [6]. The arcs ofthe secondary structure are subject to certain restrictions. The arc with ends atthe nodes i and j > i is denoted as ( i, j ), the length of the arc is j − i ≥
1. Notevery family of arcs corresponds to a secondary structure.A secondary structure is a simple graph on the backbone of the length n , suchthat the adjacency matrix A = ( a i,j ) possesses the following three properties.1) For the basic string to be a backbone, we require that a , = a , = · · · = a n − ,n = 1.2) Next, not counting the neighbors, every point can be adjacent to at most oneother point of the backbone. In terms of the incidence matrix, this means that forany i, ≤ i ≤ n, there exists at most one j with j = i ±
1, such that a i,j = 1.3) Finally, it is assumed that if the vertices a i and a j , i < j, are adjacent and i < k < j , then the vertex a k cannot be adjacent with any vertex to the left of a i or to the right of a j . In terms of the adjacency matrix this means that if a k,l = 1and i < k < j , then also i < l < j .It follows that the arcs of a secondary structure do not intersect , it is a non-crossing structure.3. Asymptotic Enumeration of the Secondary Structures. Examples
Convolution and Generating functions of Sequences.
Even if n is about10, the total listing of all the secondary structures is cumbersome, so that we wantto estimate their quantity. A convenient device for this is their generating function ,that is, the power series S ( x ) = ∞ X S n x n , where S n is the number of secondary structures with n nods, n = 0 , , , . . . , and x is an indeterminate, real or complex. If the series is divergent, it can be considered as formal power series, or we can truncate the series and consider generating poly-nomials , as discussed, e.g., in [3]. However, in the following examples all the serieshave positive radii of convergence.Given two sequences, a = { a , a , . . . } and b = { b , b , . . . } with generating func-tions f a ( x ) and f b ( x ) respectively, the generating function of their linear combina-tion αa + βb with any real or complex coefficients α and β is the linear combination f αa + βb ( x ) = αf a ( x ) + βf b ( x ) , thus, the correspondence between the sequences and their generating functions is alinear transformation. There is also an operation with sequences, which correspondsto the multiplication of their generating functions. Indeed, let f a ( x ) and f b ( x ) betwo absolutely convergent power series. Multiplying them termwise and combininglike terms, we get the series f a ( x ) × f b ( x ) = ∞ X c n x n , where c = a · b , c = a · b + a · b , . . . , c n = a · b n + a · b n − + a · b n − + · · · + a n · b , . . . . Thus defined sequence c = { c n } ∞ n =0 is called the convolution or the Cauchy product of the sequences a and b , and cor-responds to the multiplication of the generating functions.Let S λ ( x ) be the generating function of the secondary structures with the arclength at least λ ≥
2. If λ = 2, the recurrent relation(4) S λn = S λn − + n − − λ X j =0 S λn − − λ S λj was derived by Waterman [10]. To solve it, we must supply λ + 1 initial conditions.We assume S λ = S λ = · · · = S λλ = 1 . Computing the Generating Functions for the Secondary Structures.
The crucial observation is that recurrent relation (4) contains a sum of pairwiseproducts quite similar to the equation for the convolution. First, consider the case λ = 2 and to simplify writing, set S ( x ) = S λ ( x ). Multiplying (4) by x n , after somesimple algebra one derives the following quadratic equation for the generatingfunction,(5) x S ( x ) + ( x − − x ) S ( x ) + 1 = 0 . Solving it by the quadratic formula, we find S ( x ) = 12 x (cid:16) x − x + 1 ± p − x − x − x + x (cid:17) . However, the method has the broader scope. In Example 3 below, the method is applied tothe generating function satisfying a cubic equation.
From (5) we see that S (0) = 1, therefore, we have to choose the sign ” − ” above,and finally get S ( x ) = 12 x (cid:16) x − x + 1 − p − x − x − x + x (cid:17) . Estimations of the Number of the Secondary Structures.
We needthe following facts from introductory Complex Analysis course. If a power series f ( z ) = P ∞ n =0 f n z n has a positive radius of convergence R , then its sum f ( z ) is ananalytic function in the open disc | z | < R , R is the distance from the z = 0 to thenearest singular point, which must be located on the boundary | z | = R , and theCauchy-Hadamard formula 1 R = lim sup n →∞ n p | f n | is valid. We can even (with some reservations) refer, instead of the Cauchy-Hadamard formula, to the root test, which is more-or-less real valued version ofthe latter.Therefore, n p | f n | ≤ /R , and we have the required upper bound for the coeffi-cients f n ,(6) | f n | ≤ (cid:18) R (cid:19) n , n = 1 , , , . . . . By the definition of the upper limit, there is a subsequence f n k → ∞ as k → ∞ ,such that 1 R = lim sup n →∞ n p | f n | . This shows that the upper bound (6) is precise, it cannot be made smaller. More-over, the method works if there are several singularities on the circle of convergence.Nor does the method require the Taylor coefficients to be positive. However, pre-serving this elementary level, we can derive only the precise exponential upperbound for the quantities at question, and cannot find the principal term of theasymptotic formulas explicitly, as more sophisticated methods do. To demonstratethe method, we apply it to several known examples from the existing literature.
Example 1.
Consider the generating function S ( x ) above. The origin x = 0 is a removable singularity , as can be straightforwardly seen by rationalizing thenumerator, S ( x ) = 2 x − x + 1 + √ − x − x − x + x . The radicand − x − x − x + x is a symmetric polynomial of forth degree. Thecorresponding equation x − x − x − x + 1 = 0 can be explicitly solved by dividingover x and substituting t = x + 1 /x ; it has two real roots, z = −√ , z = √ ,and two complex conjugate roots, z , = − ± ı √ ; | z , | = 1 . The root, closest tothe origin, is z , thus S ( z ) is analytic in the open disc | z | < R = | z | < . Finally,from (6) we get the estimate | S n | ≤ (cid:18) − √ (cid:19) n = √ ! n , which is, up to the pre-exponential factor, the estimate derived in [6, p, 65] , [5] ,and some other places by making use of certain more sophisticated tools of complexanalysis. Remark 1.
It must be noted that the function S ( z ) is a many-valued analyticfunction of the complex argument z with the branching points at the roots z − z ofthe radicand and at infinity. To select a single-valued branch of S , we can cut theplane, for example, along the rays going from the points z , z , z to infinity. Thegenerating function S ( x ) is given by the principal branch of S ( z ) correspondingto the choice √ . In the disk | z | < | z | , this generating function S ( x ) isrepresented by the Taylor series with positive coefficients S n . The same remark isvalid for all the examples that follow. Example 2.
In general case, that is, for the secondary structures with any arclength λ ≥ , the difference equation is ( [6] ) S λ ( n ) = S λ ( n −
1) + n − − λ X j =0 S λ ( n − − j ) S λ ( j ) , leading again to a quadratic equation for the generating function, x (cid:0) S λ ( x ) (cid:1) − (1 − x + x + · · · + x λ ) S λ ( x ) + 1 = 0 , which again can be explicitly solved for S λ by the quadratic formula. For example,if λ = 3 , the radicand is the polynomial − x − x − x + 2 x + x , which can befactored, − x − x − x + 2 x + x = (1 − x − x )(1 − x ) , with the smallest root √ − . Repeating the same reasoning as above, we get theestimate S ( n ) ≤ (cid:16) √ (cid:17) n . If λ = 4 , the radicand is a polynomial of th degree, whose roots are to be evaluatednumerically; the smallest one is . , leading to the estimate S ( n ) ≤ . n . Example 3.
Many authors consider secondary structures subject to various restric-tions, see for example, [8, 9, 4] and the references therein. The presented approachworks even if the equation for the generating function is of degree higher than ,as in the next example. We consider saturated secondary structures studied in [4] .In this case the generating function S = S ( z ) = P n ≥ S n z n satisfies the system oftwo nonlinear equations (see, e.g., [4] and the references therein) S ( z ) = z + z + zT ( z ) + z T + z S + z S ,T ( z ) = z S + z T S.
Eliminating T , one derives the cubic equation for S , z S + z ( z − S + (1 − z ) S − z (1 + z ) = 0 . We solve the cubic equation for S by the classical Cardano formula. When theparameter z is within the range of interest, the equation has one real root, S r ( z ) = 2 − z z + 13 √ z ( A / + √ z − z + 1)3 A / ) , where A = − z + 30 z + 27 z + 3 z − z ) / p − z − z + 32 z + 60 z + 35 z + 6 z − z − . The equation A = 0 simplifies to ( z − z + 1) = 0 with all the roots on the unitcircle. The radicand in the A has the smallest root at z ≈ . , which isthe radius of convergence, R , of S ( z ) . It should be mentioned that we chose thebranch of the radical which is positive in a right neighborhood of z . Hence, we getthe bound S n ≤ · (1 /R ) n = const · . n in complete agreement with [4] . Example 4.
In the case of canonical secondary structures [4] , the generating func-tion S also satisfies a system of two nonlinear equations, S ( z ) = z + zS ( z ) + z Q ( z ) + z S ( z ) Q ( z ) ,Q ( z ) = z + z Q ( z ) + z S ( z ) Q ( z ) + z S ( z ) . Eliminating Q , we get the quadratic equation for S with the discriminant ∆( z ) = z − z − z + 6 z + 3 z − z − z + 4 z − z − z + 1 = 0 . Its smallest in absolute value root is R ≈ . , which is the closest to theorigin singular point of S ( z ) , and the upper bound for the number of the saturatedsecondary structures on n nucleotides is its reciprocal /R . , again in perfectagreement with [4] . Example 5.
Next, we consider an example from [2, Theor. 2, p. 352] , where theauthors derived the generating function for a certain class of secondary structures, S ( z ) = 12 z (1 + z ) (cid:16) − z − z (1 + z ) − p P ( z ) (cid:17) P ( z ) = ( z + 2 z + z + z − − z (1 + z ) . Here z = 0 is a removal singularity and z = − is beyond the disc of convergence,since the closest to the origin singular point is the branching point of the radicalat the smallest root of the polynomial P . This root is z ≈ . , thus theradius of convergence is R ≈ . and the upper bound of the number of thesecondary structures at question is · . , again in agreement with [2] . In the two examples to follow, we bound the number of various RNA shapes.The equations for the generating functions were derived in the informative paperby Lorenz, Ponty, and Clote [5], we only suggest an elementary derivation of themain exponential term of the asymptotics.
Example 6.
In the case of π − shapes, the generating function S ( z ) satisfies thesimple quadratic equation S = z S + z S + z , therefore, since S (0) = 0 , S ( z ) = 12 z (cid:16) − z − p − z − z (cid:17) . The radicand is a biquadratic equation with roots ± ı and ± / √ . Thus, the distancefrom the boundary to the nearest singularity is / √ , and the bound we sought for,is s n ≤ ( √ n .It should be emphasized that it is irrelevant for this method whether there aremultiple singularities on the circle of convergence. Example 7.
In the last example we estimate the number of π − shapes compatiblewith the RNA sequences of length n . The equation for the generating function, alsotaken from [5] , is z (1 − z ) S + ( z − z − z ) S + z = 0 . Solving it by the quadratic formula and canceling by − z , we get S ( z ) = 12 z (1 − z ) (cid:16) − z − p z − z − z + 1 (cid:17) . The singularity, closest to the origin, which gives the radius of analyticity of thegenerating function, is R ≈ ) . , hence, the required upper bound is its reciprocal /R ≈ . . Acknowledgement.
The author is thankful to the MBI for inviting him to theworkshop, and to Professors Raina Robeva and Peter Clote for useful discussions.Numerical computations in the examples above were made by making use of theWolfram Mathematica free online root finder. The author wants to acknowledgethe excellent performance of this widget.
References [1] P. Clote, E. Kranakis, D. Krizanc, Asymptotic Number of Hairpins of Saturated RNA Sec-ondary Structures, Bull. Math. Biol., Vol. 75, 2013, 24102430.[2] ´E. Fusy, P. Clote, Combinatorics of locally optimal RNA secondary structures, J. Math. Biol.Vol. 68, 2014, 341375.[3] A. Kheyfits,
A Primer in Combinatorics , Berlin, De Gruyter, 2010.[4] E. Kranakis, Combinatorics of Canonical RNA Secondary Structures. Preprint, 2011cgi.csc.liv.ac.uk/ ctag/seminars/evangelos20110309.pdf[5] W. Lorenz, Y. Ponty, P. Clote, Asymptotics of RNA shapes,
J. of Computational Biology ,Vol. 15, 2008, 31-63.[6] C. Reidys,
Combinatorial Computational Biology of RNA , Springer, 2011.[7] P. Stein, M. Waterman, On some new sequences generalizing the Catalan and Motzkin num-bers, Discr. Math., Vol. 26 (1978), 261-272.[8] Z. Wang, K. Zhang, RNA Secondary Structure Prediction, Ch. 14, in
Current Topics inComputational Molecular Biology , 345-363. Tsinghua Univ. Press and MIT, 2002.[9] W. Wang, M. Zhang, T. Wang, Asymptotic enumeration of RNA secondary structure, J.Math. Anal. Appl. Vol. 342, 2008, 514523.[10] M. Waterman, Secondary Structure of Single-Stranded Nucleic Acids, in