[PDF] Genetic robustness of let-7 miRNA sequence-structure pairs

Abstract

Genetic robustness, the preservation of evolved phenotypes against genotypic mutations, is one of the central concepts in evolution. In recent years a large body of work has focused on the origins, mechanisms, and consequences of robustness in a wide range of biological systems. In particular, research on ncRNAs studied the ability of sequences to maintain folded structures against single-point mutations. In these studies, the structure is merely a reference. However, recent work revealed evidence that structure itself contributes to the genetic robustness of ncRNAs. We follow this line of thought and consider sequence-structure pairs as the unit of evolution and introduce the spectrum of inverse folding rates (IFR-spectrum) as a measurement of genetic robustness. Our analysis of the miRNA let-7 family captures key features of structure-modulated evolution and facilitates the study of robustness against multiple-point mutations.

Full PDF

aa r X i v : . [ q - b i o . P E ] J a n “robust˙let7” — 2018/10/13 — 20:49 — page 1 — ✐ ✐ ✐✐✐ ✐ ✐✐ Published online Nucleic Acids Research, , Vol. , No. doi:10.1093/nar/gkn000

Genetic robustness of let-7 miRNA sequence-structure pairs

Qijun He , Fenix W. Huang , Christopher Barrett and Christian M. Reidys ∗ ∗ Biocomplexity Institute of Virginia Tech, Blacksburg, VA, USA. Department of Mathematics, Virginia Tech,Blacksburg, VA, USA. Thermo Fisher Scientiﬁc Fellow in Advanced Systems for Information Biology. Department ofComputer Science, Virginia Tech, Blacksburg, VA, USA.ABSTRACTGenetic robustness, the preservation of evolved phenotypesagainst genotypic mutations, is one of the central conceptsin evolution. In recent years a large body of work has focusedon the origins, mechanisms, and consequences of robustnessin a wide range of biological systems. In particular, researchon ncRNAs studied the ability of sequences to maintainfolded structures against single-point mutations. In thesestudies, the structure is merely a reference. However, recentwork revealed evidence that structure itself contributes tothe genetic robustness of ncRNAs. We follow this line ofthought and consider sequence-structure pairs as the unit ofevolution and introduce the spectrum of inverse folding rates(IFR-spectrum) as a measurement of genetic robustness.Our analysis of the miRNA let-7 family captures key featuresof structure-modulated evolution and facilitates the study ofrobustness against multiple-point mutations.INTRODUCTION

Genetic robustness can be characterized in terms of thevariation of phenotype distribution induced by genotypicchange (1, 2, 3) and concerns the insensitivity of a phenotypeto genetic changes. Mutational robustness has been studiedin the context of noncoding RNA (ncRNA) (4, 5). RNAconsists of a single strand of nucleotides (A,C,G,U) that canfold and bond to itself through base pairing. ncRNAs areknown to function in aptamer binding as riboswitches, inchemical catalysis as ribozymes and in RNA splicing such asspliceosome (6, 7, 8, 9). Most importantly, it is the foldedstructure that is underlying all these mechanisms, allowingfor the interaction with and subsequent modiﬁcation of otherbiological molecules. The self folding of RNA makes itan ideal object to study genotype-phenotype relations. Thewell-established energy based prediction of RNA secondarystructure, a -dimentional coarse grain of the real threedimensional structure, makes such studies feasible (10, 11, 12,13).Structures are an important determinant of the functionof ncRNAs, whence the robustness of an RNA can becharacterized in terms of the variation of secondary structuredistribution, i.e. the stability of a secondary structure in theface of genetic sequence changes. Structural robustness ofncRNAs is considered to be a key component of the ﬁtnessof the molecule and much research has been conducted ∗ To whom correspondence should be addressed. Tel: +01 540 231-2317; Email: [email protected] to identify the footprints of natural selection on secondarystructures of sncRNAs (4, 5). In (4), Borenstein and Ruppindeﬁne neutrality of an RNA sequence σ = a a ...a n by η ( σ ) =1 −h d i /n , where h d i denotes the average, taken over all n single-point mutants of σ , of the base pair distance d betweenthe minimum free energy (MFE) structure, S , of σ and theMFE structures of single-point mutants. The RNA sequence, σ , is then deﬁned to be robust if η ( σ ) is greater than theaverage neutrality of 1000 control sequences generated bythe program RNAinverse (14), which fold into the sametarget structure, S . The main ﬁnding of (4) is that precursormiRNAs (pre-miRNA) exhibit a signiﬁcantly higher level ofmutational robustness than random RNA sequences, havingthe same structure. Subsequently Rodrigo et al. (5) undertooka similar analysis for bacterial small RNAs. Their main ﬁndingwas, that, surprisingly, bacterial sncRNAs are not signiﬁcantlymore robust when compared with 1000 sequences having thesame structure, as computed by RNAinverse. (5) based theirﬁndings on the notion of ensemble diversity deﬁned earlier in(15).In the above mentioned papers, robustness is deﬁnedby taking the average of all n single-point mutants,the underlining assumption being, that all n single-pointmutations are equally likely to occur and, in addition, takingexclusively single-point mutations into account. Incorporationof multiple-point mutations poses obvious difﬁculties, sincethe number of sequences that need to be taken intoconsideration will grow exponentially.The neutral theory of Motoo Kimura (16), stipulatesthat evolution is achieved by neutral mutations, that is,by mutations, that are necessarily compatible with theunderlying structure. This is in accordance with recentﬁndings showing that secondary structures have a genuine inﬂuence on selection in RNA genes. In (17), Hein et al. classify stem positions into structural classes and validatethat they are under different selective constraints. In (18),the authors observe neutral evolution in Drosophila miRNAand evolution increasing the thermal dynamic stability of theRNA. (19, 20) study Human Accelerated Regions (HAR) inbrains of primates, i.e. noncoding RNAs with an acceleratedrate of nucleotide substitutions along the lineage betweenhuman and chimpanzee. (20) concludes that this increasedrate of nucleotide substitutions is of central importance forthe evolution of the human brain. The most divergent ofthese regions, HAR1, has been biochemically conﬁrmed to c (cid:13) The Author(s)This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. robust˙let7” — 2018/10/13 — 20:49 — page 2 — ✐ ✐ ✐✐✐ ✐ ✐✐ Nucleic Acids Research, , Vol. , No. fold into distinct RNA secondary structures in human andchimpanzee. Interestingly, the mutations in the human HAR1sequence, compared to the chimpanzee sequence, stabilizetheir respective RNA structure (21), suggesting a shapemodulated evolution.Recently, (22) proposed a framework consideringRNA sequences and their RNA secondary structuressimultaneously. This gives rise to an information theoreticframework for RNA sequence-structure pairs. In particular,the authors studied the “dual Boltzmann distribution”, i.e. theBoltzmann distribution of sequences with respect to a ﬁxedstructure. The authors develop a Boltzmann sampler ofsequences with respect to a given structure and study the“inverse folding rate” (IFR) of the sampled sequences,i.e. the proportion of the sampled sequences whose MFEstructure equals to the given structure. (22) reports that naturalstructures have higher IFR than random structures, suggestingthat natural structures have higher intrinsic robustness thanrandom structures.This suggests an approach to consider sequences andstructures-as pairs -as the unit of evolution, instead of justsequences in isolation. In this paper, we generalize the notionof inverse folding rate (IFR) and introduce a novel proﬁle, theIFR- spectrum , of a sequence-structure pair. By construction,this spectrum entails both sequence and structure information.The key idea is that mutations will be biased by the underlyingthermodynamic energy of the respective structure, instead ofbeing equally likely to occur.We shall conduct a detailed study on the let-7 familymiRNAs, small endogenous noncoding RNAs, that regulatethe expression of protein coding genes in animals. Theshort, mature miRNAs ( ∼

22 nt) originate from longerRNA precursor molecules that fold into a stem-loop hairpinstructure. The secondary structure of miRNA stem-loopsserves a crucial role in the miRNA gene maturation process(23). The stem-loop structure has been under evolutionarypressures to conserve its structure. Such stabilizing pressuresfavor robust conﬁgurations and may have led to the evolutionof robust structures. Furthermore, the let-7 miRNA familyhas been widely detected in metazoans, ranging from humanto fruit ﬂy (24). These features make the let-7 gene familya particularly suitable test bed for studying the evolution ofgenetic robustness.We organize our study as follows: ﬁrst we shall extendthe analysis of (4) to IFR-spectra and observe that mostnative sequence-structure pairs have a higher IFR-spectrumthan sequence-structure pairs obtained by the inverse foldingalgorithm. We shall investigate different aspects of therobustness of IFR-spectra of native sequence-structure pairsand show that these are distinctively more robust againstmultiple-point mutations. Secondly, we conduct cross-speciescomparisons of native sequence-structure pairs, observingthat higher metazoan species have higher IFR-spectra. Ouranalysis suggests that IFR-spectra are being increased in thecourse of shape-modulated evolution.

MATERIALS AND METHODS

IFR-spectrum: a sequence-structure pair proﬁle.

In this section we introduce the technical details of ourframework, starting with Borenstein and Ruppin’s deﬁnitionof neutrality (4), as the average of all n single-point mutants.By construction, all mutations are equivalent and structurehas no inﬂuence on the mutation rate. However, as suggestedin (17, 18, 21), structure genuinely affects mutation rateand mutations in turn further increase the thermal dynamicstability of the structure. We consider here mutations withrespect to a ﬁxed structure and consequently deal exclusivelywith compatible mutations. Furthermore, the idea of thefollowing is to favor energetically beneﬁcial mutations, whilepenalizing detrimental mutations. To adequately quantifythe above, we revisit some of the basic concepts of thethermodynamic model of RNA secondary structure.The free energy η can be considered as a result of pairingsequences and structures as follows: η : Q n ×S n → R ⊔ + ∞ , where Q n and S n denote the space of sequences, σ , and thespace of secondary structures, S , respectively and η ( σ,S ) isthe energy of S on σ . This mapping is computed as the sum ofthe energy contributions of individual base-pairs (25). A moreelaborate model (11, 26) evaluates the total free energy to bethe sum of from the energies of loops involving multiple base-pairs. We remark that, if a sequence, σ , is not compatible witha structure, S , then η ( σ,S ) = + ∞ . Then MFE-folding is a mapfrom sequences to distinguished structures, i.e. the minimumfree energy structures:mfe : Q n → S n , σ argmin S ∈S n η ( σ,S ) . To incorporate the thermodynamic information into theprobability of mutations, we introduce the notion of theBoltzmann distribution of k -point mutants of a sequence-structure pair. To this end, we consider the partition functionof k -point mutants of a sequence-structure pair, namely, thepartition function of all sequences that are at Hammingdistance k to the given sequence with respect to the givenstructure.D EFINITION Let σ be a sequence of n nucleotidesand let S be its associated structure, i.e. its MFE- or nativestructure. Then the partition function of k -point mutants of σ with respect to S is given by: Q Sσ,k = X σ ′ ,h ( σ,σ ′ )= k e − η ( σ ′ ,S ) RT , ( ) where h is the Hamming distance, η ( σ ′ ,S ) is the energyof S on σ ′ , R is the universal gas constant and T is thetemperature. Eq. ( ) represents the “dual” of McCaskill’s partitionfunction (27) with an additional Hamming distance ﬁltration(28). Given this partition function, we are in position torobust˙let7” — 2018/10/13 — 20:49 — page 3 — ✐ ✐ ✐✐✐ ✐ ✐✐ Nucleic Acids Research, , Vol. , No. introduce the Boltzmann distribution of k -point mutants of σ with respect to S : the probability of a speciﬁc k -point mutant σ ∗ of σ , h ( σ ∗ ,σ ) = k , with respect to S : P r

Sσ,k ( σ ∗ ) = e − η ( σ ∗ ,S ) RT Q Sσ,k . This expression allows us to consider Boltzmann weightedmutations, taking into account the free energy, when realizing S .In order to quantify mutational robustness, i.e. the ability tomaintain the structure, we consider all k -point mutants, thatfold again into S . We call the fraction of k -point mutants, σ ∗ ,that fold into S , the inverse folding rate (IFR) of the sequencestructure pair ( σ,S ) at k , that is, we haveIFR Sσ,k = X mfe ( σ ∗ )= S P r

Sσ,k ( σ ∗ ) . IFR

Sσ,k thus quantiﬁes the mutational robustness of asequence-structure pair, ( σ,S ) , with respect to Boltzmannweighted k -point mutations, taking into account the energy ofthe sequence, when assuming the structure S .IFR Sσ, can be viewed as a variation of Borenstein andRuppin’s deﬁnition of neutrality. Instead of a uniformdistribution of all n single-point mutants, a Boltzmannweighted distribution for these mutants is employed. Incontrast to using base pair distance as the metric on structurespace, we restrict ourselves to a discrete metric.The IFR

Sσ,k -spectrum of a sequence-structure pair, ( σ,S ) ,i.e. the collection of IFR Sσ,k for varying k , allows us to studymultiple-point mutations instead of conﬁning the analysisto single-point mutations. In this study, we shall analyzethe IFR Sσ,k -spectrum of sequence-structure pairs of Hammingdistances ≤ k ≤ .The IFR Sσ,k -spectrum can be viewed as the conditionalprobability of the IFR in (22), conditional to speciﬁcHamming distances. As a result, we have:IFR S = n X k =0 P r Sσ ( k ) IFR

Sσ,k , where P r Sσ ( k ) = Q Sσ,k Q S , Q S = P σ ′ e − η ( σ ′ ,S ) RT . The IFR

Sσ,k -spectrum via Boltzmann sampling.

Computing IFR

Sσ,k strictly requires folding all σ ′ that are S -compatible, such that h ( σ,σ ′ ) = k . While this task isimpractical for large k , IFR Sσ,k can be efﬁciently computedby means of Boltzmann sampling. This is conducted herevia the Hamming Distance Restricted Dual Sampler (HRDS)(28), which facilitates the approximation of IFR

Sσ,k for anygiven sequence-structure pair, ( σ,S ) , and Hamming distance, k . HDRS takes as input ( σ,S ) and k and outputs sequences, σ ∗ , having Hamming distance k with probability e − η ( σ ∗ ,S ) RT Q Sσ,k .We then have:IFR

Sσ,k ≈ of sequences folding into S of sampled sequences . In this paper, IFR

Sσ,k is calculated by sampling sequencesand computing their MFE structures. The standard errorof measuring IFR

Sσ,k by sampling times is ± . (derived by computing the IFR Sσ, of aae-let-7 sequence-structure pairs times). By computing the IFR Sσ,k for each ≤ k ≤ , we obtain the IFR Sσ,k -spectrum of the sequence-structure pairs, ( σ,S ) . miRNA data. We consider miRNA precursor sequences of the let- gene family, obtained from the miRBase database (29).These originate from different animal species. In vertebrateand urochordates multiple homologous miRNA genes arecommonly observed, while single miRNA let- genes weremore common in other animal species. We provide in thesupplemental materials (SM) the numbers of the respectivelet- genes (see Table 1 in SM). The secondary structures ofthese sequences are derived using a MFE folding algorithm,employing the Turner energy model (26). We compute theIFR Sσ,k -spectra, for ≤ k ≤ , for all let- sequence-structure pairs. RESULTS

Robustness of native let-7 sequence-structure pairs.

Borenstein and Ruppin (4) report that miRNA sequencesexhibit a high level of neutrality in comparison with randomsequences folding into a similar structure, suggesting thatnative miRNA sequence are more robust against single-pointmutations. These results motivate further analysis of nativeand random sequences, in particular whether or not thisneutrality is conﬁned to a local neighborhood. To this endwe uniformly select of the native sequence-structurepairs of the let-7 miRNA family (see Table 2 in the SM).For each such native pair, ( σ,S ) , we derive randomsequences, σ ∗ , whose MFE structure is identical to S , ascontrol set. These control sequences are computed using theinverse folding algorithm presented in (22, 30, 31). Givenstructure S , we compute the partition function of all sequenceswith respect to S . We then Boltzmann sample sequences andﬁlter (by rejection) those sequences, that fold into S . We thencompute the IFR Sσ ∗ ,k -spectra for the inverse folding solutionsand compare for each native sequence-structure pair ( σ,S ) , theIFR Sσ,k -spectrum with µ ( IFR Sσ ∗ ,k ) -spectrum, i.e. the mean ofall spectra, IFR Sσ ∗ ,k , taken at each respective k . We call ( σ,S ) k -robust, if IFR Sσ,k > µ ( IFR Sσ ∗ ,k ) . We then check thenative sequence-structure pairs for k -robustness and quantifythe signiﬁcance of k -robustness via Z -tests. In Figure 1 wedepict a sequence-structure pair, that is k -robust for ≤ k ≤ .robust˙let7” — 2018/10/13 — 20:49 — page 4 — ✐ ✐ ✐✐✐ ✐ ✐✐ Nucleic Acids Research, , Vol. , No.

Figure 1.

The IFR

Sσ,k -spectrum: the x -axis denotes the Hamming distanceand the y -axis the inverse folding rate. The IFR Sσ,k -spectrum of the api-let-7 (Acyrthosiphon pisum) gene (yellow) and the spectrum of means ofthe corresponding control set, µ ( IFR Sσ ∗ ,k ) (blue). The error bar denotes thestandard deviation of the control set of { IFR Sσ ∗ ,k } for each k . Since IFR Sσ,k >µ ( IFR Sσ ∗ ,k ) , api-let-7 is k -robust, for all ≤ k ≤ . IFR Sσ,k decreases slowerthan µ ( IFR Sσ ∗ ,k ) , as k increases. In Table 1 we summarize the data on k -robustness ofthe let-7 sequence-structure pairs included in our study.We observe that of the selected let-7 miRNAs are -robust. This observation is consistent with (4), providingfurther evidence that native let-7 sequence-structure pairsare robust against single-point mutations. However, themutational robustness of native let-7 sequence-structure pairsis not restricted to single-point mutations: for all ≤ k ≤ , we observe that k -robustness holds for over ofthe selected sequence-structure pairs. This suggests, that themutational robustness of let-7 sequence-structure pairs isnot a local phenomenon. Furthermore, the percentage ofsigniﬁcantly k -robust genes increases as k increase, see Table1. In Figure 1 we display the IFR Sσ,k -spectrum of the api-let-7 sequence-structure pair, illustrating the aforementionedphenomenon. The api-let-7 sequence-structure pair is -robustbut not signiﬁcantly -robust. However, for ≤ k ≤ , theapi-let-7 sequence-structure pairs is signiﬁcantly k -robust.We proceed by considering for each native sequence-structure pair ( σ,S ) , r k , its IFR Sσ,k -rank among the IFR Sσ ∗ ,k -values, respectively. Figure 2 presents the distribution of r k ,for k = 1 , , . Examination of the rank distribution shows Table 1. k -robustness of native sequence-structure pairsHamming distance k -robust p < . k=1 96% 8%k=2 96% 28%k=3 94% 38%k=4 92% 38%k=5 94% 38%k=10 92% 48%k=15 94% 56%k=20 96% 58% Second column: the percentage of k -robust (IFR Sσ,k >µ ( IFR Sσ ∗ ,k )) genes. Thirdcolumn: the percentage of the signiﬁcantly k -robust genes for p< . . The p valuesdenote the probability of observing IFR Sσ ∗ ,k > IFR

Sσ,k by chance and are calculated via Z -tests. that native pairs exhibit a propensity towards high ranks foreach k , supporting the observation of k -robustness. As k increases, the propensity towards high ranks becomes moreand more pronounced. In case of k = 1 , only out of nativesequence-structure pairs have r = 1 . For k = 20 , however,this quantity increases to . This is consistent with theincreasing percentage of signiﬁcantly k -robust genes as k increases, further demonstrating that native let-7 sequence-structure pairs exhibit mutational robustness against multiple-point mutations.By comparing speciﬁc native sequence-structure pairs withthe respective control pairs (obtained by inverse folding),we observe native sequence-structure pairs exhibit in generalhigher IFR Sσ,k -values. In this analysis the sequence-structurepair, ( σ,S ) , is ﬁxed and we contrast native pairs with pairsobtained via random inverse folded sequences.We can augment this analysis by considering the ensemble of spectra of native pairs versus the ensemble of all inversefolded sequence-structure pairs. In that, for any ﬁxed k , wecan integrate the information of all native pairs contrastingthis with the integrated information of the inverse foldedpairs. Figure 3 displays these two distributions: IFR Sσ,k ofnative pairs and µ ( IFR Sσ ∗ ,k ) , of the inverse folded pairs for k = 1 , , . For each k , we not only observe that the meanof the IFR Sσ,k is greater than that of the terms µ ( IFR Sσ ∗ ,k ) ,but also that the distributions of IFR Sσ,k and µ ( IFR Sσ ∗ ,k ) aredistinctively different. Furthermore, the difference betweenthe two distributions for each k is statistically signiﬁcant, seeTable 2, where the p value is calculated by two tailed Wilcoxonsigned rank tests for paired data.The increase of signiﬁcantly k -robust native sequence-structure pairs for increasing k , as well as the tendencyof native IFR Sσ,k , to assume high ranks, for increasing k ,suggest that k -robustness, ≤ k ≤ , is not a byproduct of -robustness.Clearly, the very notion of IFR Sσ,k -spectrum raises thequestion to what extend k -robustness of native sequence-structure pairs, ≤ k ≤ , is strongly correlated to -robustness. In other words, to what extent is robustness againstmultiple-point mutations induced by robustness againstsingle-point mutations.In order to quantify this, we conduct a systematiccorrelation analysis for IFR Sσ,k and µ ( IFR Sσ ∗ ,k ) . Regarding µ ( IFR Sσ ∗ ,k ) as the intrinsic k -robustness of the structure S , we Table 2.

Distinct distribution of IFR

Sσ,k and µ ( IFR Sσ ∗ ,k ) Hamming distance mean IFR

Sσ,k mean µ ( IFR Sσ ∗ ,k ) p valuek=1 0.9118 0.7904 1.09 × − k=5 0.5952 0.4220 3.57 × − k=20 0.2483 0.1387 2.23 × − The second and third columns display the mean value of IFR

Sσ,k and µ ( IFR Sσ ∗ ,k ) ,respectively, for each k . p values denote the probability of observing more extremedifferences between { IFR

Sσ,k } and { µ ( IFR Sσ ∗ ,k ) } at random, assuming two samplesare drawn from the same distribution. The p value is calculated by two tailed Wilcoxonsigned rank test for paired data. robust˙let7” — 2018/10/13 — 20:49 — page 5 — ✐ ✐ ✐✐✐ ✐ ✐✐ Nucleic Acids Research, , Vol. , No. k = 1 k = 5 k = 20 Figure 2.

The distribution of IFR

Sσ,k -ranks, r k , of the native sequence-structure pairs IFR Sσ,k . The x -axis denotes ranking and the y -axis frequency, k =1 (Left), k =5 (Center) and k =20 (Right). For each native sequence-structure pair, ( σ,S ) , the IFR Sσ,k is ranked among

IFR Sσ ∗ ,k -values, where σ ∗ is a randomsequence whose MFE structure is S . k = 1 k = 5 k = 20 Figure 3.

The distribution of IFR

Sσ,k , of native pairs (blue) and µ ( IFR Sσ ∗ ,k ) , of the control sets (yellow). The x -axis denotes the inverse folding rate and the y -axis denotes the frequency, k =1 (Left) k =5 (Center) k =20 (Right). arrive at interpreting the term IFR Sσ,k − µ ( IFR Sσ ∗ ,k ) as adaptive k -robustness.The intrinsic structural k -robustness µ ( IFR Sσ ∗ ,k ) betweendifferent k exhibits very strong correlation. Point in case:Spearman’s rank correlation coefﬁcient between µ ( IFR Sσ ∗ , ) and µ ( IFR Sσ ∗ , ) is . , where p < − . However,the correlation between IFR Sσ, − µ ( IFR Sσ ∗ , ) and IFR Sσ,k − µ ( IFR Sσ ∗ ,k ) , drops much faster as k increases. The Spearman’srank correlation coefﬁcient between IFR Sσ, − µ ( IFR Sσ ∗ , ) andIFR Sσ, − µ ( IFR Sσ ∗ , ) is . ; where p = 0 . , showingweak correlation. This indicates that there exists factorsbeyond -robustness that contribute to the k -robustness of native sequence-structure pairs.We ﬁnally study the role of the free energy for k -robustness.(32) reports that miRNA exhibits increased thermodynamicstability and furthermore that mutations in the HAR furtherstabilize the structure (21). This gives rise to the questionwhether increased k -robustness is a result of the increasedthermodynamic stability of miRNAs.To test this, we select a native sequence-structure pairfrom the miRNA let-7 family and generate two sets ofsequences: Σ high and Σ low , each consisting of sequencesthat fold into S , having higher and lower energy than thenative pair, respectively. It turns out, that sequences generatedby RNAinverse (14) tend to have higher energy than thenative pair, while sequences generated by the dual Boltzmannsampler, ﬁltered by rejection to fold into S , tend to have lower energy. We compute the mean and the standard deviation ofthe spectra of the sequences of Σ low and Σ high respectivelyand integrate our ﬁndings in Figure 4. Figure 4.

Spectra of a native sequence-structure pair versus those of inversefolding solutions having lower and higher energies. We display the spectrumof the native, let-7 miRNA pair (yellow), the spectrum of the mean IFR

Sσ,k of inverse fold solutions having lower (red) and higher (blue) energy,respectively and their standard deviations. The x -axis denotes the Hammingdistance and the y -axis the inverse folding rate. Figure 4 shows that the IFR

Sσ,k -spectrum of the nativepair is distinctively higher than the ones derived from Σ low and Σ high . We have conﬁrmed this result for additionalnative sequence-structure pairs. The ﬁndings suggest thatthermodynamic stability is not the sole factor, native pairshave been selected for. The mutational robustness of nativerobust˙let7” — 2018/10/13 — 20:49 — page 6 — ✐ ✐ ✐✐✐ ✐ ✐✐ Nucleic Acids Research, , Vol. , No. pairs is not a byproduct of evolving toward thermodynamicstability.

Robustness of metazoan species. (33) studies genetic robustness of networks of bacterial genes,observing variations among species in their level of geneticrobustness, reﬂecting adaptations to different ecologicalniches and lifestyles. The genetic robustness of a networkrefers to its ability to buffer mutations via the existenceof alternative pathways. The species-speciﬁc variations raisethe question whether such variations can also be observedfor the IFR

Sσ,k -spectra across different animal species. AreIFR

Sσ,k -spectra to some extent a reﬂection of phylogeneticrelationships?Herein, we perform an evolutionary analysis of the IFR

Sσ,k -spectra of let-7 miRNA sequence-structure pairs acrossdifferent animal species. We ﬁrst perform our analysison selected taxa: Homo sapiens, Pan paniscus, Anoliscarolinensis, Ciona, Drosophila, Chromadorae. There are , , , , and let-7 genes found in the miRBase withineach taxon, respectively. For almost all ≤ k ≤ , we observethe mean IFR Sσ,k -value within each taxa to decrease fromHomo sapiens to Chromadorae, see Figure 5 and Table. 3.The only exceptions are the IFR

Sσ,k -values of Pan paniscus,that become slightly larger than those of Homo sapiens, for ≤ k ≤ . We observe that “higher” metazoan species, asHomo sapiens, Pan paniscus, Anolis carolinensis, Ciona, allof which being Chordata, exhibit larger IFR Sσ,k -values than“lower” metazoan species, as Drosophila, Chromadorae, allof which being Ecdysozoa. In addition, the difference of

Figure 5.

Mean IFR

Sσ,k within each of the six taxa, from left to right: Homosapiens, Pan paniscus, Anolis carolinensis, Ciona, Drosophila, Chromadorae.The x -axis denotes the Hamming distance and the y -axis the inverse foldingrate. The relative ordering of IFR Sσ,k -values among taxa is consistent foreach k (from high to low: Homo sapiens, Pan paniscus, Anolis carolinensis,Ciona, Drosophila, Chromadorae, the only exception being IFR Sσ,k -valuesof Pan paniscus becoming slightly larger than those of Homo sapiens, for ≤ k ≤ ). IFR

Sσ,k -values among evolutionary closely related species, asHomo sapiens, Pan paniscus and Anolis carolinensis are small.However, comparing taxa, less related in the phylogenetic treeof life, the difference becomes signiﬁcant. Statistical tests (twotailed Wilcoxon rank sum test) are conducted to quantify thestatistical signiﬁcance of this difference within the let-7 genesIFR

Sσ,k -value distribution among these six taxa, see Figure 6,for k = 1 , , . The results of statistical tests at k = 1 , , demonstratethat IFR Sσ,k -distributions of the six taxa reﬂect the complexityof the organisms and their phylogenetic relations. Homosapiens, Pan paniscus and Anolis carolinensis almost alwaysexhibit signiﬁcantly higher IFR

Sσ,k -values than Drosophilaand Chromadorae, for k = 1 , , . On the other hand,differences among Homo sapiens, Pan paniscus and Anoliscarolinensis are statistically insigniﬁcant for k = 1 , , .However, we observe, for k = 20 insigniﬁcant differencesbetween Homo sapiens, Pan paniscus and Anolis carolinensiswith Drosophila, though the p values are very close to . .A more distinguished variation is observed comparing Cionawith Drosophila. In case of k = 1 , the difference is close ofbeing signiﬁcant with p = 0 . , but for k = 20 , the differenceis insigniﬁcant with p = 0 . .We proceed by conducting the above analysis for all let-7 sequence-structure pairs of the miRBase. Let-7 genes havebeen found in metazoan species. Multiple homologousmiRNA genes are commonly observed in vertebrates, whilesingle miRNA let-7 genes were more common in otheranimal species. The phylogenetic tree of all species isconstructed using the “Interactive Tree Of Life” (iTOL)(34, 35), based on NCBI Taxonomy (36). In addition, iTOLdetermines taxonomic classes of all internal nodes. Given thephylogenetic tree, we not only compute the mean IFR Sσ,k -values of the let-7 sequence-structure pairs within eachspeciﬁc species, but also the mean IFR

Sσ,k -values within taxa,corresponding to the internal nodes of the phylogenetic tree.Figure 7 depicts a subtree of the phylogenetic tree, thatincludes the major taxa as well as the mean IFR

Sσ,k -valueswithin each taxon, for k = 1 , , . The complete phylogenetictree is presented in the SM (Figure 3,4,5).For let-7 miRNAs, we observe that higher metazoanspecies typically exhibit higher IFR Sσ,k -values as well assigniﬁcant differences in the IFR

Sσ,k -distribution across taxa.For instance, IFR

Sσ,k -distributions of Deuterostomia andProtostomia are found to be signiﬁcantly different ( p =0 . , . , . for k = 1 , , , calculated by twotailed Wilcoxon rank sum tests). DISCUSSION

In this paper we augment the analysis of genetic sequencesby incorporating structural information. The informationrepresented by structures is distinctively different from

Table 3.

Mean IFR

Sσ,k within each taxa at each kk =1 k =5 k=20Homo sapiens 0.9288 0.6632 0.3356Pan paniscus 0.9234 0.6404 0.3402Anolis carolinensis 0.9211 0.6412 0.2949Ciona 0.8874 0.5509 0.1971Drosophila 0.8524 0.4889 0.1802Chromadorae 0.9118 0.7904 0.0642 The table displays the mean value of IFR

Sσ,k of the let-7 sequence-structure pairs withineach taxa, at k =1 , , . robust˙let7” — 2018/10/13 — 20:49 — page 7 — ✐ ✐ ✐✐✐ ✐ ✐✐ Nucleic Acids Research, , Vol. , No. Homo sapiensPan paniscus

Anolis carolinensisCionaDrosophila H o m o s ap i en s P an pan i sc u s A no li s c a r o li nen s i s C i ona D r o s oph il a C h r o m ado r ea Homo sapiensPan paniscus

Anolis carolinensisCionaDrosophila H o m o s ap i en s P an pan i sc u s A no li s c a r o li nen s i s C i ona D r o s oph il a C h r o m ado r ea k = 1 k = 5 k = 20 Figure 6.

Statistical signiﬁcance ( p value) of the difference within the let-7 genes IFR Sσ,k -distribution between taxa, for k =1 (left), k =5 (middle) and k =20 (right). p values denote the probability of observing a larger difference in IFR Sσ,k -distribution between two corresponding taxa at random, assuming they aredrawn from the same probability distribution, computed by two tailed Wilcoxon rank sum test. Differences between two taxa, that are statistically signiﬁcant ( p < . , are highlighted and phylogenetic relations among the six taxa are displayed. Bilateria

Platyhelminthes

Protostomia

Deuterostomia

Nematoda

Lophotrochozoa

Ecdysozoa

Arthropoda

Echinodermata

Hemochordata

Chordata

Tunicata

Cephalochordata

Vertebrata

Figure 7.

Phylogenetic tree of the major taxa and mean IFR

Sσ,k of the let-7 sequence-structure pairs within each taxon, for k =1 , , . Mean IFR Sσ,k -value for k =1 (Right hand side), k =5 (Center) and k =20 (Left hand side). that represented by sequences. Structures encode relationsbetween pairs of loci . Such a relation can be realized inmultiple ways, i.e., by the bond between loci i and j byAU, UA, GU, UG, CG and GC. In that, one can expect ameaningful enhancement of the sequence information.Incorporating structural information, i.e. consideringsequences and structures combined, provides new ways toanalyze genetic material. We investigate mutational robustnessof let-7 miRNA, making use of this perspective and introducethe IFR Sσ,k -spectrum, a novel observable, that quantiﬁesmutational robustness. The IFR

Sσ,k -spectrum depends onboth: the reference sequence and structure, respectively andallows us to delocalize the study of mutational robustnessbeyond single-point mutations. Our study provides evidence for direct evolution ofincreased robustness in let-7 miRNAs, by comparing nativelet-7 miRNA sequence-structure pairs with the control pairsobtained by inverse folding algorithms. It provides evidencethat mutational robustness is not local, i.e. it cannot bededuced from or restricted to single-point mutations. Onthe contrary, robustness effects become more pronounced inhigher Hamming distances: the percentage of signiﬁcantly k -robust, native let-7 sequence-structure pairs increases as theHamming distance k increases. By conducting a correlationanalysis between IFR Sσ,k -values, we provide evidence thatthere exist additional factors that contribute to mutationalrobustness against multiple-point mutations. The spectrumitself contains information that cannot be inferred from a localanalysis. robust˙let7” — 2018/10/13 — 20:49 — page 8 — ✐ ✐ ✐✐✐ ✐ ✐✐ Nucleic Acids Research, , Vol. , No.

The pronounced mutational robustness of native let-7sequence-structure pairs against multiple-point mutationsmight play a role in genetic robustness on a population level.The presence of native sequences in a population allows forsigniﬁcant sequence variation within a species while stillpreserving the phenotype. This would suggest that evolutionis not identifying sequences that are locally robust but robustwithin entire regions of sequence space.An evolutionary analysis of the IFR

Sσ,k -spectrum oflet-7 miRNAs shows that the IFR

Sσ,k -spectrum resemblesphylogenetic relationships. Statistical tests exhibit, thatclosely related species tend to have similar IFR

Sσ,k -spectrawhile distant species tend to exhibit distinct IFR

Sσ,k -spectra. Ingeneral, higher level animal species have larger IFR

Sσ,k -valuesthan lower level animal species. The increased mutationalrobustness in higher level animals appears to reﬂect thecomplexity of the organisms.The role of structure in the context of the IFR

Sσ,k -spectrum,differs substantially from the role of neutrality in (4): structureis not merely used to measure the effect of the mutation, butalso affects how sequences mutate. This constitutes effectivelya feed-back loop between sequence and structure and isarguably the driving factor enhancing the signal in nativesequence-structure pairs.We use a discrete metric for measuring the structural changeinduced by mutation in the deﬁnition of IFR

Sσ,k -spectrum,instead of the base pair distance. This metric produces stabledata: the standard error of IFR

Sσ,k by sampling times issmall, ± . , compared to the difference of IFR Sσ,k -spectrabetween native and control sequence-structure pairs and thedifference of IFR

Sσ,k -spectra across species, respectively. Thisallows us to analyze efﬁciently mutational robustness againstmultiple-point mutations by Boltzmann sampling, obsoletingthe need for large sample sizes.In our analysis, we use the MFE structures, predicted byRNA minimum free-energy folding algorithm, as phenotypes.The study can be enhanced by passing from MFE structuresto partition functions of structures with respect to a ﬁxedsequence (27). One can envision an analysis that considersthe both: the dual partition function, considered here and inaddition the partition function of structures.Traditionally, the “information” of a sequence is beingidentiﬁed with its actual sequence of nucleotides. Thisperspective results in employing sequence alignments toquantify sequence similarities, the underline assumptionbeing, that similar sequences should have close biologicalrelevant. This however is not entirely correct. Fontana etal. (37) study the ruggedness of genotype to phenotypemaps and show that similar sequences can exhibit distinctlydifferent phenotype. Furthermore, in the context of sequencedesign and detection of new functional genes, sequencesthat fold into the same structure are considered to beequal (14, 38). The presence of neutral networks (39, 40,41) of RNA secondary structures, shows, that there arecomplimentary sequences that fold into the same structure.In other words, neither does sequence similarity necessarilyimply phenotypic similarity nor does sequence dissimilarity imply phenotypic dissimilarity. This motivates to augment thesequence information by incorporating additional factors.The analysis of IFR

Sσ,k -spectra represents such anaugmentation: the let-7 sequences-structure pairs exhibit adistinctive difference between native and random pairs. As aresult, integrating the information of sequences and structuresfacilitates at minimum the extension of the local analysisconducted in (4) and may as well, as a novel paradigm alone,lead to further biological insights.

FUNDING

This research is partially funded by Thermo Fisher and thelast author is a Thermo Fisher Scientiﬁc Fellow in AdvancedSystems for Information Biology.

ACKNOWLEDGEMENTS

Special thanks to Stanley Hefta for his input on thismanuscript. We gratefully acknowledge the help of KevinShinpaugh and the computational support team at BI, MiaShu, Thomas Li, Henning Mortveit, Madhav Marathe andReza Rezazadegan for discussions. The fourth author is aThermo Fisher Scientiﬁc Fellow in Advanced Systems forInformation Biology and acknowledges their support of thiswork.

Conﬂict of interest statement.

None declared.

REFERENCES

1. Gu, Z., Steinmetz, L. M., Gu, X., Scharfe, C., et al. (2003) Role ofduplicate genes in genetic robustness against null mutations.

Nature, (6918), 63.2. de Visser, J. A. G. M., Hermisson, J., Wagner, G. P., Meyers, L. A.,Bagheri-Chaichian, H., Blanchard, J. L., Chao, L., Cheverud, J. M., Elena,S. F., Fontana, W., et al. (2003) Perspective: evolution and detection ofgenetic robustness.

Evolution, (9), 1959–1972.3. Schlichting, C. D., Pigliucci, M., et al. (1998) Phenotypic evolution: areaction norm perspective., Sinauer Associates Incorporated, .4. Borenstein, E. and Ruppin, E. (2006) Direct evolution of geneticrobustness in microRNA. Proceedings of the National Academy ofSciences, (17), 6593–6598.5. Rodrigo, G. and Fares, M. A. (2012) Describing the structural robustnesslandscape of bacterial small RNAs.

BMC evolutionary biology, (1), 52.6. Darnell, J. E. (2011) RNA: life’s indispensable molecule, Cold SpringHarbor Laboratory Press, .7. Breaker, R. R. (1996) Are engineered proteins getting competition fromRNA?. Current Opinion in Biotechnology, (4), 442–448.8. Serganov, A. and Patel, D. J. (2007) Ribozymes, riboswitches and beyond:regulation of gene expression without proteins. Nature reviews. Genetics, (10), 776.9. Breaker, R. R. and Joyce, G. F. (1994) Inventing and improving ribozymefunction: rational design versus iterative selection methods. Trends inbiotechnology, (7), 268–275.10. Waterman, M. S. (1978) Secondary structure of single-stranded nucleicacids. Adv. Math. (Suppl. Studies), , 167–212.11. Mathews, D., Sabina, J., Zuker, M., and Turner, D. (1999) Expandedsequence dependence of thermodynamic parameters improves predictionof RNA secondary structure. J. Mol. Biol., , 911–940.12. Zuker, M. and Stiegler, P. (1981) Optimal computer folding of larger RNAsequences using thermodynamics and auxiliary information.

NucleicAcids Res., , 133–148.13. Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer, L. S., Tacker, M.,and Schuster, P. (1994) Fast Folding and Comparison of RNA SecondaryStructures. Monatsh. Chem., , 167–188. robust˙let7” — 2018/10/13 — 20:49 — page 9 — ✐ ✐ ✐✐✐ ✐ ✐✐

Nucleic Acids Research, , Vol. , No.

14. Lorenz, R., Bernhart, S. H., Zu Siederdissen, C. H., Tafer, H., Flamm,C., Stadler, P. F., and Hofacker, I. L. (2011) ViennaRNA Package 2.0.

Algorithms for Molecular Biology, (1), 26.15. Gruber, A. R., Bernhart, S. H., Hofacker, I. L., and Washietl, S. (2008)Strategies for measuring evolutionary conservation of RNA secondarystructures. BMC bioinformatics, (1), 122.16. Kimura, M. (1983) The neutral theory of molecular evolution, CambridgeUniversity Press, .17. Mimouni, N. K., Lyngsø, R. B., Grifﬁths-Jones, S., and Hein, J. (2008)An analysis of structural inﬂuences on selection in RNA genes. Molecularbiology and evolution, (1), 209–216.18. Price, N., Cartwright, R. A., Sabath, N., Graur, D., and Azevedo,R. B. (2011) Neutral evolution of robustness in Drosophila microRNAprecursors. Molecular biology and evolution, (7), 2115–2123.19. Pollard, K. S., Salama, S. R., Lambert, N., Lambot, M.-A., Coppens,S., Pedersen, J. S., Katzman, S., King, B., Onodera, C., Siepel, A., etal. (2006) An RNA gene expressed during cortical development evolvedrapidly in humans. Nature, (7108), 167–172.20. Beniaminov, A., Westhof, E., and Krol, A. (2008) Distinctive structuresbetween chimpanzee and humanin a brain noncoding RNA.

RNA, (7),1270–1275.21. Pollard, K. S., Salama, S. R., King, B., Kern, A. D., Dreszer, T., Katzman,S., Siepel, A., Pedersen, J. S., Bejerano, G., Baertsch, R., et al. (2006)Forces shaping the fastest evolving regions in the human genome. PLoSgenetics, (10), e168.22. Barrett, C., Huang, F. W., and Reidys, C. M. (2017) Sequence–structurerelations of biopolymers. Bioinformatics, (3), 382–389.23. Lee, R. C., Feinbaum, R. L., and Ambros, V. (1993) The C.elegans heterochronic gene lin-4 encodes small RNAs with antisensecomplementarity to lin-14. Cell, (5), 843–854.24. Tanzer, A. and Stadler, P. F. (2004) Molecular evolution of a microRNAcluster. Journal of molecular biology, (2), 327–335.25. Nussinov, R., Piecznik, G., Griggs, J. R., and Kleitman, D. J. (1978)Algorithms for Loop Matching.

SIAM J. Appl. Math., (1), 68–82.26. Turner, D. and Mathews, D. H. (2010) NNDB: the nearest neighborparameter database for predicting stability of nucleic acid secondarystructure. Nucl. Acids Res., , 280–282.27. McCaskill, J. S. (1990) The equilibrium partition function and base pairbinding probabilities for RNA secondary structure.

Biopolymers, (6-7),1105–1119.28. Huang, F. W., He, Q., Barrett, C., and Reidys, C. M. (2017) An efﬁcientdual sampling algorithm with Hamming distance ﬁltration. arXiv preprintarXiv:1711.10549, .29. Kozomara, A. and Grifﬁths-Jones, S. (2013) miRBase: annotating highconﬁdence microRNAs using deep sequencing data. Nucleic acidsresearch, (D1), D68–D73.30. Garcia-Martin, J. A., Bayegan, A. H., Dotu, I., and Clote, P. (2016)RNAdualPF: software to compute the dual partition function with sampleapplications in molecular evolution theory. BMC bioinformatics, (1),424.31. Levin, A., Lis, M., Ponty, Y., O’Donnell, C. W., Devadas, S., Berger, B.,and Waldisp¨uhl, J. (2012) A global sampling approach to designing andreengineering RNA secondary structures. Nucleic acids research, (20),10041–10052.32. Bonnet, E., Wuyts, J., Rouz´e, P., and Van de Peer, Y. (2004) Evidence thatmicroRNA precursors, unlike other non-coding RNAs, have lower foldingfree energies than random sequences. Bioinformatics, (17), 2911–2917.33. Freilich, S., Kreimer, A., Borenstein, E., Gophna, U., Sharan, R., andRuppin, E. (2010) Decoupling environment-dependent and independentgenetic robustness across bacterial species. PLoS computational biology, (2), e1000690.34. Letunic, I. and Bork, P. (2006) Interactive Tree Of Life (iTOL): an onlinetool for phylogenetic tree display and annotation. Bioinformatics, (1),127–128.35. Letunic, I. and Bork, P. (2016) Interactive tree of life (iTOL) v3: anonline tool for the display and annotation of phylogenetic and other trees. Nucleic acids research, (W1), W242–W245.36. Federhen, S. (2011) The NCBI taxonomy database. Nucleic acidsresearch, (D1), D136–D143.37. Huynen, M. A., Stadler, P. F., and Fontana, W. (1996) Smoothnesswithin ruggedness: the role of neutrality in adaptation. Proceedings ofthe National Academy of Sciences, (1), 397–401.38. Busch, A. and Backofen, R. (2006) INFO-RNA?a fast approach to inverseRNA folding. Bioinformatics, (15), 1823–1831. 39. Reidys, C. M., Stadler, P. F., and Schuster, P. (1997) Generic propertiesof combinatory maps and neutral networks of RNA secondary structures. Bull. Math. Biol., (2), 339–397.40. Gr¨uner, R., Giegerich, R., Strothmann, D., Reidys, C. M., Weber, J.,Hofacker, I. L., Stadler, P. F., and Schuster, P. (1996) Analysis of RNAsequence structure maps by exhaustive enumeration I. structures ofneutral networks and shape space covering.. Chem. Mon., , 355–374.41. Gr¨uner, R., Giegerich, R., Strothmann, D., Reidys, C. M., Weber, J.,Hofacker, I. L., Stadler, P. F., and Schuster, P. (1996) Analysis of RNAsequence structure maps by exhaustive enumeration II. structures ofneutral networks and shape space covering..

Chem. Mon.,127