LinearDesign: Efficient Algorithms for Optimized mRNA Sequence Design
He Zhang, Liang Zhang, Ziyu Li, Kaibo Liu, Boxiang Liu, David H. Mathews, Liang Huang
LLinearDesign: Efficient Algorithms for OptimizedmRNA Sequence Design
He Zhang a,b, ♠ , Liang Zhang b, ♠ , Ziyu Li a , Kaibo Liu a , Boxiang Liu a , David H. Mathews c,d,e , and Liang Huang a,b, ♣ a Baidu Research USA, Sunnyvale, CA 94089, USA; b School of Electrical Engineering & Computer Science, Oregon State University, Corvallis, OR 97330, USA; c Dept. ofBiochemistry & Biophysics; d Center for RNA Biology; e Dept. of Biostatistics & Computational Biology, University of Rochester Medical Center, Rochester, NY 14642, USA
A messenger RNA (mRNA) vaccine has emerged as a promising di-rection to combat the current COVID-19 pandemic. This requires anmRNA sequence that is stable and highly productive in protein ex-pression, features which have been shown to benefit from greatermRNA secondary structure folding stability and optimal codon us-age. However, sequence design remains a hard problem due to theexponentially many synonymous mRNA sequences that encode thesame protein. We show that this design problem can be reducedto a classical problem in formal language theory and computationallinguistics that can be solved in O ( n ) time, where n is the mRNAsequence length. This algorithm could still be too slow for large n (e.g., n = 3 , nucleotides for the spike protein of SARS-CoV-2), so we further developed a linear-time approximate version, Lin-earDesign, inspired by our recent work, LinearFold. This algorithm,LinearDesign, can compute the approximate minimum free energymRNA sequence for this spike protein in just 11 minutes using beamsize b = 1 , , with only 0.6% loss in free energy change comparedto exact search (i.e., b = + ∞ , which costs 1 hour). We also de-velop two algorithms for incorporating the codon optimality into thedesign, one based on k -best parsing to find alternative sequencesand one directly incorporating codon optimality into the dynamic pro-gramming. Our work provides efficient computational tools to speedup and improve mRNA vaccine development.Availability: server: http://rna.baidu.com/lineardesign ;code: to be released on GitHub.
1. Introduction
To defeat the current COVID-19 pandemic, which has already claimed100,000+ deaths as of early April, a messenger RNA (mRNA) vac-cine has emerged as a promising approach thanks to its rapid andscalable production and non-infectious and non-integrating proper-ties. However, designing an mRNA sequence to achieve high stabilityand protein production remains a challenging problem. Recently, itis discovered that greater secondary structure folding stability andoptimal codon usage synergize to increase protein expression. Thedesign problem can therefore be formulated as finding the mRNA se-quence(s) that are good in both secondary structure stability and codonoptimality among the exponentially many synonymous sequences thatencode the same protein.Each amino acid is translated by a codon, which is 3 adjacentmRNA nucleotides. For example, the start codon AUG translates intomethionine, the first amino acid in any protein sequence. But due toredundancies in the genetic code ( = 64 triplet codons for 21 aminoacids), most amino acids can be translated from multiple codons. Thisfact makes the search space of mRNA design increase exponentiallywith protein length, e.g., for the spike protein of SARS-CoV-2 (thevirus that causes the COVID-19 pandemic), which contains 1,273amino acids (plus the stop codon which is part of the mRNA butnot part of a protein), there are about mRNA candidates. The mRNA design problem, therefore, aims to exploit the redundancies inthe genetic code to find more stable and productive mRNA sequencesthan the wild type in nature.Our key idea is to show that this design problem can be reducedto a classical notion in formal language theory and computationallinguistics, namely the intersection between a Stochastic ContextFree Grammar (SCFG) and a Deterministic Finite Automaton (DFA),which dates back to 1961. Here the SCFG represents the foldingfree energy model and the DFA represents the set of all possiblesynonymous mRNA sequences that code a given protein. While theuse of SCFG in RNA folding is well-known, the use of DFA toencode the mRNA search space and solving the design problem viathe intersection of SCFG and DFA are novel. This intersection canbe done in O ( n ) time, where n is the mRNA sequence length, but itcould still be too slow for large n (e.g., n = 3 × (1273 + 1) =3 , nucleotides for the spike protein of SARS-CoV-2), so wefurther developed a linear-time approximate version, LinearDesign,inspired by our recent work, LinearFold. We also develop twoalgorithms for incorporating the codon optimality into the design, onebased on k -best parsing to find alternative sequences and one directlyincorporating codon optimality into the dynamic programming.After the completion of our algorithm design and implementationof our Python prototype and C++ code, we became aware of twoearlier, independent, papers that tackled the same problem of “moststable RNA design” using dynamic programming. The second pa-per “CDSfold" published in 2016 did not cite the first one, publishedin 2003. Our work is different in three aspects. First, we reduced themRNA design problem to “CFG-DFA intersection”, a classical prob-lem in formal language theory and computational linguistics, whichis more general and can be applied to other scenarios with alternativeinputs, whereas the previous algorithms were ad hoc. Second, weintegrated codon optimality in the optimization. Third, we furtherdeveloped a linear-time approximate version with greatly reducedruntime for long sequences and small sacrifices in search quality.
2. LinearDesign Algorithms
The mRNA design problem can be formulated as follows: given aprotein sequence p = p . . . p m where each p i is an amino acid,we search, among all possible mRNA sequences r that translate intoprotein p , the best mRNA sequence r ? ( p ) , defined as the sequencethat has the structure with minimum folding free energy change: r ? ( p ) = argmin r ∈ RNA ( p ) MFE ( r ) [1]MFE ( r ) = min s ∈ structures ( r ) ∆ G ◦ ( r , s ) [2]where RNA ( p ) = { r | protein ( r ) = p } is the search space,structures ( r ) is the set of all possible secondary structure for RNA ♠ These two authors contributed equally. ♣ Corresponding author: [email protected]. arXiv Submission arXiv Submission | a r X i v : . [ q - b i o . B M ] M a y A U G
B 0,0 1,0 2,0 3,0
G U ACGU
C 0,0 1,01,1 2,02,1 3,0
UA CG ACGUC U
D 0,0 1,0 2,02,1 3,0
U AG AGA
Fig. 1.
Four examples of Deterministic Finite Automaton (DFA) representations for amino acids. Each example represents one amino acid. A : DFA formethionine, with only one path (tryptophan is the only other amino acid with a single codon path in the standard genetic code); B : DFA for valine, withalternatives only at the third nucleotide (15 of 21 amino acids are like this, with 2–4 paths); C : DFA for serine, which branches at the start node and has atotal of 6 paths (leucine and arginine are also like this); D : DFA for the stop codon. | |◦ |◦ | D ( methionine ) D ( leucine ) D ( STOP ) A U G UC UU AGA C G U U AG AGA
Fig. 2.
The protein DFA for “methionine leucine” by concatenating small DFAs from individual amino acids, i.e., D ( methionine ) ◦ D ( leucine ) ◦ D ( STOP ) .The thick arrows indicate the best mRNA sequence after intersecting this DFA with the context-free grammar (see Fig. 3). sequence r , and ∆ G ◦ ( r , s ) is the free energy change of structure s forRNA r according to an energy model. Note that the mRNA sequencelength is n = 3( m + 1) due to the final stop codon, which is nottranslated into the protein sequence.Next we first show how to represent the exponentially large searchspace RNA ( p ) compactly via DFAs, and then discuss how to do this argmin search (over the product of two exponentially large spaces)efficiently via dynamic programming, which can be reduced to theCFG intersection with DFA. A. DFA representation for amino acid codons and mRNAsearch space.
In the fields of formal language theory and com-putational linguistics, the DFA graph is typically used to encodeambiguities in languages. We notice that the ambiguity of codonchoice for amino acid is similar as to the language ambiguity problem,and can be represented with a DFA graph too. We first illustrate howto represent the amino acid codons using DFAs.Informally, a DFA is a directed graph with labeled edges anddistinct start and end nodes. For our purpose each edge is labeled bya nucleotide, so that each start-to-end path represents a triplet codon.Formally, a DFA is a 5-tuple h Q, Σ , δ, q , F i where Q is the set ofnodes, Σ is the alphabet (here Σ = { A , C , G , U } ), q is the start node, F is the set of end nodes (in this work the end node is unique, i.e., | F | = 1 ), and δ is the transition function that takes a node q and asymbol a ∈ Σ and returns the next node q , i.e., δ ( q, a ) = q encodesa labeled edge q a → q .Fig. 1 illustrates how DFAs represent 4 different types of aminoacids and their codons. All these DFAs have (0 , and (3 , as their start and end nodes, respectively. Fig. 1A is the DFA representationfor methionine, which has only one path. Fig. 1B is the DFA for aminoacid valine, whose codon has a choice at the third nucleotide (mostamino acids are of this type). Fig. 1C represents the most complexcase, serine, leucine, and arginine, which have 6 codons each, and thebranching happens at the start node. Fig. 1D is the DFA for the stopcodon. It is special because branching happens at the second step, atnode (1 , .After building DFAs for each amino acid, we can concatenate theminto a single DFA D ( p ) for a protein sequence p , which representsall possible mRNA sequences that translate into that protein D ( p ) = D ( p ) ◦ D ( p ) ◦ · · · ◦ D ( p m ) ◦ D ( STOP ) by stitching the end node of each DFA with the start node of the next.The new end node of the protein DFA is (3( m + 1) , . Fig. 2 givesa DFA of the protein sequence “methionine leucine”.We also define out _ edges( q ) to be set of outgoing edges fromnode q , and in _ edges( q ) to be the set of incoming edges: out _ edges( q ) = { ( q , a ) | δ ( q, a ) = q } in _ edges( q ) = { ( q , a ) | δ ( q , a ) = q } For the DFA in Fig. 2, out _ edges((3 , { ((4 , , U ) , ((4 , , C ) } and in _ edges((9 , { ((8 , , A ) , ((8 , , G ) , ((8 , , A ) } . B. CFG intersection with DFA.
A stochastic context-free grammar(SCFG) is a context-free grammar in which each rule is augmentedwith a weight. More formally, an SCFG is a 4-tuple h N, Σ , P, S i | arXiv Submission Zhang| arXiv Submission Zhang
A stochastic context-free grammar(SCFG) is a context-free grammar in which each rule is augmentedwith a weight. More formally, an SCFG is a 4-tuple h N, Σ , P, S i | arXiv Submission Zhang| arXiv Submission Zhang et al. , S , , N , A , S , U , S , , N , G , S , C , S , , N , U , N , G , N , U G A · ( · (U · U · ACU A G · )A )A G Fig. 3.
One of the best derivations of the intersected grammar, demonstrating the path through the SCFG and the DFA (there are multiple best trees dueto the simple energy model). The corresponding secondary structure (in dot-bracket format) is shown below the solid arrows. The rest of the DFA (fromFig. 2) is shown in light gray. where N is the set of non-terminals, Σ is the set of terminals (identicalto the alphabet in the DFA, in this case { A , C , G , U } ), P is the set ofweight-associated context-free writing rules, and S ∈ N is the startsymbol. Each rule in P has the form A w → ( N ∪ Σ) ∗ where A ∈ N is a non-terminal that can be rewritten according to this rule into asequence of non-terminals and terminals ( ∗ means repeating zero ormore times) and w ∈ R is the weight associated with this rule.SCFGs are used to represent RNA folding. The weight of a deriva-tion (parse tree, or a secondary structure in this case) is the sum ofweights of the productions used in that derivation. For example, for the very simple Nussinov-Jacobson model (which simplifies the energy model to the number of base pairs),we can define this SCFG G : S → S SS − → A S U | U S A | C S G | G S C | G S U | U S G S → N S | S N | N N NN → A | C | G | U Here the first line is the bifurcating case, the second line is thebase-pairing case (with weight − , and the negative score mirrors thefree energy minimization problem), and the third line is the unpairedcases (note S → N N N makes sure the minimum hairpin length is3, i.e., no sharp turn is allowed).The standard RNA secondary structure prediction problem undera Nussinov model can be cast a parsing problem: given the aboveSCFG G and an input RNA sequence, what is the minimum-weightderivation in G that can generate sequence? For example, for input CCAAAGG , the best derivation is showed in Fig. 4.The mRNA design problem is now a simple extension of the abovesingle-sequence folding problem to the case of multiple inputs: insteadof finding the minimum energy structure (minimum weight derivation)for a given sequence, we find the minimum energy structure (and itscorresponding sequence) among all possible structures for all possiblesequences. This can be solved by intersecting the SCFG G on theprotein DFA D , which results in a bigger SCFG G = G ∩ D S G S G S N A N A N ACC
Fig. 4.
The the best derivation of sequence
CCAAAGG using the Nussinov-Jacobson grammar. and find the best derivation in G . In the intersected grammar G , each nonterminal has the form q Aq , where A ∈ N is an original nonterminal in G and q and q are two nodes in D ; and the new start symbol is q Sq n where S is theoriginal start symbol in G and q and q n are the start and end nodesin D . The bifurcation rule S → S S will become q Sq → q Sq q Sq for all ( q , q , q ) node triplets in D . We can see that this intersectionconstruction generalizes the widely-used CKY algorithm wherethe triple of states ( q , q , q ) generalizes the triple of string indices ( i, k, j ) . The CKY algorithm is a special case of intersection whenthe DFA only encodes one string, e.g., when the protein is made ofamino acids that have only one codon (methionion and tryptophan).Similarly, in computational linguistics, this intersection constructionis widely used for word-lattice parsing in speech recognition wherethe word-lattice is a DFA to account for ambiguity in word identity. The terminal rule N → A will become q Nq → q A q if only if there is a labeled edge q A → q in D , i.e., δ ( q , A ) = q .This intersected grammar G will have N | Q | nonterminals and P | Q | rules in the worst case ( | Q | is the number of nodes in D ). Thisresembles the space and time complexities of the CKY algorithm Zhang et al. arXiv Submission | : function B OTTOM U P D ESIGN ( p ) n ← · | p | + 1 . mRNA length; +1 for the stop codon D ← D ( x ) ◦ D ( x ) ◦ ... ◦ D (stop) . build protein DFA best ← hash() . hash table: from [ q i , q j ] to score back ← hash() . hash table: from [ q i , q j ] to backpointer for i = 0 . . . ( n − do for each q i ∈ nodes( D, i ) do for each ( q i +1 , nuc i ) ∈ out _ edges( D, q i ) do best [ q i , q i +1 ] ← . singleton: N → A | C | G | U back [ q i , q i +1 ] ← nuc i for l = 2 . . . n do for i = 0 . . . ( n − l ) do j ← i + l for each q i ∈ nodes( D, i ) do for each q j ∈ nodes( D, j ) do if j − i > then . pairing (no sharp turn) for each ( q i +1 , nuc i ) ∈ out _ edges( D, q i ) do for each ( q j − , nuc j − ) ∈ in _ edges( D, q j ) do if match ( nuc i , nuc j − ) then score ← best [ q i +1 , q j − ] − . S − → A S U | ... U PDATE ( q i , q j , score , ( nuc i , q i +1 , q j − , nuc j − )) for k = ( i + 1) . . . ( j − do . bifurcation for each q k ∈ nodes( D, k ) do score ← best [ q i , q k ] + best [ q k , q j ] . S → S S U PDATE ( q i , q j , score , q k ) return best [ q , q n ] , B ACKTRACE ( q , q n ) Fig. 5.
The pseudocode of a simplified bottom-up version of the mRNADesign algorithm. See Fig. SI 1 for U
PDATE and B
ACKTRACE functions. (where | Q | = n ). Indeed, intersection is a generalization of CKYfrom fixed input (RNA folding) to lattice input (mRNA design).Now we just need to find the best (minimum weight) derivationin G , from which we can read off the best mRNA sequence and itscorresponding structure. Fig. 3 shows one of the best derivations forthe DFA in Fig. 2. C. Bottom-up dynamic programming on Nussinov model.
Wedescribe how to do dynamic programming based on CFG intersectionwith DFA. First, we use bottom-up dynamic programming on theNussinov-Jacobson energy model as an introduction.Fig. 5 gives the pseudocode for this simplified version. We firstbuild up the given protein’s DFA graph, and initialize two hash ta-bles, b est to store the best score for each state, and b ack to storethe best backpointer for each state. The base cases (singleton rule)are b est [ q i , q i +1 ] ← and b ack [ q i , q i +1 ] ← nuc i for each state ( q i , q i +1 ) , where nuc i is the edge between q i and q i +1 . Next, foreach state ( q i , q j ) it goes through the pairing rule and bifurcationrules, and updates if a better score is found. After filling out thehash tables bottom-up, we can backtrace the best mRNA sequencestored with the backpointers. See Fig. SI 1 for details of Update andBacktrace functions. D. Left-to-right dynamic programming with beam pruning.
The algorithm based on bottom-up dynamic programming runs incubic time, however, it is still slow for long sequences. Inspired byour previous work, LinearFold, we further developed a linear-time function L INEAR D ESIGN ( p , b ) . b is beam size for j = 1 . . . n do for each q j − ∈ nodes( D, j − do for each q i such that [ q i , q j − ] ∈ best do for each ( q j , nuc j − ) ∈ out _ edges( D, q j − ) do backpointer ← ( q j − , nuc j − ) U PDATE ( q i , q j , best [ q i , q j − ] , backpointer ) . S → S N if j − ( i − > then for each ( q i − , nuc i − ) ∈ in _ edges( D, q i ) do if match ( nuc i − , nuc j − ) then . S − → S A S U | .... for each q k such that [ q k , q i − ] ∈ best do score ← best [ q k , q i − ] + best [ q i , q j − ] − backpointer ← ( q i − , nuc i − , q i , q j − , nuc j − ) U PDATE ( q k , q j , score , backpointer ) B EAM P RUNE ( best , j, b ) . choose top- b among all ( q i , q j ) ’s return best [ q , q n ] , B ACKTRACE ( q , q n ) Fig. 6.
The pseudocode of (simplified) LinearDesign algorithm.The first 10lines are the same as in B
OTTOM U P D ESIGN (see Fig. 5). See Fig. SI 1 forU
PDATE , B
ACKTRACE
2, and B
EAM P RUNE functions. approximation algorithm for mRNA design. We apply beam prun-ing, a classical pruning technique, to significantly narrow down thesearch space without sacrificing too much search quality.Fig. 6 gives the pseudocode of simplified LinearDesign algorithmfor the Nussinov model, based on left-to-right dynamic programmingand beam pruning. LinearDesign replaces bottom-up dynamic pro-gramming with a left-to-right parsing. At each step j (the j th positionof mRNA sequence), we only keep the top b states with the lowest freeenergy and prune out the less promissing states, since they are unlikelyto be the optimal sequence. Here b , the beam size, is a user-adjustableparameter to balance runtime and search quality. This approximationleads to a significant runtime reduce from O ( n ) to O ( nb ) . Noticethat we use b = 100 as default in LinearFold, but in LinearDesignwe usually use a larger b because the search space is much larger. E. Implementation on Turner model.
Our real system uses a left-to-right dynamic programming with beam pruning on the Turnernearest neighbor free energy model.
We implement the ther-modynamic parameters following Vienna RNAfold, except for thedangling ends and special hairpins. Dangling ends refer to stabilizinginteractions for multiloops and external loops, which require knowl-edge of the nucleotide sequence outside of the state ( q i , q j ) . Thoughit could be integrated in LinearDesign, the implementation becomesmore involved. Special hairpins are hairpin loop sequences of length3, 4, or 6 unpaired nucleotides with folding free energies stored inlookup tables, rather than estimated from features like other sequences.Special hairpins can be also integrated in LinearDesign with somepreprocessing. We will include both dangling end and special hairpinstabilities in future versions. F. MFE and CAI joint optimization.
Since CAI is also importantfor mRNA functional half life, we consider optimizing MFE andCAI jointly. We add CAI as an additive regularization term in the | arXiv Submission Zhang et al. Codon freq. log
CAIUCA 0.15 -0.47UCC 0.22 -0.09UCG 0.06 -1.39UCU 0.18 -0.29AGC
B 0,0 1,01,1 2,02,1 3,0
U: 0A: 0 C: 0G: 0 A: -0.47C: -0.09G: -1.39U: -0.29C: 0U: -0.47
C 0,0 1,01,1 2,02,1 3,0
U: -0.09A: 0 C: 0G: 0 A: -0.38C: 0G: -1.30U: -0.20C: 0 U: -0.47
Fig. 7.
The DFA representation integrating CAI as edge weight. A : Codon table of "serine". B : DFA graph of "serine" with CAI as edge weight; weightsonly differ at the last edges. C : Improved edge weights to differ earlier. objective function: r ? ( p ) = argmin r ∈ RNA ( p ) (cid:16) MFE ( r ) − ( m + 1) λ log CAI ( r ) (cid:17) = argmin r ∈ RNA ( p ) (cid:16) MFE ( r ) − ( m + 1) λ log(Π m +1 i =1 w i ( r )) m +1 (cid:17) = argmin r ∈ RNA ( p ) (cid:16) MFE ( r ) − λ m +1 X i =1 log w i ( r ) (cid:17) where m is the protein length, log w i ( r ) is the measurement of de-viation from the optimal codon (0 is optimal) for the i th amino acid(given an mRNA candidate), and λ is a hyperparameter that balancesMFE and CAI.We integrate this equation into LinearDesign dynamic process, i.e.,each DFA graph’s edge will have a cost so that the combined cost oftraversing a local path (choosing a codon) equals log w i . Each edgecost is the “best” of the paths (i.e., codons) that uses this edge.Fig. 7 uses serine as an example, showing how to integrating CAIas edge cost in DFA graph. The 6 codons of serine, listed in Fig. 7Awith their CAI, each has a corresponding path in the DFA graph (seeFig. 7B). For example, codon UCU has a CAI of 0.18, while the bestcodon AGC has a CAI of 0.24. The edge costs in "UCU path" are 0,0, log(0 . / .
24) = − . , therefore, log w i of UCU is -0.29. Thebest codon AGC has a log w i of 0, meaning that choosing AGC wouldnot have a cost.Considering LinearDesign is doing left-to-right dynamic program-ming with beam pruning, and at each step j states with lower scoresare pruned, it is better to incur edge costs as early as possible in apath, which ensures the states with better CAI paths are more likely tosurvive in each step. Fig. 7C rearranges the edge cost to fit better forLinearDesign. Note that different edge costs in Fig. 7B and C wouldnot affect exact search. G. The top k best mRNA candidates. An alternative solution forfinding mRNA candidates with both stable secondary structure andhigh CAI is to provide the top k mRNA candidates with the lowestMFE, and post-process them by features such as CAI. Although thisis not as principled as the algorithm described in subsection 2F, thissolution is easier to implement, and is more flexible in the sense thatusers are free to add other customized filters.Inspired by the k -best parsing algorithm, we developed an effi-cient algorithm to find suboptimal candidates in a sorted order. Duringthe dynamic programming process (forward-phase), instead of justsaving the single best prestate as the backpointer for each state, westore alternative prestates that all transit to this state. Then in thebacktrace process (backforward-phase), starting from the last state,we query the second best. The answer is one of the two cases: (1) thesecond best is from another prestate S ; or (2) the second best is fromthe same prestate S . In the former case we can find the second best by backtracing the best path going through prestate S , while in thelatter case we keep querying for the second best from the prestate S .Recursively, we would compute and get as many solutions as needed.To our knowledge, our algorithm is the first one to output subopti-mal results in the mRNA design problem. Two previous algorithmsexplored searching suboptimal secondary structure for RNA foldingproblem. Zuker’s algorithm is to find diverse suboptimal sec-ondary structures, and Wuchty’s algorithm is to find all secondarystructures in a given free energy gap. Our algorithm is different fromthese two in the sense that: (1) ours is for mRNA design problem; and(2) ours can output all top k best candidates in a sorted order.Combining the k -best algorithm and linearization, LinearDesign isable to quickly design a large set of mRNA candidates, which providea set of alternative designs for follow up with wet lab experiments. H. Less secondary structure at 5’-end leader region.
Somestudies have shown that protein translation level will drop if the 5’-endleader region has more secondary structure.
Consideringthis practical issue, LinearDesign can be used to design an ORF withan absence of base pairing in the 5’-end leader region by utilizing asimple strategy.Instead of designing the most stable sequence for the whole cod-ing region, we leave the 5’-end leader region (e.g., the first 15 nt )unchanged from the wildtype, since the wildtype usually has lessstructure in this region. Then we use LinearDesign for the rest ofthe coding region. Because the designed region will be composed ofstrong base pairs (generally maximizing GC content), it is unlikelyfor a global structure change when concatenating with the wildtype5’-end head region, which we observe is often depleted of GC content.Refolding using secondary structure prediction tools for the concate-nated sequence, we get its corresponding secondary structure andobserve that the first 15 nucleotides are unpaired.If a wildtype sequence is not available, we can alternatively enu-merate all possible sequences in the 5’-end leader region. Becauseeach amino acid has 3 codons on average and the start codon is fixed,the enumeration space of the first 15 nt in the 5’-end region is small( = 81 ), which makes the enumeration feasible.
3. Results
A. Efficiency and scalability.
To estimate the run-time complexityof LinearDesign, we use 100 protein sequences from Uniprot fol-lowing CDSfold, with length from 78 to 2,828 nt (not including thestop codon). We found that there are three sequences whose lengthsreported in CDSfold do not match the ones currently in Uniprot, sowe removed these sequences, resulting in a dataset with 97 diverseprotein sequences.We compared LinearDesign in exact (infinite beam size) and ap-proximate modes (beam size b = 1 , and b = 100 ). BecauseCDSfold code is not currently available, we directly use the runtime Zhang et al. arXiv Submission | mRNA (CDS) sequence length r un t i m e i n hou r s CDSfold O ( n ) B mRNA (CDS) sequence length r un t i m e i n hou r s LinearDesign b = + O ( n )LinearDesign b = 1000 O ( n )LinearDesign b = 100 O ( n ) Fig. 8.
Runtime comparison between CDSfold and LinearDesign on Uniprot dataset. A : the runtime of CDSfold as reported by Terai et al. B : the runtimeof LinearDesign (exact mode b = + ∞ , and approximation mode b = 1 , and b = 100 ) run by us on an Intel Xeon E5-2660 v3 CPU. A B C -5000-3000-1000 -5000 -3000 -1000 L i nea r D e s i gn b = ( kc a l / m o l ) LinearDesign b = + ∞ (kcal/mol) mRNA (CDS) sequence length f r ee ene r g y gap % b = 100 b = 1000 f r ee ene r g y gap % beam size SpikeEGFP
Fig. 9.
Search quality of LinearDesign with beam pruning is good. A : LinearDesign with b = 1 , has small free energy gap compared with exactsearch for 97 Uniprot sequences. B : The percentage of LinearDesign free energy gap changes linearly with mRNA sequence length for both b = 100 and b = 1 , on Uniprot dataset. C : The percentage of LinearDesign free energy gap changes with beam size for the spike protein of SARS-CoV-2 (inpurple) and EGFP (in green). reported in CDSfold paper as a comparison allowing us to comparethe computational complexity. Note that CDSfold results and ourresults were run in different machines. We run LinearDesign on amachine with 2 Intel Xeon E5-2660 v3 CPUs (2.60 GHz), whileCDSfold is run on the Chimera cluster system at AIST, which wasreported in the paper as 176 Intel Xeon E5550 CPUs (2.53 GHz).Fig. 8 shows the runtime plots. We observe that CDSfold has anestimated runtime complexity of O ( n . ) , while LinearDesign (exactmode b = + ∞ ) runs in a complexity of O ( n . ) . Both CDSfold andLinearDesign ( b = + ∞ ) have nearly cubic runtime, but LinearDe-sign ( b = + ∞ ) is under the exact O ( n ) because we use the "jump"trick as in LinearFold, i.e., jump to the next possible nucleotide nuc j that can pair with nuc i (with the help of preprocessing), instead ofchecking all positions one by one. In terms of the time cost, CDSfoldtakes 31 hours for the longest sequence in the dataset (8,484 nt ),while LinearDesign ( b = + ∞ ) takes 11 hours. If applying beampruning, the runtime of LinearDesign reduces to linear complexity asexpected. With beam size b = 1 , , the runtime is O ( n . ) , andfurther reduces to O ( n . ) with b = 100 . Our LinearDesign finishesthe longest sequence design in 35 minutes with b = 1 , , and inonly 2 minutes with b = 100 , which is × speed up compared withthe LinearDesign exact search. B. Search quality of linear-time approximation.
Since the Lin-earDesign algorithm is significantly faster than its exact counterpart,we envision that the LinearDesign algorithm will be used for longsequences. To ensure the quality of the approximation used in Lin-earDesign (with beam pruning mode), we compared the energy gap between the exact ( b = + ∞ ) and approximate algorithm ( b = 1 , ).For this analysis, we used the same dataset as in subsection 3A.Fig. 9 represents the folding free energy differences between theexact search and approximation. Fig. 9A compares the free energychanges of the mRNA sequences designed with exact search andwith the b = 1 , approximation. The x-axis is the free energyof exact search, while the y-axis corresponds to the free energy ofapproximation. We see that all plots are on or close to the diagonal,which confirms that the folding free energy differences are 0 or small.Fig. 9B shows the trend of free energy differences increases linearlywith mRNA sequence length. They y-axis is the percentage of freeenergy change gap, which is the free energy change gap ( ∆∆ G ◦ )divided by the total free energy change of the MFE structure ( ∆ G ◦ ).The percentage of free energy change gap is small for all sequencesin the dataset. For b = 1 , , all sequences have gaps within 1%.Even for the longest sequence (8,484 nt ), the gap is 0.8%. For b = 100 , most gaps are within 5%, and the largest gap is 7%. We alsoinvestigate the percentage of free energy change gap against beam sizefor two specific protein sequences, the spike protein of SARS-CoV-2and EGFP (GenBank KM042177.1), in Fig. 9C. The purple curveshows the result of the spike protein. Starting with a small beam size, b = 20 , the gap is 10.6%. With increasing beam size, the gap shrinksquickly to less than 6% at b = 100 . Further increasing b up to 500,the gap drops to 1%. With a beam size of b = 2 , the gap dropsto 0, which indicates that the approximate result is the same as theexact search. The EGFP result (the green curve) has the same shapeas the spike protein, but the gap drops faster and down to less than 2%at b = 100 . This is because EGFP is shorter (239 amino acids), thus | arXiv Submission Zhang et al. C A I MFE (kcal/mol) codon-biased randompure randomwildtypeCAI-greedy design b =+ ∞ (exact) b =1,000 b =100 b =20 λ =-0.01 λ =1 λ =35 λ =100 λ =200 λ =300 λ =400 λ =600 λ =1000 λ =3000 λ =-0.01 λ =1 λ =35 λ =100 λ =200 λ =300 λ =400 λ =600 λ =1000 λ =3000 Fig. 10.
Two-dimensional comparisons (MFE and CAI) between wildtype mRNA sequence (the light-red circle), random sequences (the blue cloud andthe orange cloud) and our designed sequences (the blue spots, the light-blue curve and the magenta curve) on the spike protein of SARS-CoV-2. Thebetter performance is the upper left of the graph, with low MFA and high CAI. Our designed sequences include three parts: (1) the sequence with thelowest MFE in exact mode and in approximation mode ( b = 1 , , b = 100 and b = 20 , respectively), which are optimized by MFE only (shown indark-blue spots and labeled by beam size); (2) sequences that are jointly optimized by MFE and CAI (showed in light-blue curve for exact mode andmagenta curve for b = 1 , ); and (3) top , , best sequences (dark-red cloud next to the b = + ∞ blue spot). We also show CAI-greedydesign result as a grey point. the approximation is close to the exact result even with smaller beamsize. At b = 200 , the gap increases 0.25% compared with b = 100 .This happens because more states are kept when enlarging the beamsize, among which states with higher scores at length j survive. Theirextension to longer lengths (like offspring) beat states and fill thebeams, but all these states become worse at the end, resulting a smalldrop of the search quality. The fluctuation happens in this sequence,but the jump is small and the gap quickly decreases to 0 at b = 400 and above. C. Example for the coding mRNA of the spike protein.
Thespike protein of SARS-CoV-2, which has 1,273 amino acid residues,is the target of mRNA vaccines ( https://clinicaltrials.gov/ct2/show/NCT04283461 ).Therefore, we use the spike pro-tein of SARS-CoV-2 as an example, and compare our designed se-quences with the wildtype sequence and random generated sequences.We use the mRNA sequence from the reference genome of SARS-CoV-2 ( ) as the wildtype sequence, which contains 3,822nucleotides (including the stop codon). Additionally, we use twodifferent strategies to generate random sequences (5,000 sequences for each strategy) as another baseline. One of the two strategies,named "pure random", is to randomly (with equal probabilities)choose a codon for each amino acid in the spike protein, andform a mRNA sequence by concatenation. The other strategy,called "codon-biased random", is to choose a codon the probabilityproportional to its usage frequency. We also run LinearDesign (inboth exact mode and approximation mode) to evaluate the bestsequences we can achieve. Since previous studies show that boththe folding free energy change of mRNA secondary structure andcodon optimality influence the vaccine effectiveness, we do atwo-dimensional comparison, MFE and codon adaptation index(CAI), between mRNA sequences.Fig. 10 shows the results. The wildtype sequence, denoted ina red circle, folds into a structure with the minimum free energychange of -967.80 kcal/mol, and has a low CAI of 0.655. Most "purerandom" sequences, denoted in blue cloud, have similar free energychanges (-987.90 kcal/mol on average) as the wildtype, but withhigher CAI (0.671 on average). This may be because SARS-CoV-2just recently infected human cells and does not have enough mutationsto optimize for human codons. Compared with the wildtype and "purerandom" sequences, "codon-biased random" sequences, denoted in Zhang et al. arXiv Submission | UGUUUGUUUUUCUUGUGCUGCUUCCGCUGGUGAGUUCACAGUGUGUGAACCUCACCACGCGGACGCAGCUACCCCCCGCGUAUACGAACUCGUUUACGCGGGGGGUAUACUACCCUGACAAGGUGUUCAGGAG ACACCUGA
AUGUUUGUGUUUCUUGUCCUCCUU AAGGGGGUCAAGUUACACUACACAUAA
AUGUUUGUUUUUCUUGUGCUGCUUCCGCUGGUGAGUUCACAGUGUGUGAACCUCACCACGCGGACGCAGCUACCCCCCGCGUAUACGAACUCGUUUACGCGGGGGGUAUACUACCCUGACAAGGUGUUCAGGAGUAGUGUGCUUCACUCCACUCAGGACCUGUUCCUGCCGUUCUUUUCGAACGUAACAUGGUUCCACGCGAUACACGUCUCAGGCACGAACGGCACCAAGCGCUUCGACAAUCCCGUUCUCCCCUUCAACGACGGGGUGUACUUCGCUUCGACAGAGAAGUCCAACAUUAUUCGCGGAUGGAUAUUCGGAACCACUCUCGACUCCAAGACUCAGUCCUUGUUGAUAGUGAACAACGCCACGAACGUGGUCAUUAAGGUCUGUGAGUUUCAGUUCUGUAAUGACCCGUUCUUGGGUGUUUACUAUCACAAGAACAACAAGUCUUGGAUGGAGAGUGAGUUCCGAGUGUAUUCAUCCGCGAAUAAUUGUACCUUCGAGUAUGUCAGUCAGCCGUUUCUGAUGGAUCUUGAAGGCAAACAGGGCAAUUUCAAGAAUUUGCGCGAGUUUGUCUUCAAGAACAUCGACGGCUACUUCAAGAUAUACUCGAAGCACACGCCAAUCAACCUCGUCCGUGAUCUCCCGCAGGGCUUCAGCGCUCUGGAACCGCUGGUGGACUUGCCGAUAGGCAUCAACAUCACGCGGUUCCAGACGCUGUUGGCCCUGCACAGGAGUUACCUGACCCCAGGUGACUCCUCGAGUGGGUGGACUGCAGGUGCCGCCGCGUACUACGUGGGGUACCUGCAGCCACGCACGUUCUUGUUGAAGUACAACGAGAACGGGACGAUCACGGACGCGGUUGAUUGUGCGU UGGACCCUCUGUCGGAGACGAAGUGCACCCUGAAGUCGUUUACGGUGGAGAAGGGGAUUUAUCAGACCUCCAACUUCCGUGUCCAGCCGACUGAGAGUAUCGUUCGGUUUCCGAACAUCACGAACCUGUGUCCGUUUGGAGAGGUCUUCAACGCGACCAGGUUCGCCUCCGUGUACGCUUGGAACAGGAAGAGGAUAUCGAAUUGUGUAGCAGACUACAGUGUGCUAUACAAUUCGGCGUCCUUUUCCACUUUCAAGUGUUACGGAGUGUCGCCCACGAAGUUGAACGACCUCUGCUUCACCAACGUGUACGCGGAUUCCUUCGUCAUCCGUGGUGACGAGGUUCGGCAGAUCGCGCCCGGACAGACUGGAAAGAUAGCGGACUACAAUUAUAAGUUGCCCGACGACUUUACUGGCUGCGUUAUUGCUUGGAAUAGCAAUAACUUAGACAGUAAAGUCGGGGGCAACUAUAAUUACCUGUACAGACUGUUUCGGAAAAGCAAUUUGAAGCCCUUCGAGCGCGACAUUAGCACGGAGAUCUACCAGGCUGGUAGUACUCCGUGCAAUGGCGUGGAGGGCUUCAAUUGCUAUUUUCCGUUACAGUCGUACGGGUUUCAGCCCACCAACGGGGUAGGGUACCAGCCCUACCGCGUGGUGGUGCUGUCGUUCGAGCUCCUUCACGCACCCGCGACUGUCUGUGGGCCCAAGAAGUCGACGAACUUGGUGAAGAACAAGUGCGUCAACUUCAACUUCAAUGGGCUCACAGGCACGGGGGUGCUGACGGAGUCGAACAAAAAGUUCCUACCUUUCCAGCAGUUCGGGCGCGAUAUUGCCGACACCACGGAUGCCGUAAGGGAUCCGCAGACGCUUGAGAUUCUGGACAUCACGCCCUGCAGCUUCGGGGGCGUCAGUGUAAUCACGCCUGGUACCAACACGAGCAACCAGGUUGCCGUGUUGUACCAGGACGUGAAUUGCACUGAGGUCCCCGUAGCGAUCCACGCGGAUCAGCUAACCCCAACGUGGAGGGUGUACUCGACAGGGAGUAAUGUCUUCCAGACUCGCGCUGGUUGUCUGAUUGGCGCUGAGCAUGUGAACAACUCGUACGAGUGUGACAUCCCCAUUGGAGCGGGGAUCUGCGCGUCGUACCAGACCCAGACGAACAGCCCGAGGCGUGCCCGCUCAGUAGCGUCGCAGUCGAUCAUCGCGUACACGAUGAGCUUGGGGGCGGAGAACAGUGUAGCCUAUUCGAACAACAGCAUUGCUAUCCCCACAAAUUUUACAAUUAGUGUAACCACGGAGAUCUUACCGGUCUCCAUGACCAAGACCUCCGUGGAUUGCACUAUGUAUAUUUGUGGGGAUAGCACUGAGUGUUCUAACCUCCUGCUCCAGUACGGCAGUUUCUGUACGCAGCUCAACCGAGCGCUUACAGGAAUUGCCGUGGAGCAGGACAAGAACACUCAGGAGGUGUUUGCCCAGGUCAAGCAGAUCUACAAGACACCUCCGAUCAAGGACUUUGGCGGCUUCAACUUCUCCCAGAUACUCCCCGACCCCAGCAAGCCCAGCAAGCGUAGCUUUAUUGAAGAUCUGCUCUUCAAUAAAGUUACGCUUGCUGAUGCUGGGUUCAUCAAGCAGUACGGGGAUUGUCUGGGAGAUAUAGCCGCCAGAGACUUGAUCUGUGCUCAGAAGUUCAAUGGGCUCACUGUUCUCCCCCCCUUGCUCACGGACGAGAUGAUCGCCCAGUAUACUUCGGCGCUACUGGCGGGCACGAUCACCUCGGGCUGGACGUUUGGGGCUGGUGCGGCGCUGCAGAUCCCCUUCGCCAUGCAGAUGGCGUACCGCUUCAAUGGGAUCGGUGUCACACAGAACGUUUUGUACGAGAAUCAGAAGCUCAUCGCCAAUCAGUUCAACAGCGCGAUCGGGAAGAUACAGGACUCCCUGUCGAGUACAGCCUCCGCGUUGGGGAAGCUGCAGGACGUGGUGAACCAGAAUGCUCAAGCGUUGAACACGUUGGUGAAGCAGUUGUCGUCCAACUUCGGGGCGAUAAGUUCGGUGCUGAACGAUAUUCUCAGUCGGCUGGACAAGGUGGAGGCUGAGGUCCAGAUCGACCGGCUCAUCACUGGUCGCCUCCAGAGUUUGCAGACGUACGUAACUCAGCAGCUCAUCCGAGCUGCUGAGAUACGUGCGUCUGCAAACCUGGCGGCGACCAAGAUGAGCGAGUGCGUUCUGGGCCAGUCGAAGCGCGUGGAUUUCUGCGGGAAGGGCUAUCACCUGAUGUCCUUCCCGCAGAGCGCCCCCCACGGGGUGGUCUUCCUGCACGUGACAUAUGUGCCGGCGCAGGAGAAGAACUUCACCACUGCGCCGGCCAUAUGUCACGACGGGAAGGCCCACUUCCCGCGGGAGGGCGUGUUCGUCAGCAACGGAACGCAUUGGUUCGUGACCCAGCGCAAUUUCUAUGAGCCCCAGAUCAUUACCACUGACAAUACCUUUGUCAGUGGUAAUUGCGACGUGGUCAUAGGCAUUGUGAACAACACUGUCUAUGACCCGUUGCAGCCCGAGCUAGAUAGUUUCAAGGAGGAGCUUGAUAAGUACUUCAAGAAUCAUACUUCCCCAGACGUGGAUCUUGGCGACAUUAGCGGCAUCAACGCUAGUGUCGUCAACAUCCAGAAGGAGAUCGACAGGCUCAAUGAGGUUGCCAAGAACCUCAACGAGAGCCUGAUCGAUCUCCAGGAGCUGGGGAAGUAUGAGCAGUACAUCAAGUGGCCUUGGUACAUCUGGCUCGGGUUCAUCGCGGGGCUCAUAGCGAUUGUGAUGGUCACGAUAAUGCUCUGUUGCAUGACGAGCUGCUGUUCGUGCCUGAAGGGGUGUUGCUCGUGUGGAUCAUGUUGCAAGUUCGAUGAGGACGACAGCGAGCCGGUCCUGAAGGGAGUGAAGCUACACUACACCUGA A U G UUU G UUUUUCUU G UUUU A UU G CC A CU A G UCUCU A G UC A G U G U G UU AA UCUU A C AA CC A G AA CUC AA UU A CCCCCU G C A U A C A CU AA UUCUUUC A C A C G U GG U G UUU A UU A CCCU G A C AAA G UUUUC A G A UCCUC A G UUUU A C A UUC AA CUC A GG A CUU G UUCUU A CCUUUCUUUUCC AA U G UU A CUU GG UUCC A U G CU A U A C A U G UCUCU
GGG A CC AA U GG U A CU AA G A GG UUU G A U AA CCCU G UCCU A CC A UUU AA U G A U GG U G UUU A UUUU G CUUCC A CU G A G AA G UCU AA C A U AA U AA G A GG CU GG A UUUUU GG U A CU A CUUU A G A UUC G AA G A CCC A G UCCCU A CUU A UU G UU AA U AA C G CU A CU AA U G UU G UU A UU AAA G UCU G U G AA UUUC AA U UUU G U AA U G A UCC A UUUUU
G GG U G UUU A UU A CC A C AAAAA C AA C AAAA G UU GG A U GG AAA G U G A G UUC A G A G UUU A UUCU A G U G C G AA U AA UU G C A CUUUU G AA U A U G UCUCUC A G CCUUUUCUU A U GG A CCUU G AA GG AAAA C A GGG U AA UUUC
AAAAA
UCUU A GGG AA UUU G U G UUU AA G AA U A UU G A U GG UU A UUUU
AAAA U A U A UUCU AA G C A C A C G CCU A UU AA UUU A G U G C G U G A UCUCCCUC A GGG
UUUUUC GG CUUU A G AA CC A UU GG U A G A UUU G CC AA U A GG U A UU AA C A UC A CU A GG UUUC
AAA
CUUU A CUU G CUUU A C A U A G AA G UU A UUU G A CUCCU GG U G A UUCUUCUUC A GG UU GG A C A G CU GG U G CU G C A G CUU A UU A U G U GGG UU A UCUUC AA CCU A GG A CUUUUCU A UU AAAA U A U AA U G AAAA U GG AA CC A UU A C A G A U G CU G U A G A CU G U G C A CUU G A CCCUCUCUC A G AAA C AAA G U G U A C G UU G AAA
UCCUUC A CU G U A G AAAAA GG AA UCU A UC AAA
CUUCU AA CUUU A G A G UCC AA CC AA C A G AA UCU A UU G UU A G A UUUCCU AA U A UU A C AAA
CUU G U G CCCUUUU GG U G AA G UUUUU AA C G CC A CC A G A UUU G C A UCU G UUU A U G CUU GG AA C A GG AA G A G AA UC A G C AA CU G U G UU G CU G A UU A UUCU G UCCU A U A U AA UUCC G C A UC A UUUUCC A CUUUU AA G U G UU A U GG A G U G UCUCCU A CU AAA UU AAA U G A UCUCU G CUUU A CU AA U G UCU A U G C A G A UUC A UUU G U AA UU A G A GG U G A U G AA G UC A G A C AAA UC G CUCC A GGG C AAA CU GG AAA G A UU G CU G A UU A U AA UU A U AAA UU A CC A G A U G A UUUU A C A GG CU G C G UU A U A G CUU GG AA UUCU AA C AA UCUU G A UUCU AA GG UU GG U GG U AA UU A U AA UU A CCU G U A U A G A UU G UUU A GG AA G UCU AA UCUC
AAA
CCUUUU G A G A G A G A U A UUUC AA CU G AAA
UCU A UC A GG CC GG U A G C A C A CCUU G U AA U GG U G UU G AA GG UUUU AA UU G UU A CUUUCCUUU A C AA UC A U A U GG UUUCC AA CCC A CU AA U GG U G UU GG UU A CC AA CC A U A C A G A G U A G U A G U A CUUUCUUUU G AA CUUCU A C A U G C A CC A G C AA CU G UUU G U GG A CCU
AAAAA G UCU A CU AA UUU GG UU AAAAA C AAA U G U G UC AA UUUC AA CUUC AA U GG UUU AA C A GG C A C A GG U G UUCUU A CU G A G UCU AA C AAAAA G UUUCU G CCUUUCC AA C AA UUU GG C A G A G A C A UU G CU G A C A CU A CU G A U G CU G UCC G U G A UCC A C A G A C A CUU G A G A UUCUU G A C A UU A C A CC A U G UUCUUUU GG U GG U G UC A G U G UU A U AA C A CC A GG AA C AAA U A CUUCU AA CC A GG UU G CU G UUCUUU A UC A GG A U G UU AA CU G C A C A G AA G UCCCU G UU G CU A UUC A U G C A G A UC AA CUU A CUCCU A CUU GG C G U G UUU A UUCU A C A GG UUCU AA U G UUUUUC
AAA C A C G U G C A GG CU G UUU AA U A GGGG CU G AA C A U G UC AA C AA CUC A U A U G A G U G U G A C A U A CCC A UU GG U G C A GG U A U A U G C G CU A G UU A UC A G A CUC A G A CU AA UUCUCCUC GG C GGG C A C G U A G U G U A G CU A G UC AA UCC A UC A UU G CCU A C A CU A U G UC A CUU GG U G C A G AAAA
UUC A G UU G CUU A CUCU
A A U AA CUCU A UU G CC A U A CCC A C AAA
UUUU A CU A UU A G U G UU A CC A C A G AAA
UUCU A CC A G U G UCU A U G A CC AA G A C A UC A G U A G A UU G U A C AA U G U A C A UUU G U GG U G A UUC AA CU G AA U G C A G C AA UCUUUU G UU G C AA U A U GG C A G UUUUU G U A C A C AA UU AAA CC G U G CUUU AA CU GG AA U A G CU G UU G AA C AA G A C AAAAA C A CCC AA G AA G UUUUU G C A C AA G UC AAA C AAA
UUU A C AAAA C A CC A CC AA UU AAA G A UUUU GG U GG UUUU AA UUUUUC A C AAA U A UU A CC A G A UCC A UC AAAA CC AA G C AA G A GG UC A UUU A UU G AA G A UCU A CUUUUC AA C AAA G U G A C A CUU G C A G A U G CU GG CUUC A UC AAA C AA U A U GG U G A UU G CCUU GG U G A U A UU G CU G CU A G A G A CCUC A UUU G U G C A C AAAA G UUU AA C GG CCUU A CU G UUUU G CC A CCUUU G CUC A C A G A U G AAA U G A UU G CUC AA U A C A CUUCU G C A CU G UU A G C GGG U A C AA UC A CUUCU GG UU GG A CCUUU GG U G C A GG U G CU G C A UU A C AAA U A CC A UUU G CU A U G C AAA U GG CUU A U A GG UUU AA U GG U A UU GG A G UU A C A C A G AA U G UUCUCU A U G A G AA CC AAAAA UU G A UU G CC AA CC AA UUU AA U A G U G CU A UU GG C AAAA
UUC AA G A CUC A CUUUCUUCC A C A G C AA G U G C A CUU GG AAAA
CUUC AA G A U G U GG UC AA CC AAAA U G C A C AA G CUUU
AAA C A C G CUU G UU AAA C AA CUU A G CUCC AA UUUU GG U G C AA UUUC AA G U G UUUU
AAA U G A U A UCCUUUC A C G UCUU G A C AAA G UU G A GG CU G AA G U G C AAA UU G A U A GG UU G A UC A C A GG C A G A CUUC
AAA G UUU G C A G A C A U A U G U G A CUC AA C AA UU AA UU A G A G CU G C A G AAA UC A G A G CUUCU G CU AA UCUU G CU G CU A CU AAAA U G UC A G A G U G U G U A CUU GG A C AA UC AAAAA G A G UU G A UUUUU G U GG AAA
GGG CU A UC A UCUU A U G UCCUUCCCUC A G UC A G C A CCUC A U GG U G U A G UCUUCUU G C A U G U G A CUU A U G UCCCU G C A C AA G AAAA G AA CUUC A C AA CU G CUCCU G CC A UUU G UC A U G A U GG AAAA G C A C A CUUUCCUC G U G AA GG U G UCUUU G UUUC
AAA U GG C A C A C A CU GG UUU G U AA C A C AAA GG AA UUUUU A U G AA CC A C AAA UC A UU A CU A C A G A C AA C A C A UUU G U G UCU GG U AA CU G U G A U G UU G U AA U A GG AA UU G UC AA C AA C A C A G UUU A U G A UCCUUU G C AA CCU G AA UU A G A CUC A UUC AA GG A GG A G UU A G A U AAA U A UUUU AA G AA UC A U A C A UC A CC A G A U G UU G A UUU A GG U G A C A UCUCU GG C A UU AA U G CUUC A G UU G U AAA C A UUC
AAAAA G AAA UU G A CC G CCUC AA U G A GG UU G CC AA G AA UUU
AAA U G AA UCUCUC A UC G A UCUCC AA G AA CUU GG AAA G U A U G A G C A G U A U A U AAAA U GG CC A U GG U A C A UUU GG CU A GG UUUU A U A G CU GG CUU G A UU G CC A U A G U AA U GG U G A C AA UU A U G CUUU G CU G U A U G A CC A G UU G CU G U A G UU G UCUC AA GGG CU G UU G UUCUU G U GG A UCCU G CU G C AAA
UUU G A U G AA G A C G A CUCU G A G CC A G U G CUC
AAA GG A G UC AAA UU A C A UU A C A C A U AA
5’ 3’
AUGUUUGUGUUUCUUGUCCUCCUUCCACUGGUUUCGAGUCAGUGCGUCAAUCUUACAACACGAACCCAGCUGCCGCCAGCCUACACGAACUCCUUCACGCGGGGAGUGUACUACCCCGACAAGGUGUUCCGCUCGUCUGUUCUGCACAGCACGCAGGACCUCUUCCUCCCGUUCUUCUCGAACGUGACGUGGUUCCAUGCCAUUCACGUUUCGGGAACGAACGGGACGAAGAGGUUCGAUAACCCUGUUCUACCGUUUAACGACGGGGUGUACUUCGCUUCGACAGAGAAGUCCAACAUUAUUCGCGGAUGGAUAUUCGGAACCACUCUCGAUUCCAAGACUCAGUCCUUGUUGAUAGUGAACAACGCCACGAACGUGGUCAUUAAGGUCUGUGAGUUUCAGUUCUGUAAUGACCCGUUCUUGGGUGUUUACUAUCACAAGAACAACAAGUCUUGGAUGGAGAGUGAGUUCCGAGUGUAUUCAUCCGCGAAUAAUUGUACCUUCGAGUAUGUCAGUCAGCCGUUUCUGAUGGAUCUUGAAGGCAAACAGGGCAAUUUCAAGAAUUUGCGCGAGUUUGUCUUCAAGAACAUCGACGGCUACUUCAAGAUAUACUCGAAGCACACGCCAAUCAACCUCGUCCGUGAUCUCCCGCAGGGCUUCAGCGCUCUGGAACCGCUGGUGGACUUGCCGAUAGGCAUCAACAUCACGCGGUUCCAGACGCUGUUAGCCCUGCACAGGAGUUACCUGACCCCAGGUGACUCCUCGUCCGGUUGGACUGCAGGUGCCGCCGCGUACUACGUGGGGUACCUGCAGCCCCGGACGUUCUUGUUGAAGUACAACGAGAACGGGACGAUCACGGACGCGGUUGAUUGUGCGUUGGACCCUCUGUCGGAGACGAAGUGCACCCUGAAGUCGUUUACGGUAGAAAAGGGGAUCUAUCAGACCUCCAACUUCCGCGUCCAGCCGACUGAGAGUAUCGUUCGGUUUCCGAACAUCACGAACCUGUGUCCGUUUGGAGAGGUCUUCAACGCGACCAGGUUCGCCUCCGUGUACGCUUGGAACAGGAAGAGGAUAUCGAAUUGUGUAGCAGACUACAGUGUGCUAUACAAUUCGGCGUCCUUUUCCACUUUCAAGUGUUACGGAGUGUCGCCCACGAAGUUGAACGACCUCUGCUUCACCAACGUGUACGCGGAUUCCUUCGUCAUCCGUGGUGACGAGGUUCGGCAGAUCGCGCCCGGACAGACUGGAAAGAUAGCGGACUACAAUUAUAAGUUGCCCGACGACUUUACUGGCUGCGUUAUUGCUUGGAAUAGCAAUAACUUAGACAGUAAAGUCGGGGGCAACUAUAAUUACCUGUAUCGACUGUUUCGGAAAAGCAAUCUGAAGCCCUUCGAGCGCGACAUUAGCACGGAGAUCUACCAGGCUGGUAGUACUCCGUGCAAUGGCGUGGAGGGCUUCAAUUGCUAUUUUCCGUUACAGUCGUACGGGUUUCAGCCCACCAACGGGGUAGGGUACCAGCCCUACCGCGUGGUGGUGCUGUCGUUCGAACUCCUUCACGCACCCGCGACUGUCUGUGGGCCCAAGAAGUCGACGAACUUGGUGAAGAACAAGUGCGUCAACUUCAACUUCAAUGGGCUCACAGGCACGGGGGUGCUGACGGAGUCGAACAAAAAGUUCCUACCUUUCCAGCAGUUCGGGCGCGAUAUUGCCGACACCACGGAUGCCGUAAGGGAUCCGCAGACGCUUGAGAUUCUGGACAUCACGCCCUGCAGCUUCGGGGGCGUCAGUGUAAUCACGCCUGGUACCAACACGAGCAACCAGGUUGCCGUGUUGUACCAGGACGUGAAUUGCACUGAGGUCCCCGUAGCGAUCCACGCGGAUCAGCUAACCCCAACGUGGAGGGUGUACUCGACAGGGAGUAAUGUCUUCCAGACUCGCGCUGGUUGUCUGAUUGGCGCUGAGCAUGUGAACAACUCGUACGAGUGUGACAUCCCCAUUGGAGCGGGGAUCUGCGCGUCGUACCAGACCCAGACGAACAGCCCGAGGCGUGCCCGCUCAGUAGCGUCGCAGUCGAUCAUCGCGUACACGAUGAGCUUGGGGGCGGAGAACAGUGUAGCCUAUUCGAACAACAGCAUUGCUAUCCCCACAAAUUUUACAAUUAGUGUAACCACCGAGAUCUUACCGGUCUCCAUGACCAAGACCUCGGUGGAUUGCACUAUGUAUAUUUGUGGGGAUAGCACUGAGUGUUCUAACCUCCUGCUCCAGUACGGCAGUUUCUGUACGCAGCUCAACCGAGCGCUUACAGGAAUUGCCGUGGAGCAGGACAAGAACACUCAGGAGGUGUUUGCCCAGGUCAAGCAGAUCUACAAGACACCUCCGAUCAAGGACUUUGGCGGCUUCAACUUCUCCCAGAUACUCCCCGACCCCAGCAAGCCCAGCAAGCGUAGCUUUAUUGAAGAUCUGCUCUUCAAUAAAGUUACGCUUGCUGAUGCUGGGUUCAUCAAGCAGUACGGGGAUUGUCUGGGAGAUAUAGCCGCCAGAGACUUGAUCUGUGCUCAGAAGUUCAAUGGGCUCACUGUUCUCCCCCCCUUGCUCACGGACGAGAUGAUCGCCCAGUAUACUUCGGCGCUGCUGGCGGGCACGAUCACCUCGGGCUGGACGUUUGGGGCUGGUGCGGCGCUGCAGAUCCCCUUCGCCAUGCAGAUGGCGUACCGCUUCAAUGGGAUCGGUGUCACACAGAACGUUUUGUACGAGAAUCAGAAGCUCAUCGCCAAUCAGUUCAACAGCGCGAUCGGGAAGAUACAGGACUCCCUGUCGAGUACAGCCUCCGCGUUGGGGAAGCUGCAGGACGUGGUGAACCAGAAUGCUCAAGCGUUGAACACGUUGGUGAAGCAGUUGUCGUCCAACUUCGGGGCGAUAAGUUCGGUGCUGAACGAUAUUCUCAGUCGGCUGGACAAGGUGGAAGCGGAGGUCCAGAUAGAUCGGCUCAUCACUGGUCGCCUCCAGAGUUUGCAGACGUACGUAACUCAGCAGCUCAUCCGAGCUGCUGAGAUACGUGCGUCUGCAAACCUGGCGGCGACCAAGAUGAGCGAGUGCGUGCUGGGGCAGAGCAAGCGAGUGGACUUCUGCGGGAAGGGCUAUCACCUGAUGUCCUUCCCGCAGAGCGCACCUCACGGGGUAGUCUUUCUCCACGUGACAUAUGUGCCGGCGCAGGAGAAGAACUUCACCACUGCGCCGGCCAUAUGUCACGAUGGGAAAGCCCACUUCCCGCGUGAAGGAGUUUUUGUAUCAAACGGGACGCAUUGGUUCGUCACGCAGCGCAACUUCUAUGAGCCACAGAUAAUUACCACUGACAAUACCUUUGUCAGUGGUAAUUGUGAUGUGGUCAUAGGGAUCGUGAACAACACGGUCUACGAUCCCCUGCAGCCCGAGCUAGAUAGUUUCAAGGAGGAGCUUGAUAAGUACUUCAAGAAUCAUACUUCCCCAGACGUGGAUCUUGGCGACAUUAGCGGCAUCAACGCUAGUGUCGUCAACAUCCAGAAGGAGAUCGACAGGCUCAAUGAGGUUGCCAAGAACCUCAACGAGAGCCUGAUCGAUCUCCAGGAGCUGGGGAAGUAUGAGCAGUACAUCAAGUGGCCUUGGUACAUCUGGCUCGGGUUCAUUGCAGGGUUGAUCGCGAUCGUGAUGGUCACGAUCAUGUUGUGCUGCAUGACGAGCUGUUGCUCCUGUUUGAAGGGCUGCUGCAGCUGUGGUUCGUGUUGUAAGUUUGACGAGGAUGACUCGGAGCCAGUGUUGAAGGGGGUCAAGUUACACUACACAUAA
5’ 3’ 5’ 3’ 5’3’
Less structure at 5’ leader region- 2473.7 kcal/molLinearDesign ( b = ) +∞ - 2477.7 kcal/molLinearDesign ( b = ) +∞ - 2463.8 kcal/molLinearDesign ( b = ) - 967.8 kcal/molWildtype A B C ED F
AUGUUUGUGUUUCUUGUCCUCCUUCCACUGGUUUCGAGUCAGUGCGUCAAUCUUACAACACGAACCCAGCUGCCGCCAGCCUACACGAACUCCUUCACGCGGGGAGUAUACUACCCGGAUAAGGUGUUCCGGAGUAGUGUGCUCCACUCCACCCAGGACUUGUUCUUGCCCUUCUUCUCCAACGUGACGUGGUUCCACGCCAUCCACGUGAGCGGGACGAACGGGACGAAGAGAUUCGACAAUCCCGUUCUCCCGUUCAACGAUGGCGUGUACUUCGCGUCAACGGAGAAGAGUAAUAUCAUCCGGGGAUGGAUCUUCGGCACGACACUGGACUCGAAGACCCAGUCACUACUGAUCGUCAACAACGCGACGAACGUAGUGAUUAAGGUCUGCGAGUUCCAGUUCUGCAACGAUCCAUUCCUGGGUGUAUAUUACCACAAGAACAACAAGUCCUGGAUGGAGUCGGAGUUCCGCGUGUAUAGCAGCGCGAACAACUGUACGUUCGAGUAUGUCAGCCAGCCAUUCCUGAUGGACCUGGAGGGCAAGCAGGGUAACUUCAAGAACCUGCGUGAGUUUGUGUUCAAGAACAUCGAUGGGUACUUCAAGAUCUACUCGAAGCACACGCCGAUAAACCUCGUGCGGGACCUCCCGCAGGGGUUUUCGGCGUUGGAGCCUCUGGUAGAUCUGCCCAUCGGCAUAAACAUCACGCGGUUCCAGACCCUGCUUGCCCUCCACAGGAGCUAUUUGACUCCGGGGGACUCCAGUUCCGGUUGGACCGCCGGGGCUGCAGCGUACUACGUGGGGUACCUGCAGCCCCGGACGUUCUUGUUGAAGUACAACGAGAACGGAACCAUCACGGACGCGGUGGACUGCGCCCUGGAUCCCUUGUCUGAGACUAAAUGCACUCUCAAGAGUUUCACUGUUGAGAAGGGCAUUUAUCAGACAAGUAAUUUCAGGGUGCAGCCCACCGAGUCCAUCGUGAGGUUCCCGAACAUCACCAAUCUUUGCCCGUUUGGCGAGGUGUUCAACGCCACGCGGUUUGCUAGUGUGUAUGCGUGGAAUCGCAAGCGCAUUAGCAACUGCGUGGCGGAUUAUUCAGUUUUGUAUAACUCAGCCUCGUUUAGCACGUUCAAGUGCUACGGGGUGAGUCCUACAAAACUGAAUGAUCUCUGCUUCACCAACGUCUACGCGGAUAGCUUCGUUAUCCGCGGAGACGAGGUGAGGCAGAUCGCACCUGGCCAGACGGGCAAGAUUGCUGAUUAUAAUUAUAAGUUGCCCGACGACUUUACUGGCUGCGUUAUUGCUUGGAAUAGCAAUAACUUAGACAGUAAAGUCGGGGGCAACUAUAAUUACCUGUAUAGACUCUUUCGAAAGUCCAACUUGAAGCCAUUCGAGCGGGACAUUUCGACAGAGAUCUAUCAGGCGGGCAGCACACCGUGCAAUGGCGUGGAGGGGUUCAACUGCUACUUCCCACUACAGUCGUACGGCUUCCAGCCCACCAACGGGGUGGGCUACCAGCCGUACCGUGUAGUGGUGUUGAGCUUUGAGCUCCUCCACGCCCCUGCAACGGUGUGCGGCCCGAAGAAGUCGACGAACUUGGUGAAGAACAAGUGCGUCAACUUCAAUUUCAACGGUCUAACCGGAACUGGAGUCCUCACGGAGUCAAAUAAGAAGUUCCUGCCGUUCCAGCAGUUUGGCCGAGACAUCGCCGAUACCACGGAUGCUGUCCGGGAUCCCCAGACCCUGGAGAUCCUGGACAUCACUCCGUGUAGUUUCGGCGGUGUCUCGGUCAUAACUCCUGGAACGAACACGUCCAAUCAGGUGGCUGUGCUGUACCAGGACGUGAACUGUACAGAGGUUCCCGUUGCUAUACACGCGGAC CAGCUUACCCCAACGUGGAGGGUGUACUCGACAGGGAGUAAUGUCUUCCAGACUCGCGCUGGUUGUCUGAUUGGCGCUGAGCAUGUGAACAACUCGUAUGAGUGCGACAUCCCCAUUGGAGCCGGCAUCUGCGCGAGUUACCAGACGCAGACGAAUUCUCCGCGCCGGGCUCGUAGUGUGGCGAGUCAGUCGAUUAUCGCCUACACUAUGAGCCUCGGCGCGGAGAAUUCGGUUGCGUAUUCGAACAACUCCAUUGCUAUCCCCACAAAUUUUACAAUUAGUGUAACCACGGAGAUCUUACCGGUCUCCAUGACCAAGACCUCCGUGGAUUGCACUAUGUAUAUUUGUGGGGAUAGCACGGAGUGUUCGAAUCUCCUCCUGCAGUAUGGCUCGUUCUGCACACAGCUAAACCGGGCCCUGACCGGUAUAGCUGUGGAGCAGGACAAGAAUACGCAGGAGGUGUUUGCUCAGGUCAAGCAGAUCUACAAGACGCCCCCGAUCAAGGACUUCGGGGGCUUCAACUUCUCGCAGAUCUUGCCUGACCCGAGCAAGCCGAGCAAGAGGUCCUUUAUUGAGGACCUCUUGUUCAACAAGGUGACUCUCGCAGAUGCCGGCUUCAUCAAGCAGUAUGGGGAUUGUCUGGGUGACAUCGCUGCCAGGGACCUCAUCUGCGCGCAGAAGUUCAACGGUCUGACAGUGCUGCCCCCGCUCCUCACCGACGAGAUGAUCGCCCAGUACACCAGCGCCCUGCUGGCUGGGACGAUCACCUCGGGGUGGACCUUUGGAGCGGGGGCAGCACUGCAGAUCCCUUUUGCGAUGCAGAUGGCCUAUAGGUUCAAUGGCAUCGGUGUCACCCAGAACGUACUCUACGAGAAUCAGAAGCUCAUCGCCAAUCAGUUCAACAGCGCGAUCGGGAAGAUACAGGACUCCCUGUCGAGUACAGCCUCCGCGUUGGGGAAGCUGCAGGAUGUCGUUAAUCAGAACGCUCAGGCCCUGAAUACGCUGGUCAAGCAAUUGUCCAGCAACUUCGGGGCCAUCUCGAGCGUUCUGAACGACAUCCUGAGCCGCCUCGACAAGGUCGAGGCGGAGGUCCAGAUCGACCGGCUCAUCACUGGUCGCCUCCAGAGUUUGCAGACGUACGUAACUCAGCAGCUCAUCCGAGCUGCUGAGAUACGUGCGUCUGCAAACCUGGCGGCGACCAAGAUGAGCGAGUGCGUUCUGGGCCAGUCCAAGCGGGUGGACUUCUGCGGGAAGGGCUAUCACCUGAUGUCCUUCCCGCAGAGUGCCCCGCAUGGAGUGGUCUUCCUGCACGUGACAUAUGUGCCGGCGCAGGAGAAGAACUUCACCACUGCGCCGGCCAUAUGUCACGACGGGAAGGCCCACUUCCCGCGUGAAGGAGUUUUUGUA UCAAACGGGACGCAUUGGUUCGUCACGCAGCGCAACUUCUAUGAACCACAGAUAAUUACCACUGACAAUACCUUUGUCAGUGGUAAUUGUGAUGUGGUCAUAGGGAUCGUGAACAACACGGUCUACGAUCCCCUGCAGCCCGAGCUAGAUAGUUUCAAGGAGGAGCUUGAUAAGUACUUCAAGAAUCAUACUUCCCCAGACGUGGAUCUUGGCGACAUUAGCGGCAUCAACGCUAGUGUCGUCAACAUCCAGAAGGAGAUCGACAGGCUCAAUGAGGUUGCCAAGAACCUCAACGAGAGCCUGAUCGAUCUCCAGGAGCUGGGGAAGUAUGAGCAGUACAUCAAGUGGCCUUGGUACAUCUGGCUCGGGUUCAUUGCAGGGUUGAUCGCGAUCGUGAUGGUCACGAUCAUGUUGUGCUGCAUGACGAGCUGUUGCUCCUGUUUGAAGGGCUGCUGCAGCUGUGGUUCGUGUUGUAAGUUUGACGAGGAUGACUCGGAGCCAGUGUUGAAGGGGGUCAAGUUACACUACACAUAA
Fig. 11.
The secondary structures of the wildtype and designed mRNA sequences that translate into the spike protein of SARS-CoV-2. A : Wildtypesequence and its secondary structure. B : The best mRNA sequence (with lowest free energy change) designed by LinearDesign with b = 1 , . C : Thebest mRNA sequence designed by LinearDesign with b = + ∞ , i.e. an exact search. D : A zoom in of C showing the 5’-end, which is base paired with the3’ end. E : The best mRNA sequence designed by LinearDesign with b = + ∞ , while using the wildtype sequence for the first 15 nt nucleotide at 5’-end.These 15 nucleotides do not base pair with the designed sequence and therefore remain unstructured. F : A zoom in of E showing just the 5’-end and the3’ end. orange cloud, have much higher CAI (0.768 on average), and slightimprovement on MFE (-1063.23 kcal/mol on average). We also noticethat both "pure random" and "codon-biased random" sequences arepacked in cloud-shaped small regions. This is because the searchspace of possible mRNAs is extremely huge and most of the randomsequences have similar MFE and CAI. On the left (with much lowerMFE) we plot our designed sequences. The blue plots are optimizedby MFE only. The leftmost one is the sequence designed in exactsearch mode, which has the lowest MFE of -2,477.70 kcal/mol anda CAI of 0.726. The MFE gap between our best designed sequenceand the wildtype, as well as random sequences, is large (more than1,300 kcal/mol). With only 0.56% MFE loss from the exact searchsequence, the designed sequence with beam size b = 1 , achievesan MFE of -2,463.8 kcal/mol and a higher CAI of 0.751. Comparedto the exact mode, which takes 1 hour for designing the sequence,the approximation with b = 1 , only takes 11 minutes, resultingin a . × speed-up. For b = 100 and b = 20 , the MFE are stilllower than -2,200 kcal/mol, with CAI both at around 0.735 and 0.725,respectively. Our designed sequences, for both exact search sequenceand approximate search sequences, are much better than random onesand the wildtype in terms of MFE.We show that the top , , suboptimal sequences for exactmode (the dark-red cloud on the right of exact design sequence). TheMFE of the sequences are very close to the optimal one, and the freeenergy gaps are within 20 kcal/mol. Some of the sequences havehigher CAI, e.g., some have the CAI higher than 0.730. This showsthat our k best algorithm can be used to select sequences with lowMFE and relative higher CAI as vaccine candidates.Further, we show the results of MFE and CAI joint optimiza-tion.The light-blue curve is joint optimization design using exactmode. Each point on the curve is with a different λ , which balancesthe importance of MFE and CAI. We see that the curve is on the top-left of the figure, indicating that the sequences on the curve have bothstable secondary structures and higher CAI. In fact, this curve is theaccessible boundary of all possible sequences, i.e., no sequences can achieve the region beyond (to the top-left) the curve. The points onthe curve are good candidates for an mRNA vaccine. For example, thepoint with λ = 100 , has the free energy change of -2,414.6 kcal/moland CAI of 0.823, which is only 2.5% away from the optimal MFE se-quence but with 0.097 increase in CAI. We observe that with λ > ,the sequences on the curve have better CAI than codon-biased ran-dom sequences. Shifting right from the light-blue curve with a smallmargin, the magenta curve is the results of joint optimization using b = 1 , . This curve shows that the approximation quality is goodwith b = 1 , . We also designed a CAI-greedy sequence, whichgreedily chooses the best codon for each amino acid, leading to aspecial sequence with CAI=1. We see that the two curves both pointto the CAI-greedy design, comfirming that the designed sequencesachieve better CAI but sacrifice MFE with increasing λ , and reachCAI-greedy design with a large λ (e.g., λ = 3 , ).Note that the MFE of the wildtype, the CAI-greedy design andrandom sequences are calculated by Vienna RNAfold with "-d0"(disable stabilizing interactions for multiloops and external loops),to make fair comparisons with our designed sequences. All the se-quences can be refolded using RNAfold without "-d0", by which thepoints will shift to the left.Figure 11 shows the secondary structures of the wildtype se-quences, our designed sequences with b = 1 , and b = + ∞ ,as well as designed sequences with less structures at the 5’-end leaderregion. We can see that the secondary structures of the wildtype(Fig. 11A) have a large number of loops, and our designed sequences(Fig. 11B and C) have longer helices and fewer loops, which makesthe structure more stable. The designed sequence with b = 1 , (Fig. 11B) has similar free energy changes as the one with b = + ∞ (Fig. 11C), but it has a multiloop in the middle.Additionally, we investigate the effectiveness of our strategy forless structure design at the 5’-end leader region. Fig. 11E is the wholesecondary structure of the designed sequence with the goal of leavingthe 5’ end unpaired (the first 15 nucleotides kept identical to thewildtype and the remaining nucleotides designed by LinearDesign), | arXiv Submission Zhang et al. nd we zoom in the 5’-end in Fig. 11F. As a comparison, we also zoomin the 5’-end of designed sequence without constraint in Fig. 11D.This demonstrates our strategy can keep the 5’ end unstructured,whereas designing the complete sequence results in basepairing at the5’ end.
4. Discussion
The mRNA design problem is of utmost importance, especially formRNA vaccines during the current COVID-19 pandemic. We reducedthis problem into a classical problem in formal language theory andcomputational linguistics, namely the intersection of a CFG (encodingthe energy model) with a DFA (encoding the mRNA search space).This reduction provides a natural O ( n ) -time CKY-style bottom-upalgorithm, where n is the mRNA sequence length, but this algorithmmight still be too slow for long proteins such as the spike protein ofSARS-CoV-2, a promising candidate for an mRNA vaccine. Inspiredby our recent work of LinearFold, we then developed a left-to-rightalgorithm, LinearDesign, which employs beam search to reduce theruntime to O ( n ) , with the cost of exact search. LinearDesign isorders of magnitude faster than exact search (i.e., b = + ∞ ) andsuffers only a small loss in folding free energy. For example, forthis spike protein, LinearDesign can finish in 11 minutes while exactsearch takes 1 hour, and the free energy difference is only 0.6%. Wealso developed two algorithms for incorporating codon optimality(CAI) into the consideration, one using k -best algorithms to computesuboptimal sequences and one directly integrating CAI into dynamicprogramming. Our work provides efficient computational tools tospeed up and improve mRNA vaccine development. ACKNOWLEDGMENTS.
We thank Rhiju Das for introducing themRNA design problem to us. D.H.M. is supported by National Institutesof Health grant R01GM076485.
References David M Mauger, B Joseph Cabral, Vladimir Presnyak, Stephen V Su, David W Reid, BrookeGoodman, Kristian Link, Nikhil Khatwani, John Reynders, Melissa J Moore, et al. mRNA struc-ture regulates protein expression through changes in functional half-life.
Proceedings of theNational Academy of Sciences U.S.A. , 116(48):24075–24083, 2019. Yehoshua Bar-Hillel, Micha Perles, and Eli Shamir. On formal properties of simple phrase struc-ture grammars.
Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung ,14(2):143–172, 1961. Elena Rivas. The four ingredients of single-sequence RNA secondary structure prediction. aunifying perspective.
RNA Biology , 10(7):1185–1196, 2013. Liang Huang, He Zhang, Dezhong Deng, Kai Zhao, Kaibo Liu, David A Hendrix, and David HMathews. LinearFold: linear-time approximate RNA folding by 5’-to-3’ dynamic programmingand beam search.
Bioinformatics , 35(14):i295–i304, 07 2019. Barry Cohen and Steven Skiena. Natural selection and algorithmic design of mRNA.
Journal ofComputational Biology , 10(3-4):419–432, 2003. Goro Terai, Satoshi Kamegai, and Kiyoshi Asai. CDSfold: an algorithm for designing a protein-coding sequence with the most stable secondary structure.
Bioinformatics , 32(6):828–834,2016. John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman.
Introduction to Automata Theory,Languages, and Computation (3rd Edition) . Addison-Wesley Longman Publishing Co., Inc.,USA, 2006. Ruth Nussinov and Ann B Jacobson. Fast algorithm for predicting the secondary structure ofsingle-stranded RNA.
Proceedings of the National Academy of Sciences U.S.A. , 77(11):6309–6313, 1980. Mark-Jan Nederhof and Giorgio Satta. Probabilistic parsing as intersection. In
Proceedings ofthe Eighth International Conference on Parsing Technologies , pages 137–148, Nancy, France,April 2003. Tadao Kasami. An efficient recognition and syntax-analysis algorithm for context-free languages.
Coordinated Science Laboratory Report no. R-257 , 1966. Daniel H. Younger. Recognition and parsing of context-free languages in time n . Informationand control , 10(2):189–208, 1967. Richard Durbin, Sean R. Eddy, Anders Krogh, and Graeme Mitchison.
Biological SequenceAnalysis: Probabilistic Models of Proteins and Nucleic Acids . Cambridge University Press, Cam-bridge, UK, 1998. Masaru Tomita. An efficient word lattice parsing algorithm for continuous speech recognition.In
ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing ,volume 11, pages 1569–1572. IEEE, 1986. Liang Huang, Suphan Fayong, and Yang Guo. Structured perceptron with inexact search. In
Pro-ceedings of the 2012 Conference of the North American Chapter of the Association for Compu-tational Linguistics: Human Language Technologies , pages 142–151, Montréal, Canada, June2012. Association for Computational Linguistics. David H Mathews, Jeffrey Sabina, Michael Zuker, and Douglas H Turner. Expanded sequencedependence of thermodynamic parameters improves prediction of RNA secondary structure.
Journal of Molecular Biology , 288(5):911–940, 1999. David H. Mathews, Matthew D. Disney, Jessica L. Childs, Susan J. Schroeder, Michael Zuker,and Douglas H. Turner. Incorporating chemical modification constraints into a dynamic program-ming algorithm for prediction of RNA secondary structure.
Proceedings of the National Academyof Sciences U.S.A. , 101(19):7287–7292, 2004. Ronny Lorenz, Stephan H Bernhart, Christian Hoener Zu Siederdissen, Hakim Tafer, ChristophFlamm, Peter F Stadler, and Ivo L Hofacker. ViennaRNA package 2.0.
Algorithms for MolecularBiology , 6(1):1, 2011. Liang Huang and David Chiang. Better k-best parsing.
Proceedings of the Ninth InternationalWorkshop on Parsing Technologies , pages 53–64, 2005. Michael Zuker. On finding all suboptimal foldings of an RNA molecule.
Science , 244(4900):48–52, 1989. Stefan Wuchty, Walter Fontana, Ivo L. Hofacker, and Peter Schuster. Complete suboptimalfolding of RNA and the stability of secondary structures.
Biopolymers , 49(2):145–65, 1999. David H Mathews. Revolutions in RNA secondary structure prediction.
Journal of molecularbiology , 359(3):526–532, 2006. Yiliang Ding, Yin Tang, Chun Kit Kwok, Yu Zhang, Philip Bevilacqua, and Sarah Assmann. In vivogenome-wide profiling of RNA secondary structure reveals novel regulatory features.
Nature ,505, 11 2013. Yue Wan, Kun Qu, Qiangfeng Cliff Zhang, Ryan A. Flynn, Ohad Manor, Zhengqing Ouyang,Jiajing Zhang, Robert C. Spitale, Michael P. Snyder, Eran Segal, and Howard Y. Chang. Land-scape and variation of RNA secondary structure across the human transcriptome.
Nature ,505:706–709, 2014. Premal Shah, Yang Ding, Malwina Niemczyk, Grzegorz Kudla, and Joshua B. Plotkin. Rate-limiting steps in yeast protein translation.
Cell , 153:1589–601, 2013. Tamir Tuller and Hadas Zur. Multiple roles of the coding sequence 5’ end in gene expressionregulation.
Nucleic Acids Research , 43(1):13–28, 12 2014. UniProt Consertium. Uniprot: a hub for protein information.
Nucleic Acids Research ,42:D204–D12, 2005. Paul M Sharp and Wen-Hsiung Li. The codon adaptation index-a measure of directional syn-onymous codon usage bias, and its potential applications.
Nucleic Acids Research , 15(3):1281–1295, 1987.
Zhang et al. arXiv Submission | upporting InformationLinearDesign: Efficient Algorithms for OptimizedmRNA Sequence Design He Zhang, Liang Zhang, Ziyu Li, Kaibo Liu, Boxiang Liu, David H. Mathews and Liang Huang function U PDATE ( q i , q j , score , backpointer ) if score < best [ q i , q j ] then . minimizing weight best [ q i , q j ] ← score back [ q i , q j ] ← backpointer function B ACKTRACE ( q i , q j ) backpointer ← back [ q i , q j ] if type ( backpointer ) is string then . singleton nuc i ← backpointer return nuc i , "." if length ( backpointer ) = 4 then . pairing: S − → A S U | ... nuc i , q i +1 , q j − , nuc j − ← backpointer seq , struct ← B ACKTRACE ( q i +1 , q j − ) return nuc i + seq + nuc j − , "(" + struct + ")" else . bifurcation: S → S S q k ← backpointer seq , struct ← B ACKTRACE ( q i , q k ) seq , struct ← B ACKTRACE ( q k , q j ) return seq + seq , struct + struct function B ACKTRACE q i , q j ) if length ( backpointer ) = 5 then . pairing: S − → S A S U | ... q k − , nuc i − , q k , q j − , nuc j − ← backpointer seq , struct ← B ACKTRACE ( q i , q k − ) seq , struct ← B ACKTRACE ( q k , q j − ) return seq + nuc i − + seq + nuc j − , struct + "(" + struct + ")" else . unpaired: S → S N q j − , nuc j − ← backpointer seq , struct ← B ACKTRACE ( q i , q j − ) return seq + nuc j − , struct + "." function BEAMPRUNE ( Q, j, b ) cands ← hash() . hash table: from node q i to combined score best [ q , q i ] + best [ q i , q j ] for each q j ∈ nodes ( j ) do for each key ( q i , q j ) ∈ best do cands [ q i ] ← best [ q , q i ] + best [ q i , q j ] . best [ q , q i ] as prefix score cands ← S ELECT T OP B ( cands , b ) . select top- b by score for each key ( q , q j ) ∈ best do if key q i not in cands then delete ( q i , q j ) in best . prune out low-scoring states Fig. SI 1.
The pseudocode for U
PDATE , B
ACKTRACE (used in B
OTTOM U P D ESIGN ), and B
ACKRACE
EAM P RUNE (used in L
INEAR D ESIGN ) functions. | arXiv Submission Zhang| arXiv Submission Zhang