[PDF] Structural asymmetry along protein sequences and co-translational folding

Abstract

Proteins are translated from the N- to the C-terminus, raising the basic question of how this innate directionality affects their evolution. To explore this question, we analyze 16,200 structures from the protein data bank (PDB). We find remarkable enrichment of α -helices at the C terminus and β -strands at the N terminus. Furthermore, this α - β asymmetry correlates with sequence length and contact order, both determinants of folding rate, hinting at possible links to co-translational folding (CTF). Hence, we propose the 'slowest-first' scheme, whereby protein sequences evolved structural asymmetry to accelerate CTF: the slowest of the cooperatively-folding segments are positioned near the N terminus so they have more time to fold during translation. A phenomenological model predicts that CTF can be accelerated by asymmetry, up to double the rate, when folding time is commensurate with translation time; analysis of the PDB reveals that structural asymmetry is indeed maximal in this regime. This correspondence is greater in prokaryotes, which generally require faster protein production. Altogether, this indicates that accelerating CTF is a substantial evolutionary force whose interplay with stability and functionality is encoded in sequence asymmetry.

Full PDF

SStructural asymmetry along protein sequences and co-translational folding

John M. McBride and Tsvi Tlusty

Center for Soft and Living Matter, Institute for Basic Science, Ulsan 44919, South Korea Departments of Physics and Chemistry, Ulsan National Institute of Science and Technology, Ulsan 44919,South Korea * [email protected], [email protected] 23, 2020 Proteins are translated from the N- to the C-terminal, raising the basic question of how this in-nate directionality aﬀects their evolution. To explorethis question, we analyze , structures from theprotein data bank (PDB). We ﬁnd remarkable enrich-ment of α -helices at the C terminal and β -sheets at theN terminal. Furthermore, this α - β asymmetry corre-lates with sequence length and contact order, bothdeterminants of folding rate, hinting at possible linksto co-translational folding (CTF). Hence, we proposethe ‘slowest-ﬁrst’ scheme, whereby protein sequencesevolved structural asymmetry to accelerate CTF: theslowest-folding elements ( e.g. β -sheets) are positionednear the N terminal so they have more time to foldduring translation. Our model predicts that CTF canbe accelerated, up to double the rate, when foldingtime is commensurate with translation time; analysisof the PDB reveals that structural asymmetry is in-deed maximal in this regime. This correspondence isgreater in prokaryotes, which generally require fasterprotein production. Altogether, this indicates thataccelerating CTF is a substantial evolutionary forcewhose interplay with stability and functionality is en-coded in sequence asymmetry. All proteins are translated sequentially from the N- to theC-terminal, and are thus inherently asymmetric [1]. One exam-ple of such N-to-C asymmetry is signal peptides, which enabletranslocation across membranes, and are located at the N ter-minal [2]. This raises the question of whether and how theuniversal unidirectionality of protein production is leveragedto gain evolutionary advantage. Here we examine structuraldata from the protein data bank (PDB) in search of traces ofsuch adaptation. We analyzed the distribution of secondarystructure along the sequence for , PDB proteins, ﬁndingtwo striking patterns of asymmetry. First, disordered residuesare principally located at the ends of sequences, and depletedtowards the middle. Second, β -sheets are enriched by

55 % near the N terminal, while α -helices are enriched by

22 % atthe C terminal. These ﬁndings agree qualitatively with previ-ous reports [3–11]. This α - β asymmetry peaks at intermediatevalues of sequence length and contact order – which both cor-relate negatively with folding rate – indicating a possible linkbetween secondary structure asymmetry and folding.Hence, we further explore the possibility that α - β asymme-try may accelerate protein production, and is therefore a signa-ture of evolutionary adaptation. Production of functional pro-teins from mRNA comprises two concerted processes: transla-tion and folding. The rate of translation is limited by trade-oﬀs between speed, accuracy and dissipation [12–15]. Fold-ing quickly has certain advantages: unfolded proteins leadto aggregation, putting a signiﬁcant burden on the cell [16– 18]; faster folding allows quicker responses to environmentalchanges [19, 20]. Moreover, organisms whose ﬁtness dependson fast self-reproduction would beneﬁt from accelerated pro-tein production that can shorten division time [21, 22]. Pro-teins begin folding during translation [23–28]. Thus, in princi-ple, faster production times may be achieved if proteins ﬁnishfolding and translation at around the same time. This co-translational folding (CTF) enables adaptations that increaseyield and kinetics of protein production [26–30]. For example,nascent peptides interact with ribosomes and chaperones toreduce aggregation and misfolding [31–35], while translationrates can be tuned to facilitate correct folding [36–39]. Speciﬁ-cally, we ask if structural asymmetry may have evolved for fastand eﬃcient production via CTF.We show that the structural asymmetry observed in pro-teins is consistent with a scheme for accelerating CTF basedon the sequential nature of translation and the heterogeneityof folding rates along the sequence [40, 41] – e.g. β -sheets foldmuch slower than α -helices [42]. In the proposed slowest-ﬁrst scheme, protein sequences take advantage of this heterogeneityby evolving structural asymmetry: the slowest-folding struc-tures are enriched at at the N-terminal [5–11], so that they aretranslated ﬁrst and have more time to fold. A simple modelpredicts that, under this scheme, production rate can be al-most doubled when folding time is equivalent to translationtime. To examine this hypothesis, we estimate the ratio offolding to translation time of the PDB proteins and compareit with their α - β asymmetry, ﬁnding that asymmetry peakswhen folding time is commensurate with translation time. Inthis region, proteins are twice as likely to exhibit α - β asymme-try that favours the slowest-ﬁrst scheme. We see more evidencefor this scheme in prokaryotic proteins, which is consistent withprokaryotes’ greater need for fast protein production due tomore frequent cell division. Taken together, these ﬁndings sug-gest that proteins sequences have been adapted for acceleratedCTF via structural asymmetry. Results

Protein secondary structure is asymmetric

Given the vectorial nature of protein translation, one may ex-pect corresponding asymmetries in protein structure. To probethis, we study a non-redundant set of , proteins fromthe Protein Data Bank (PDB) [43]. We ﬁnd that these PDBproteins exhibit signiﬁcant asymmetry in secondary structure(Fig. 1A-B): the ﬁrst residues at the N terminal are

55 % more likely to form sheets, and the ﬁrst residues at the Cterminal are

22 % more likely to form helices. This asymme-try is stronger for prokaryotic proteins (

72 % ;

20 % ) than foreukaryotic proteins (

20 % ;

28 % ). The substantial α - β asym-metry points to an evolutionary driving force which we further1 a r X i v : . [ q - b i o . B M ] O c t

10 20 30 40 50Sequence distance from ends0.00.10.20.30.40.50.6 S e c o n d a r y s t r u c t u r e p r o b a b ili t y A Full sample S t r u c t u r a l a s y mm e t r y l o g ( N / C ) B Eukaryotes

Prokaryotes

N C Helix Sheet Coil Disorder0 10 20 30 40 50Sequence distance from ends1.00.50.00.51.0

10 104 140 174 215 255 296 343 408 507 4717

Sequence Length1121721252831354150195 C o n t a c t O r d e r C Helix

10 104 140 174 215 255 296 343 408 507 4717

Sequence Length1121721252831354150195 C o n t a c t O r d e r D Sheet

10 104 140 174 215 255 296 343 408 507 4717

Sequence Length1121721252831354150195 C o n t a c t O r d e r E Distribution a s y m a s y m C o un t Figure 1: (A) Distribution of secondary structure along the sequence as a function of distance from the N- and C-terminal,and (B) the structural asymmetry – the ratio of the N and C distributions (in log scale; ± are 2:1 and 1:2 N/C ratios) –for all , proteins (left), , eukaryotic proteins (middle), and , prokaryotic proteins (right). Shading indicatesbootstrapped conﬁdence intervals. C-D: Mean asym α (C) and asym β (D) as a function of sequence length and contactorder (Eq. 1). The data is split into deciles and the bin edges are indicated on the axes. E: Distribution of proteins accordingto contact order and sequence length.investigate.In both N- and C-terminals, the α -helix and β -sheet distri-butions exhibit periodicity in the positioning of these elementsalong the sequence. This periodicity is matched by several αβ -type protein folds where α -helices and β -sheets are arrangedin alternating order (SI Fig. 1). These folds tend to be moreabundant in prokaryotic proteins (SI Table 1); for example,ferredoxin-like folds exhibit high α - β asymmetry, signiﬁcantperiodicity at the N-terminal, and are ∼ times more com-mon in prokaryotes. Disordered regions are more abundant at the ends

The distribution of disordered regions exhibits a diﬀerent pat-tern of asymmetry: disordered residues are enriched at bothends of proteins compared to the middle [3, 4]. Eukaryotic pro-teins are signiﬁcantly more disordered, where the probabilityof disorder is well approximated by ∼ D − . , where D is thedistance from the end, while in prokaryotic proteins the prob- ability of disorder decays as ∼ D − . Proteins also tend to bemore disordered at the N terminals [3]: eukaryotic proteins are

30 % more likely to be disordered within the ﬁrst residuesof the N terminal compared to the C terminal (prokaryotes:

17 % ). Although prokaryotic proteins are less disordered thaneukaryotic ones, the ratio of the numbers of residues in β sheetsand α helices is the same. Structural asymmetry correlates with sequence lengthand contact order

To better understand the α - β asymmetry, we examined corre-lations with sequence length, L , and contact order, CO. CO isthe average sequence distance between intra-protein contacts,CO = (cid:10) | j − i | (cid:11) , (1)where i and j are pairs of residue indices for each contact [44].High CO means that native contacts require large-scale move-ments to form, thus increasing folding time.2 translation delay foldingpeptide chainmRNA A NterminalCterminal time bottleneck bottleneck t r a n s l a t i o n o f s e g m e n t s t a r t s mRNA entersribosome folded protein folded protein Figure 2:

Co-translational folding and the slowest-ﬁrst mechanism.

A section of mRNA (red) is translated to a proteinsegment (left), which translocates through the ribosome channel (middle), and undergoes folding once the full segment has beentranslated and is free from steric constraints (right). B: Distribution of R = τ fold /τ trans , the ratio of folding to translationtime (Eq. 3), for our entire sample, prokaryotic proteins, and eukaryotic proteins. Solid lines are kernel density estimation ﬁtsto histograms; dotted line indicates R = 1 ; shading indicates bootstrapped

95 % conﬁdence intervals. C: Production timeline:Proteins gain functionality after translation and folding, proceeding from the N-terminal (top) to the C-terminal (bottom),where each segment begins folding after translation ( τ seg ) with some delay ( τ delay ). Two sequences (blue/orange blocks) consistof a set of structural elements with the same folding times (blocks of equal length). Production time is shortened in the bluesequence by asymmetric ordering of the folding time along the protein sequence: slow-folding sections are at the N terminaland fast-folding sections at the C terminal. Blocks for τ seg and τ delay are drawn the same length for simplicity. D: Theoreticalmaximum speedup of production rate as a function of R and τ ribo (Eq. 4).To quantify secondary structure asymmetry, we calculate themagnitude of asymmetry normalized by length,asym α = ( N α − C α ) /L , (2)asym β = (cid:0) N β − C β (cid:1) /L , where N α ( N β ) and C α ( C β ) are the number of residues in a α -helices ( β -sheets) in the N and C halves of a protein sequence.We ﬁnd that α - β asymmetry is a non-monotonic function ofboth L and CO (Fig. 1C-D). In particular, there is a region ofintermediate length ( − ) and intermediate CO ( − )where structural asymmetry is most apparent. The fact thatboth quantities correlate negatively with folding rate ( L , r = − . ; CO, r = − . ; SI Fig. 2) [44–46], taken togetherwith proteins’ inherent asymmetry due to vectorial translation,leads us to suspect that the origins of this α - β asymmetry maybe related to co-translational folding. Co-translational folding appears to be widespread

During protein production, the ribosome advances along themRNA from the N to the C terminal (Fig. 2A). Each mRNAsegment encoding a structural element (red dashed segment),in turn, enters the ribosome where it is translated and passesthrough the ribosome channel. The time it takes this segmentto clear the ∼ nm long ribosome tunnel, τ ribo , is the sumof the segment’s translation time τ seg and the potential delay τ delay until the onset of co-translational folding (CTF) oncethe segment exits the ribosome and is free of steric constraints[26, 47, 48]. In principle, one way to maximise the rate of productionand to minimise aggregation is by making proteins fold fasterthan they are translated, or at a similar rate. We can obtaina rough approximation of how often this occurs by estimatingfolding rates and translation rates of proteins. We estimate thefolding rate k fold using a power law scaling with length ﬁttedto data from the protein folding kinetics database (PFDB) [46](Methods). We assume an average translation rate k trans thatdepends on the organism. Thus we can estimate the ratio R of folding time τ fold to translation time τ trans , R = τ fold τ trans = 1 /k fold L/k trans . (3)The estimated R distribution exhibits a peak in the region ofcommensurate time R ≈ (Fig. 2B). For the

68 % of proteins(CI −

88 % , SI Fig. 3) that lie in the region R ≤ , foldingmay be quicker than translation, indicating that CTF is com-mon. In comparison, a more rigorous method estimated that in

37 % of proteins in

E. coli , at least one domain will fully foldbefore translation ﬁnishes [49]. Examining prokaryotic pro-teins and eukaryotic proteins separately reveals a sharper peakin the R distribution for prokaryotic proteins in the region ofcommensurate folding and translation times, / < R < .Notably, a greater fraction of prokaryotic proteins (

56 % ) arein this regime compared to eukaryotic proteins (

41 % ). Folding rate asymmetry can speed up co-translationalfolding

Fig. 2C shows the production timeline of a protein whose fold-ing time τ fold is determined by a rate-limiting fold [41, 50–3 .1 0.0 0.1asym-8.0-2.6-2.0-1.5-1.0-0.6-0.30.00.40.85.6 l o g R A Helix

Sheet B E E l o g R C Prokaryote E Eukaryote E Prokaryote E Eukaryote E Figure 3: A: α - β asymmetry distributions as a function of R , the folding/translation time ratio (Eq. 3). Proteins are dividedinto deciles according to R ; bin edges are shown on the y-axis. B: N terminal enrichment – the degree to which sheets/helicesare enriched in the N over the C terminal (Eq. 6) – is shown for the deciles given in B. C: N terminal enrichment as a functionof R for , eukaryotic proteins and , prokaryotic proteins. Proteins are divided into bins according to R ; bin edges,shown on the x-axis, are the same as in A-B. Whiskers indicate bootstrapped conﬁdence intervals.52]. This bottleneck may represent slow kinetics in sec-ondary/tertiary structure formation, or formation of misfoldedintermediates. If the bottleneck is located at the N termi-nal (blue blocks in Fig. 2C), then the production time isminimal, τ min = max( τ fold + τ ribo , τ trans ) . In the other ex-treme, if the rate-limiting fold is located at the C terminal(orange blocks in Fig. 2C), production time is maximized [53], τ max = τ fold + τ trans . In this case, the last element can es-cape the ribosome quickly after being translated ( τ delay ≈ )since it is not delayed by downstream translation [54]. Thus,production rate can be accelerated by a factor,speedup = τ max τ min . (4)In the limit τ ribo (cid:28) τ trans , one ﬁnds from Eqs. 3-4 that thespeedup as a function of R = τ fold /τ trans (Fig. 2D) isspeedup = 1 + e −| ln R | . (5)A maximal, twofold speedup is achieved when translation timeequals folding time, R = 1 , and taking τ ribo > shifts thismaximum towards R < . Structural asymmetry is maximum for commensuratefolding and translation times

The speedup curve (Fig. 2D) implies that proteins can beneﬁtthe most from structural asymmetry when R = τ fold /τ trans ≈ . Hence, we estimate the magnitude of α - β asymmetry asa function of R , and plot the distributions in Fig. 3A. Atintermediate R , the means of the distributions shift away fromzero, indicative of strong bias.To capture the magnitude of these shifts we calculate the Nterminal enrichment, E , deﬁned as the fraction of proteins withpositive asymmetry ( i.e. enriched at the N terminal) minus thefraction of proteins with negative asymmetry (enriched at theC terminal), for both helices and sheets: E α = P ( asym α > − P ( asym α < , (6) E β = P ( asym β > − P ( asym β < . Fig. 3B shows that in the R decile with maximum asymme-try, proteins in the PDB are . times as likely to be enrichedin β -sheets in the N terminal, while α -helices are . timesmore likely to be found in the C-terminal half. This maximumis found when − . ≤ log R ≤ . (the

95 % conﬁdence in-tervals for the k fold estimate give − . ≤ log R ≤ . ; SIFig. 4). This region of maximal asymmetry overlaps with theregion of maximal speedup (Fig. 2D, Eq. 5), suggesting thatasymmetry evolves because it enhances CTF. Prokaryotes exhibit greater asymmetry than Eukary-otes

We looked at α - β asymmetry for prokaryotic and eukaryoticproteins separately, ﬁnding that when asymmetry is maxi-mum, prokaryotes exhibit more asymmetry than eukaryotes– sheets are

36 % more likely to be enriched at the N terminalin prokaryotes compared to eukaryotes (Fig. 3C). Typically,prokaryotic cells divide more frequently than eukaryotic cells[22], and thus have a greater need for fast production of func-tional proteins. The analysis is therefore consistent with theslowest-ﬁrst scheme that implies that the stronger pressure onprokaryotes should lead to greater asymmetry.

Multi-domain proteins are optimized for CTF via dis-tinct mechanisms

Multi-domain proteins can be potentially adapted at two lev-els: within domains, and between domains (Fig. 4A). To testthis, we isolated individual domains in the PDB (using Pfam)[55], and calculated CO and α - β asymmetry for each domainas in Fig. 3. While intra-domain optimization of secondarystructure clearly occurs within single-domain proteins, it ismuch weaker within multi-domain proteins (Fig. 4B-C). Inter-domain optimization entails ordering the slowest-folding do-mains at the N terminal, for which we ﬁnd no signiﬁcant bias(SI Fig. 5). Instead, we ﬁnd that as the number of domainsincreases, the CO of individual domains decreases (Fig. 4D).Thus CTF is maintained in multi-domain proteins mostly by4 C1 2 3

OptimizationbetweendomainsOptimizationwithin domains A Figure 4: A: Multi-domain proteins can be optimized viaasymmetry between domains, and/or within domains. B-C:N terminal enrichment within domains as a function of R forsingle-domain proteins (B: , domains) and multi-domainproteins (C: , domains). Domains are split into decilesbased on R , and the bin edges are shown on the x-axis; whiskersindicate bootstrapped conﬁdence intervals. D: Domaincontact order distributions for proteins with diﬀerent numbersof domains.using faster-folding domains throughout. Discussion

Selection pressures vary

We examined the hypothesis that proteins are selected forCTF to hasten protein production and reduce aggrega-tion/misfolding, but this may not be equally true for all pro-teins. As an example, we showed that in prokaryotes, whichhave a greater burden of cell growth, proteins tend to havemore asymmetry than in eukaryotes. More generally, CTFmay be hindered in some proteins by interactions with the ri-bosome [56]. Long-lived proteins [57] may derive little beneﬁtfrom an increase in production speed. On the other hand,proteins produced in large quantities need to fold quickly asaggregation can increase non-linearly with concentration [58].These predictions can be tested when suﬃcient data for pro-tein lifespan [59], expression levels [60], and structure becomeavailable. While we showed that α - β asymmetry is apparentin a broad set of proteins, further analysis of an extended dataset may be able to detect the sub-classes of proteins that willbeneﬁt most from α - β asymmetry. CTF for multi-domain proteins is more complex

Multi-domain proteins exhibit less asymmetry than single-domain proteins. Due to interactions between domains [35, 38,47, 48, 61], optimization via asymmetry may not be feasible —instead, a safe strategy is to fold each domain before translat-ing subsequent domains. To explain the lack of intra-domain α - β asymmetry (Fig. 4C), we propose a simple mechanicalargument. When a β -sheet forms, the protein chain contracts.This results in a pulling force on both the ribosome [62, 63],and on any upstream domains. This extra resistance to β -sheetformation may preclude the early formation of β -sheets at theN terminal side of a domain. If this is true, then the domainin position 1 should still exhibit α - β asymmetry; we currentlylack suﬃcient statistical power to conclusively test this (SI Fig.6). Further tests could look at CTF of a β -rich domain in thethe presence or absence of an upstream domain [64, 65]. Suggested experiments for circular permutants

To experimentally test the slowest-ﬁrst mechanism, we sug-gest studying CTF of multiple proteins with R ≈ , whichdiﬀer in asym α and asym β . In particular, we propose to useproteins whose sequences are related by circular permutation ,while having identical structures [66–69]. Circular permutantswith opposite structural asymmetry, as the example in Fig. 5,should fold at signiﬁcantly diﬀerent rates. Additional experi-mental control of R is possible via synonymous codon muta-tions [70] or in vitro expression systems [25]. Thus, one cantest whether asymmetry in secondary structure can lead toacceleration of CTF, and how this depends on R . Figure 5: Secondary structure for nuclear transport factor2 H66A mutant (PDB: 1ASK [71]) and a circular permutant,1ASK-CP67, which may fold faster during translation.

Disorder is enriched at both sequence ends

The N and C terminals principally share a notable tendencyfor disorder near the end, which suggests that they are aﬀectedby the same physical end eﬀect . The amino acid at the end islinked to the chain by only one peptide bond, leaving it moreconﬁgurational freedom than an amino acid in the centre ofthe protein, which is constrained by two bonds. This entropiccontribution to the free energy of the loose ends, of order k B T ,can induce disorder in marginally stable structures.Since disordered regions do not need time to fold, placingthem towards the C-terminal gives the other residues moretime to fold. Yet, we ﬁnd a similar, slightly stronger, tendencyfor disorder near the N-terminal (green curves in Fig. 1B),particularly in eukaryotes. This may result from other deter-minants of protein evolution; e.g. , disordered regions tend tointeract with some ribosome-associating chaperones [72, 73]. Ifdisorder at the N terminal is related to chaperones, we expectthat asymmetry will be higher for slow-folding proteins as theyare more prone to aggregation. We ﬁnd that bias for disorderat the N-terminal is strongest for slow-folding proteins (high R , L and CO; SI Fig. 7), but only for prokaryotes, not eukary-otes. Given the absence of a correlation between R , L and COand disorder asymmetry in eukaryotic proteins, the question of5hy eukaryotic proteins are more disordered at the N terminalremains open. Considering tertiary structure

We used secondary structure as a proxy for folding rate, butthere are also contributions from tertiary structure. To testthis assumption, we ran coarse-grained simulations of CTFof three structurally-asymmetric proteins while varying R , forboth the original sequence and of the reverse sequence. Weﬁnd that these proteins fold faster when β -sheets are trans-lated ﬁrst, in the relevant region of R ∼ (SI Fig. 8).We also studied the eﬀect of tertiary structure by looking atasymmetry in surface accessibility. β sheets at the N terminalare less likely to be exposed to solvent than β sheets at the Cterminal; this bias is stronger for prokaryotic proteins (ﬁrst 20residues:

41 % ) compared to eukaryotic proteins (

13 % ) (SI Fig9). Since solvent-exposed β -sheets are less likely to form partof a folding nucleus [74], this suggests that β -sheets at the Nterminal are more likely to nucleate folding compared to thoseat the C terminal. Correlations support the ‘slowest-ﬁrst’ hypothesis

The data used to ﬁt Eq. 7 are sparse ( proteins), biasedtowards small, single-domain proteins, and typically obtainedfrom in vitro refolding experiments [46]. To test whether ourconclusions are robust to sampling, we estimate conﬁdence in-tervals using bootstrapping with sample sizes equal to the orig-inal sample size, and half that amount; we perform this teston both the reduced version of the PFDB data set used in themain ﬁgures, and on a second version of the PFDB data set(Methods; SI Fig. 4). In addition, we calculate the main re-sults using using a diﬀerent protein folding data set, ACPro[75], which partially overlaps with PFDB, but includes largerproteins (SI Fig. 10). In all of the above analyses, the pointof maximum asymmetry is found to be / < R < ,which corresponds to the region where CTF speed-up is pos-sible. However, to fully overcome the aforementioned limita-tions, further experiments are needed. Analysis is consistent with hypothesis that proteinsare selected for CTF via secondary structure

To sum, in the proposed the slowest-ﬁrst mechanism, CTFcan be accelerated by positioning the slowest-folding parts ofa protein near the N terminal so that they have more time tofold. A survey of the PDB shows that the estimated accel-eration correlates with asymmetry in secondary structure. Inparticular, the rate of production can be almost doubled whentranslation time is similar to folding time, and indeed theseproteins exhibit the maximal asymmetry in secondary struc-ture distribution. Altogether, there appears to be substantialevolutionary selection, manifested in sequence asymmetry, forproteins that can fold during translation.

Methods

Data

We extracted a set of , proteins from the Protein DataBank (PDB) [43]. We only include proteins that exactly matchtheir Uniprot sequence (not mutated, spliced, or truncated)[76]. For each unique protein sequence, we only include themost recent structure. We used SIFTS to map PDB andUniprot entries [77]. We exclude proteins with predicted sig-nal peptides as little is known about whether such proteinsundergo CTF; we used Signal-P5.0 to identify signal peptides[78]. Using the above criteria we extracted a set of , domains by matching PDB entries to Pfam domains [55]. α helix and β -sheets are identiﬁed through annotations in thePDB; disorder is inferred from residues with missing coordi-nates. To calculate contact order, we only consider contactsbetween residues where α carbons are within Å; we conﬁrmthat the correlation in Fig. 1D is robust to choice of this cutoﬀ(SI Fig. 11).We use the protein folding kinetics database (PFDB) forestimating folding rates [46]. For our main results we onlyused entries with realistic physical conditions ( < pH < ,and ° C < T < ° C ) and ignored folding rates which hadbeen extrapolated to T = 25 ° C ; in total, proteins. Wetest a second version of the PFDB data set without excludingproteins, and using folding rates which were extrapolated to T = 25 ° C ; proteins. We also use the ACPro data set [75]to test the robustness of our conclusions; proteins. Predicting folding and translation rate

The folding rate, k fold (in units of 1/sec), is estimated by apower-law ﬁt as a function of the protein’s length: log k fold = A + B log L , (7)where L is sequence length in residues; A and B are free pa-rameters. We ﬁt these parameters using data from the PFDB[46] to get

95 % conﬁdence intervals of A = 13 . ± . and B = − . ± . (with correlation coeﬃcient r = − . , andp-value p < . ). The estimate from Eq. 7 is limited for thefollowing reasons: (i) It is extracted from a small set of 122proteins. (ii) It disregards the eﬀects of secondary structure,contact order order, and other important determinants. (iii)The data is from in vitro measurements. (vi) The data is bi-ased towards small, single-domain proteins. Thus, it is only arough predictor for the folding rates of individual proteins inthe set, as the standard deviation between estimated and em-pirical folding rates is . . For all these reasons, we use Eq. 7as an estimator of the average folding rate of sets of proteinsof similar length L where the large sampling size of each bin isexpected to reduce the errors as ∼ N − / .We tested whether the predicted folding rates of proteins inthe PDB are within certain approximate bounds on realisticfolding rates. A lower bound to folding time has been esti-mated at ∼ L/ µ s [79], while we take the doubling time of E. coli , roughly minutes, as an approximate upper bound.Of course, many proteins rely on chaperones, so their bare esti-mated folding time may be longer than the upper bound, whileothers come from organisms with much longer doubling times.Even so, according to Eq. 7 only of proteins are estimatedto have a folding time greater than minutes, while only of proteins are estimated to fold faster than the lower bound.Given the magnitude of the error in estimating the folding timeof individual proteins, Eq. 7 appears to yield estimates that aremostly within the biologically reasonable regime. Furthermore,in estimating the folding rate of large proteins, a common as-sumption is that they consist of multiple independently-foldingdomains [80] – which considerably reduces the estimated fold-ing time of the slowest proteins – but we neglect to make thisassumption.In principle, we could have used structural/topological mea-sures (such as contact order, long-range order, etc. [81]) toslightly improve the ﬁt to Eq. 7. However these typically in-volve numerous methodological choices and additional param-eters [82], and the scaling relations are entirely empirical. Incontrast, scaling of folding time with length has a robust theo-retical background [45, 83–89]; the exact form of of the scalingis debated, but a power law is favoured slightly [84, 90].We assume the translation rate, k trans , depends on the or-ganism (host organism for viral proteins), such that k trans is amino acids per second for eukaryotes and for prokaryotes.6 ata Availability The non-redundant sets of proteins and domains, along withthe data used in the ﬁgures and Supplementary Information,will be made available on Zenodo.

Code Availability

Simulation and analysis code, along with code used to makeall ﬁgures, are accessible at https://github.com/jomimc/FoldAsymCode . Acknowledgements

We acknowledge Albert J. Libchaber for stimulating discus-sions and comments on the manuscript. This work was sup-ported by the Institute for Basic Science, Project Code IBS-R020-D1.

Competing Interests

The authors declare that they have no competing ﬁnancial in-terests.

Correspondence

Correspondence and requests for materials should be ad-dressed to J.M. (email: [email protected]) andT.T. ([email protected]).

Author Contributions

J.M. and T.T. designed research; J.M. performed research;J.M. analyzed data; J.M. and T.T. wrote the paper.

References [1] M. Salas, M. A. Smith, W. M. Stanley, A. J. Wahba, andS. Ochoa. Direction of reading of the genetic message.

J.Biol. Chem. , 3995:3988–3995, 1965.[2] G. von Heijne. The signal peptide.

J. Membrane Biol. ,115(3):195–201, 1990. doi: 10.1007/BF01868635.[3] M. Y. Lobanov, E. I. Furletova, N. S. Bogatyreva, M. A.Roytberg, and O. V. Galzitskaya. Library of disorderedpatterns in 3d protein structures.

Plos Comput. Biol. , 6(10):1–10, 2010. doi: 10.1371/journal.pcbi.1000958.[4] M. Y. Lobanov, I. V. Likhachev, and O. V. Galzit-skaya. Disordered residues and patterns in the proteindata bank.

Molecules , 25(7):1522, 2020. doi: 10.3390/molecules25071522.[5] J. M. Thornton and B. L. Chakauya. Conformation ofterminal regions in proteins.

Nature , 298(5871):296–297,1982. doi: 10.1038/298296a0.[6] R. Bhattacharyya, D. Pal, and P. Chakrabarti. Secondarystructures at polypeptide-chain termini and their features.

Acta Crystallogr. D. , 58(10 Part 2):1793–1802, 2002. doi:10.1107/S0907444902013069.[7] D. Marenduzzo, T. X. Hoang, F. Seno, M. Vendruscolo,and A. Maritan. Form of growing strings.

Phys. Rev. Lett. ,95:098103, 2005. doi: 10.1103/PhysRevLett.95.098103.[8] M. M. G. Krishna and S. W. Englander. The n-terminalto c-terminal motif in protein folding and function.

P.Natl. Acad. Sci. Usa. , 102(4):1053–1058, 2005. doi: 10.1073/pnas.0409114102. [9] A. Laio and C. Micheletti. Are structural biases at proteintermini a signature of vectorial folding?

Proteins. , 62(1):17–23, 2006. doi: 10.1002/prot.20712.[10] R. Saunders, M. Mann, and C. M. Deane. Signaturesof co-translational folding.

Biotechnol. J. , 6(6):742–751,2011. doi: 10.1002/biot.201000330.[11] M. Baiesi, E. Orlandini, F. Seno, and A. Trovato.Sequence and structural patterns detected in entan-gled proteins reveal the importance of co-translationalfolding.

Sci. Rep. , 9(1):8426, 2019. doi: 10.1038/s41598-019-44928-3.[12] J. J. Hopﬁeld. Kinetic proofreading: A new mechanismfor reducing errors in biosynthetic processes requiring highspeciﬁcity.

P. Natl. Acad. Sci. Usa. , 71(10):4135–4139,1974. doi: 10.1073/pnas.71.10.4135.[13] J. Ninio. Kinetic ampliﬁcation of enzyme discrimination.

Biochimie , 57(5):587 – 595, 1975. doi: https://doi.org/10.1016/S0300-9084(75)80139-8.[14] D. A. Drummond and C. O. Wilke. The evolutionaryconsequences of erroneous protein synthesis.

Nat. Rev.Genet. , 10(10):715–724, 2009. doi: 10.1038/nrg2662.[15] W. D. Piñeros and T. Tlusty. Kinetic proofreading andthe limits of thermodynamic uncertainty.

Phys. Rev. E ,101:022415, 2020. doi: 10.1103/PhysRevE.101.022415.[16] C. M. Dobson. Protein misfolding, evolution and disease.

Trends Biochem. Sci. , 24(9):329 – 332, 1999. doi: https://doi.org/10.1016/S0968-0004(99)01445-0.[17] C. López-Otín, M. A. Blasco, L. Partridge, M. Serrano,and G. Kroemer. The hallmarks of aging.

Cell , 153(6):1194–1217, 2013. doi: 10.1016/j.cell.2013.05.039.[18] M. Santra, K. A. Dill, and A. M. R. de Graﬀ. Proteostasiscollapse is a driver of cell aging and death.

P. Natl. Acad.Sci. Usa. , 116(44):22173–22178, 2019. doi: 10.1073/pnas.1906592116.[19] K. A. Spriggs, M. Bushell, and A. E. Willis. Trans-lational regulation of gene expression during conditionsof cell stress.

Mol. Cell , 40(2):228–237, 2010. doi:10.1016/j.molcel.2010.09.028.[20] E. de Nadal, G. Ammerer, and F. Posas. Controlling geneexpression in response to stress.

Nat. Rev. Genet. , 12(12):833–845, 2011. doi: 10.1038/nrg3055.[21] K. A. Dill, K. Ghosh, and J. D. Schmit. Physical limitsof cells and proteomes.

P. Natl. Acad. Sci. Usa. , 108(44):17876–17882, 2011. doi: 10.1073/pnas.1114477108.[22] M. V. Zubkov. Faster growth of the major prokaryoticversus eukaryotic co2 ﬁxers in the oligotrophic ocean.

Nat.Commun. , 5(1):3776, 2014. doi: 10.1038/ncomms4776.[23] N. Alexandrov. Structural argument for n-terminal ini-tiation of protein folding.

Protein Sci. , 2(11):1989–1991,1993. doi: 10.1002/pro.5560021121.[24] O. B. Nilsson, A. A. Nickson, J. J. Hollins, S. Wickles,A. Steward, R. Beckmann, G. von Heijne, and J. Clarke.Cotranslational folding of spectrin domains via partiallystructured states.

Nat. Struct. & Mol. Bio. , 24(3):221–225, 2017. doi: 10.1038/nsmb.3355.725] A. J. Samelson, E. Bolin, S. M. Costello, A. K. Sharma,E. P. O’Brien, and S. Marqusee. Kinetic and struc-tural comparison of a protein’s cotranslational foldingand refolding pathways.

Sci. Adv , 4(5), 2018. doi:10.1126/sciadv.aas9098.[26] M. Liutkute, E. Samatova, and M. V. Rodnina. Cotrans-lational folding of proteins on the ribosome.

Biomolecules ,10(1):97, 2020. doi: 10.3390/biom10010097.[27] G. Zhang and Z. Ignatova. Folding at the birth ofthe nascent chain: Coordinating translation with co-translational folding.

Curr. Opin. Struc. Biol. , 21(1):25 –31, 2011. doi: https://doi.org/10.1016/j.sbi.2010.10.008.[28] F. Trovato and E. P. O’Brien. Insights into cotransla-tional nascent protein behavior from computer simula-tions.

Annu. Rev. Biophys. , 45(1):345–369, 2016. doi:10.1146/annurev-biophys-070915-094153.[29] G. Kramer, A. Shiber, and B. Bukau. Mechanisms ofcotranslational maturation of newly synthesized proteins.

Annu. Rev. Biochem. , 88(1):337–364, 2019. doi: 10.1146/annurev-biochem-013118-111717.[30] C. A. Waudby, C. M. Dobson, and J. J. Christodoulou.Nature and regulation of protein folding on the ribosome.

Trends Biochem. Sci. , 44(11):914 – 926, 2019. doi: https://doi.org/10.1016/j.tibs.2019.06.008.[31] C. M. Kaiser, H. C. Chang, V. R. Agashe, S. K. Laksh-mipathy, S. A. Etchells, M. Hayer-Hartl, F. U. Hartl, andJ. M. Barral. Real-time observation of trigger factor func-tion on translating ribosomes.

Nature , 444(7118):455–460,2006. doi: 10.1038/nature05225.[32] C. M. Kaiser, D. H. Goldman, J. D. Chodera, I. Tinoco,and C. Bustamante. The ribosome modulates nascent pro-tein folding.

Science , 334(6063):1723–1727, 2011. doi:10.1126/science.1209740.[33] E. P. O’Brien, J. Christodoulou, M. Vendruscolo, andC. M. Dobson. Trigger factor slows co-translational fold-ing through kinetic trapping while sterically protectingthe nascent chain from aberrant cytosolic interactions.

J. Am. Chem. Soc. , 134(26):10920–10932, 2012. doi:10.1021/ja302305u.[34] E. P. O’Brien, M. Vendruscolo, and C. M. Dobson. Ki-netic modelling indicates that fast-translating codons cancoordinate cotranslational protein folding by avoiding mis-folded intermediates.

Nat. Commun. , 5(1):2988, 2014.doi: 10.1038/ncomms3988.[35] K. Liu, K. Maciuba, and C. M. Kaiser. The ribosomecooperates with a chaperone to guide multi-domain pro-tein folding.

Mol. Cell , 74(2):310 – 319.e7, 2019. doi:https://doi.org/10.1016/j.molcel.2019.01.043.[36] S. J. Kim, J. S. Yoon, H. Shishido, Z. Yang, L. A. A.Rooney, J. M. Barral, and W. R. Skach. Translationaltuning optimizes nascent protein folding in cells.

Science ,348(6233):444–448, 2015. doi: 10.1126/science.aaa3974.[37] W. M. Jacobs and E. I. Shakhnovich. Evidence of evo-lutionary selection for cotranslational folding.

P. Natl.Acad. Sci. Usa. , 114(43):11434–11439, 2017. doi: 10.1073/pnas.1705772114. [38] A. Bitran, W. M. Jacobs, X. Zhai, and E. Shakhnovich.Cotranslational folding allows misfolding-prone proteinsto circumvent deep kinetic traps.

P. Natl. Acad. Sci. Usa. ,117(3):1485–1495, 2020. doi: 10.1073/pnas.1913207117.[39] I. M. Walsh, M. A. Bowman, I. F. Soto Santarriaga,A. Rodriguez, and P. L. Clark. Synonymous codon sub-stitutions perturb cotranslational protein folding in vivoand impair cell ﬁtness.

P. Natl. Acad. Sci. Usa. , 117(7):3528–3534, 2020. doi: 10.1073/pnas.1907126117.[40] K. Lindorﬀ-Larsen, S. Piana, R. O. Dror, and D. E. Shaw.How fast-folding proteins fold.

Science , 334(6055):517–520, 2011. doi: 10.1126/science.1208351.[41] W. M. Jacobs and E. I. Shakhnovich. Structure-basedprediction of protein-folding transition paths.

Biophys.J. , 111(5):925 – 936, 2016. doi: https://doi.org/10.1016/j.bpj.2016.06.031.[42] M. P. Morrissey, Z. Ahmed, and E. I. Shakhnovich. Therole of cotranslation in protein folding: A lattice modelstudy.

Polymer , 45(2):557 – 571, 2004. doi: https://doi.org/10.1016/j.polymer.2003.10.090.[43] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N.Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne.The protein data bank.

Nucleic Acids Res. , 28(1):235–242, 2000. doi: 10.1093/nar/28.1.235.[44] D. N. Ivankov, S. O. Garbuzynskiy, E. Alm, K. W. Plaxco,D. Baker, and A. V. Finkelstein. Contact order revisited:Inﬂuence of protein size on the folding rate.

Protein Sci. ,12(9):2057–2062, 2003. doi: 10.1110/ps.0302503.[45] A. N. Naganathan and V. Muñoz. Scaling of folding timeswith protein size.

J. Am. Chem. Soc. , 127(2):480–481,2005. doi: 10.1021/ja044449u.[46] B. Manavalan, K. Kuwajima, and J. Lee. Pfdb: Astandardized protein folding database with temperaturecorrection.

Sci. Rep. , 9(1):1588, 2019. doi: 10.1038/s41598-018-36992-y.[47] L. Notari, M. Martínez-Carranza, J. A. Farías-Rico,P. Stenmark, and G. von Heijne. Cotranslational fold-ing of a pentarepeat β -helix protein. J. Mol. Biol. , 430(24):5196 – 5206, 2018. doi: https://doi.org/10.1016/j.jmb.2018.10.016.[48] K. Liu, X. Chen, and C. M. Kaiser. Energetic dependen-cies dictate folding mechanism in a complex protein.

P.Natl. Acad. Sci. Usa. , 116(51):25641–25648, 2019. doi:10.1073/pnas.1914366116.[49] P. Ciryam, R. I. Morimoto, M. Vendruscolo, C. M. Dob-son, and E. P. O’Brien. In vivo translation rates cansubstantially delay the cotranslational folding of the es-cherichia coli cytosolic proteome.

P. Natl. Acad. Sci. Usa. ,110(2):E132–E140, 2013. doi: 10.1073/pnas.1213624110.[50] C. T. Friel, A. P. Capaldi, and S. E. Radford. Structuralanalysis of the rate-limiting transition states in the foldingof im7 and im9: Similarities and diﬀerences in the fold-ing of homologous proteins.

J. Mol. Biol. , 326(1):293 –305, 2003. doi: https://doi.org/10.1016/S0022-2836(02)01249-4.[51] Y. Hanazono, K. Takeda, and K. Miki. Structural studiesof the n-terminal fragments of the ww domain: Insightsinto co-translational folding of a beta-sheet protein.

Sci.Rep. , 6(1):34654, 2016. doi: 10.1038/srep34654.852] P. Tian, A. Steward, R. Kudva, T. Su, P. J. Shilling,A. A. Nickson, J. J. Hollins, R. Beckmann, G. von Hei-jne, J. Clarke, and R. B. Best. Folding pathway of anig domain is conserved on and oﬀ the ribosome.

P.Natl. Acad. Sci. Usa. , 115(48):E11284–E11293, 2018. doi:10.1073/pnas.1810523115.[53] X. Chen, N. Rajasekaran, K. Liu, and C. M. Kaiser.Synthesis runs counter to the directional folding path-way of a nascent protein domain. bioRxiv , 2020. doi:10.1101/2020.04.29.068593.[54] D. A. Nissley, Q. V. Vu, F. Trovato, N. Ahmed, Y. Jiang,M. S. Li, and E. P. O’Brien. Electrostatic interactions gov-ern extreme nascent protein ejection times from ribosomesand can delay ribosome recycling.

J. Am. Chem. Soc. , 142(13):6103–6110, 2020. doi: 10.1021/jacs.9b12264.[55] S. El-Gebali, J. Mistry, A. Bateman, S. R. Eddy, A. Lu-ciani, S. C. Potter, M. Qureshi, L. J. Richardson, G. A.Salazar, A. Smart, E. L. L. Sonnhammer, L. Hirsh, L. Pal-adin, D. Piovesan, S. C. E. Tosatto, and R. D. Finn. Thepfam protein families database in 2019.

Nucleic AcidsRes. , 47(D1):D427–D432, 2019. doi: 10.1093/nar/gky995.[56] G. Kemp, R. Kudva, A. de la Rosa, and G. von Hei-jne. Force-proﬁle analysis of the cotranslational foldingof hemk and ﬁlamin domains: Comparison of biochemicaland biophysical folding assays.

J. Mol. Biol. , 431(6):1308– 1314, 2019. doi: https://doi.org/10.1016/j.jmb.2019.01.043.[57] B. H. Toyama and M. W. Hetzer. Protein homeostasis:Live long, won’t prosper.

Nat. Rev. Mol. Cell Bio. , 14(1):55–61, 2013. doi: 10.1038/nrm3496.[58] W. Wang. Protein aggregation and its inhibition in bio-pharmaceutics.

Int. J. Pharmaceut. , 289(1):1 – 30, 2005.doi: https://doi.org/10.1016/j.ijpharm.2004.11.014.[59] E. F. Fornasiero, S. Mandad, H. Wildhagen, M. Alevra,B. Rammner, S. Keihani, F. Opazo, I. Urban, T. Is-chebeck, M. S. Sakib, M. K. Fard, K. Kirli, T. P. Centeno,R. O. Vidal, R. U. Rahman, E. Benito, A. Fischer, S. Den-nerlein, P. Rehling, I. Feussner, S. Bonn, M. Simons,H. Urlaub, and S. O. Rizzoli. Precisely measured pro-tein lifetimes in the mouse brain reveal diﬀerences acrosstissues and subcellular fractions.

Nat. Commun. , 9(1):4230, 2018. doi: 10.1038/s41467-018-06519-0.[60] H. Yun, J. W. Lee, J. Jeong, J. Chung, J. M. Park, H. N.Myoung, and S. Y. Lee. Ecoprodb: The escherichia coliprotein database.

Method. Biochem. Anal. , 23(18):2501–2503, 2007. doi: 10.1093/bioinformatics/btm351.[61] J. H. Han, S. Batey, A. A. Nickson, S. A. Teichmann,and J. Clarke. The folding and evolution of multidomainproteins.

Nat. Rev. Mol. Cell Bio. , 8(4):319–330, 2007.doi: 10.1038/nrm2144.[62] D. H. Goldman, C. M. Kaiser, A. Milin, M. Righini,I. Tinoco, and C. Bustamante. Mechanical force re-leases nascent chain–mediated ribosome arrest in vitroand in vivo.

Science , 348(6233):457–460, 2015. doi:10.1126/science.1261909.[63] S. E. Leininger, F. Trovato, D. A. Nissley, and E. P.O’Brien. Domain topology, stability, and translationspeed determine mechanical force generation on the ribo-some.

P. Natl. Acad. Sci. Usa. , 116(12):5523–5532, 2019.doi: 10.1073/pnas.1813003116. [64] S. Batey, L. G. Randles, A. Steward, and J. Clarke. Coop-erative folding in a multi-domain protein.

J. Mol. Biol. ,349(5):1045 – 1059, 2005. doi: https://doi.org/10.1016/j.jmb.2005.04.028.[65] G. Kemp, O. B. Nilsson, P. Tian, R. B. Best, and G. vonHeijne. Cotranslational folding cooperativity of contigu-ous domains of α -spectrin. P. Natl. Acad. Sci. Usa. , 2020.doi: 10.1073/pnas.1909683117.[66] E. J. Miller, K. F. Fischer, and S. Marqusee. Experimentalevaluation of topological parameters determining protein-folding rates.

P. Natl. Acad. Sci. Usa. , 99(16):10359–10363, 2002. doi: 10.1073/pnas.162219099.[67] W. C. Lo, C. C. Lee, C. Y. Lee, and P. C. Lyu. Cpdb:A database of circular permutation in proteins.

Nu-cleic Acids Res. , 37:D328–D332, 2008. doi: 10.1093/nar/gkn679.[68] K. R. Kemplen, D. De Sancho, and J. Clarke. The re-sponse of greek key proteins to changes in connectiv-ity depends on the nature of their secondary structure.

J. Mol. Biol. , 427(12):2159 – 2165, 2015. doi: https://doi.org/10.1016/j.jmb.2015.03.020.[69] A. P. Marsden, J. J. Hollins, C. O’Neill, P. Ryzhov, S. Hig-son, C. A. T. F. Mendonça, T. O. Kwan, G. K. Lee,A. Steward, and J. Clarke. Investigating the eﬀect of chainconnectivity on the folding of a beta-sheet protein on andoﬀ the ribosome.

J. Mol. Biol. , 430(24):5207 – 5216, 2018.doi: https://doi.org/10.1016/j.jmb.2018.10.011.[70] A. A. Komar, T. Lesnik, and C. Reiss. Synonymouscodon substitutions aﬀect ribosome traﬃc and proteinfolding during in vitro translation.

Febs Lett. , 462(3):387– 391, 1999. doi: https://doi.org/10.1016/S0014-5793(99)01566-5.[71] W. D. Clarkson, A. H. Corbett, B. M. Paschal, H. M.Kent, A. J. McCoy, L. Gerace, P. A. Silver, and M. Stew-art. Nuclear protein import is decreased by engineeredmutants of nuclear transport factor 2 (ntf2) that do notbind gdp-ran11edited by i. b. holland.

J. Mol. Biol. , 272(5):716 – 730, 1997. doi: https://doi.org/10.1006/jmbi.1997.1255.[72] M. Alamo, D. J. Hogan, S. Pechmann, V. Albanese, P. O.Brown, and J. Frydman. Deﬁning the speciﬁcity of co-translationally acting chaperones by systematic analysis ofmrnas associated with ribosome-nascent chain complexes.

Plos Biol. , 9(7):1–23, 2011. doi: 10.1371/journal.pbio.1001100.[73] F. Willmund, M. del Alamo, S. Pechmann, T. Chen,V. Albanèse, E. B. Dammer, J. Peng, and J. Frydman.The cotranslational function of ribosome-associated hsp70in eukaryotic protein homeostasis.

Cell , 152(1):196 – 209,2013. doi: https://doi.org/10.1016/j.cell.2012.12.001.[74] B. Nölting and D. A. Agard. How general is thenucleation-condensation mechanism?

Proteins. , 73(3):754–764, 2008. doi: 10.1002/prot.22099.[75] A. S. Wagaman, A. Coburn, I. Brand-Thomas, B. Dash,and S. S. Jaswal. A comprehensive database of veriﬁedexperimental data on protein folding kinetics.

ProteinSci. , 23(12):1808–1812, 2014. doi: 10.1002/pro.2551.976] The UniProt Consortium. Uniprot: A worldwide hubof protein knowledge.

Nucleic Acids Res. , 47(D1):D506–D515, 2018. doi: 10.1093/nar/gky1049.[77] J. M. Dana, A. Gutmanas, N. Tyagi, G. Qi,C. O’Donovan, M. Martin, and S. Velankar. Sifts:Updated structure integration with function, taxonomyand sequences resource allows 40-fold increase in cover-age of structure-based annotations for proteins.

NucleicAcids Res. , 47(D1):D482–D489, 2018. doi: 10.1093/nar/gky1114.[78] J. J. Almagro Armenteros, K. D. Tsirigos, C. K. Sønderby,T. N. Petersen, O. Winther, S. Brunak, G. von Heijne, andH. Nielsen. Signalp 5.0 improves signal peptide predictionsusing deep neural networks.

Nat. Biotechnol. , 37(4):420–423, 2019. doi: 10.1038/s41587-019-0036-z.[79] J. Kubelka, J. Hofrichter, and W. A. Eaton. The proteinfolding ‘speed limit’.

Curr. Opin. Struc. Biol. , 14(1):76 –88, 2004. doi: https://doi.org/10.1016/j.sbi.2004.01.013.[80] G. C. Rollins and K. A. Dill. General mechanism of two-state protein folding kinetics.

J. Am. Chem. Soc. , 136(32):11420–11427, 2014. doi: 10.1021/ja5049434.[81] J. Song, K. Takemoto, H. Shen, H. Tan, M. M. Gromiha,and T. Akutsu. Prediction of protein folding rates fromstructural topology and complex network properties.

IPSJTransactions on Bioinformatics , 3:40–53, 2010. doi: 10.2197/ipsjtbio.3.40.[82] A. S. Wagaman and S. S. Jaswal. Capturing pro-tein folding-relevant topology via absolute contact or-der variants.

Journal of Theoretical and Computa-tional Chemistry , 13(01):1450005, 2014. doi: 10.1142/S0219633614500059.[83] D. Thirumalai. From minimal models to real proteins:Time scales for protein folding kinetics.

J. Phys. I France ,5(11):1457–1467, 1995. doi: 10.1051/jp1:1995209.[84] A. M. Gutin, V. I. Abkevich, and E. I. Shakhnovich. Chainlength scaling of protein folding time.

Phys. Rev. Lett. ,77:5433–5436, 1996. doi: 10.1103/PhysRevLett.77.5433.[85] M. Cieplak, T. X. Hoang, and M. S. Li. Scaling of foldingproperties in simple models of proteins.

Phys. Rev. Lett. ,83:1684–1687, 1999. doi: 10.1103/PhysRevLett.83.1684.[86] N. Koga and S. Takada. Roles of native topology andchain-length scaling in protein folding: A simulation studywith a g¯o-like model.

J. Mol. Biol. , 313(1):171 – 180, 2001.doi: https://doi.org/10.1006/jmbi.2001.5037.[87] M. S. Li, D. K. Klimov, and D. Thirumalai. Dependenceof folding rates on protein length.

J. Phys. Chem. B , 106(33):8302–8305, 2002. doi: 10.1021/jp025837q.[88] T. J. Lane and V. S. Pande. A simple model predictsexperimental folding rates and a hub-like topology.

J.Phys. Chem. B , 116(23):6764–6774, 2012. doi: 10.1021/jp212332c.[89] S. O. Garbuzynskiy, D. N. Ivankov, N. S. Bogatyreva, andA. V. Finkelstein. Golden triangle for folding rates ofglobular proteins.

P. Natl. Acad. Sci. Usa. , 110(1):147–150, 2013. doi: 10.1073/pnas.1210180110.[90] T. J. Lane and V. S. Pande. Inferring the rate-lengthlaw of protein folding.

Plos One , 8(12):1–5, 2013. doi:10.1371/journal.pone.0078606. 10 tructural asymmetry along protein sequences and co-translational folding

John M. McBride and Tsvi Tlusty

Center for Soft and Living Matter, Institute for Basic Science, Ulsan 44919, South Korea Departments of Physics and Chemistry, Ulsan National Institute of Science and Technology, Ulsan 44919,South Korea * [email protected], [email protected] 23, 2020 In this work, we assume that β -sheets fold slower than α -helices in general, and hence our model predicts that co-translationalfolding (CTF) will ﬁnish faster if β -sheets are located at the N terminal. Here we further examine this phenomenon using coarse-grained models of proteins, such that we also take into account the eﬀect of tertiary structure. We study three proteins thatexhibit signiﬁcant asymmetry in secondary structure, with native structures taken from the Protein Data Bank (PDB): 1ILO,2OT2, 3BID.To investigate whether these proteins beneﬁt from having β -sheets at the N terminal, we ﬁrst calculate the time it takes to foldafter starting from an unfolded conﬁguration, τ fold . Then, for a range of translation times ( τ trans = Lτ AA , where τ AA is the timeit takes to translate one amino acid) we calculate the time it takes to undergo translation and folding, τ CTF , for both translationfrom the N- to the C-terminal ( τ CTF , N ), and also from the C- to the N-terminal ( τ CTF , C ). We then plot the speed-up, the ratiobetween the time it takes for CTF along the backward to the forward direction ( τ CTF , C /τ CTF , N ), against the ratio of foldingtranslation time, R = τ fold /τ trans (Fig. 8). All three proteins folded faster (up to 21%) when β -sheets were translated ﬁrst, andthe maximum speed-up occurred when 1 / < R < τ fold , we runsimulations from the unfolded state at T ∗ = 100, with a time step of dt ∗ = 0 .

001 for 2 × steps; units are reported in reducedunits according to the deﬁnitions given by Gromacs. We record folding time, τ fold , as the time it takes for 90% of native contactsto be within 20% of the distance given in the PDB structure. We run the simulation 1 ,

000 times to get an average.To calculate co-translational folding time, τ CTF , N ( τ CTF , C ), we initialise the protein in an elongated conﬁguration and restrainall beads except the ﬁrst bead at the N (C) terminal. We run the simulation for τ AA , the time it takes to translate a single residue,and then remove the restraint from the next bead in the chain; we repeat until all restraints are removed. We also create wallout of purely repulsive spheres packed on a square lattice, positioned at the boundary between free and restrained beads, whichprevents interaction between the ‘translated’ and ‘untranslated’ parts of the protein. As each bead has its restraint removed, wemove the wall along the chain. When the restraint is removed from the ﬁnal bead, we remove the wall and run for a further 5 × or 1 × time steps. We measure τ CTF , N ( τ CTF , C ) in the same way as τ fold . We run the simulation in each direction 4 ,

000 timesto get an average. A full set of simulation parameters is available in the supplementary materials in the form of Gromacs inputﬁles and python scripts for conﬁguration set-up and analysis. 1 a r X i v : . [ q - b i o . B M ] O c t able 1: The probability of that a prokaryotic protein has a speciﬁc fold, P Prok , compared with the probability of that a eukaryoticprotein has a speciﬁc fold, P Euk . Results are shown for the 9 most common SCOP folds [3] identiﬁed in PDB data used in thisstudy. SCOP Class SCOP ID Count Description P Prok P Euk

Probability ratioa/b 2000031 171 TIM beta/alpha-barrel 0.06 0.03 1.94a/b 2000148 171 SDR-type extended Rossmann fold 0.06 0.04 1.33a+b 2000014 115 Ferredoxin-like 0.04 0.01 2.85a/b 2000088 76 Methyltransferase-like 0.02 0.02 1.10a+b 2000303 65 Acyl-CoA N-acyltransferases (Nat) 0.02 0.01 4.14a 2000002 62 Globin-like 0.00 0.07 0.07a/b 2000016 58 Rossmann(2x3)oid (Flavodoxin-like) 0.02 0.01 1.98a/b 2000607 58 PLP-dependent transferase-like 0.02 0.02 1.04a/b 2000027 56 ClpP-type beta-alpha superhelix 0.02 0.00 nan2

10 20 30 40 50Sequence distance from ends0.00.20.40.60.8 S e c o n d a r y s t r u c t u r e p r o b a b ili t y SDR-type extended Rossmann foldTotal sequences: 171

TIM beta/alpha-barrelTotal sequences: 171

N C Helix Sheet Coil Disorder0 10 20 30 40 50Sequence distance from ends0.00.20.40.60.8

Ferredoxin-likeTotal sequences: 115 S e c o n d a r y s t r u c t u r e p r o b a b ili t y Methyltransferase-likeTotal sequences: 76

Acyl-CoA N-acyltransferases (Nat)Total sequences: 65

Globin-likeTotal sequences: 62 S e c o n d a r y s t r u c t u r e p r o b a b ili t y Rossmann(2x3)oid (Flavodoxin-like)Total sequences: 58

PLP-dependent transferase-likeTotal sequences: 58

ClpP-type beta-alpha superhelixTotal sequences: 56

Figure 1: Secondary structure probability as a function of sequence distance from the N and C terminal respectively, for α -helices, β -sheets, random coil, and disorder. Separate plots are shown for the 9 most common SCOP folds in our sample of PDB proteins[3]. The folds that exhibit α - β periodicity in their secondary structure are highlighted in bold.3 .6 1.8 2.0 2.2 2.4Sequence Length2024 l o g k f A l o g k f B Figure 2: Correlations between sequence length ( A ), contact order ( B ) and empirically-determined folding rate k f [4]. Two tailedPearson’s correlation gives: sequence length, r = − . p < . r = − . p < . CI CI Figure 3: Distribution of R , the ratio of folding to translation time, estimated by using the main ﬁt to Eq. 7 (main text), andthe 95% conﬁdence intervals to the ﬁt. 5 .2 0.0 0.2asym-6.8-2.6-2.2-1.8-1.4-1.1-0.9-0.6-0.30.03.6 l o g R A l o g R B B C Helices

Fit95% CI CI

10 12 14 16 18A87654 B Sheets

Fit95% CI CI R m a x R m a x Figure 4: A - B : α - β asymmetry distributions and N terminal enrichment for the full sample according to the 95% conﬁdenceintervals on the empirical ﬁt for estimating k trans . Proteins are divided into deciles according to R ; bin edges are shown on thex-axis. C : The median of the R decile that exhibits maximum asymmetry, R max , as a function of ﬁtting parameters A and B . Thepink star corresponds to the ﬁt used in the main text. The pink circles correspond to the ﬁts used in A and B . The rings showthe bootstrapped 95% conﬁdence interval for values of A and B . Black rings are for the PFDB data set used in the main ﬁgures;green rings are for the less conservative PFDB data set. Solid rings indicate bootstrapped samples with the same sample size asthe original sample; dotted rings indicate bootstrapped samples with half the sample size of the original sample.6

100 200 300 400 500 600Sequence Length0.000.050.100.15 D e n s i t y position 1position 20 10 20 30 40 50 60Contact Order0.000.050.10 D e n s i t y Figure 5: Sequence length (top) and contact order (bottom) distributions for domains in position 1 (near the N terminal) andposition 2 in two-domain proteins. Shaded area indicates bootstrapped 95% conﬁdence intervals.The distributions do not diﬀer based on domain position, which suggests that folding time does not depend on domain position.7 . - . - . - . - . - . - . - . - . - . . log R N T e r m i n a l E n r i c h m e n t Position 1

E E - . - . - . - . - . - . - . - . - . - . . log R N T e r m i n a l E n r i c h m e n t Position 2

Figure 6: N terminal enrichment of α helices ( E α ) and β sheets ( E β ) for individual domains within two-domain proteins fordomains in position 1 (left; closest to the N terminal) and position 2 (right). Proteins are divided into deciles according to R ; binedges are shown on the x-axis; whiskers indicate bootstrapped 95% conﬁdence intervals.The data shows that domains in position 1 exhibit maximum β asymmetry when − . < R < .

4, in agreement with the proposed‘slowest-ﬁrst’ scheme. However, the conﬁdence intervals are almost as large as the eﬀect size, so we lack suﬃcient data to show asigniﬁcant eﬀect. Domains in position 2 do not appear to exhibit asymmetry in agreement with the ‘slowest-ﬁrst’ scheme.8 . - . - . - . - . - . - . . . . . log R N t e r m i n a l E n r i c h m e n t Full sample - . - . - . - . - . - . - . - . . . . log R N t e r m i n a l E n r i c h m e n t Eukaryotes - . - . - . - . - . - . - . . . . . log R N t e r m i n a l E n r i c h m e n t Prokaryotes

10 104 140 174 215 255 296 343 408 507 4717

Sequence Length0.000.050.100.15 N t e r m i n a l E n r i c h m e n t

10 89 131 161 206 251 309 373 464 623 4717

Sequence Length0.00.10.2 N t e r m i n a l E n r i c h m e n t

14 108 144 179 218 256 291 335 393 469 2523

Sequence Length0.000.050.100.150.20 N t e r m i n a l E n r i c h m e n t . . . . . . . . . . . Contact Order0.000.050.100.150.20 N t e r m i n a l E n r i c h m e n t . . . . . . . . . . . Contact Order0.000.050.100.150.20 N t e r m i n a l E n r i c h m e n t . . . . . . . . . . . Contact Order0.00.10.2 N t e r m i n a l E n r i c h m e n t Figure 7: N terminal enrichment of disordered residues as a function of log R , Sequence Length, and Contact Order; data isshown for the full sample (left), eukaryotic proteins (middle), and prokaryotic proteins (right). Proteins are divided into decilesaccording to R (top), Sequence Length (middle), and Contact Order (bottom); bin edges are shown on the x-axis.There is a clear association between N terminal enrichment and slow-folding proteins in prokaryotes, but not in eukaryotes.9 R Sp ee d - u p ribo = 0 ribo = 0.3 trans Figure 8: (Top) Secondary structure of three proteins that exhibit α - β asymmetry in line with the ‘slowest-ﬁrst’ scheme (PDB IDs:3BID, 1ILD, 2OT2). (Bottom) Maximum theoretical speed-up achievable as a function of R and τ ribo (lines). Speed-up achievedin a simulation by translating proteins from the N- to the C-terminal (with slow-folding parts near the N terminal) compared totranslating them from the C- to the N-terminal. 10 .000.250.500.751.00 S o l v e n t a cc e ss i b ili t y p r o b a b ili t y All proteinsAll residues S o l v e n t a cc e ss i b ili t y a s y mm e t r y l o g ( N / C ) Eukaryotic proteinsAll residues

Prokaryotic proteinsAll residues S o l v e n t a cc e ss i b ili t y p r o b a b ili t y Eukaryotic proteinsHelix, sheet and coil S o l v e n t a cc e ss i b ili t y a s y mm e t r y l o g ( N / C ) Eukaryotic proteinsOnly Sheets

N C Buried Middle Exposed0 10 20 30 40 50Sequence distance from ends1.00.50.00.51.0 0.000.250.500.751.00

Prokaryotic proteinsOnly Sheets

Figure 9: Relative solvent accessibility (SA) and SA asymmetry as a function of sequence distance from the terminal residues.SA is calculated using the freesasa C library [5, 6], and residues are divided equally among three categories: Buried, Middle, orExposed. It is impossible to calculate SA for disordered residues, and we assume that these are Exposed [7]. (Top) Data is shownfor all proteins (left), eukayotic proteins (middle) and prokaryotic proteins (right). (Bottom) Data is shown for helix, sheet and coilresidues from eukaryotic proteins (left), for sheet residues from eukarotic proteins (middle), and for sheet residues from prokaryoticproteins (right).For eukaryotic proteins overall, the N terminal has a tendency to be more exposed than the C terminal. However, when onlyconsidering helices, sheets and coils there is no signiﬁcant asymmetry – i.e. the bias for being exposed at the N terminal isexplained by the bias for being disordered at the N terminal. When only considering sheets, the N terminal is slightly more buried,while the C terminal is more likely to contain ’Middle’ values of SA. For prokaryotic proteins overall, the N terminal is more likelyto buried compared to the C terminal, which is mainly due to β sheets.11 .1 0.0 0.1asym-6.3-3.1-2.7-2.4-2.1-1.9-1.7-1.5-1.2-1.01.6 l o g R C Helix

Sheet D E E l o g R E Prokaryote E Eukaryote E Prokaryote E Eukaryote E R D e n s i t y A All Prokaryotes Eukaryotes 1.0 1.5 2.0 2.5log L l o g k f B ACPro dataPFDB data

Figure 10: Repeat of main text Figures 2-3 after calculating R , the folding/translation time ratio, using the ACPro database [8]instead of the PFDB [4]. A : R distribution for the full PDB sample, prokaryotic proteins, and eukaryotic proteins. B : Scatter plotand correlation between sequence length L and folding rate k f for both the ACPro and PFDB databases. Shaded area indicates95% conﬁdence interval of the linear ﬁt. C : α - β asymmetry distributions as a function of R . Proteins are divided into decilesaccording to R ; bin edges are shown on the y-axis. D : N terminal enrichment – the degree to which sheets/helices are enriched inthe N over the C terminal – is shown for the deciles given in C. E : N terminal enrichment as a function of R for 4 ,

633 eukaryoticproteins and 10 ,

400 prokaryotic proteins. Proteins are divided into bins according to R ; bin edges, shown on the x-axis, are thesame as in C-D. Whiskers indicate 95% conﬁdence intervals.In contrast with the results obtained using the PFDB, estimating R using the ACPro database results in the prediction that formost proteins translation time is considerably longer than folding time. The principal reason for this appears to be a few proteinswith either few residues ( <

34) or many residues ( > − . < R < − .

2, where non-negligible acceleration of CTF is still possible.12

Sequence Length14689101214161956 C o n t a c t O r d e r A Helix

Sequence Length14689101214161956 C o n t a c t O r d e r Sheet

Sequence Length171114161921242733125 C o n t a c t O r d e r B Helix

Sequence Length171114161921242733125 C o n t a c t O r d e r Sheet

Sequence Length1121721252831354150195 C o n t a c t O r d e r C Helix

Sequence Length1121721252831354150195 C o n t a c t O r d e r Sheet

Sequence Length1162327323640455262252 C o n t a c t O r d e r D Helix

Sequence Length1162327323640455262252 C o n t a c t O r d e r Sheet a s y m a s y m a s y m a s y m a s y m a s y m a s y m a s y m Figure 11: Correlation between sequence length, contact order and α - β asymmetry. Separate plots are shown for diﬀerent valuesof the cutoﬀ used to calculate contact order: A , 6 ˚A; B , 8 ˚A; C , 10 ˚A; D , 12 ˚A.13 eferences [1] M. J. Abraham, T. Murtola, R. Schulz, S. P´all, J. C. Smith, B. Hess, and E. Lindahl. Gromacs: High performance molecularsimulations through multi-level parallelism from laptops to supercomputers. SoftwareX , 1-2:19 – 25, 2015. doi: https://doi.org/10.1016/j.softx.2015.06.001.[2] S. Neelamraju, D. J. Wales, and S. Gosavi. Go-kit: A tool to enable energy landscape exploration of proteins.

J. Chem. Inf.Model. , 59(5):1703–1708, 2019. doi: 10.1021/acs.jcim.9b00007.[3] A. Andreeva, E. Kulesha, J. Gough, and A. G. Murzin. The scop database in 2020: Expanded classiﬁcation of representativefamily and superfamily domains of known protein structures.

Nucleic Acids Res. , 48(D1):D376–D382, 2019. doi: 10.1093/nar/gkz1064.[4] B. Manavalan, K. Kuwajima, and J. Lee. Pfdb: A standardized protein folding database with temperature correction.

Sci.Rep. , 9(1):1588, 2019. doi: 10.1038/s41598-018-36992-y.[5] B. Lee and F. M. Richards. The interpretation of protein structures: Estimation of static accessibility.

J. Mol. Biol. , 55(3):379– IN4, 1971. doi: https://doi.org/10.1016/0022-2836(71)90324-X.[6] S. Mitternacht. Freesasa: An open source c library for solvent accessible surface area calculations [version 1; peer review: 2approved].

F1000Research , 5(189), 2016. doi: 10.12688/f1000research.7931.1.[7] J. A. Marsh. Buried and accessible surface area control intrinsic protein ﬂexibility.

J. Mol. Biol. , 425(17):3250 – 3263, 2013.doi: https://doi.org/10.1016/j.jmb.2013.06.019.[8] A. S. Wagaman, A. Coburn, I. Brand-Thomas, B. Dash, and S. S. Jaswal. A comprehensive database of veriﬁed experimentaldata on protein folding kinetics.