[PDF] Disordered peptide chains in an α-C-based coarse-grained model

Abstract

We construct a one-bead-per-residue coarse-grained dynamical model to describe intrinsically disordered proteins at significantly longer timescales than in the all-atom models. In this model, inter-residue contacts form and disappear during the course of the time evolution. The contacts may arise between the sidechains, the backbones or the sidechains and backbones of the interacting residues. The model yields results that are consistent with many all-atom and experimental data on these systems. We demonstrate that the geometrical properties of various homopeptides differ substantially in this model. In particular, the average radius of gyration scales with the sequence length in a residue-dependent manner.

Full PDF

aa r X i v : . [ q - b i o . B M ] J u l Disordered peptide chains in an α -C-based coarse-grained model Lukasz Mioduszewski and Marek Cieplak ∗ Institute of Physics, Polish Academy of Sciences,Al. Lotnik´ow 32/46, 02-668 Warsaw,Poland (Dated: July 23, 2018)We construct a one-bead-per-residue coarse-grained dynamical model to describe intrinsically dis-ordered proteins at signiﬁcantly longer timescales than in the all-atom models. In this model, theinter-residue contacts form and disappear during the course of the time evolution. The contactsmay arise between the sidechains, the backbones and the sidechains and backbones of the interactingresidues. The model yields results that are consistent with many all-atom and experimental dataon these systems. We demonstrate that the geometrical properties of various homopeptides diﬀersubstantially in this model. In particular, the average radius of gyration scales with the sequencelength in a residue-dependent manner.

I. INTRODUCTION

Coarse-grained models have provided valuable computational tools to gain insights into the temperature- orforce-induced dynamics of large conformational transformations of proteins. These models have been applied mostlyto proteins that are natively structured, usually globular, and they often consider the solvent to be implicit. Oneconvenient way to implement coarse-graining is to adopt the structure-based approach in which the dynamics aregoverned by contact interactions and the parameters involved are derived so that the native state coincides with theground state of the system. This idea embodies the principle of minimal frustration and leads to an optimal foldingfunnel.

The issues of conformational dynamics become center-stage when describing intrinsically disordered proteins(IDPs).

This is because such systems evolve through various competing basins of attraction instead of stayingmostly within one deep valley in the free energy landscape, corresponding to the native state. Some IDP systems,for instance α -synuclein, polyvaline (polyV) and polyglutamine (polyQ), have been studied by all-atomsimulations, despite persistent questions about the validity of the standard force ﬁelds in this case. However,the all-atom studies come with limited time-scales of the simulations and thus with restricted probing of the possiblebasins (this problem can be alleviated somewhat by eliminating the explicit solvent, by using GPU clusters and byshortening folding trajectories for structured proteins by involving eﬀective polarizable force ﬁelds. In addition, itis likely that on very long time scales many of the all-atom details of description cease to be relevant. Thus there isa need to develop coarse-grained models for molecular dynamics of the IDPs.The existing approaches are based on the Monte Carlo method, Brownian dynamics, discontinuous moleculardynamics and molecular dynamics with an explicit solvent. All of them use more than one pseudo-atom peramino acid. In addition, with the exception of the last approach, they cannot be easily applied in situations inwhich a protein is partially structured and partially disordered. Also, the existing hybrid Go-models with nonnativecontacts do not take into account directionality and distance dependence of the contacts.Here, we construct a novel one-bead-per-residue top-down approach formulated in the spirit of the structure-based models – speciﬁcally they involve contact interactions. The diﬀerence, however, is that the parametersused are not determined for one speciﬁc protein, but by a statistical method, based on a non-redundant setof structures from Protein Data Bank as collected in the CATH database. The parameters, especially thevalues of characteristic lengths in the contact potentials, incorporate amino acid speciﬁcity. Another diﬀerenceis that the establishment of contacts between the α -C atoms is governed not only by their mutual distance butalso by the expected directionality of the side chains as derived from the positions of the α -C atoms. In thisrespect, we borrow from several Monte Carlo studies pertaining to the natively structured proteins and ﬁbrilformation but provide a molecular dynamics implementation of this concept. It should be pointed out that, unlikemost other coarse-grained approaches, the nature and characteristics of a given contact may be time-dependent ∗ E-mail: [email protected] because an interaction between, say, two sidechains may switch to that between one sidechain and the other backbone.We show that, despite the rather approximate nature of our model, the approach yields results which are consistentwith many previous computational and experimental results pertaining to the average geometry of the polypeptidechains of various lengths, n . We also predict values of the geometrical parameters for 20 homopeptides of the samechain length ( n = 30) and demonstrate existence of substantial diﬀerences between the homopeptides made ofdiﬀerent residues. We have found that in the UniProt Knowledge Database, the naturally occurring homopolymerictracts range in the maximal sequential length, n max , between 6 (Ile, Tyr, Trp) and 79 (Gln). The full list ofthese maximal lengths is given as one of the entries in Table I. Nevertheless, arbitrary lengths can be consideredtheoretically which allows us to determine the scaling exponents for the variations of the average radii of gyrationwith n and show that they depend on the nature of the residue. It is expected that a chemical synthesis can lead tothe tract lengths that exceed n max .In this work, we consider only the case of monomers. However, the true strength of this coarse-grained approachis expected to show in studies of many-chain disordered systems such as gluten and of large protein aggregatesand complexes with disordered linkers such as arising in cellulosomes. These large systems are inaccessible tomeaningful all-atom modelling and assessing, say, the viscoelastic properties of gluten requires very long simulationtimes.

II. METHODOLOGY

Our coarse-grained approach is, in many ways, as outlined in our structure-based model except that now thereis no single native structure that could be used to derive parameters in the potentials. The peptide chains are repre-sented by harmonically connected beads that also participate in transient contact interactions. The beads are locatedat the α -C atoms that are at the distance, r b = 3 . i and j , separatedby the distance of r i,j , are described by the Lennard-Jones (LJ) potential V L − J ( r i,j ) = 4 ǫ (cid:20)(cid:16) σ i,j r i,j (cid:17) − (cid:16) σ i,j r i,j (cid:17) (cid:21) ,where the energy parameter ǫ is assumed to be of the same order as for the structured proteins: about 110pN ˚A obtained by matching to the experimental data on stretching. This value is close to 1.5 kcal/mol obtainedindependently by matching all-atom energies to the coarse-grained expressions. We determine the length parameter σ i,j = r min · (0 . , (where r min is the distance where the potential acquires its minimal value) based on thechemical identity of the residues involved. Its choice and the criteria for the contact formation will be discussedbelow. The elastic constant in the tethering harmonic potential, mimicking the peptide bonds, is set to k = 100 ˚A − · ǫ . A. Contact interactions

In structure-based coarse-grained models, the contact map is ﬁxed and is determined by the native structure. Allof these contacts are occupied only in the native state, otherwise the occupation is only partial. In the case of theIDPs, there is no a priori ﬁxed contact map. However, at each instant, there is a set of contacts that are active, i.e.they provide attraction between residues.In order to learn how to determine which contacts are functional, we ﬁrst make a survey of 21 090 structuredproteins from the database provided by the CATH database (the set of proteins with the sequence similarity notexceeding 40%: cath-dataset-nonredundant-S40.pdb) and determine the nature of the native contacts as deﬁned bythe overlap (OV) criterion that is discussed in more details in ref. Brieﬂy, the heavy atoms are represented byspheres. The radii of the spheres are equal to the context-dependent values as listed in ref. and multiplied by 1.24to account for attraction (the factor corresponds to the inﬂection point in the LJ potential). If at least one spherebelonging to amino acid i overlaps with at least one sphere belonging to amino acid j then we declare existence ofthe native contact between i and j . Unlike a procedure in which a contact is declared to be present based on a cutoﬀdistance, our criterion directly relates to the conformation of the side groups and to the sizes of the atoms.We distinguish three classes of the contacts depending on the residue fragment from which the OV-making spheresoriginate: side chain - side chain (ss), backbone - backbone (bb) and backbone - side chain (bs). For a given pairof amino acids there could be simultaneous overlaps belonging to diﬀerent classes. Independently of their number,only one eﬀective contact is used in the dynamical description. It corresponds to a given α -C – α -C distance. Wedetermine the distribution of these distances in the three classes: bb, bs and ss as illustrated in Fig. 1. However, inthese distributions, the presence of, say, an ss contact does not preclude a simultaneous OV of the bs or bb kind in agiven pair of residues. In our model, though, for a given distance, only one kind of contact can be operational for apair of residues. It should be noted that our eﬀective description involves only the α -C atoms instead of introducing(up to) two kinds of beads to represent a residue, one for the backbone atom and another for a side chain, as proposed,for instance, by Gu et al . In order to associate a characteristic distance with a unique type of contact, we derive subdistributions corre-sponding to situations in which the OVs arise only in one class (e.g., only ss). In the case of the ss contacts, we alsodivide them into pairs of speciﬁc residues as listed in Table II. In our model, like-charged residues do not form any sscontacts. Neither do glycine (Gly) and proline (Pro) as they can form only the backbone-involving contacts. Thus,the table has 165 entries (instead of 210).The numbers listed in the Table II correspond to the values of r min used. They range between 6.42 ˚A (Ala–Ala)and 10.85 ˚A (Trp–Trp). In principle, they are equal to the average values in the distribution. However, they arecalculated by also taking into account additional restrictions that relate to the directionality, as discussed later on,though the corrections due to these restrictions are minor. The large value of r min for the Trp-Trp pairs requires aspecial consideration as it would lead to unphysical extended conformations in polyW tracts that are not found inthe PDB. The longest polyW tract with a known structure is found in PDB:3N85. It has the length of 4 and it cor-responds to a turn. The extended conformations do not arise if the i, i +3 ss contacts in the Trp-Trp pairs are disabled.For the bb and bs cases, we take r min of 5.0 and 6.8 ˚A, respectively, without incorporating any speciﬁcity.Following previous works, the bb contacts are treated in a special way: their well energy is set to 2 ǫ (energyof every other contact is ǫ ) and they cannot be formed between the i, i + 4 pairs of residues. This approach takesinto account ,,renumbering” of the backbone hydrogen bonds in a coarse-grained representation of an α -helix: the i, i + 4 bb contacts that have been identiﬁed by the OV criterion, applied to globular structures, have the α -C - α -Cdistances that are close to 6 ˚A (data not shown) and over 98% of them are accompanied by a simultaneously arisingbs contact. Therefore, it is suﬃcient to attribute the eﬀective i, i + 4 interaction to the bs-based contact. In all otherbb contacts, the typical distance is about 5 ˚A and an association with a bs contact is much less common. Doublingthe strength of the bb contacts compensates for the strict criteria for forming them, agrees with the earlier literatureﬁndings and conserves the balance between the number of each type of contacts by making the bb contacts harderto break by thermal ﬂuctuations (in structured proteins there are about twice as many ss contacts as the bb contacts). The time evolution is based on the equations of motion for the α -C atoms. During its course, an eﬀective contactmay appear or disappear based on the distance between the α -C atoms (as illustrated in Fig. 2) and on the shape ofthe conformation. Switching the contacts on and oﬀ is done quasi-adiabatically.At any instant, a contact between a given pair of residues is allowed to arise only as a result of one mechanism,corresponding to its speciﬁc value of r min . A new type of mechanism is allowed to act only after the potentialassociated with the previous one drops to zero. Thus before a new type of contact can be turned on, the previous typeis completely turned oﬀ. In practice, dynamical switching between various contact mechanisms occurs infrequently.The electrostatic interactions are not considered to be contacts in the sense used in our molecular dynamics model. Itshould be noted that if one uses the OV criterion, then the various kinds of the overlap may act together but – in thestructure-based model – this results just in single standard values of the energy and length parameters of the contact,independent of the number of the contributing OVs. Thus, in our dynamical model two mechanisms cannot acttogether so that the contact does not get described by a sum of, say, two potentials. However, it might be interestingto consider a model in which the eﬀective r min is a weighted average of the length parameters corresponding todiﬀerent successive mechanisms.All beads are endowed with the repulsive potential to ensure that the chain never passes through itself. Thispotential is always turned on, regardless of the possible additional presence of the attractive potentials. The repulsionis described by the LJ potential truncated at r min of 5 ˚A. For globular proteins the commonly used value is 4 ˚A,but the less constraining potentials appearing in our description of the IDPs, and hence an enhanced ﬂexibility ofthe chain, require boosting the eﬀective backbone stiﬀness by making the excluded volume larger.There are several conditions that need to be satisﬁed in order to establish a contact. The ﬁrst of these pertainsto the distance, r on , at which the LJ potential between two beeds is switched on in addition to the repulsion. Weset r on to be equal to r min for any given type of a contact. The other conditions will be speciﬁed in the next twosubsections. An established contact is switched oﬀ if the α -C i – α -C j distance exceeds 1.5 σ i,j . B. Eﬀects of the directionality

A bb contact may arise if the N-atom on the backbone part of the i th residue can form the hydrogen bond withthe O-atom on the backbone part of residue j or vice versa. This is allowed only if the two atoms point towardseach other. This means that the vector from α -C to O in one residue should be approximately antiparallel to thevector from α -C to N in another. In addition, the lines set by these two vectors should nearly coincide (insteadof being far away). In order to capture this physics by involving only the α -C atoms, we use the h i introduced inreferences that starts from the α -C atom on residue i and is perpendicular to the plane that is set by sites i − i and i + 1, i.e. parallel to v i × v i +1 , where v i = r i − r i − and r i,j = r j − r i . Examples of these vectors are shownon the bottom panel of Fig. 2. It can be seen that they should be lying along nearly parallel (or antiparallel, e.g. inantiparallel β -sheets) directions. This is captured by the three directional conditions imposed on the formation of anbb contact : a) | cos ( h i , r i,j ) | > .

92 (the threshold angle of 23 o ), b) | cos ( h j , r i,j ) | > .

92 and c) | cos ( h i , h j ) | > . o ). The values of the thresholds have been obtained from statistical distributions of theseangles. Deﬁning the directionality associated with the ss contacts can be accomplished by introducing the normal vector n i = r i − + r i +1 − r i | r i − + r i +1 − r i | . (1)The negative normal direction ( − n ) approximately points from the α -C to β -C atoms (see the top panel of Fig.2). A more accurate, but also more complicated, statistical expression for the location of the β -C atom can befound in the references. However, this more accurate description is valid for 82% of the ss contacts determinedby the OV criterion, while the simpler description is valid for an even higher value of 97%. For an ss contactto form, it is necessary that the side chains point at each other. There are two conditions for this to happen :1) cos( n i , r i,j ) < . n j , r j,i ) < . o ). The conditions for the ss contactare qualitatively distinct from those for the bb contact because the nature of the ss contacts is much more di-verse (resulting also in less restrictive thresholds), whereas the bb contacts have a speciﬁc hydrogen bonding pattern. Similar considerations lead to the following two conditions for a bs contact to form (the sidechain of the i thresidue interacts with the backbone of the j th residue, see the middle of Fig. 2): 1) cos( n i , r i,j ) < . | cos( h j , r j,i ) | > . r min of the bs contacts from 6.8 ˚Ato 5.4 ˚A, but this does not change the results in any signiﬁcant manner.Acceptable values of cos( n i , r i,j ) seem to be largely independent from the r i,j distance itself. Examples ofdistributions on the plane corresponding to cos( n i , r i,j ) and r i,j are shown in Fig. S4 in the Electronic SupplementaryInformation (ESI). We do not introduce any potentials that would favor the directionalities to stay within theappropriate bounds – they are checked only at the instant when the contact is established. C. The types and speciﬁty of eﬀective residues

We group the residues into six classes: 1) Gly, 2) Pro, 3) hydrophobic: Ala, Cys, Val, Ile, Leu, Met, Phe, Tyr,Trp, 4) polar: Gln, Ser, Thr, Asn, His 5) negatively charged: Asp, Glu, and 6) positively charged: Arg, Lys (seeTable I). The last two classes count as polar but they also carry charge. The distinction between the polar andhydrophobic residues matches (with the exception of Ala) the clustering (based on the eigenvalue decomposition)of the Miyazawa-Jernigan interaction matrix into two clusters. Contacts of the ss kind can be made between thehydrophobic residues, but also between the polar ones (crucial for modeling of polyQ) and between the polar andhydrophobic residues. The unlike-charged polar residues may form salt bridges but the like-charged ones merelyrepel. The energy parameter for each type of the ss contacts is the same. This, together with allowing for theformation of hydrophobic-polar ss contacts, distinguishes our model from the simple ,,HP” models, where suchcontacts either have reduced strength or are not allowed to arise. Thus, even though our model cannot capture thedetails of all possible interactions (for instance, between the π electrons on the faces of aromatic rings and cations),our models allows for them statistically.Real cysteine residues may form disulﬁde bonds, but this process is not covered in this paper, as none of theconsidered systems seems to form such bonds. They can be implemented in a coarse-grained model as discussed inreferences. Each residue is allowed to form up to z s contacts with other residues. The number z s is residue-dependent. Wetake z s = n b + min( s, n H + n P ), where n b is the permitted number of backbone contacts, s is the maximum allowednumber of the sidechain contacts and n H , n P are the maximal numbers of sidechain contacts with hydrophobicand polar sidechains of other residues, respectively. The values of these parameters are listed in Table I. For a bscontact, one backbone ”slot” assigned to one residue must be combined with one sidechain ”slot” assigned to theother residue. The backbone does not count as polar in bs contacts, so it aﬀects only the total s limit for a givensidechain. Treating the backbone as a polar entity merely changes the proportions between the types of contactsarising without improving the results. n b is equal to 2, corresponding to two possible hydrogen bonds. In the case ofPro, n b = 1.The numbers n H , n P and s depend on the type of the residue and were derived from the structure database bycalculating, for each residue of a given type, the distribution of its observed numbers, s c , of the ss contacts, i.e. thecoordination numbers. s was operationally deﬁned in relation to s max - the values of s c corresponding to the maximain such distributions. For the hydrophobic residues, the distributions are broad and we take s = s max . For thepolar residues, the distributions are narrow and then we take s = s max + 1. The polar residues usually stay nearthe surface of a protein and thus some of their possible ’contacts’ are with the molecules of water instead of withthe other residues and hence the under-count in the distribution. This is also the reason why we do not determine s by counting the possible hydrogen bonds that could arise. The values of n H , n P were determined similarly, bycalculating two-dimensional distributions of pairs of numbers of hydrophobic and polar contacts (with maxima atpoints s Hmax , s

P max ). Then, n H corresponds to the maximum s Hmax of such distribution (the most common numberof hydrophobic ss contacts for a given type of amino acid), n P to s P max + 1. The values of s , n H , n P for the diﬀerenttypes of residues are listed in Table I.It has been suggested that electrostatics, speciﬁcally the net charge per residue, may inﬂuence the nature ofconformational ensembles of the IDPs substantially. Thus, in addition to the ss, bb, and bs contacs, the chargedresidues interact with each other via the modiﬁed Debye-Hueckel potential V D − H = e exp ( − r/λ )4 π κκ r (2)with the screening length λ = 10 ˚A. Following Tozzini et al. , the relative permittivity is taken to be κ = 4 ˚A − r (generally, κ changes with the distance sigmoidally so that it is around 4 well within the protein and around 80 at largedistances; κ is the permittivity constant). This leads to the eﬀective electrostatic potential which is approximatelyequal to V el ( r ) =

85 exp ( − r/λ ) ǫ ˚A r . This long-range interaction does not count into the z s limit. When two oppositelycharged residues form an ss contact, their interaction becomes described by the standard LJ contact potential with r min taken from Table II, in an analogy to the way the salt bridges are described in the structure-based models of thestructured proteins. When a salt bridge is adiabatically turned on, other electrostatic interactions made by these tworesidues are turned oﬀ in the same manner, so that the local geometry of the bridge is stable. However, we have foundthat this precaution does not aﬀect the results noticeably. Like-charged residues cannot form the ss contacts but theymay participate in the bb contacts. This is possible for certain conformational geometries or for situations in whichat least one of the charges is eﬀectively removed due to the formation of the ionic bridge with another residue. Theﬁrst and last residues in the chain are not allowed to form attractive inter-terminal LJ contacts so that the proteindoes not form a closed loop (in Enciso and Rey’s model the inter-terminal contacts may exist, but their strengthwas reduced). In an all-atom model, the termini do come with opposite electric charges but this is usually not so incoarse-grained models in which the terminal residua have the same properties as the non-terminal ones. D. The backbone stiﬀness

The backbone stiﬀness is usually deﬁned in terms of bond and dihedral angles. In structure-based models ofproteins, these angles can be measured as deviations from their local native values. In the IDP case there are no suchreference values and we resort to a statistical approach as proposed by Ghavani et al. In this approach, and when determining the distribution of bond angles, one divides the residues into threetypes: Gly, Pro and X, where X denotes all other 18 residues. There are 27 possible combinations of these residuesand the bond angle associated with the middle of the three consecutive residues is governed by a distributionspeciﬁc to the triplet. The distributions have been obtained by matching to the experimental Ramachandranangles and are found to be fairly broad: they typically span about 60 o whereas the structure-based potentialsare signiﬁcantly narrower. We use the potentials that distinguish only the middle residue and whether it precedesproline, which results in dealing with only 6 possible combinations. The sequential order of the residues mattersbecause it deﬁnes the local chirality. Thus we can only distinguish cases in which a proline is preceded but notwhen it is succeeded. The examples of statistical potentials are shown in Fig. S5 in ESI. In our molecular dynamicssimulations, we ﬁt each statistical potential to a sixth degree polynomial with the coeﬃcients listed in Table S1 in ESI.The dihedral angle requires a consideration of four consecutive residues that can be viewed as a superposition oftwo triplets that overlap at two central residues. In the three-alphabet case, there are 9 possible overlaps and thus9 distributions. We show examples of the statistical potentials for these distributions in Fig. S6 in ESI. We ﬁt thestatistical potentials to the formula: a sin( x ) + b cos( x ) + c sin ( x ) + d cos ( x ) + e sin( x ) cos( x ), where the values ofthe coeﬃcients are also listed in Table S2 in ESI.Both of the bond-angle and the dihedral potentials were obtained by the inverse Boltzmann method at the roomtemperature applied to the random coil database statistics. These potentials correctly describe properties of denat-urated proteins, that can be assumed to have no contacts. A priori, the energy scale involved may not exactlymatch the one associated with the contact energy. We have checked, however, that doubling the amplitudes of theangle-involving potentials does not aﬀect the geometrical parameters of the test systems in any signiﬁcant manner. E. The molecular dynamics simulations

The solvent is implicit. The time evolution is deﬁned in terms of molecular dynamics with the velocity-dependentdamping and the Langevin noise, both representing the inﬂuence of the solvent. The noise corresponds to temperature T . The characteristic timescale, τ is of order 1 ns since the system is considered to be overdamped so that the motionof the beads is diﬀusional instead of ballistic. The damping constant, γ , is taken to be 2 m/τ , where m is the typicalmass of a residue. More realistic values of γ should be about 25 times larger but adopting them would lead to muchlonger conformational dynamic.As described earlier, there are three types of requirements for a contact to be legitimate: 1) on the distancebetween the residues, 2) on the proper directional placement of the implicit side groups and backbone hydrogenbonds, 3) on the number of contacts that the residues are permitted to form. If all of these requirements arefulﬁlled, a contact is made. A sudden creation of a contact may lead to instabilities and hence we switch thecorresponding contact potential on adiabatically: the depth of the potential well increases linearly form 0 to ǫ within the timescale of 10 τ . This timescale is suﬃciently long for the system to thermalize so the process isquasi-adiabatic. We have observed that shorter switching-on times lead to heating of the system (see the toppanel of Fig. S7 in ESI). Longer times, up to 40 τ , do not aﬀect the statistical properties that we measure.Still longer times make the chains more mobile which shows as a slight expansion of the system (bottom panelof Fig. S7 in ESI). Increasing γ is expected to lead to more stable dynamics and longer permissible switching-on times.We integrate the equations of motion by the ﬁfth order predictor-corrector algorithm and each τ is discretizedinto 200 steps. Thus the adiabatic switching-on process is accomplished in 2 000 steps. Switching oﬀ of the contacts,when the contact distance crosses 1.5 σ i,j , is implemented with the same time scale. If a pair of residues is connectedby a contact of one type, it cannot be simultaneously connected by a contact of another type.As explained in reference the kinetic and thermodynamic behavior of a coarse-grained model depends on thetype of the backbone stiﬀness used. The T -range in which folding of structured proteins is optimal is around 0.3 –0.35 ǫ/k B if the backbone stiﬀness is described by a chirality potential. In this case then 0.35 ǫ/k B corresponds tothe room temperature, T r , which is consistent with the callibrated value of ǫ . However, for the bond and dihedralpotential T r shifts to about 0.7 ǫ/k B , because these terms contribute more substantially to the energy.In order to determine T r for our IDP model, we perform a similar folding test for several globular proteins with thePDB structual codes of 1GB1, 1TIT, 1UBQ, and 2M7D. We take the native contacts map but adopt the statisticalangular potentials for the backbone stiﬀness. We ﬁnd that the kinetic optimality is reached around 0.3 – 0.35 ǫ/k B .In addition, Fig. S8 in ESI shows that for 0.3 ǫ/k B , the average end-to-end distance matches the room-temperatureall-atom results for polyQ for n up to 60 and it does not for higher temperatures. The matching at 0.2 ǫ/k B iscomparable, but folding is signiﬁcantly worse than at 0.3 ǫ/k B due to the emergence of more potent kinetic traps.Moreover, Fig. S8 demonstrates a similar matching to the experimental results for the ﬂanked versions of polyQ. Wehave also found that best agreement with experimental data for polyproline occurs for temperatures in range 0.3 –0.4 ǫ/k B . (results for T = 0 . ǫ/k B will be discussed later). Thus, in our simulations we take 0.3 ǫ/k B as the T forwhich most simulations were made.The results on the geometry of the systems were obtained based on 100 independent trajectories, each lasting for100 000 τ . The time evolution in the ﬁrst 5 000 τ was excluded from the data acquisition to allow for equilibration.The conformations were saved every 100 τ , unless stated otherwise. For structured proteins, we started from thenative structure whereas for the IDPs – from a self-avoiding random walk. III. RESULTS

The main purpose of the performed simulations is to provide a validation of our model. We use three parametersto characterize the geometrical properties of an IDP. One is the time-averaged radius of gyration R g = q h r g i ,where r g is the instantaneous value of this radius. Another is l – the time averaged end-to-end instantaneousdistance d ee ( l = h d ee i ). The third is σ = p l − h d ee i , which is the dispersion in d ee . As described by R´o˙zycki et al. σ can be used to deﬁne the eﬀective elastic stiﬀness, κ , of the system, since energy equipartitionimplies that κσ = k B T . The nature of our coarse-grained approximation does not allow us to make predictionsabout local structural parameters such as the Ramachandran angles and thus to make the corresponding comparisons.Generally, we test the predictions of our model against the results obtained through all-atom simulations, in whichthe geometrical parameters were calculated, and against the experimental results. We have chosen the NMR andSAXS experiments in which few interpretational assumptions about the parameters were made. The comparisonsare shown in Figs. 4, 5, and 6. The ﬁrst two of these present comparisons to the all-atom results and the last – tothe experimental ones. A. PolyQ and polyV

Fig. 4 pertains to polyQ and polyV for n of up to 60. We use the notation Q n , V n , and similarly for otheramino-acid homopolymers. Cossio et al. have generated 30 063 statistically independent conformations for V byusing all-atom simulations in a meta-dynamics approach involving the replica-exchange method. The solvent wasimplicit and treated within the generalized Born surface area approach. They concluded that only a fraction of theseconformations is structurally similar to the proteins listed in the CATH database. They took it as an evidence thatthere are evolutionary pressures that favor only selected classes of conformations. Here, we calculate l , σ , and R g for7 077 confromations of V , that were made available to us by the authors, by assuming that these conformationsapproximately represent an equilibrium trajectory. Our model is seen to generate parameters that are larger by afactor of around 2 (depending on which quantity is considered) for n =60, but the diﬀerences are expected to getsmaller for lower values of n .It should be noted that V n does not exist in nature whereas long tracts of polyQ do arise in various proteins, forinstance in the context of Huntington disease. G´omez-Sicilia et al. have applied the approach of Cossio et al. toQ n with n between 16 and 80. The statistics of the independent conformations were 308, 491, 330, 422, 479, 322,269, 246, and 108 for n =16, 20, 25, 30, 33, 38, 40, 60, and 80 respectively. We do not consider n =80 here due to thesmaller statistics.Fig. 4 demonstrates that the geometrical parameters, calculated based on the statisticaly independent confor-mations obtained by G´omez-Sicilia et al. agree fairly well with the results of our model, especially in the case ofthe parameter l . Generally, the smaller the n , the closer the agreement, especially in the case of l . We consider theagreement to be adequate enough to model gluten (under study) which consists of many Q-rich chains. We haveobserved that a better agreement with the all-atom data on polyQ can be obtained by increasing the value of theparameter s (the number of possible sidechain contacts) from 2 to 3. However, we keep s = 2 in order to achieveoptimal consistency for all systems considered here. Increasing the limit for backbone contacts, n b , from 2 to 3,results in a similar improvement for R g (though not for σ ) for this system.One of the ﬁndings of G´omez-Sicilia et al. was that a substantial fraction (9.3%) of the statistically independentconformations of Q are knotted and that the sequential extension of the knotted segments starts at about 36residues. Such knotted conformations may jam the proteasome and thus lead to toxicity. In experimental systems, the toxicity arises above a threshold of about 35.Our coarse-grained model has produced several knotted conformationsfor Q (all with the shallow knots) and none for V . In the all-atom model, V had a factor of 3 smaller propensityto form knots than Q so there is a qualitative agreement. An example of a knotted conformation obtained forQ is shown in the rightmost panel of Fig. 7. Increasing s from 2 to 3 has been found to enhance the prob-ability of knotting (but still below 9%). (We ﬁnd that more substantial knotting propensities are exhibited by polyW).Since the IDP systems ﬂow from one conformation to another, it is interesting to ask for how long do they stay inparticular conformations. One measure to assess this is to probe the system every 1 τ and determine the time neededto deviate in RMSD (root mean square deviation) from the considered conformations by 5 ˚A. For Q the distributionof such times extends up to 65 τ and has the mean of 24 τ and the dispersion of 10 τ . A typical duration time obtainedin the all-atom simulations is up to of order 20 ns, i.e. comparable to that obtained in the coarse-grained simulations.Another way to characterize conformational transformations is to use the fraction, f cc , of the contacts thatare common with a reference structure. The RMSD is more related to the overall shape whereas f cc to thenature of the local cohesion. For instance, bending of the chain segments results in an increase in the RMSDbut may leave f cc largely unchanged, indicating persistence of the local structure. The time dependence of thetwo measures for a trajectory obtained for Q is shown in Fig. 3. We consider three reference structures: one(the lines in green; denoted by A) obtained at 10 τ , another (the lines in black; denoted by B) at 2 10 τ and still another (the lines in blue) at 3 10 τ . The red vertical line indicates time at which, based on f cc , theconformation stops being more alike to conformation A and becomes more alike to conformation C. This happensat about 1 .

53 10 τ indicating that the changes in the contact map take place in the scale of microseconds.Interestingly, a similar transition to conformations similar to B (the black vertical line), which is closer in time,takes place at the later instant of 1 .

67 10 τ . In contrast, the RMSD crosses the 5-˚A treshold much more rapidly –within 0 .

05 10 τ . A more detailed characterization of the time evolution in terms of f cc can be found in ESI (Fig. S9).Experimental FRET data for polyglutamine chains of length m equal to 8, 12, 16, 20 and 24 have been obtained byWalters and Murphy for two values of pH: 7 and 12. These chains, however, are ﬂanked by the sequence KKW onthe N-terminus and by AKK on C-terminus so the backbone length n = m + 6. Thus the model-based results requirerecalculations, especially because the presence of the lysines brings in electrostatic eﬀects, providing another kind oftest of the model. Fig. 5 shows our results compared to the experimental ones for pH=7. The systems are denoted asKQmpH7 and the data points are shown in green. There is a close agreement for l but a discrepancy for σ . The ex-perimental results on σ apper unreasonably small though, particularly at n of 30 (the case shown in the ﬁgure), wherethe Coulomb eﬀects should be too weak to keep the system stiﬀ and barely ﬂuctuating. All theoretical data points aredistinct from the range of values found in systems without the ﬂanking lysine (Fig. 4). If we assume that changingpH from 7 to 12 aﬀects mostly the protonation level of the lysine (according to Enciso et al. ), then this should beequivalent to replacing the charged lysine by the neutral asparagine. Our theoretical results with the asparagine showcorrelations with the experimental data at pH=12 (see Fig. S10 in ESI) of a similar quality as with the lysine at pH=7. B. PolyA and polyP

Another interesting system is polyA. The all-atom results, which used the TIP3P model for the molecules ofwater and the CHARMM36 protein force ﬁeld, have been derived for l and σ and for a set of values of n . Theyare compared to our coarse-grained results in Fig. 5. We observe that there is a good agreement for small valuesof n , up to 15, when the conformations are helix-like (see also ref.) as illustrated in the left panel of Fig. 7. Atlarger values of n , the coarse-grained data fall down relative to the all-atom ones. We attribute the discrepancyto the diﬀerences in the timescales. In the all-atom simulations it is 100 ns whereas in the coarse-grained model –of order 5 µ s, i.e.

50 times longer and repeated in 100 trajectories. In the latter situation, the chain can exploresigniﬁcantly more phase space and adopt bent conformations such as the one shown in the middle panel of Fig. 7.Such bent conformations allow for a closer approach of the termini to each other than in the case of a helix. Theyhave been observed in simulations with an implicit solvent for n exceeding 40. The threshold n above which thebent conformations may appear must depend on the description of the backbone stiﬀness – in our case the backboneis rather soft (the constraints on the bond angle are weak) and hence the lowering of the threshold. A better modelingof the bond-angle potential should increase the treshold.The results for polyP are shown in Fig. 6. The agreement with the FRET-based data on l and σ is prettygood. All three geometrical parameters exhibit a relative steep growth with n , reﬂecting the fact that the shapeof the conformations is governed primarily by the backbone-stiﬀness potentials so that the details of any non-localpotentials have only a minor impact. C. Other systems

In studies on glycine-rich systems the FRET eﬃciency is not used to calculate l directly. Instead, it servesas a constraint that is employed to reconstruct the distance distributions by using simple models. These systemsincorporate repetitive sequences containing glycine (like GGS or GGGGS, collectively named ,,GS”). They areexpected to be prone to a substantial ﬂexibility. This expectation is conﬁrmed (see the plot of σ in Fig. 5).Furthermore, there is a good agreement of the model results with the FRET-based data.We now consider the disordered cellulosomal linkers that are listed in Table 1 of reference. In order to avoidcrowding in Fig. 5, we show the results only for the linkers with n =10 and n ≥

17 (the symbols in blue). The datapoints are scattered – however, they are comparable to the all-atom explicit-solvent simulations for these systems(there are no all-atom results for R g ).IDP’s may require using diﬀerent force ﬁelds than the structured proteins. Huang et al. have recently proposeda new force ﬁeld, CHARMM36m, which should show an improved performance both in the IDP and structuredcases. Several proteins were considered as test cases. Among them, the FG-nucleoporin and the RS peptide, bothdisordered. Fig. 5 shows that our model gives results that are comparable to those obtained by the novel force ﬁeld( l and σ are available for the RS peptides and R g for the FG-nucleoporin peptide). The agreement gets worse forHistatin-5 (His5, used to parameterize the Amber and Gromos force ﬁelds) probably because we consider His to beuncharged. An admixture of the charged states of His should introduce electrostatic repulsion and thus larger valuesof R g .The Protein Ensemble Databank (PE-DB) contains a number of structural data which were obtained throughsimulations which are considered to be consistent with the ensembles obtained in the NMR and SAXS experimentsdata on IDP’s. These proteins have n considerably larger than the systems considered so far. We ﬁnd (see Fig.6) that our results on R g , though not coinciding with the PE-DB results, are of comparable size. However, thediscrepancies are larger for 6AAA and 9AAA (not shown). It should be noted, however, that the PE-DB results wecompare to were obtained with the use of particular all-atom and/or coarse-grained models. In addition they werederived for a given timescale which may or may not be adequate. D. Comparisons of homopeptides

Table I lists the known maximal lengths of tracts formed by amino acids of the same type, as obtained from theUniProt database. The longest of them correspond to Gln (79), Ser (58), and Asn (46). For Ala the length is 24 butfor the sligthly larger Val – only 9. Nevertheless, it is instructive to simulate homopeptides of unrestricted length. Inparticular, Cossio et al. have studied the chains of Val with n =60. Here, we simulate all homopeptides of length 30to bring out the diﬀerences in behavior between them, as derived using our model. The results on the geometricalparameters are shown in Fig. 8 and on the properties of the contacts in Fig. 9. The properties of the contacts areassessed by providing the average coordination number, < z > , and the average sequential distance in the contacs,0 < | j − i | > . The former includes the contacts made by the dipeptide bonds along the backbone. The latter, whendivided by n , deﬁnes a parameter known as the contact order.All of the homopeptides of length 30 appear to be IDP’s since σ is always substantial: it exceeds 5 ˚A. In agreementwith Uversky and Mao et al. , there is a clear diﬀerence between the charged polymers (polyK, polyE, polyR,and polyD) and the collapsed non-charged systems. The charged homopetides have the largest values of R g and l asthey do not form any contacts. PolyP comes next – these chains also do not form contacts. Due to the restrictedbending capabilities, polyP is the stiﬀest of the homopolymers considered here (the lowest vakues of σ ). However,the properties of polyW appear to be the most extreme: it shows the lowest values of R g and l and the highest valuesof < z > and < | j − i | > . These features result from the largest value of r min and hence from the dominance ofnon-local ss contacts (every W residue can form up to 4 of them). There are almost no bb contacts in the simulatedpolyW systems. PolyW is also the most likely to generate knotted structures. The knots appear in 16% of thetrajectories and last for about 5% of total simulation time.Other hydrophobic homopolymers are generally more compact and less ﬂexible than the polar ones, with theexception of polyG and polyA that are endowed with short side chains. This can explain the fact that the longesthydrophobic residue tracks seen in the UniProt database are shorter than those made from the polar residues (againwith the exception of polyG and polyA that are more ﬂexible and thus more accomodating.)Another way to compare the homopeptides is to study the scaling behavior of radius of gyration in a broad rangeof the values of n : between 10 and 100. For homopolymers with n = 30 each MD trajectory was 50 000 τ long. Weobserve a good agreement with the power law for R g ∼ n ν (3)The values of the exponent ν are listed in Table I. We ﬁnd that ν ’s for the large hydrophobic amino acids are smaller(which corresponds to the poor solvent regime) than for the polar and small hydrophobic amino acids (for whichwater should be better solvent). Thus our model is seen to capture the substantial diﬀerences between the residuetypes in the correct manner. IV. CONCLUSIONS

We have constructed an α -C based molecular dynamics model of IDP which, despite its simplicity, yields structuralresults that are in a reasonable agreement with all-atom and experimental data and allows for an access to signiﬁcantlylonger timescales than in the all-atom simulations.There is a category of IDPs that acquire structure on binding with a substrate. Our model is likely to fail whenusing it to predict the speciﬁc structure that arises on binding. The model requires a more stringent ﬁne tuning forthis purpose. We have also found that the potentials used in our model cannot keep any structured proteins in theirnative conformation, as equilibrium inter-residue distances are diﬀerent. Thus applying our model to study, say,folding of the structured proteins leads to a fairly chaotic behavior. In this case, using the structure-based models ismuch more eﬀective.An alternative way to introduce speciﬁcity into the ss contacts is to use the sums of statistically determined radiiassociated with the residues instead of the pair-wise characteristics summarized in Table II. This approach wouldintroduce merely 20 residue-related parameters. It remains to be explored whether it provides an improvement overthe methodology presented here. Acknowledgements

We appreciate fruitful discussions with B. R´o˙zycki as well as receiving help of G. Matyszczakand K. Wo lek with the calculations. MC has received funding from the National Science Centre (NCN), Poland,under grant No. 2014/15/B/ST3/01905 and by the EU Joint Programme in Neurodegenerative Diseases project(JPND CD FP-688-059) through the NCN grant 2014/15/Z/NZ1/00037. He was also supported by the NCN grantNo. 2016/21/B/NZ1/00006. The computer resources were supported by the PL-GRID infrastructure and also ﬁnancedby the European Regional Development Fund under the Operational Programme Innovative Economy NanoFun1POIG.02.02.00-00-025/09. M. Levitt and A. Warshel, Computer simulations of protein folding.

Nature , 1975, , 694-698. M. Levitt, A simpliﬁed represntation of protein conformations for rapid simulation of protein folding.

J. Mol. Biol. , 1976, , 59-107. A. Kolinski, ed.,

Multiscale approaches to protein modeling: structure prediction, dynamics, thermodynamics and macro-molecular assemblies , Springer, New York, 2010. A. Liwo, ed.,

Computational methods to study the structure and dynamics of biomolecules and biomolecular processes - frombioinformatics to molecular quantum mechanics , Springer, Heidelberg, 2014. V. Tozzini, J. Trylska, C. Chang and J. A. McCammon, Flap Opening Dynamics in HIV-1 Protease Explored with aCoarse-Grained Model.

J. Struct. Biol. , 2007, , 60615. Y. C. Kim and G. Hummer, Coarse-grained models for simulations of multiprotein complexes: application to ubiquitinbinding.

J. Mol. Biol. , 2008, , 1416-1433. Y. Ueda, H. Taketomi and N. Go, Studies on protein folding, unfolding and ﬂuctuations by computer simulations.

Biopoly-mers , 1978, , 1531-1548. I. Shrivastava, S. Vishveshwara, M. Cieplak, A. Maritan and J. R. Banavar, Lattice model for rapidly folding protein-likeheteropolymers.

Proc. Natl. Acad. Sci. , 1995, (USA) , 9206-9209. N. Koga and S. Takada, Roles of native topology and chain-length scaling in protein folding: a simulation study with aGo-like model.

J. Mol. Biol. , 2001, , 171-180. C. Clementi, H. Nymeyer, and J. N. Onuchic, Topological and energetic factors: what determines the structural details ofthe transition state ensemble and ”en-route” intermediates for protein folding? An investigation of small globular proteins.

J. Mol. Biol. , 2000, , 937-953. T. X. Hoang and M. Cieplak, Molecular dynamics of folding of secondary structures in Go-like models of proteins.

J. Chem.Phys. , 2000, , 6851-6862. J. D. Bryngelson and P. G. Wolynes, Spin glasses and the statistical mechanics of protein folding.

Proc. Natl. Acad. Sci. ,1987, USA , 7524-7528. D. Baker, A surprising simplicity to protein folding.

Nature , 2000, , 39-42. P. E. Wright and H. J. Dyson, Intrinsically unstructured proteins: Re-assessing the protein structure-function paradigm.

J.Mol. Biol. , 1999, , 321-331. A. L. Fink, Natively unfolded proteins.

Curr. Opin. Struct. Biol. , 2005, , 35-41. V. N. Uversky and A. K. Dunker, Understanding proteins non-folding.

Biochem. Biophys. Acta , 2010, , 1231-1264. A. C. M. Ferreon, C. R. Moran, Y. Gambin and A. A. Deniz, Single-molecule ﬂuorescence studies of intrinsically disorderedproteins.

Methods Enzymol, , 2010, , 179-204. V. N. Uversky, Unusual biophysics of intrinsically disordered proteins.

Biochim. Biophys. Acta , 2013, , 932-951. A. Sethi, J. Tian, D. M. Vu and S. Gnanakaran, Identiﬁcation of minimally interacting modules in an intrinsically disorderedprotein.

Biophys. J. , 2012, , 748-757. P. Cossio, A. Trovato, F. Pietrucci, F. Seno, A. Maritan and A. Laio, Exploring the universe of protein structures beyondthe Protein Data Bank.

PLoS Comp. Biol. , 2010, , e1000957. A. Vitalis, X. Wang and R. V. Pappu, Quantitative characterization of intrinsic disorder in polygultamine: Insights frmanalysis based on polymer theories.

Biophys. J. , 2007, , 1923-1937. L. Esposito, A. Paladino, C. Pedone and L. Vitagliano, Insights into structure, stability, and toxicity of monomeric andaggregated polyglutamine models from molecular dynamics simulations.

Biophys. J. , 2008, , 4031-40. H. Ogawa, M. Nakano, H. Watanabe, E. B. Starikov, S. M. Rothstein and S. Tanaka, Molecular dynamics simulation studyon the structural stabilities of polyglutamine peptides.

Comp. Biol. Chem. , 2008, , 102-110. ´A. G´omez-Sicilia, M. Sikora, M. Cieplak and M. Carri´on-V´azquez, An exploration of the universe of polyglutamine structures PLoS Comp. Biol. , 2015, , e1004541. S. Rauscher, V. Gapsys, M. J. Gajda, M. Zweckstetter, B. L. de Groot and H. Grubmueller, Structural ensembles ofintrinsically disordered proteins depend strongly on force ﬁeld: A comparison to experiment.

J. Chem. Theor. Comp. , 2015, , 5513-5524. W. Wang, W. Ye, C. Jiang, R. Luo and H.-F. Chen, New force ﬁeld on modeling intrinsically disordered proteins.

Chem.Biol. Drug Des. , 2014, , 253-269. J. Huang, S. Rauscher, G. Nawrocki, T. Ran, M. Feig, B. L. de Groot, H. Grubmueller and A. D. MacKerell Jr,CHARMM36m: and improved force ﬁeld for folded and intrinsically disordered proteins.

Nature Methods , 2017, , 71-73. L. Duan, T. Zhu, C. Ji, Q. Zhanga and J. Z. H. Zhang, Direct folding simulation of helical proteins using an eﬀectivepolarizable bond force ﬁeld.

Phys. Chem. Chem. Phys. , 2017, , 15273-15284. T. Frembgen-Kesner, C. T. Andrews, S. Li, N. A. Ngo, S. A. Shubert, A. Jain, O. J. Olayiwola, M. R. Weishaar and A. H.Elcock, Parametrization of Backbone Flexibility in a Coarse-Grained Force Field for Proteins (COFFDROP) Derived fromAll-Atom Explicit-Solvent Molecular Dynamics Simulations of All Possible Two-Residue Reptides.

J. Chem. Theor. Comp. ,2015, , 234154. M. Cheon, I. Chang and C. K. Hall, Extending the PRIME model for protein aggregation to all 20 amino acids.

Proteins:Str. Func. Bioinf. , 2010, , 29502960. V. A. Wagoner, M. Cheon, I. Chang, C. K. Hall, Computer simulation study of amyloid ﬁbril formation by palindromicsequences in prion peptides.

Proteins: Str. Func. Bioinf. , 2011, , 2132-2145. A. B. Poma, M. Cieplak and P. E. Theodorakis, Combining the MARTINI and Structure-Based Coarse-Grained Approachesfor the Molecular Dynamics Studies of Conformational Transitions in Proteins.

J. Chem. Theory Comput. , 2017, ,1366-1374. D. De Sancho and R. B. Best, Modulation Of An IDP Binding Mechanism And Rates By Helix Propensity And Non-NativeInteractions: Association Of Hif1 α With CBP.

Mol. BioSyst. , 2012, , 256-267. D. Ganguly, W. Zhang and J. Chen, Electrostatically Accelerated Encounter And Folding For Facile Recognition Of Intrin-sically Disordered Proteins.

PLoS Comp. Biol. , 2013, , e1003363. M. Enciso and A. Rey, Improvement of Structure-Based Potentials for Protein Folding by Native and Nonnative HydrogenBonds.

Biophys. J. , 2011, . 147482. N. L. Dawson, T. E. Lewis, S. Das, J. G. Lees, D. Lee, P. Ashford, C. A. Orengo and I. Sillitoe, CATH: an expanded resourceto predict protein function through structure and sequence.

Nucl. Acids Res. , 2016, , D289-D295. T. X. Hoang, A. Trovato, F. Seno, J. R. Banavar and A. Maritan, Geometry and symmetry presculpt the free-energylandscape of proteins.

Proc. Natl. Acad. Sci. , 2004, (USA) , 7960-7964. N.-V. Buchete, J. E. Straub and D. Thirumalai, in

Coarse-Graining of Condensed Phase and Biomolecular System , ed. G.A. Voth, CRC Press, Boca Raton, 2009. ch. 10, 141-156. M. Enciso and A. Rey, A reﬁned hydrogen bond potential for ﬂexible protein models.

J. Chem. Phys. , 2010, , 235102. N. B. Hung, D.-M. Le and T. X. Hoang, Sequence dependent aggregation of peptides and ﬁbril formation.

J. Chem. Phys. ,2017, , 105102. The UniProt Consortium, UniProt: the universal protein knowledgebase.

Nucleic Acids Res. , 2017, , D158-D169. P. R. Shewry, N. G. Halford, P. S. Belton and A. S. Tatham, The structure and properties of gluten: an elastic protein fromwheat grain.

Phil. Trans. Roy. Soc. B , 2002, , 133-142. H. Wieser, Chemistry of gluten proteins.

Food Microbiol. , 2007, , 115-119. E. A. Bayer, J. P. Belaich, Y. Shoham ancd R. Lamed,

Annu. Rev. Microbiol. , 521-554. B. R´o˙zycki, M. Cieplak M. Czjzek, Large conformational ﬂuctuations of the multi-domain Xylanase Z of

Clostridium ther-mocellum . J. Struct. Biol. , 2015, , 68-75. B. R´o˙zycki, P.-A. Cazade, S. O’Mahony, D. Thompson and M. Cieplak, The length but not the sequence of peptide linkermodules exerts the primary inﬂuence on the conformations of protein domains in cellulosome multi-enzyme complexes.

Phys.Chem. Chem. Phys. , 2017, , 21414-21425. M. Sikora, J. I. Su lkowska and M. Cieplak, Mechanical strength of 17 134 model proteins and cysteine spliknots.

PLoSComp. Biol. , 2009, , e1000547. J. I. Su lkowska and M. Cieplak, Mechanical stretching of proteins a theoretical survey of the Protein Data Bank.

J. Phys.:Condens. Matter , 2007, J. I. Su lkowska and M. Cieplak, Selection of optimal variants of Go-like models of proteins through studies of stretching.

Biophys. J. , 2008, , 3174-3191. A. B. Poma, M. Chwastyk, and M. Cieplak, Polysaccharide-protein complexes in a coarse-grained model.

J. Phys.Chem. B. ,2015, , 12028-12041. K. Wo lek, ´A. G´omez-Sicilia, and M. Cieplak, Determination of contact maps in proteins: a combination of structural andchemical approaches.

J. Chem. Phys. , 2015, , 243105. J. Tsai, R. Taylor, C. Chothia and M. Gerstein, The packing density in proteins: Standard radii and volumes.

J. Mol. Biol. ,1999, , 253-266. G. Settanni, T. X. Hoang, C. Micheletti and A. Maritan, Folding pathways of prion and doppel.

Biophys. J. , 2002, ,3533-3541. J. Gu, F. Bai, H. Li and X. Wang, A generic force ﬁeld for protein coarse-grained molecular dynamics simulations.

Int. J.Mol. Sci. , 2012, , 14451-14469. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne. The ProteinData Bank.

Nucl. Acids Res. , 2000, A. Kolinski, Protein modeling and structure prediction with a reduced representation.

Acta Biochim Pol. , 2004, .349-71. D. G. Covell, and R. L. Jernigan, Conformation of folded proteins in restricted spaces.

Biochem. , 1990, , 3287-94. C. Micheletti, F. Seno, J. R. Banavar and A. Maritan, Learning eﬀective amino acid interactions through iterative stochastictechniques.

Proteins Struct. Funct. Genet. , 2001, , 422-31. M. Cieplak, N.S. Holter, A. Maritan and J. R. Banavar, Amino acid classes and the protein folding problem.

J. Chem. Phys. ,2001, , 1420. A. Korkut and W. A. Hendrickson, A Force Field For Virtual Atom Molecular Mechanics Of Proteins.

Proc. Natl. Acad.Sci. (USA), 2009, , 15667-15672. M. Qin, W. Wang and D. Thirumalai. Protein Folding Guides Disulﬁde Bond Formation.

Proc. Natl. Acad. Sci. (USA),2015, , 11241-11246. A. H. Mao, S. L. Crick, A. Vitalis, C. L. Chicoine and R. V. Pappu,

Proc. Natl. Acad. Sci. (USA), 2010, , 8183-8188. P. Debye and E. Hueckel, Zur Theorie der Elektrolyte. I. Gefrierpunktserniedrigung und verwandte Erscheinungen,

Phys.Zeitschrift , 1923, , 185-206. A. Ghavani, E. van der Giessen and P. R. Onck, Coarse-grained potentials for local interactions in unfolded proteins.

J.Chem. Theor. Comp. , 2013, , 432-440. T. Veitshans, D. Klimov and D. Thirumalai, Protein folding kinetics: time scales, pathways and energy landscapes in termsof sequence-dependent properties.

Folding Des. , 1997, , 1-22. M. P. Allen D. J. Tildesley,

Computer simulation of liquids , Oxford University Press, New York, 1987. K. Wo lek and M. Cieplak, Criteria for folding in structure-based models of proteins.

J. Chem. Phys. , 2016, , 185102. B. R´o˙zycki and M. Cieplak, Stiﬀness of the C-terminal disordered linker aﬀects the geometry of the active site in endoglu-canase Cel8A.

Mol. BioSyst. , 2016, , 3589-3599. M. Wojciechowski, ´A. G´omez-Sicilia, M. Carri´on-V´azquez and M. Cieplak, Unfolding knots by proteasome-like systems:simulations of the behavior of folded and neurotoxic proteins.

Mol. BioSyst. , 2016, , 2700-2712. J. Petruska, M. J. Hartenstine and M. F. Goodman, Analysis of strand slippage in DNA polymerase expansions of CAG/CTGtriplet repeats associated with neurodegenerative disease.

J. Biol. Chem. , 1998, , 5204-5210. R. H. Walters and R. M. Murphy, Examining Polyglutamine Peptide Length: A Connection between Collapsed Conforma-tions and Increased Aggregation.

J. Mol. Biol. , 2009, , 978992. M. Enciso, C. Sch¨utte and L. Delle Site. pH-Dependent Coarse-Grained Model of Peptides.

Soft Matter , 2013, ,6118-6127. W. L. Jorgenson, J. Chandrasekhar, J. D. Madura, R. W. Impey and M. L. Klein, Comparison of simple potential functionsfor simulating liquid water.

J. Chem. Phys. , 1983, , 926-935. P. Palencar T. Bleha, Molecular dynamics simulations of the folding of poly(alanine) peptides.

J. Mol. Model. , 2011, ,2367-23745. B. Schuler, E. A. Lipman, P. J. Steinbach, M. Kumke and W. A. Eaton, Polyproline and the ,,spectroscopic ruler” revisitedwith single-molecule ﬂuorescence.

Proc. Natl. Acad. Sci. (USA), 2005, , 27542759. A. M¨oglich, K. Joder and T. Kiefhaber, End-to-end distance distributions and intrachain diﬀusion constants in un-folded polypeptide chains indicate intramolecular hydrogen bond formation.

Proc. Natl. Acad. Sci. (USA), 2006, ,1239412399. S. Sanyal, D. F. Coker and D. MacKernan, How ﬂexible is a protein: simple estimates using FRET microscopy.

Mol. BioSyst. ,2016, , 29882991. J. Henriques, C. Cragnell and M. Skepo, Molecular Dynamics Simulations of Intrinsically Disordered Proteins: Force FieldEvaluation and Comparison with Experiment.

J. Chem. Theor. Comp. , 2015, , 34203431. M. Varadi, S. Kosol, P. Lebrun, E. Valentini, M. Blackledge, A. K. Dunker, ... and P. Tompa, PE-DB: A database ofstructural ensembles of intrinsically disordered and of unfolded proteins.

Nucl. Acids Res. , 2014, , 326335. P. J. Flory, Spatial Conﬁguration Of Macromolecular Chains.

Brit. Polym. J. , 1976, , 1-10. L. X. Peterson, A. Roy, C. Christoﬀer, G. Terashi and D. Kihara, Modeling disordered protein interactions from biophysicalprinciples.

PLoS Comp. Biol. , 2017, , e1005485. M. Chwastyk, A. Poma Bernaola and M. Cieplak, Statistical radii associated with amino acids to determine the contactmap: Fixing the structure of a type I cohesin domain in the

Clostridium thermocellum cellulosome.

Phys. Biol. , 2015, ,046002. R. B. Best, X. Zhu, J. Shim, P. E. M. Lopes, J. Mittal, M. Feig and A. D. MacKerell Jr, Optimization of the additiveCHARMM all-atom protein force ﬁeld targeting improved sampling of the backbone φ , ψ and side-chain χ and χ dihedralangles. J. Chem. Theory Comput. , 2012, , 3257-3273. name Gly Pro Gln Cys Ala Ser Val Thr Ile Leu type - - P P H P H P H H s n H n P n max

23 27 79 11 24 58 9 24 6 13ID

UniProt

P10275 Q9NP73 Q156A1 Q03751 Q8R089 O15417 P32867 Q869S5 Q95US5 Q80YA8 ν Asn Asp Lys Glu Met His Phe Arg Tyr Trp type P P- P+ P- H P H P+ H H s n H n P n max

46 45 11 33 7 23 9 14 6 6ID

UniProt

Q54XG7 Q08438 Q8CI03 Q6PCN3 Q01668 Q6ZQ93 Q3S2U2 P38835 Q9C5D3 P86690 ν s ) of the sidechain contacts thatthe amino acid can make with all other residues. The third and fourth lines lines show the numbers of sidechain contacts thatthe amino acids can make with the hydrophobic ( n H ) and polar ( n P ) residues. The ﬁfth line list the sequential extensionsof the longest homopolymeric tracts made with the residue as found in the reviewed part of UniProt database (SProt), withthe highest level of certainty (PE=1, see). The sixth line gives the corresponding accession code. The seventh line lists theexponent ( ν ) for the sequence-length dependence ( n ) of R g . The lines further down are for the remaining 10 residues. His isconsidered to be uncharged. Gln Cys Ala Ser Val Thr Ile Leu Asn Asp Lys Glu Met His Phe Arg Tyr TrpGln

Cys

Ala

Ser

Val

Thr

Ile

Leu

Asn

Asp

Lys

Glu

Met

His

Phe

Arg

Tyr

Trp FIG. 1: Examples of the distributions of the α -C – α -C distances in contacts. The distributions are for situations in whichnot more than one contact is made between the atoms. The thin lines are for all of these situations whereas the thick linescorrespond to eliminating contacts which do not satisfy the required directional conditions. The vertical dotted lines indicate thevalues adopted in our deﬁnition of the model. They correspond to the average value calculated based on the thick histograms.The top left panel is for the bb contacts regardless the speciﬁcity. Most of them occur in the i, i + 3 contacts so only those weretaken into account. The average distance is about 5.0 ˚A. However, the average distances in other contacts are close in value:4.7 ˚A for the i, i + 4 contacts and 4.8 ˚A for contacts at all larger sequential distances. The middle panel is for the bs contacts,again regardless the speciﬁcity. Almost all of such contacts occur in contacts i, i + k , where k ≥ FIG. 2: Examples of contact formation between Ala (the residue on the left) and Met. The distance between the α -C atoms(shown as spheres in black) are 7.91, 6.03 and 5.4 ˚A top to bottom respectively. The top, middle and bottom panels illustrateformation of an ss, bs and bb contacts respectively. The ﬁrst two panels are obtained in one trajectory of an all-atom simulation(performed using NAMD). The bottom panel shows a fragment of an α -helix in the structure PDB:14GS. The α -C atoms areshown as black spheres. The other atoms are shown using the CPK coloring scheme. The hydrogen atoms are not shown. Thesidechain atoms are represented by semi-transparent spheres. The arrows indicate either vectors h (for the backbone-involvinginteractions) or n (for the sidechain-involving interactions, as indicated. Vector r i,j connecting α -C atoms is drawn as a dottedline. The threshold values of r i,j , at which a given kind of contact is considered to be setting in, are written on the right. v i isdeﬁned as r i − r i − . FIG. 3: The time-dependence of f cc (top) and RMSD (bottom) for Q . It is calculated every 1000 τ with respect to threereference structures obtained at 10 τ (A), 2 · τ (B, dotted line), and 3 · τ (C) in an example of a trajectory. Thereference structures are depicted at the top. The red vertical line indicates a transition from a similarity in f cc between A andC. The black vertical line: between A and B.FIG. 4: The results obtained by our IDP coarse-grained model for Q n (solid blue circles) and V n (solid red squares) as afunction of the number of residues, n . The top panel is for l , the middle for σ , and the bottom one for R g . The open circlescorrespond to the results obtained by all-atom simulations for Q n . The open red squares – only for n =60 – correspond tothe results obtained by all-atom simulations for V .The lines are guides to the eye. The sizes of the symbols are of the orderof the error of the mean. FIG. 5: Similar to Fig. 4 but for other indicated systems. The results obtained by our IDP coarse-grained model are shownusing the full circles and those obtained by all-atom simulations – by the open circles. The black data points are for polyalanine.The blue data points – for selected linkers. The selected linkers are those with n = 10 and n >

15. The red data points arefor the systems denoted as FG and RS. Note that in the case of FG, the all-atom results are available only for R g . The greendata points are for G15 and GS8. The magenta data points are for His5.FIG. 6: Similar to Figs. 4 and 5 but now the comparison is to the results obtained experimentally (the open circles). The bluedata points are for polyproline. The black data points are for ”GS”. The red data points are for the three indicated systemsfrom PEDb. 6AAA and 9AAA disagree with the theoretical ﬁndings more signiﬁcantly and are not shown. The green datapoints are for KQn at pH=7. FIG. 7: Examples of conformations of three homopeptides: A (left), A (middle) and Q . The latter conformation isknotted. The N-terminal end of Q is shown in black. It crosses the loop inwards, forming a shallow knot.FIG. 8: The geometrical parameters R g , l and σ for the homopeptides of length 30. The error of the mean does not exceed1, 3 and 2% for R g , l and σ respectively. The types of the amino acids are indicated in the single letter convention. If weconsidered the histidine residues to be charged, the parameters would be the same as for K,E,R and D. FIG. 9: Similar to Fig. 8 but for the average coordination number and the average sequential distance in the contacts. Contactorder is this distance divided by n . The error of the mean does not exceed 0.5 and 3.3% for < z > and < | j − i | >>