[PDF] Geometric constraints in protein folding

Abstract

The intricate three-dimensional geometries of protein tertiary structures underlie protein function and emerge through a folding process from one-dimensional chains of amino acids. The exact spatial sequence and configuration of amino acids, the biochemical environment and the temporal sequence of distinct interactions yield a complex folding process that cannot yet be easily tracked for all proteins. To gain qualitative insights into the fundamental mechanisms behind the folding dynamics and generic features of the folded structure, we propose a simple model of structure formation that takes into account only fundamental geometric constraints and otherwise assumes randomly paired connections. We find that despite its simplicity, the model results in a network ensemble consistent with key overall features of the ensemble of Protein Residue Networks we obtained from more than 1000 biological protein geometries as available through the Protein Data Base. Specifically, the distribution of the number of interaction neighbors a unit (amino acid) has, the scaling of the structure's spatial extent with chain length, the eigenvalue spectrum and the scaling of the smallest relaxation time with chain length are all consistent between model and real proteins. These results indicate that geometric constraints alone may already account for a number of generic features of protein tertiary structures.

Full PDF

DDecember 27, 2018

Geometric constraints in protein folding

Nora Molkenthin , Steﬀen Mühle , Antonia S. J. S. Mey , Marc Timme Network Dynamics, Max Planck Institute for Dynamics and Self-Organization (MPIDS), 37077 Göttingen, Germany; University of Göttingen, Third Institute of Physics – Biophysics, Friedrich-Hund-Platz 1, 37077 Göttingen, Germany. ; EaStCHEM School of Chemistry, University of Edinburgh, Edinburgh, United Kingdom; Chair for Network Dynamics,Institute for Theoretical Physics and Center for Advancing Electronics Dresden (cfaed), Technical University of Dresden,01069 Dresden; Network Dynamics, Max Planck Institute for Dynamics and Self-Organization (MPIDS), 37077 Göttingen,Germany *For correspondence: [email protected] (MT)

Abstract

The intricate three-dimensional geometries of protein tertiary structures underlie proteinfunction and emerge through a folding process from one-dimensional chains of amino acids. The exactspatial sequence and conﬁguration of amino acids, the biochemical environment and the temporal sequenceof distinct interactions yield a complex folding process that cannot yet be easily tracked for all proteins.To gain qualitative insights into the fundamental mechanisms behind the folding dynamics and genericfeatures of the folded structure, we propose a simple model of structure formation that takes into accountonly fundamental geometric constraints and otherwise assumes randomly paired connections. We ﬁnd thatdespite its simplicity, the model results in a network ensemble consistent with key overall features of theensemble of Protein Residue Networks we obtained from more than 1000 biological protein geometries asavailable through the Protein Data Base. Speciﬁcally, the distribution of the number of interaction neighborsa unit (amino acid) has, the scaling of the structure’s spatial extent with chain length, the eigenvalue spectrumand the scaling of the smallest relaxation time with chain length are all consistent between model and realproteins. These results indicate that geometric constraints alone may already account for a number ofgeneric features of protein tertiary structures.

Author summary

How proteins fold constitutes one of the most persistent, broad, and exciting open research questions at theintersection of biology, chemistry, and physics. Which mechanisms induce a one-dimensional sequence ofamino acids to form into a complex three-dimensional (3D) structure? Proteins in their active 3D structureimpact most of the basic processes inside cells, including gene regulation, cell metabolism, and the creationof protein structures themselves. Yet, a general rule about which conditions lead to which speciﬁc 3D proteinstructures remains unknown to date.Here, we demonstrate how a simple model that takes only fundamental geometric constraints intoaccount and otherwise assumes randomly paired connections, naturally generates an ensemble of foldedstructures that exhibits many of its coarse scale features consistent with those of protein residue networksresulting from tertiary structures of biological proteins. Speciﬁcally, we tested a set of more than 1000biological proteins and model structures and extracted a range of ensemble properties, including the spatialextension with chain size, the distribution of the number of interacting neighbors in the folded structure, thespectrum of Laplacian eigenvalues, and the distribution of the dominant non-trivial eigenvalue. We foundthat all of those properties are consistent between the ensemble of biological protein residue networks andthe networks emerging in a self-organized way from the simple model.These results indicate that coarse ensemble properties of 3D protein structures are already induced bygeometric constraints alone such that only ﬁner scales of the folded structures of individual proteins are a r X i v : . [ q - b i o . B M ] D ec ecember 27, 2018speciﬁcally controlled by the details of their amino acid sequences. Such simple models provide a new angleof analyzing protein structures at the coarse scale of ensembles and may help understand core mechanismsunderlying the complex folding process. Introduction

Proteins consist of sequences of amino acids. The resulting primary structure of a protein, is expected toprovide constraints for the folded three-dimensional (3D) structure of a globular protein, its tertiary structure .The problem of predicting the 3D structure of an amino acid sequence in an aqueous solution is known asthe protein folding problem consisting of three sub-problems: First, to ﬁnd the chemically active folded state;second, to uncover the pathway to get to that state; and third, to develop computational tools capable ofaccurately predicting the folded state [10, 18, 20, 31–33]. Many diﬀerent avenues have been taken to exploresolutions towards this problem, ranging from atomistic models using molecular dynamics approaches [36],to coarse grained models e.g [6], and to machine learning-based and heuristic physical models that disregardthe atomistic details of the amino acid sequence [5, 9]. While much progress has been made improvingmolecular dynamics simulations using atomistic detail, the folding process of long chains is computationallyhighly expensive or even infeasible, and still requires access to purpose build massively parallel computerssuch as Anton [34], or distributed computing projects such as folding@home in order to generate quantitativedata [35]. The other avenue often explored for structure models is tested in community-wide challengessuch as the ’Critical Assessment of Protein Structure Prediction’ (CASP) [24–26]. CASP is run every otheryear to see if a protein’s tertiary structure can be predicted based on its primary sequence of proteinstructures unresolved at the time of the challenge [Ogorzalek et al.]. Predictions have improved drasticallyover previous CASP challenges [10], however, often rely on existing structural information in the proteindata base (PDB) and homology modelling, comparing new proteins based on existing insights from knowntemplate proteins using computational models such as HHPred [4] or I-TASSER [41]. These approachessupport accurate prediction of 3D structures, yet by construction limit insights into fundamental physicalmechanisms and constraints underlying the folding processes and ﬁnal structures observed in the manyand various proteins observed in nature.Here, we propose a complementary approach to further understand geometry and formation processesof 3D tertiary structures from chain-like primary protein structures without comparing to speciﬁcally chosenprotein structures available on the PDB, and without using complex molecular dynamics simulations. First, weanalyze 1122 protein structures from the PDB, consider them as an ensemble of protein tertiary structures,and quantify overall properties of this ensemble. In particular we (i) uncover the scaling of the diameterof proteins with their chain length, (ii) reveal the distribution of the number of other amino acids anygiven amino acid closely interacts with and (iii) ﬁnd the distribution of second largest eigenvalues of theirassociated graph Laplacians, characterizing the most persistent time scales on which proteins are dynamicallyresponding to perturbations. Second, we propose and analyze a simple stochastic process modeling thefolding of chains of units. The minimal model takes into account geometric constraints only and does notconsider any other protein property. The model process keeps connected units connected, forbids geometricoverlap of units (volume exclusion) and connects randomly chosen units if geometrically permitted. Basedonly on such random monomer interactions and geometric constraints, akin to those in Lennard-Jonesclusters and sticky hard spheres [37, 38], the 3D structures self-organizing through the simple model processare consistent with those of real protein ensembles in all of the above-mentioned features simultaneously.These results suggest that beyond the details of pairwise interaction of amino acids, from intermediatescales of a few amino acids to the full spatial extent of proteins, geometric constraints play an important rolein structure formation and strongly impact the ﬁnal protein tertiary structure. Our insights may put intoperspective the inﬂuence of the speciﬁc details of sequences of amino acids relative to simpler geometricconstraints on structure forming processes of proteins. ecember 27, 2018

Results

Ensemble analysis of Protein Residue Networks

With their modular polymer structure and their complex interaction patterns, proteins lend themselvesnaturally to a description as ensembles of complex networks. The mathematical object of a graph, simplytermed network, represents a structure of nodes (units) and links, each describing an interaction betweentwo units [1, 3, 27]. Networks and graphs have been used to describe the structure of a wide variety ofsystems, as diﬀerent as social networks [15, 40] and the global climate system [12, 22]. In this article, weanalyze an ensemble of 1122 protein tertiary structures of chain lengths ranging from 𝑁 = 8 to 𝑁 = 1500 amino acids. Detailed structures have been experimentally determined to great accuracy and stored in theprotein data bank (PDB) [2]. Part of the information stored in the PDB are the coordinates 𝑥 𝑖 ∈ ℝ of theindividual amino acid’s central carbon atoms 𝐶 𝛼 , where 𝑖 indexes the amino acid’s position along the chain.Given such geometric data, the structures resulting from protein folding are commonly expressed asprotein residue networks (PRN’s) [8, 11, 13, 39], in which the central carbon atom of each amino acid is takento be a node and a link represents the interaction of two nodes if their spatial distance is small, i.e. less thana distance 𝑑 𝑐 apart.Here, the distance between the amino acids indexed 𝑖 and 𝑗 is given by the Euclidean distance metric 𝑑 𝑖,𝑗 = ‖ 𝑥 𝑖 − 𝑥 𝑗 ‖ . An adjacency matrix 𝐴 𝑖𝑗 encodes the topology of a network, its entries are if 𝑑 𝑖,𝑗 ≤ 𝑑 𝑐 , i.e. theunits are considered connected, and otherwise. The distance matrix resulting from PDB data thus deﬁnesthe adjacency matrix as 𝐴 PDB 𝑖𝑗 = ⎧⎪⎨⎪⎩ , if 𝑑 𝑖,𝑗 > 𝑑 𝑐 or 𝑖 = 𝑗 , if 𝑑 𝑖,𝑗 ≤ 𝑑 𝑐 . (1)The threshold of the PRN is commonly chosen between 𝑑 𝑐 = 4 Å, the typical length of a covalent bond, and 𝑑 𝑐 = 15 Å, reﬂecting an upper bound for a signiﬁcant interaction to occur between two units. Here, wecreated the PRNs of 1122 proteins selected from the PDB list in [17], covering a range of chain lengths 𝑁 for comparison to simulations. Their geometric structures have been determined previously via NMR andx-ray studies. We choose a threshold value of 𝑑 𝑐 = 6 . Å to calibrate the average degree of nodes in thePRNs to the average degree found in the model simulations in the range of large 𝑁 ∈ [200 , , Fig. 1a.The average degree ⟨ 𝑘 ⟩ grows with 𝑁 and appears to saturate at a value determined by 𝑑 𝑐 . The ratio ofthis cutoﬀ threshold and the unit size in the model, which we take half their mean distance, constitutesthe only free structural parameter we employ in the current study. The degree distribution of the resultingnetwork ensemble, displayed in Fig. 1b, is unimodal and covers eﬀective degrees between 𝑘 = 2 and 𝑘 = 11 .Interestingly, the degree distribution resulting from simulations of the model ensemble we are about tointroduce below is statistically indistinguishable from those of the network of real PRNs (no additional ﬁtparameter), Fig. 1b. Equally, other quantiﬁers obtained from the simple, geometry-only model ensembleagree surprisingly well with those obtained from our data analysis of the experimentally obtained proteinstructures. Simple model focused on geometric constraints

To better understand the impact of geometric constraints on the topology of protein tertiary structures, weintroduce a random network formation model that takes into account geometric constraints and leavesout almost all other properties of real proteins, including heterogeneous sequences of amino acids, theamino acids’ speciﬁcs molecular properties, diﬀerent forms of electrochemical interactions, conformationaldetails of interactions between nearby amino-acids, and the inﬂuence of the ﬂuid environment on proteinfolding. We ﬁnd that the simple, geometry-centered model already reproduces a range of overall topologicalproperties of real protein residue networks well.The model is built on the simple observation that proteins consist of a chain of close-to identical unitsthat interact in complex ways when folding, yet can not intersect, giving rise to geometric constraints. Theindividual units of the chain interact when they come into contact; typically there is an attraction that is the the degree 𝑘 𝑖 of node 𝑖 counts the number of nodes it is connected to ecember 27, 2018 Figure 1. Degree distributions of simple model ensemble and real proteins are statistically indistinguishable. a)The average degree of real protein ensemble (red dots) asymptotically saturates to ⟨ 𝑘 ⟩ ≈ 6 . as the chain length 𝑁 becomeslarge. The average degree of the nodes resulting from 30 model simulations for each chain length 𝑁 , ranging from 𝑁 = 3 to 𝑁 = 398 . b) The degree distribution of the model simulation within the error margin is indistinguishable from that ofreal proteins (error bars indicate standard deviation of the distribution at each 𝑘 ). stronger the closer they are but repelling once they overlap. Depending on the speciﬁc amino acid, size,shape, and electromagnetic properties vary. In our model, however, all amino acids are represented as unitspheres and the interactions between each pair become very simple and identical across all pairs.The model’s initial state consists of a chain of 𝑁 connected spheres, each of diameter and bond lengthof unity (later rescaled to match the mean distance between neighboring amino acids 𝑑 mean ). A foldingproceeds by sequentially picking random pairs of spheres (not connected with each other) and connectingthem if possible, given the geometric constraints of volume exclusion. Here, volume exclusion also applies toco-moving other spheres connected either initially along the chain or through a previous steps (see methodssection for details). The process repeats until all pairs are either connected or geometrically incapable ofconnecting. The adjacency matrix 𝐴 sim of the simulated chain keeps track of which spheres are linked toeach other. Initially, it contains only zeros except for its secondary diagonal elements which equal 1 since ecember 27, 2018neighboring spheres are connected via the backbone chain. The model is motivated by a two-dimensionalmodel of network-based formation of aggregates where link constraints due to geometry in space havebeen approximately mapped to purely graph-theoretic constraints during network formation [23].As described in the method section, the process of moving spheres towards each other is realized in asimple consistent way to satisfy all geometric constraints continuously in time. The forces and potentialsemployed, however, are not intended to reﬂect any physical forces or potentials created by amino acids.They plainly help to realize to attempt the joining of two randomly selected spheres.Snapshots of the folding process are illustrated in Fig. 2, three examples of the ﬁnal aggregates in Fig. 3.The aggregates are highly compact compared to the straight initial conditions. They are also much morecompact than aggregates generated from self-avoiding random walks and close to, yet not quite maximallydensely packed (see below), consistent with previous suggestions based on 2D aggregates [23]. Spatial scaling of protein structures

The ensemble of protein tertiary structures exhibits an algebraic scaling law indicating that their radii ofgyration 𝑅 𝑔 depend on their chain length 𝑁 such that: 𝑅 𝑔 ∼ 𝑁 𝜈 , (2)as expected from a number of previous studies [7, 17, 21, 23]. As the overall geometry of a folded protein isoften characterized by the locations of the central carbon atoms ( 𝐶 𝛼 -atoms, one for each amino acid) of itsbackbone chain, its spatial extension is commonly measured by the radius of gyration 𝑅 𝗀 = ( 𝑁 −1 ∑ 𝑖 ( 𝑥 𝑖 − ̄𝑥 ) ) , (3)quantifying the average distance of units from the center of mass ̄𝑥 , where 𝑥 𝑖 is the location of unit 𝑖 ∈{1 , … , 𝑁 } . Our previous study [23] revealed that the scaling law indeed is algebraic and that the exponent 𝜈 is (slightly) larger than for space ﬁlling aggregates (where 𝜈 SF = = 0 . in 3D) yet (far) smaller thanfor aggregates created through a self-avoiding random walk (where 𝜈 RW = = 0 . in 3D). That study found 𝜈 = 0 . . for 37162 proteins. For our smaller data set of 1122 proteins, we ﬁnd 𝜈 𝑒𝑥𝑝 = 0 .

374 ± 0 . ,see Fig. 4 for illustration.To compare the spatial extent of model aggregates, i.e. graph-theoretically deﬁned networks of spheres,to biological proteins on the same footing, we ﬁrst study how the network diameter 𝐷 compares to theradius of gyration deﬁned through Eq. (3). The graph diameter is deﬁned as the maximum number of linksto be taken on the shortest link sequence (also referred to as shortest simple paths) between any pair ofunits in the PRN. We ﬁnd that 𝐷 is strongly linearly correlated with the spatial extent 𝑅 𝑔 of the PRN, Fig. 4.Both the ensemble of biological proteins and the model ensembles studied exhibit a roughly proportionaldependence of 𝐷 + 1 on 𝑅 𝑔 , with the slope obtained from the model data ( 𝜕𝜕𝑅 𝑔 𝐷 = 0 . Å −1 ) being lower andmore precisely determined than that obtained for the PRNs ( 𝜕𝜕𝑅 𝑔 𝐷 = 0 . Å −1 ). As proportionality factors donot aﬀect the scaling, we thus also ﬁnd ( 𝐷 + 1) ∼ 𝑁 𝜈 , (4) Figure 2. Model folding process at diﬀerent times.

Starting from an initial chain with 𝑁 = 60 , randomly picked unitsconnect if geometrically possible. Shown here are examples after 𝑙 =

0, 2, 7, 14 and 140 successful connection attempts.5 of 13 ecember 27, 2018

Figure 3. Final model aggregates.

The ﬁnal aggregates of the simulation for 𝑁 = {5 , , display the expectedcompactness. The corresponding networks are non-planar. for both the PDB proteins and geometric-constraint model.With the cutoﬀ distance for the creation of networks chosen to be 𝑑 𝑐 = 6 . Å the resulting average linklength in the biological proteins becomes 𝑑 𝑚𝑒𝑎𝑛 ≈ 5 . Å, which in Fig. 4 we substituted for the unit lengthof our model simulations. In the PRNs the network diameters are more dispersed. The lower bound ofthe experimental data ﬁts well with the simulated structures, suggesting geometric constraints as a majordriving mechanism inﬂuencing the spatial density during network formation.Both ensembles show power-law scaling of the diameter. The exponent of 𝜈 𝑠𝑖𝑚 = 0 .

345 ± 0 . of thesimulation is very close to the value of 𝜈 𝑒𝑥𝑝 = 0 .

374 ± 0 . , measured in the PDB data. The plots are shownin Fig. 5. Simulations for heterogeneous systems where the radii of individual units are drawn randomlyfrom the uniform distribution on [1 − 𝑎, 𝑎 ] for 𝑎 ∈ {0 . , . , . , . , . , . increased the variance of themeasurements for the radius of gyration, as expected. We did not observe any signiﬁcant bias in the averagessuch that the scaling relations stay the same also for heterogeneous systems. The simulated results arefound to align very well with the lower bound of folded protein diameters, suggesting that much of thediscrepancy (constant factor shifting the measured results up in Fig. 5) can be explained by the fact that thesimulation only ceases to make new links when this is no longer geometrically possible. In real proteinson the other hand interactions range from Van-der-Waals interactions to hydrogen bonds and individualmonomers vary in size and chemical properties and are subject to thermodynamic ﬂuctuations. All this leadsto larger gaps within the folded molecule and hence larger diameters of the PRN’s. Distribution and scaling of Laplacian eigenvalues

Lastly we explore the scaling of the second largest eigenvalue of the graph Laplacian with 𝑁 in Fig. 6 andﬁnd that it grows with 𝑁 , approaching a saturation point of ≈ 15 for large 𝑁 .As two additional features roughly characterizing the dynamic properties of protein residue networks, weconsider the distribution and scaling of Laplacian eigenvalues. The Laplacian of a network captures both its ecember 27, 2018 R g / Å D + Protein DataSimulationData fitSimulation fit

Figure 4. The network diameter scales linearly with the radius of gyration.

This holds for both biological proteinresidue networks and simulated model networks. Scaling the model link length to the average link length of the PRN (seetext for details), yields a scaling of the graph diameter of model networks within the experimentally observed range. Thebest ﬁtting proportionality constant, however, diﬀers, with 𝜕𝐷𝜕𝑅 𝑔 = 0 . Å −1 for experimental data and 𝜕𝐷𝜕𝑅 𝑔 = 0 . Å −1 forthe model data interaction topology and its relaxation and vibration properties [14, 29]. If the PRN were made only of thecentral 𝐶 𝛼 atoms, the Laplacian would exactly quantify the networks vibrational and relaxational modes. Asreal PRNs are more complex, the Laplacian spectrum can be taken as a proxy for oscillatory and relaxationdynamics.Because the eigenvalue spectra intrinsically scale with graph size (here: chain length), we have evaluatedthe spectra of simulated structures and PRNs of lengths of 𝑁 = 400 ± 30 . Fig. 6a shows the histogram ofeigenvalues for the 18 PRNs (red) in that length range, accumulating all 𝑁 eigenvalues for each of the 18PRNs. For comparison, we computed 28 simulated structures (black), that fall in the same length range.Both eigenvalue spectra exhibit a characteristic unimodal shape. The simulated structures have a moresymmetric, slightly broader spectrum with a peak at 𝜆 ≈ 7 , while the PRN’s have a slightly sharper peak at 𝜆 ≈ 8 and higher probabilities for very small eigenvalues. Similarly, the second largest Laplacian eigenvalueexhibits the same qualitative scaling with chain length 𝑁 for PRNs and geometric-constraint model. Thesecond largest eigenvalue of a network’s Laplacian quantiﬁes the time scale of its slowest relaxing mode;as such, its scaling with chain length 𝑁 indicates how intrinsic relaxation time scales change due to theaggregates becoming larger.The spectra and equally the scaling of the second largest eigenvalues are not indistinguishable betweenmodel and biological protein data yet overall exhibit similar properties. Whether or not spectra of modelensemble and PRN ensemble actually agree or disagree cannot be concluded without doubt from the dataavailable, both because at (exactly) ﬁxed chain length 𝑁 there typically is no, one, or only very few proteinsavailable in the real protein data set and because the model realizations at ﬁxed 𝑁 yield very similar spectra ecember 27, 2018 a) b) Figure 5. Diameter scaling with chain length. (a) The diameter of simulated and measured PRN’s scales according toEq. 4 with the chain length. The model results coincide with the lower bound of measured results, which we attribute tothe fact that we fold maximally. (b) Matching the proportional scaling relation between graph diameter 𝐷 and radius ofgyration (Fig. 4) yields scaling relations between aggregate extent and chain length to be statistically indistinguishablebetween model and real proteins. For both panels, we simulated 30 random dynamic realizations each for aggregatelengths 𝑁 with logarithmically spaced between 𝑁 = 3 to 𝑁 = 398 . The data displayed shows the network diameteraveraged across realizations as a function of chain length. due to chain homogeneity. There is no unbiased way we know of to account for uncertainties in 𝑁 andsimultaneously inhomogeneities in the chain units such that a unambiguous conclusion can be drawn. Discussion

In this article we have proposed a simple model of spatial network formation taking into account geometricconstraints only. Decoupling the constraints, that drive the folding process (geometry, sequence and solution)and focusing on the geometry allows us insights into the folding mechanisms behind the ensemble features.While this approach does not yield direct predictive power to ﬁnd the native state of a speciﬁc sequence itmay narrow down the landscape of possibilities.We ﬁnd that geometrically constrained random linking already leads to strong similarities of the resultingstructures with protein residue networks in biology. Generalizing a 2D model of purely graph-theoreticalnetwork formation presented in [23] to 3D, the model is based upon random link additions with geometricconstraints. As the topological shortcut is no longer possible, the geometric constraints are simulated directly.The simulation results were then compared to protein residue networks (PRN’s), choosing the threshold suchthat the mean degrees of simulation results and PRN’s matched. As a result, the degree distributions arewithin the error margins of each other.The network diameter is linearly related to the radius of gyration in both simulation and data and matcheswhen the simulation results are correctly scaled with the mean connection lengths. The network diameterscales with the chain length as a shifted power law with an exponent of 𝜈 𝑠𝑖𝑚 = 0 .

345 ± 0 . , which is inagreement with value of 𝜈 𝑒𝑥𝑝 = 0 .

374 ± 0 . , measured in PRN’s. As in 2D, this is close to space ﬁlling.Furthermore, we have studied the Laplacian eigenvalue spectrum and the scaling of the second largesteigenvalue with system size, ﬁnding that the two systems are compatible. Using the ﬁndings from [14, 29]we can infer that the structure of vibrational modes and relaxation properties produced by the model aresimilar to those found in biological proteins.These results can be taken as an indication that geometric constraints may be a mechanism behind thescaling behaviour of real protein structures, generating an ensemble also compatible on degree distributionand Laplacian spectrum. Further research, however, is necessary to determine how far the structuralsimilarity reaches. For example by comparing further topological characteristics of PRN’s vs. model simu- ecember 27, 2018 Figure 6. Model eigenvalue spectra are similar to those of the PRN spectra. a) Histograms of eigenvalue spectra ofPRN’s with 𝑁 ≈ 400 and 𝑟 𝑐 = 6 . Å compared to model output at 𝑁 = 400 . b) Second largest eigenvalues grow in similarways for simulation and data. lations. If the analogy persists, the model could be extended to allow simple sequence features, such ashydrophobicity to attempt to get a simpler predictive model. This may give insights into the folding process,that are otherwise lost in simulation complexity.Taken together, the above results indicate that coarse ensemble properties of protein tertiary structuresare already induced by geometric constraints alone such that only ﬁner scales of the folded structures ofindividual proteins may be controlled by the details of their amino acid sequences. Such simple models pro-vide a new angle of analyzing protein structures at the coarse scale of ensembles and may help understandcore mechanisms underlying the complex folding processes. ecember 27, 2018 Methods

Simulation method of the geometric constraint protein model

We have simulated the process modifying the chain geometry in 3D and tested the geometric constraintsaccording to an algorithm consisting of repeated cycles of:1. A pair ( 𝑖 ∗ , 𝑗 ∗ ) of non-adjacent spheres is randomly chosen from the uniform distribution among the setof untried pairs.2. The two spheres are attempted to be connected by switching on a force of unit strength pointingtowards each other (see Fig. 2), under the geometric constraints:(i) the backbone spheres stay together(ii) no spheres overlap(iii) spheres connected previously stay together.3. If the selected spheres touch, a new link between them forms and we update the adjacency matrix bysetting the elements 𝐴 sim 𝑖 ∗ 𝑗 ∗ = 𝐴 sim 𝑗 ∗ 𝑖 ∗ = 1 . Alternatively, if the spheres move less than a velocity threshold Δ 𝑅 ∕Δ 𝑡 , the link is discarded and marked geometrically impossible (see below for details).This process is repeated until no further link remains untried. During each cycle, to emulate the directmotion of spheres towards each other and to continuously match all geometric constraints, we change thespheres’ positions 𝑥 𝑖 , 𝑖 ∈ {1 , … , 𝑁 } , according to simple overdamped dynamicsd 𝑥 𝑖 ∕ d 𝑡 = 𝜁𝐹 𝑖 ( 𝑥 ) , where 𝑥 = ( 𝑥 , … , 𝑥 𝑁 ) T is the collection of all positions and 𝐹 𝑖 ( 𝑥 ) is the sum of all constraint forces acting onsphere 𝑖 and, if 𝑖 ∈ { 𝑖 ∗ , 𝑗 ∗ } , the unit force of magnitude 1. The space and time scale were chosen such thatall quantities are dimensionless, the single-sphere friction coeﬃcient 𝜁 is set to 1 and a distance of 𝑥 = 1 corresponds to a bond length, whose mean for real proteins equals . Å.The constraints are approximated by taking the total force 𝐹 𝑖 ( 𝑥 ) = −∇ 𝑖 𝑉 ( 𝑥 ) + 𝐹 𝖼𝗈𝗇𝗇𝖾𝖼𝗍 𝑖 ( 𝑥 ) , as the sum of the forces inducing the connection attempt as 𝐹 𝖼𝗈𝗇𝗇𝖾𝖼𝗍 𝑖 ( 𝑥 ) = ( 𝛿 𝑖,𝑖 ∗ − 𝛿 𝑖,𝑗 ∗ ) 𝑥 𝑗 ∗ − 𝑥 𝑖 ∗ ‖ 𝑥 𝑗 ∗ − 𝑥 𝑖 ∗ ‖ . (5)and the constraint forces that are gradients of summed potentials 𝑉 ( 𝑥 ) = 𝐾 𝑁 ∑ 𝑛,𝑚 =1

12 ( 𝑑 𝑛,𝑚 − 1) ( 𝐴 sim 𝑛𝑚 + Θ(1 − 𝑑 𝑛,𝑚 ) ) (6)quadratic in the distances 𝑑 𝑛,𝑚 = ‖ 𝑥 𝑛 − 𝑥 𝑚 ‖ . Here the Heaviside step function is deﬁned as Θ( 𝑦 ) = 0 if 𝑦 < and Θ( 𝑦 ) = 1 if 𝑦 ≥ . The ﬁrst term in the ﬁnal parenthesis in (6) ensures keeping neighboring units along thechain in contact as well as all other pairs of spheres linked so far during the process. The second term causesoverlapping spheres to repel each other. 𝐾 is an elastic constant chosen large enough for the constraints tobe virtually fulﬁlled and the ﬁnal chain statistics being invariant of choosing larger values for 𝐾 , but smallenough in order not to limit the allowed numerical time steps unnecessarily. The value 𝐾 = 50 has turnedout to meet these conditions.The initial conﬁguration of the chain was drawn from a Boltzmann distribution with probability 𝑝 = 𝑍 −1 exp(− 𝐸 Bend ∕ 𝑘 𝐵 𝑇 ) with 𝑘 𝐵 𝑇 = 1 and energy 𝐸 Bend = − 𝜅 𝑁 −1 ∑ 𝑛 =2 cos( 𝜃 𝑛 ) , (7)where 𝑍 is a normalization constant and 𝜃 𝑛 is the bending angle at the 𝑛 th unit of the chain, deﬁned as theangle between the adjacent tangential vectors through the scalar (dot) product cos 𝜃 𝑛 = ( 𝑥 𝑛 − 𝑥 𝑛 −1 ) ⋅ ( 𝑥 𝑛 +1 − 𝑥 𝑛 ) ,noting that the sphere diameter equals unity. Initially, generated chains were rejected if any constraint was

10 of 13 ecember 27, 2018violated. The prefactor 𝜅 can be interpreted as bending stiﬀness and determines the persistence length ofthe initial chains. It was set to 𝜅 = 5 such that initial chains are slightly bent (See Fig. 2 for an example).During a cycle started by selecting the spheres 𝑖 ∗ and 𝑗 ∗ to be pulled together, we monitored theirdecreasing distance 𝑑 𝑖 ∗ ,𝑗 ∗ . As soon as 𝑑 𝑖 ∗ ,𝑗 ∗ ≤ , the cycle is considered successful and a new link is formed.We have also periodically checked at intervals Δ 𝑇 whether 𝑑 𝑖 ∗ ,𝑗 ∗ has shrunk by less than a threshold distance Δ 𝑅 = Δ 𝑇 × 𝜒 × 2∕( 𝑁 ∕2) . If this is the case, the cycle is discarded as unsuccessful, because the pair ofunits cannot make contact due to geometric constraints. The conﬁguration at the beginning of that cycleis then restored. The last factor in Δ 𝑅 is the relative velocity of the spheres 𝑖 ∗ and 𝑗 ∗ in case both – inorder to move – have to drag half the other spheres ( 𝑁 ∕2 ) along. This lower velocity threshold was furtherdecreased by introducing the factor 𝜒 = 0 . because the ﬁnal chain statistics weakly varied for larger valuesbut remained the same for smaller values. We have found Δ 𝑇 = 0 . to be small enough in order not towaste computational time on unsuccessful cycles, but large enough to not abort cycles in which 𝑑 𝑖 ∗ ,𝑗 ∗ shrinksslowly only temporarily.The excluded volume forces are nonzero only for pairs of spheres whose distance is less than one. Tospeed up the simulation, they were only evaluated for spheres that are elements of each others neighbor list listing all spheres within a distance 𝜖 . We initially generated these lists, then integrated the maximumvelocity of all spheres over time and updated the neighbor lists whenever the resulting value exceeded 𝜖 .The value 𝜖 = 0 . provided the best speed-up. At each integer multiple of 100 cycles, all untried links to asphere 𝑖 with ∑ 𝑁𝑗 =1 𝐴 sim 𝑖𝑗 =∶ 𝑘 𝑖 ≥ were discarded. This measure was taken to accelerate the simulations asfurther bonding trials including this sphere are geometrically impossible. Protein Database protein structure preparation

From reference [17] we obtained the list of PDB ﬁles used for their analysis. We split the list into NMRstructures and X-ray crystal structures, as the NMR structures would contain multiple protein conﬁgurationsin their PDB entry. Each PDB was then processed with a custom python script that would count the numberof C- 𝛼 atoms found in the structure and order the PDB IDs according to the length of the protein chain.Then from this ordered list every 10th protein was picked, ensuring a good spread of length distributions,a good sample size while also keeping computations easily doable on a workstation. For NMR structuresthe ﬁrst structure in the PDB entry was chosen. C- 𝛼 coordinates were then extracted for each protein usingMDAnalysis [19, 30], from which protein residue contact networks were computed using a cutoﬀ distance of 𝑑 𝑐 = 6 . Å. The adjacency matrix 𝐴 PDB 𝑖𝑗 was populated according to equation 1. This allows the comparison ofthe computationally generated adjacency matrix to the PRN generated one. For the network measures andmanipulations NetworkX [16] was used.All simulation details, including the code for reproducing the geometric constraint simulations, aswell as the preparation and analysis of PDB ﬁles can be found in the following github repository: https://github.com/ppxasjsm/Geometric-constraints-protein-folding Acknowledgments

We acknowledge partial support by the Max Planck Society [NM, MT], the Deutsche Forschungsgemeinschaft(DFG) through SFB 755 (project A05) [SM], the Engineering and Physical Sciences Research Council (EPSRC)UK under grant no. EP/P022138/1 [AM], and the DFG through funding the Clusters of Excellence Center forAdvancing Electronics Dresden (cfaed) and Physics of Life (PoL) [MT].

References [1] Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks.

Reviews of Modern Physics , 74:47.[2] Berman, H., Henrick, K., and Nakamura, H. (2003). Announcing the worldwide protein data bank.

Nature StructuralBiology , 10(12).[3] Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., and Hwang, D.-U. (2006). Complex networks: Structure and dynamics.

Physics Reports , 424(4-5):175–308. 11 of 13 ecember 27, 2018 [4] Bradley, P., Misura, K. M. S., and Baker, D. (2005). Toward high-resolution de novo structure prediction for smallproteins.

Science , 309(5742):1868–1871.[5] Cheng, J. and Baldi, P. (2006). A machine learning information retrieval approach to protein fold recognition.

Bioinfor-matics , 22(12):1456–1463.[6] Clementi, C. (2008). Coarse-grained models of protein folding: toy models or predictive tools?

Current Opinion inStructural Biology , 18(1):10 – 15. Folding and Binding / Protein-nucleic acid interactions.[7] Danielsson, U. H., Lundgren, M., and Niemi, A. J. (2010). Gauge ﬁeld theory of chirally folded homopolymers withapplications to folded proteins.

Physical Review E , 82(2):1–5.[8] Di Paola, L., De Ruvo, M., Paci, P., Santoni, D., and Giuliani, a. (2013). Protein contact networks: An emerging paradigmin chemistry.

Chemical Reviews , 113:1598–1613.[9] Dill, K. A., Bromberg, S., Yue, K., Chan, H. S., Ftebig, K. M., Yee, D. P., and Thomas, P. D. (1995). Principles of proteinfolding — a perspective from simple exact models.

Protein Science , 4(4):561–602.[10] Dill, K. A. and MacCallum, J. L. (2012). The protein-folding problem, 50 years on.

Science , 338(6110):1042–1046.[11] Dokholyan, N. V., Li, L., Ding, F., and Shakhnovich, E. (2002). Topological determinants of protein folding.

Proceedingsof the National Academy of Sciences of the United States of America , 99(13):8637 – 8641.[12] Donges, J. F., Zou, Y., Marwan, N., and Kurths, J. (2009). Complex networks in climate dynamics.

The European PhysicalJournal Special Topics , 174(1):157–179.[13] Estrada, E. (2011).

The structure of complex networks - Theory and Applications . Oxford University Press.[14] Estrada, E. and Hatano, N. (2010). Resistance distance, information centrality, node vulnerability and vibrations incomplex networks. In

Network Science , pages 13–29. Springer.[15] Girvan, M. and Newman, M. E. (2002). Community structure in social and biological networks.

Proceedings of theNational Academy of Sciences , 99(12):7821–7826.[16] Hagberg, A. A., Schult, D. A., and Swart, P. J. (2008). Exploring network structure, dynamics, and function usingnetworkx. In Varoquaux, G., Vaught, T., and Millman, J., editors,

Proceedings of the 7th Python in Science Conference , pages11 – 15, Pasadena, CA USA.[17] Hong, L. and Lei, J. (2009). Scaling law for the radius of gyration of proteins and its dependence on hydrophobicity.

Journal of Polymer Science, Part B: Polymer Physics , 47(2):207–214.[18] Mey, A. S., Geissler, P. L., and Garrahan, J. P. (2014). Rare-event trajectory ensemble analysis reveals metastabledynamical phases in lattice proteins.

Physical Review E , 89(3):032109.[19] Michaud-Agrawal, N., Denning, E. J., Woolf, T. B., and Beckstein, O. (2011). Mdanalysis: A toolkit for the analysis ofmolecular dynamics simulations.

Journal of Computational Chemistry , 32(10):2319–2327.[20] Mirny, L. and Shakhnovich, E. (2001). Protein folding theory: from lattice to all-atom models.

Annual Review ofBiophysics and Biomolecular Structure , 30:361–396.[21] Molkenthin, N., Hu, S., and Niemi, A. (2011). Discrete nonlinear Schrödinger equation and polygonal solitons withapplications to collapsed proteins.

Physical Review Letters , 106(7):078102.[22] Molkenthin, N., Rehfeld, K., Marwan, N., and Kurths, J. (2014). Networks from ﬂows-from dynamics to topology.

Scientiﬁc reports , 4.[23] Molkenthin, N. and Timme, M. (2016). Scaling laws in spatial network formation.

Physical Review Letters ,117(16):168301.[24] Monastyrskyy, B., D’Andrea, D., Fidelis, K., Tramontano, A., and Kryshtafovych, A. (2016). New encouraging devel-opments in contact prediction: Assessment of the CASP 11 results.

Proteins: Structure, Function, and Bioinformatics ,84:131–144.[25] Moult, J., Fidelis, K., Kryshtafovych, A., Rost, B., Hubbard, T., and Tramontano, A. (2007). Critical assessment ofmethods of protein structure prediction — round vii.

Proteins: Structure, Function, and Bioinformatics , 69(S8):3–9.[26] Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., and Tramontano, A. (2018). Critical assessment of methods ofprotein structure prediction (CASP)—round xii.

Proteins: Structure, Function, and Bioinformatics , 86:7–15. 12 of 13 ecember 27, 2018 [27] Newman, M. E. (2003). The structure and function of complex networks.

SIAM Review , 45(2):167–256.[Ogorzalek et al.] Ogorzalek, T. L., Hura, G. L., Belsom, A., Burnett, K. H., Kryshtafovych, A., Tainer, J. A., Rappsilber, J.,Tsutakawa, S. E., and Fidelis, K. Small angle x-ray scattering and cross-linking for data assisted protein structureprediction in CASP 12 with prospects for improved accuracy.

Proteins: Structure, Function, and Bioinformatics , 86(S1):202–214.[29] Ren, J. and Li, B. (2009). Thermodynamic stability of small-world oscillator networks: A case study of proteins.

PhysicalReview E , 79(5):051922.[30] Richard J. Gowers, Max Linke, Jonathan Barnoud, Tyler J. E. Reddy, Manuel N. Melo, Sean L. Seyler, Jan Domański,David L. Dotson, Sébastien Buchoux, Ian M. Kenney, and Oliver Beckstein (2016). MDAnalysis: A Python Package for theRapid Analysis of Molecular Dynamics Simulations. In Sebastian Benthall and Scott Rostrup, editors,

Proceedings of the15th Python in Science Conference , pages 98 – 105.[31] Saunders, M. G. and Voth, G. A. (2013). Coarse-graining methods for computational biology.

Annual Review ofBiophysics , 42:73–93.[32] Scheraga, H. A., Khalili, M., and Liwo, A. (2007). Protein-folding dynamics: overview of molecular simulation techniques.

Annual Review of Physical Chemistry , 58:57–83.[33] Shakhnovich, E. (2006). Protein Folding Thermodynamics and Dynamics : Where Physics, Chemistry, and BiologyMeet Fundamental Model of Protein Folding.

Chemical Reviews , 106:1559–1588.[34] Shaw, D. E., Deneroﬀ, M. M., Dror, R. O., Kuskin, J. S., Larson, R. H., Salmon, J. K., Young, C., Batson, B., Bowers, K. J.,Chao, J. C., et al. (2008). Anton, a special-purpose machine for molecular dynamics simulation.

Communications of theACM , 51(7):91–97.[35] Shirts, M. and Pande, V. S. (2000). Screen savers of the world unite!

Science , 290(5498):1903–1904.[36] Snow, C. D., Nguyen, H., Pande, V. S., and Gruebele, M. (2002). Absolute comparison of simulated and experimentalprotein-folding dynamics.

Nature , 420(6911):102–106.[37] Trombach, L., Hoy, R. S., Wales, D. J., and Schwerdtfeger, P. (2018). From sticky-hard-sphere to lennard-jones-typeclusters.

Physical Review E , 97(4):043309.[38] Uppenbrink, J. and Wales, D. J. (1991). Packing schemes for lennard-jones clusters of 13 to 150 atoms: minima,transition states and rearrangement mechanisms.

Journal of the Chemical Society, Faraday Transactions , 87(2):215–222.[39] Vendruscolo, M., Dokholyan, N. V., Paci, E., and Karplus, M. (2002). Small-world view of the amino acids that play akey role in protein folding.

Physical Review E , 65:1–4.[40] Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics of ‘small-world’networks.

Nature , 393(6684):440–442.[41] Yang, J., Yan, R., Roy, A., Xu, D., Poisson, J., and Zhang, Y. (2015). The I-TASSER suite: protein structure and functionprediction.