[PDF] Building blocks of protein structures -- Physics meets Biology

Abstract

The native state structures of globular proteins are stable and well-packed indicating that self-interactions are favored over protein-solvent interactions under folding conditions. We use this as a guiding principle to derive the geometry of the building blocks of protein structures, alpha-helices and strands assembled into beta-sheets, with no adjustable parameters, no amino acid sequence information, and no chemistry. There is an almost perfect fit between the dictates of mathematics and physics and the rules of quantum chemistry. Our theory establishes an energy landscape that channels protein evolution by providing sequence-independent platforms for elaborating sequence-dependent functional diversity. Our work highlights the vital role of discreteness in life and has implications for the creation of artificial life and on the nature of life elsewhere in the cosmos.

Full PDF

1 Building blocks of protein structures – Physics meets Biology

Tatjana Š krbi ć , Amos Maritan , Achille Giacometti , George D. Rose , Jayanth R. Banavar Department of Physics and Institute for Fundamental Science, University of Oregon, Eugene, OR 97403, USA 2.

Dipartimento di Scienze Molecolari e Nanosistemi, Università Ca’ Foscari Venezia, Campus Scientifico, Edificio Alfa, via Torino 155, 30170 Venezia Mestre, Italy 3.

Dipartimento di Fisica e Astronomia, Università di Padova and INFN, via Marzolo 8, 35131 Padova, Italy 4.

T. C. Jenkins Department of Biophysics, Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD 21218-2683, USA

Abstract The native state structures of globular proteins are stable and well-packed indicating that self-interactions are favored over protein-solvent interactions under folding conditions. We use this as a guiding principle to derive the geometry of the building blocks of protein structures – α -helices and strands assembled into β -sheets – with no adjustable parameters, no amino acid sequence information, and no chemistry. There is an almost perfect fit between the dictates of mathematics and physics and the rules of quantum chemistry. Our theory establishes an energy landscape that channels protein evolution by providing sequence-independent platforms for elaborating sequence-dependent functional diversity. Our work highlights the vital role of discreteness in life and has implications for the creation of artificial life and on the nature of life elsewhere in the cosmos. Proteins (1-40) [we apologize that we have only included a limited selection of papers], the molecular machines of life, are formidably complex (41). They have myriad degrees of freedom, an astronomical number of possible sequences for even a moderate length chain, and are stabilized by thousands of interactions, both intra-molecular and with solvent. Yet, many proteins adopt their native conformation spontaneously under physiological conditions (5). The native state structures of globular proteins are space-filling and maximize self-interaction (6,7,9). The folded structures (21,26,32,35) are modular and built on scaffolds of α -helices (2) and strands of β -sheet (3), the only two conformers that can be extended indefinitely without steric interference while providing hydrogen-bonding partners for their own backbone polar groups (4,10,28). Proteins are digital molecules: nature’s exclusion of α - β hybrid segments (27) – part α -helix, part β -strand – is built into proteins at the covalent level and restricts the topology of single domain proteins to a few thousand distinct folds at most (8,14,20). Helices are ubiquitous in biomolecular structures. They are also found in everyday life, e.g. a garden hose (or a flexible tube) is often wound into a helix. Figure 1a is a sketch of a segment of a protein helix shown with a tube envelope. A uniform, flexible, self-avoiding solid tube, whose axis is a line, is a geometrical generalization of a sphere. A sphere is a region carving out space around a point, its center. Analogously, all points within the tube are at a distance from the tube axis smaller than or equal to the tube thickness, which is measured by the tube radius, Δ . A flexible tube is an extended object with uniaxial symmetry and is not plagued by symmetry 3 conflicts, unlike the simple model of a chain of tethered spheres for which the uniaxial symmetry inherent to a chain clashes with the spherical symmetry of the constituent objects. Here we model a protein as a discretized tube with a set of equally spaced points, analogous to the C α atoms along the protein backbone, defining its axis. The coordinates of these points are described using two angles: θ and µ (see Figure 2). The simplest repeating geometry of the axis of a tube of radius Δ is a helix of pitch P, wrapped around a straight cylinder of radius R, taken to be the helix radius. The helix is parameterized by a variable t and is defined by r (t) = (Rcos(t), Rsin(t), Pt/(2 π )). (1) As t advances by an integer multiple of 2 π , the helix repeats periodically along the z-axis, with an increment equal to the pitch. The helical tube geometry is characterized by three dimensionless quantities Δ /R, η =P/(2 π R), and ε , the rotation angle between successive points along the axis. Our initial goal parallels the seminal work of Pauling et al. (2), who sought rotation angles that allowed for the optimal placement of hydrogen bonds in a helix. The crucial difference here is that we do not need to invoke quantum chemistry, covalent bonds, the planarity of peptide bonds or hydrogen bonds. We seek to maximize the self-interaction of a continuum tube (42-47) by winding the tube as tightly as possible, subject to the excluded volume constraint that the tube cannot penetrate itself. We ensure local space-filling of the helix by equating the tube radius to the local radius of curvature (Fig 1c), which, in turn, is equal to R(1+ η ) (46) yielding: Δ = R(1+ η ). (2) 4 The successive turns of a space-filling helix need to be parallel and alongside each other (Figure 1e). The square of the distance between a reference point in the continuum helix (denoted by t =0 ° ) and an arbitrary point t is given by d = R [2(1-cos t)+ η t ]. (3) We determine the parameter value t min for which d is a minimum and set this minimum distance equal to the square of the tube diameter, 4 Δ , thereby ensuring non-local space-filling (Figure 1f). The minimization condition is sin t min + η t min = 0, (4) and the distance constraint is 4 Δ = R [2(1-cos t min )+ η t min2 ]. (5) We solve Equations (2, 4, and 5) simultaneously to obtain the unique geometry of the continuum space-filling helix (Figure 1c,e,f): η ~0.4, Δ /R~1.16, and t min ~302 ° . The idealized continuum tube does not take into account discreteness, a common ingredient to all matter, which is crucial at small length scales. A unique benefit of discreteness is the emergence of a second building block (besides the space filling helix): a two dimensional strand with a zig-zag tube axis (Figure 3a), the rotation angle ε of 180 ° , and µ =180 ° . The existence of two building blocks is required for the rich diversity of topologically distinct folds, necessary for the versatile functioning of the molecular machines. A helix is defined by a repeat of ( θ , µ )-values and a planar strand by a repeat of µ =180°. For repeat µ -values close to 180 ° , one obtains a twisted planar strand, a geometrical feature often observed in protein structures. 5 Figure 1g shows the space-filling discrete helix with η ~0.4 and Δ /R~1.16, the geometrical characteristics of the continuum space-filling helix. The discretization requires the specification of the rotation angle ε between successive points that retains the space-filling conditions for the discrete case. This choice of ε is made (in direct analogy with the continuum case) by requiring that the distance between points i (analogous to t =0 °) and i+m with integer m (analogous to t min ) is equal to the tube diameter and the angles (i-1,i,i+m) and (i,i+m,i+m+1) are both equal to 90 ° (analogous to the minimization condition). The smallest value of m for which these conditions are satisfied is m=3 and ε ∼ ° (the ratio of the distance to the tube diameter is found to be 1.00… and both the angles are 90.0… ° for this value of ε ). Upon defining the length scale to match the mean C α -C α distance along the protein backbone of 3.81Å, the tube radius is found to be Δ∼ α -helix building block of protein structures (see Figures 4-5, Table 1). A space-filling helix maximizes self-interaction through local interactions, whereas the non-local interactions of strands assembled into sheets leads to space-filling. We build on the insights gained from the helix analysis to make predictions of the geometrical arrangements for strand pairing (Figure 3b-c). First, the strands need to be in phase with each other mimicking the behavior of adjoining turns in the continuum helix, placed parallel to and alongside each other. Second, there are two distinct ways (Figures 3b-c) of accomplishing space-filling of assembled strands corresponding to anti-parallel and parallel β -sheet hydrogen bonding patterns, first predicted by Pauling and Corey (3) based on hydrogen bonding. The space-filling packing requires that the distances (i,j) in Figure 3b (anti-parallel arrangement) and (i,M j ) in Figure 3c 6 (parallel arrangement), which are measures of the closest approach of two parallel tube segments, both ought to be 2 Δ ~5.26Å (see Figure 3d-e and Table 1). It is important to note that, for both helices and sheets, the side-chains do not clash sterically unlike in a well-packed compact arrangement of parallel strands in a hexagonal array. In addition to helices and strands, chain turns are needed to inter-connect these building blocks. In proteins, the most abundant turns are β -turns, tight, four-residue segments that approximately reverse the overall chain direction (13). β -turns are tightly wound like an α -helix, and therefore are predicted to have similar θ -angles as in the α -helix (Figure 4). Figure 4b shows the ( θ , µ ) coordinates for 4 classes of residues: those that participate in α -helices, parallel β -sheets, anti-parallel β -sheets, and β -turns. The black X marks the coordinates of the predicted space-filling helix. Unsurprisingly, α -helix µ -values (49.7 ± 3.9) ° are a bit lower than the theoretical prediction of 52.4 ° because the distance between a hydrogen-bonded donor and acceptor (N-H ··· O=C) can be less than their summed van der Waals radii. Of course, an ideal tube is unaffected by such chemical particulars. Nevertheless, the predicted µ value for an ideal tube is remarkably close to 50 ° , the average µ value for Pauling's α -helix (2), with 3.6 residues per turn. As predicted, the tight turns predominantly have a θ value close to that of the α -helix. The β strands are twisted with a µ angle around 180 ° and have a spread of θ angles. The accord between our prediction and structural data from the protein data bank underscores the consilience (48) between mathematics and physics on one hand and quantum chemistry on the other and show how self-interaction is maximized through a space-filling arrangement of 7 individual helices and sheets (Figure 6). The large but finite number of protein native state folds (8,14,20) sculpted by geometry and symmetry (24,25) is reminiscent of the restriction of the number of space groups of Bravais lattices of three-dimensional crystals to exactly 230 due to periodicity and space-filling requirements (49). Our theory shows convincingly that structure-space and sequence-space of proteins are separable, yielding sequence-independent forms (22) that are Platonic and immutable, and not subject to Darwinian evolution. Sequences can then populate these forms resulting in the evolution of the functional diversity of life. The evolution (40,50,51) of biological macromolecules can be framed as a random walk in an inordinately vast sequence space, with selection guided by “fitness”. Our formalism imposes an important constraint on protein evolution. A consequence is that the repertoire of possible folds is generated from pre-sculpted α -helices and β -strands, and, of necessity, accessible folds are mix-and-match constructs of these fundamental forms. This diversity of structural scaffolds provides a platform for elaborating functional diversity. In seminal work, Anfinsen (5) demonstrated that proteins fold rapidly and reproducibly into their native state structures. This naturally led to the text book wisdom (35) that the amino acid sequence of a protein determines its three-dimensional structure leading to much effort in finding the energy minimum of a many-body complex system of a protein in its solvent with a huge number of degrees of freedom and with myriad interactions. Subsequent work by Matthews (16) and others showed that protein structure is nevertheless very tolerant of amino acid replacement. a menu of putative native state structures is created without regard to amino acid sequence and chemistry. In the second step, a given protein selects its native state from this menu. Thus the horrendous problem of working out the native state structure of a given protein from knowledge of its sequence by finding, from scratch, the conformation, which minimizes the net energy of myriad imperfectly known microscopic interactions, is replaced by the much simpler task of finding the best fit of the sequence to one among the library of geometrically sculpted folds determined in a sequence-independent and chemistry-independent manner. This best-fit process, also exploited in the threading algorithm (15), is where the role of the amino acid sequence becomes paramount. Indeed, in an influential series of papers (12,17-19), it has been highlighted that the amino acid side chains must be able to fit into the native state fold with minimal frustration thereby creating a landscape akin to a folding funnel. Some 80 years ago, Bernal (1) wrote – Any effective picture of protein structure must provide at the same time for the common character of all proteins as exemplified by their many chemical and physical similarities, and for the highly specific nature of each protein type. It is reasonable to believe, though impossible to prove, that the first of these depends on some common arrangement of the amino acids . Indeed, our work here shows that the common character of all proteins originates from an appropriate tube-like geometrical description of just the backbone C α atoms, which are common to all proteins, and results in the library of native state folds sculpted by geometry and symmetry, without a need for sequence specificity or chemistry. The highly specific nature of each protein type then arises from its distinctive amino acid side-chains and their fit to one of the folds from the library. For a protein, the folded structure is central to its 9 functionality. The situation is loosely analogous to a restaurant in which the chef (geometry and symmetry) creates a menu of items (the library of putative native state folds) that customers (protein sequences) can order from (fold into). The chef does not cater to the individual tastes of the customers. Rather, all patrons of the restaurant are satisfied picking an item from the menu. As in proteins, the total number of patrons can vastly exceed the number of menu items. If, in fact, the menu of protein structures itself evolved, then one would be confronted by an almost impossible situation for evolution and natural selection in which a protein and its interacting partners would have to co-evolve their structures synergistically in order to maintain function. This situation is deftly avoided by the geometrically determined native state folds providing a fixed backdrop for evolution to shape protein sequences and functionalities. Richard Feynman, in a lecture entitled

There’s Plenty of Room in the Bottom: An Invitation to Enter a New Field of Physics at the annual American Physical Society Meeting at Caltech on December 29, 1959, suggested that tiny, nanoscale machines could be constructed by manipulating individual atoms. Proteins are precisely such machines (21,26,32,35). Indeed, proteins as well as macroscopic machines establish a stable framework that can accommodate moving parts, which perform a function. Proteins are nature’s implementation of the abstract forms presented here, a diversity of stable forms deduced entirely from mathematical considerations. These predictions – independent of any chemistry – have implications for life elsewhere in our cosmos (52) suggesting that there is no absolute need for carbon chemistry for life to exist. We look forward to other implementations in the lab, raising the prospect of powerful interacting machines, potentially leading to artificial life (53). 10 In summary, underlying life’s evolving complexity (41) is a sequence-independent energy landscape with thousands of stable minima — a landscape formed from nature’s scaffold building blocks, a protein grammar. In both natural and artificial languages, a grammar is a finite set of rules that can generate an a large number of syntactically correct sentences or strings. The discretized tube model establishes an immutable grammar of life and “ from so simple a beginning, endless” – protein sequences and functionalities – “most beautiful and most wonderful have been, and are being, evolved” (54).

PDB analysis:

We have carried out a quantitative comparison between our predictions and protein structure. To develop a working set for comparison, Richardsons’ Top 8000 set of high-resolution, quality-filtered protein chains (resolution < 2Å, 70% PDB homology level) [see the web site: http://kinemage.biochem.duke.edu/databases/top8000.php ] was further filtered to exclude all structures with missing backbone atoms, yielding a working set of 4416 structures (listed in Table 2). The working set was cross-checked against 478 proteins having a more stringent homology cutoff of 20 % , taken from the Pisces database (23); 205 entries are in common to both sets. Almost all bond lengths (C α ( i) -C α ( i+1) distance) (~99.7 % ) in the working set are clustered around 3.81Å, as expected for a trans peptide. Those remaining have shorter bonds, ~2.95Å, predominantly from cis residues. For purposes of comparison, a fixed bond length of 3.81Å is used. Hydrogen bonds were identified using DSSP (11). Hydrogen-bonded conformers extracted from the working set include 3595 helices, 8473 antiparallel pairs, 4639 parallel pairs, and 58,820 turns. Helices were identified as 12-residue segments with intra-helical hydrogen bonds (N i -H•••O i-4 and O i •••H-N i+4 ) at each residue. Antiparallel strand pairs were identified by three inter-pair hydrogen bonds at (i,j), (i+2,j-2), and (i-2,j+2), i ∊ strand1, j ∊ strand2. To avoid 11 possible end effects, only (i,j) residue pairs were used. Parallel strand pairs were identified by four inter-pair hydrogen bonds between (i,j-1), (i,j+1), (i+2,j+1), and (i-2,j-1), i ∊ strand1, j ∊ strand2, and again only the i-th residue was retained. Double-counting was assiduously avoided. β turns were identified by hydrogen bonds between (i,i+3) with no helical residues among the 4. The ( θ , µ )-values were then recorded for points i+1 and i+2 in the turns. References

1. Bernal, J. D. Structure of Proteins.

Nature , 663-667 (1939). 2. Pauling, L., Corey, R. B. & Branson, H. R. The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain.

Proc. Natl. Acad. Sci. USA , 205-210 (1951). 3. Pauling, L. & Corey, R. B. The pleated sheet, a new layer configuration of polypeptide chains. Proc. Natl. Acad. Sci. USA , 251-256 (1951). 4. Ramachandran, G. N. & Sasisekharan, V. Conformation of polypeptides and proteins. Adv. Prot. Chem. , 283-438 (1968). 5. Anfinsen, C. B. Principles that govern the folding of protein chains. Science , 223-230 (1973). 6. Richards, F. M. The Interpretation of Protein Structures: Total Volume, Group Volume Distributions and Packing Density.

J. Mol. Biol. , 1-14 (1974). 7. Finney, J. L. Volume Occupation, Environment and Accessibility in Proteins. The Problem of the Protein Surface. J.Mol. Biol. , 721- 732 (1975). 8. Levitt, M. & Chothia, C. Structural patterns in globular proteins. Nature , 552-558 (1976). 12 9. Richards, F. M. Areas, volumes, packing, and protein structure.

Annu. Rev. Biophys. Bioeng. , 151-176 (1977). 10. Kim, P. S. & Baldwin, R. L. Specific intermediates in the folding reactions of small proteins and the mechanism of protein folding. Annu. Rev. Biochem. , 459-489 (1982). 11. Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers , 2577-2637 (1983). 12. G ō , N. The consistency principle in protein structure and pathways of folding. Adv. Biophys. , 149-164 (1984). 13. Rose, G. D., Gierasch, L. M. & Smith, J. A. Turns in peptides and proteins. Adv. Protein Chem. , 1-109 (1985). 14. Chothia, C. One thousand families for the molecular biologist. Nature , 543–544 (1992). 15. Jones, D. T., Taylor, W. R. & Thornton, J. M. A new approach to protein fold recognition.

Nature , 86-89 (1992). 16.

Matthews, B. W. Structural and genetic analysis of protein stability.

Annu. Rev. Biochem. , 139–160 (1993). 17. Bryngelson, J. D., Onuchic, J. N., Socci N. D. & Wolynes, P. G. Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins , 167-195 (1995). 18. Wolynes, P. G., Onuchic, J. N. & Thirumalai D. Navigating the folding routes. Science , 1619-1620 (1995).

19. Dill, K. A. & Chan, H. S. From Levinthal to pathways to funnels.

Nat. Struct. Biol. , 10-19 (1997). 13 20. Przytycka, T., Aurora, R. & Rose, G. D. A protein taxonomy based on secondary structure. Nat. Struct. Biol. , 672-682 (1999). 21. Tanford, C. & Reynolds, J. Nature’s Robots: A History of Proteins (Oxford University Press, 2001). 22. Denton M. & Marshall, C. Laws of form revisited.

Nature , 417 (2001). 23. Wang, G. & Dunbrack, R. L., Jr. PISCES: a protein sequence culling server.

Bioinformatics , 1589-1591 (2003). 24. Banavar, J. R., Hoang, T. X., Maritan, A., Seno, F. & Trovato, A. Unified perspective on proteins: A physics approach. Phys. Rev. E , 041905 (2004). 25. Hoang, T. X., Trovato, A., Seno, F., Banavar, J. R. & Maritan, A. Geometry and symmetry presculpt the free-energy landscape of proteins. Proc. Natl. Acad. Sci. USA , 7960-7964 (2004). 26. Lesk, A. M.

Introduction to Protein Science: Architecture, function and genomics (Oxford University Press, 2004). 27. Fitzkee, N. C. & Rose, G. D. Steric restrictions in protein folding: an alpha-helix cannot be followed by a contiguous beta-strand.

Protein Sci. , 633-639 (2004). 28. Rose, G. D., Fleming, P. J., Banavar, J. R. & Maritan, A. A backbone-based theory of protein folding. Proc. Natl. Acad. Sci. USA , 16623-16633 (2006). 29. Dill, K.A., Ozkan, S. B., Shell, M. S. & Weikl, T. R. The protein folding problem.

Annu. Rev. Biophys. , 289–316 (2008). 30. Shaw D. E., Maragakis, P., Lindorf-Larsen, K., Piana, S., Dror, R. O., Eastwood, M. P., Bank, J. A., Jumper, J. M., Salmon, J. K., Shan, Y. & Wriggers W. Atomic-Level Characterization of the Structural Dynamics of Proteins. Science , 341-346 (2010). 14 31. Bitbol, A.-F., Dwyer, R. S., Colwell, L. J. & Wingreen, N. S. Inferring interaction patterns from protein sequences.

Proc. Natl. Acad. Sci. USA , 12180-12185 (2016). 32. Bahar, I., Jernigan R. L. & Dill, K. A.

Protein Actions (Garland Science, Taylor & Francis Group, 2017). 33. Rocks, J. W., Pashine, N., Bischofberger, I., Goodrich, C. P., Liu, A. J. & Nagel, S. R. Designing allostery-inspired response in mechanical networks.

Proc. Natl. Acad. Sci. USA , 2520-2525 (2017). 34. Runnels, C. M., Lanier, K. A., Williams, J. K., Bowman, J. C., Petrov, A. S., Hud, N. V. & Williams, L. D. Folding, assembly, and persistence: The essential nature and origins of biopolymers . J. Mol. Evol. , 598-610 (2018). 35. Berg. J. M., Tymoczko, J. L., Gatto, G. J. Jr & Stryer, L. Biochemistry, Ninth edition (Macmillan Learning, 2019).

36. Leman, J. K. et al. (2020). Macromolecular modeling and design in Rosetta: recent methods and frameworks.

Nat. methods , 665-680 (2020). 37. Dobson, C. M., Knowles, T. P. J. & Vendruscolo M. The Amyloid Phenomenon and Its Significance in Biology and Medicine. Cold Spring Harb. Perspect. Biol. , a033878 (2020). 38. Fantini, M., Lisi, S., De Los Rios, P., Cattaneo, A. & Pastore, A. Protein Structural Information and Evolutionary Landscape by In Vitro Evolution. Mol. Biol. Evol. , 1179-1192 (2020). 39. Merritt, H. I., Sawyer, N. & Arora, P. S. Bent into shape: Folded peptides to mimic protein structure and modulate protein function. Peptide Sci. , e24145 (2020). 15 40. Bowman, J. C., Petrov, A. S., Frenkel-Pinter, M., Penev, P. I. & Williams, L. D. Root of the Tree: The Significance, Evolution, and Origins of the Ribosome,

Chem. Rev. , 4848 − Science , 87-89 (1999). 42. Maritan, A., Micheletti, C., Trovato, A. & Banavar, J. R. Optimal shapes of compact strings.

Nature , 287-290 (2000). 43. Stasiak, A. & Maddocks, J. H. Best packing in proteins and DNA.

Nature , 251-252 (2000). 44. Przybyl, S. & Pieranski, P. Helical close packings of ideal ropes.

Eur. Phys. J. E , 445-449 (2001). 45. Snir, Y. & Kamien, R. D. Entropically Driven Helix Formation. Science , 1067 (2005). 46. Snir, Y. & Kamien, R. D. Helical tubes in crowded environments.

Phys. Rev. E , 051114 (2007). 47. Olsen, K. & Bohr, J. The generic geometry of helices and their close-packed structures. Theor. Chem. Acc. , 207-215 (2010). 48. Wigner, E. P. Unreasonable Effectivness of Mathematics in the Natural Sciences.

Commun. Pure Appl. Math. , 1-14 (1960). 49. Chaikin, P. & Lubensky, T. Principles of Condensed Matter Physics (Cambridge University Press, 2000). 50. Dawkins R.

The Blind Watchmaker (W. W. Norton & Company, London, 1986). 51. Goldenfeld, N. & Woese, C. Biology’s next revolution.

Nature , 369 (2007). 16 52. Davis, P.

The Eerie Silence: Renewing Our Search for Alien Intelligence (Mariner Books, 2011). 53. Levy, S.

Artificial Life: The Quest for a New Creation (Penguin Books, 1993). 54. Darwin, C.

On the Origin of Species (John Murray, London, 1859).

Acknowledgements:

We are indebted to Pete von Hippel for his warm hospitality and to him, Jeremy Berg, and Brian Matthews for stimulating comments.

Funding:

This project received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sk ł odowska-Curie Grant Agreement No 894784. The contents reflect only the authors’ view and not the views of the European Commission. Support from the University of Oregon (through a Knight Chair to JRB), NSF (GDR), University of Padova through “Excellence Project 2018” of the Cariparo foundation (AM), MIUR PRIN-COFIN2017 Soft Adaptive Networks grant 2017Z55KCW and COST action

CA17139 (AG) is gratefully acknowledged. The computer calculations were performed on the Talapas cluster at the University of Oregon.

Author contributions:

JRB, AM and T Š developed the ideas for the calculations, and all the authors participated in exploring the consequences of the theory. T Š carried out the calculations under the guidance of JRB. JRB and GDR wrote the paper. All authors reviewed the manuscript. Emails: [email protected], [email protected], [email protected], [email protected], [email protected] 17 Figure 1: Optimal geometry of space-filling helix. (a) A segment of ten residues of a helix from phage T4 lysozyme protein 1L56 (residues 61-70). The green ribbon represents the helical trace formed by the C α atoms, the spheres denote the heavy backbone and side-chains atoms in the helix, and the transparent tube is a guide to the eye. (b-c) show top-views of two continuum helices, both with a helix pitch P to helix radius R ratio η =(P/2 π R)~0.4 and a local radius of curvature of the helix, R local =R(1+ η )~1.16R. The tube radii Δ in the two cases are different: Δ /R local =1/2 and 1 respectively. (b) When Δ is less than R local , there is empty space in the interior. When Δ is bigger than R local , the turn is too tight leading to a kink, as is sometimes observed in a garden hose (not shown). (c) The sweet spot occurs when Δ =R local , leading to maximization of the local self-interaction. (d-e) shows side views of two helices with η -values of 0.8 and ~0.4 !

90° 90° !! t = 0° t min !! a) b) d) e) f) g) ! t t min i i+2 i-1 i+1 i+3 i+4 " ° ~ !! c)

18 respectively. In both cases, Δ has been chosen to be the local radius of curvature of the latter helix ~1.16R. (d) When η is larger than ~0.4, there is empty space between successive turns and the non-local self-interaction is not maximized. In the other limit of small η (not shown), successive turns of the tube overlap and this is forbidden sterically. (e) A Goldilocks situation here is when η is tuned just right to ~0.4 yielding ( Δ /R)~1.16 for a continuum space-filling helix maximizing both local and non-local self-interaction. The top and side views of the optimal continuum helix are shown in (c) and (e) respectively. (f) and (g) show how these results can be captured analytically (see text) for a continuum and a discrete tube respectively. Figure 2: Coordinate system at discrete location i along tube axis.

The bond length b, assumed here to be a constant, is the distance between successive points. The angle θ i is the angle subtended at i by points (i-1) and (i+1) along the tube axis. µ i is the dihedral angle between the planes π and π formed by [(i-2),(i-1),i] and [(i-1),i,(i+1)] respectively or equivalently the angle between the binormals in a Frenet reference frame at points (i-1) and i. Knowledge of the coordinates of the previous three points (i-2,i-1,i) and the variables ( θ i , µ i ) are sufficient to uniquely specify the coordinates of the point (i+1). Figure 3: Optimal packing of strands. (a) A single two dimensional zig-zag strand (with a rotation angle of 180 ° ) lying in the plane of the paper. This planarity can only occur for a F r e qu e n c y d(i,j) [Å] a) b) ! = 180° c) " " " " " " i j i M j d) e) i+1 i-1 j-1 j+1 i+1 i-1 j+1 j-1 j F r e qu e n c y d(i,M j ) [Å]

20 discrete tube and is forbidden for a tube in the continuum. Alternate points along a strand are colored red and blue. There are two equivalent choices for a straight tube axis, one lying along the line of blue points (the blue axis) or the line of red points (red axis). Two distinct space-filling arrangements for strand packing are shown corresponding to (b) red axis-red axis (or equivalently blue axis-blue-axis – not shown) packing and (c) red axis-blue axis (or equivalently blue axis-red axis – not shown) packing. The two cases correspond to anti-parallel and parallel β -sheets with distinct distance constraints. The yellow point M j lies midway between the blue points j-1 and j+1. The maximization of self-interaction dictates that the distances (i,j) in (b) and (i,M j ) in (c) ought to be 2 Δ ~5.26Å to ensure space filling. (d) and (e) show the histograms of the distances (i,j) and (i,M j ) in the interior of anti-parallel and parallel β -sheets in protein structures. The black vertical lines show the theoretical prediction of 2 Δ ~5.26Å. The mean values of both histograms are the same as the theoretical prediction (see Table 1). Figure 4: Two views of the local structure representation of proteins. a) ( θ , µ ) plot of the PDB data set (see Table 2) comprising 4416 proteins and 972,519 residues. Here, the local conformations of residues are shown in the ( θ , µ ) plane. For strands, a µ -value that deviates from ~180 ° is the signature of a twisted strand, which is still locally planar. The plot shows chiral symmetry breaking, i.e., the points are not symmetrically placed around µ =180 ° . Our simplified analysis does not attempt to account for this. b) ( θ , µ ) coordinates of random samples of 12000 points each from the interior of α -helices (orange); anti-parallel (green) and parallel (red) β -sheets; and β -turns (the two interior sites of (i,i+3) hydrogen-bonded residues with no helical residues) (blue). The tight turns have θ -values similar to those of helices. Unlike for helices and turns, the θ -values of strands are not constrained. The black X in both panels shows our prediction of the geometry of space-filling helices. 22 Figure 5: Distribution of α -helix characteristics. (a) Distribution of the experimentally determined bond lengths (consecutive C α -C α distances). The bond length in the theory was chosen to be the mean bond length of 3.81Å and sets the characteristic length scale. The other panels show the distributions of (b) the rotation angle, (c) the rise per residue, (d) the helix radius, (e) θ , (f) µ , (g) the local radius of curvature, and (h) the dihedral angle between the planes defined by the points (i-1,i,i+3) and (i,i+3,i+4) in Figure 1g. The triangles formed by the two triplets ought to be congruent but they are not co-planar. The black line in each of the panels (except the first) shows the zero parameter theoretical prediction. Overall, there is excellent accord between theory and observations from protein structures. F r e qu e n c y bond length [Å] F r e qu e n c y (cid:161) [°] F r e qu e n c y Rise per residue p [Å] F r e qu e n c y Helix radius R [Å] F r e qu e n c y (cid:101) [°] F r e qu e n c y µ [°] F r e qu e n c y Local radius of curvature [Å] a) b) c) d) e) f) g) h) ! " $% & ' ( (cid:144) )* (cid:47) *+ (cid:239) ,-+-+./0-) (cid:47) *+-+./-+.100)234) Figure 6: Consilience between mathematics and biochemistry.

The figure shows three views each of two short proteins. (a-c) is the 56-residue long protein 3GB1 comprising 4 strands assembled into sheets along with a single helix. (d-f) is a protein of the same length, 2KDL, comprised of a three-helix bundle. Each panel shows a uniform tube, with the theoretically predicted radius of 2.6Å, whose axis passes through the C α atoms. The sole exception is the β -sheet (for which hydrogen bonding was identified using DSSP (11)), where every other C α atom is considered (as explained in Figures 3b and c). The tube color varies continuously from red to blue (via grey) as its axis moves from the N-terminal to the C-terminal. The heavy atoms of the side chains sticking outside the tube are shown. The maximization of the self-interaction through space-filling is evident. a) b) c) d) e) f) Continuum tube diameter from theory 2 Δ =5.26… Å Quantity Theory PDB data HELIX

Rotation angle ε [°] 99.8 99.1 ± 3.4 Number of residues per turn 3.61 3.63 ± 0.13 Helix radius R [Å] 2.27 2.30 ± 0.07 Rise per residue p [Å] 1.58 1.51 ± 0.08 Helix pitch P [Å] 5.69 5.47 ± 0.49 Pitch to radius ratio η = P/(2 π R) 0.400 0.377 ± 0.046 ∠ ( π (i-1,i,i+3), π (i,i+3,i+4)) [ ° ] 69.1 70.0 ± 4.4 Local radius of curvature [Å] 2.74 2.73 ± 0.05 θ [ ° ] 91.8 91.3 ± 2.2 µ [ ° ] 52.4 49.7 ± 3.9 SHEET Type I β -sheet: parallel θ [ ° ] flexible 121 ± 10 µ [ ° ] ~180 191 ± 17 d (i,M j ) [Å] 2 Δ =5.26 5.26 ± 0.16 Type II β -sheet: antiparallel θ [ ° ] flexible 127 ± 10 µ [ ° ] ~180 186 ± 20 d (i,j) [Å] 2 Δ =5.26 5.26 ± 0.20 Table 1: Quantitative comparison between theory and data from the Protein Data Bank (PDB).

We choose the bond length to match the experimentally determined mean distance between successive C α atoms of 3.81 ± 0.02Å. The chain is defined by discrete points denoted by 1,2,3,…,i,… d(i,j) is the distance between the points i and j. The angle ∠ ( π (i,j,k), π (l,m,n)) is the dihedral angle between the two planes formed by the sites (i,j,k) and (l,m,n). M j is defined to be the geometrical center of the points j-1 and j+1. The agreement between theory and data is striking considering that the theory is parameter-free. Table 2: PDB codes of the 4416 proteins used in our analysis.1ihj_B 1iib_B 1ijb_A 1ijt_A 1ijx_C 1ijy_B 1ikt_A 1io0_A 1iom_A 1ioo_B 1iq6_B 1pam_B 1pcf_C 1pdo_A 1pe9_B 1pfb_A 1pgv_A 1pj5_A 1pk3_B 1pkh_A 1pl3_A 1pl8_D 1vph_E 1vps_A 1vq3_B 1vqe_A 1vsr_A 1vyf_A 1vyo_A 1vzi_B 1vzy_B 1w0d_A 1w0n_A 2bo9_C 2bo9_D 2boo_A 2bpd_B 2bpq_A 2bqx_A 2br9_A 2bsj_A 2bt6_A 2bt9_A 2buu_A 2hra_A 2hrv_B 2hsa_A 2ht9_B 2hta_A 2hu9_A 2hur_B 2hv8_A 2hv8_E 2hvm_A 2hvw_C 2r6j_B 2r75_1 2r8e_E 2r8o_A 2r8q_A 2r99_A 2r9f_A 2ra3_B 2ra4_A 2ra6_B 2rbk_A 2zex_A 2zez_B 2zfc_B 2zfd_A 2zfz_D 2zgq_A 2zhj_A 2zhn_A 2zhz_C 2zib_A 2zjd_C 3enu_A 3eoi_A 3epr_A 3eqn_B 3er6_A 3era_B 3erj_A 3erx_B 3esg_B 3esl_B 3eu9_C 3kef_B 3keo_B 3kfa_A 3kff_A 3kg0_C 3kgr_A 3kgz_B 3kh7_A 3kij_C 3kki_A 3kkq_A 4vub_A 5pal_A 6cel_A 6rxn_A 7fd1_A 7rsa_A 8abp_A