[PDF] Local sequence-structure relationships in proteins

Abstract

We seek to understand the interplay between amino acid sequence and local structure in proteins. Are some amino acids unique in their ability to fit harmoniously into certain local structures? What is the role of sequence in sculpting the putative native state folds from myriad possible conformations? In order to address these questions, we represent the local structure of each C-alpha atom of a protein by just two angles, theta and mu, and we analyze a set of more than 4000 protein structures from the PDB. We use a hierarchical clustering scheme to divide the 20 amino acids into six distinct groups based on their similarity to each other in fitting local structural space. We present the results of a detailed analysis of patterns of amino acid specificity in adopting local structural conformations and show that the sequence-structure correlation is not very strong compared to a random assignment of sequence to structure. Yet, our analysis may be useful to determine an effective scoring rubric for quantifying the match of an amino acid to its putative local structure.

Full PDF

1 Local sequence-‐structure relationships in proteins

Tatjana Škrbić , Amos Maritan , Achille Giacometti and Jayanth R. Banavar Department of Physics and Institute for Fundamental Science, University of Oregon, Eugene, OR 97403, USA 2.

Dipartimento di Scienze Molecolari e Nanosistemi, Università Ca’ Foscari Venezia, Campus Scientifico, Edificio Alfa, via Torino 155, 30170 Venezia Mestre, Italy 3.

Dipartimento di Fisica e Astronomia, Università di Padova and INFN, via Marzolo 8, 35131 Padova, Italy

Abstract We seek to understand the interplay between amino acid sequence and local structure in proteins. Are some amino acids unique in their ability to fit harmoniously into certain local structures? What is the role of sequence in sculpting the putative native state folds from myriad possible conformations? In order to address these questions, we represent the local structure of each C α atom of a protein by just two angles, θ and µ , and we analyze a set of more than 4000 protein structures from the PDB. We use a hierarchical clustering scheme to divide the 20 amino acids into six distinct groups based on their similarity to each other in fitting local structural space. We present the results of a detailed analysis of patterns of amino acid specificity in adopting local structural conformations and show that the sequence-‐structure correlation is not very strong compared to a random assignment of sequence to structure. Yet, our analysis may be useful to determine an effective scoring rubric for quantifying the match of an amino acid to its putative local structure. Keywords: sequence-‐structure relationship, local structure, amino acid groupings, amino acid propensity Significance statement:

We present a quantitative study of the emergent constraints of sterics, the chain topology, and the quantum chemistry on local protein native state structures measured in a simple representation. We present two main classes of results: the propensity of amino acids to occupy certain local structures and a grouping of amino acids based on their similarity in hosting local structures.

It is known that there are just a few important principles (1-‐6) that drive the folding process of a protein: the requirement of avoiding steric overlaps in both the folded and unfolded states, the lower conformational entropy in the folded state than in the unfolded state, the hydrophobic effect favoring a compact conformation that is able to expel water from the core of the folded state and the delicate balance of hydrogen bonds with the solvent and within the protein backbone that can tip the energetic balance between the unfolded and folded state. The fundamental issue is how nature has effectively explored the astronomically large sequence space through evolution to make proteins the molecular target of natural selection. 3 Here we characterize the native state folds within a simple coarse-‐grained representation and elucidate the role, if any, played by the repertoire of amino acids in fitting into one of these local geometries. We model a chain by just its C α atoms and follow the coordinate representation shown in Figure 1. With the knowledge of the preceding C α locations, we specify the position of a given C α atom by three coordinates (7), the bond length, b, and two angles, θ and µ . θ is the bending angle at the given C α location, whereas µ is the angle between successive binormals (Figure 1). The binormal associated with a specific consecutive triplet of C α atoms is the unit vector perpendicular to the plane of the triplets. The tangent, the normal, and the binormal, all at the middle C α atom, form a right-‐handed Cartesian coordinate system. This coordinate system was introduced by Rubin and Richardson in a paper describing the Byron bender that allowed for a simple construction of protein C α models (8,9). Our analysis is carried out with a set of more than 4000 experimentally determined protein native state structures. Starting from the Top 8000 set proteins of the Richardson laboratory (10,11) with 70 % homology level, we excluded all structures with missing atoms in the protein backbone, yielding a set of 4416 protein native state structures that we used for our analysis (the same set was employed in Ref. 7) (see Table S1 in Supplementary Information). We successfully validated our analysis using 478 proteins from the Dunbrack data set (12), this time with a maximum sequence homology level of 20 % . There were 205 proteins in common between the Richardson and Dunbrack sets that we employed. We carried out the ( θ , µ ) analysis for both the Richardson and Dunbrack data sets and obtained virtually identical results with the Dunbrack data being understandably more sparse. We present here the detailed analysis for just the much larger Richardson data set. 4 Figure 1: Definition of coordinate system.

The bond length b at location i, b i, is the distance between the points i and (i+1). The angle θ i is the angle subtended at i by points (i-‐1) and (i+1) along the chain. The third coordinate µ i is the dihedral angle between the planes π and π formed by [(i-‐2), (i-‐1), i] and [(i-‐1),i,(i+1)] respectively and is the angle between the binormals at (i-‐1) and i. Knowledge of the coordinates of the previous three points (i-‐2,i-‐1,i) and the three variables (b i , θ i , µ i ) are sufficient to uniquely specify the coordinates of the point (i+1). A simplification arises because the vast majority of bond lengths is nearly constant (Figure 2). Figures 2a and a blown up version, Figure 2b, depict histograms of bond lengths with two peaks centered around 3.81Å and 2.95Å. The shorter bonds are associated with a Ramachandran 5 angle ω (1) around 0 degrees (13) (Figure 2c). Because the fraction of short bonds is relatively small (0.3 % ), our analysis here is carried out with all C α positions, each characterized by a bond length, the θ and µ angles and the amino acid identity. An analysis of the amino acids associated with just the short bonds shows the preponderance of glycine in the first position and proline in the second position (because of the low barrier for transitioning between its cis and trans conformations). Figure 2: Distribution of bond lengths.

Figure 2a shows a histogram of bond lengths in our data set. A blown up version in Fig 2b shows that the distribution is bimodal with short bonds (centered around 2.95Å) and long bonds (centered around 3.81Å). The red arrow is the length 6 we use for partitioning the bonds into the short and long categories. Figure 2c shows the link between the Ramachandran ω angle (1,13) and the bond length. For a non-‐interacting phantom chain, one obtains a uniform distribution of points in the ( θ , µ ) plane (not shown as a figure). As a benchmark, we studied, using Wang-‐Landau Monte Carlo simulations (14), a simple self-‐avoiding polymer chain model comprised of 40 unit diameter tangent spheres (tethered hard spheres) subject to a self-‐attraction between sphere centers located within a distance of 2 units of each other. Figure 3a and 3b show a cross plot in the ( θ , µ ) plane of 17 conformations in the coil phase adopted by the chain at high temperatures and for 17 low energy conformations, respectively. The situation is dramatically different for proteins compared to a standard self-‐avoiding polymer model. Figure 3c is the ( θ , µ ) cross plot for the protein data set with a highly selective occupancy of ( θ , µ ) space (a version of this graph was presented earlier in Ref. 7). We binned the data in Figure 3c into squares of width 5 ° along θ (24 bins in the range 60 ° -‐180 ° ) and 5 ° along µ (72 bins spanning the range from 0 ° to 360 ° ) to determine the three highest density regions. These density peaks are shown in the figure as black X’s along with three larger squares of size 10 ° ×10 ° around them. They are identified as helices (blue region with black X at θ = 92.5 ° and µ = 47.5 ° ), β-‐strands (red region with black X at θ = 122.5 ° and µ = 192.5 ° ), and loops (green region with black X at θ = 92.5 ° and µ = 242.5 ° ) with 184382, 16372, and 10974 points respectively. The density of points in the α -‐helix peak is approximately 20 times that of loops and β-‐strands but the loop and β-‐strand regions are more spread out than the helical 7 region. The other populated regions in the ( θ , µ ) plane correspond to variants of helices and β-‐strands and the loops that link them together in the native state structure. Figure 3: Local structure representation. a) ( θ , µ ) cross plot for the high temperature coil phase of tethered hard spheres. The only constraint here is the requirement of self-‐avoidance of the spheres. The points are scattered across the plane with no θ angle less than 60 ° (a steric constraint) and few almost straight line triplets with a θ near 180 ° . b) ( θ , µ ) cross plot for low energy states of tethered hard spheres. Here again one observes no θ angles below 60 ° and favored θ angles of 60 ° , 90 ° , 120 ° , and 150 ° showing that the order favors a face-‐centered-‐cubic packing locally, which would be appropriate for the close packing of untethered spheres. c) ( θ , µ ) 8 plot for the Richardson data set comprising 4416 proteins and 972519 residues (purple points). The three highlighted regions correspond to density peaks related to α -‐helices (blue region θ = 92.5 ° and µ = 47.5 ° ), β-‐strands (red region θ = 122.5 ° and µ = 192.5 ° ), and loops (green region θ = 92.5 ° and µ = 242.5 ° ) (d) Plot of the Ramachandran ( ϕ , ψ ) angles for the highlighted regions in Figure (c) . It is important to note that the angles θ and µ are distinct from the Ramachandran (1) angles, which require the knowledge of the locations of backbone atoms besides those of the C α atoms. The ( θ , µ ) pair is a coarse grained representation of the Ramachandran angles and can be useful to describe a generic chain conformation and employed in models of statistical mechanics (15). In fact, knowing a sequence of Ramachandran angles, one can derive the values of θ and µ . The inverse process of determining the Ramachandran angles from the ( θ , µ ) values does not have a unique solution. For the C α atoms in the interior of all 4416 proteins, we measured the ( θ , µ ) as well as the Ramachandran ( ϕ , ψ ) angles. We illustrate the relationship between the two coordinate systems in Figure 3d. We plot the three colored regions (blue, red and green) of dense points in Figure 3c, but this time expressed as the ( ϕ , ψ ) Ramachandran angles color coded in the same manner as in the ( θ , µ ) plot. Note that the closely packed points in the ( θ , µ ) plot are more dispersed in the Ramachandran plot sometimes occupying non-‐contiguous regions. This is because θ and µ depend on more than one set of Ramachandran angles and the relationship is complicated and non-‐linear. θ , µ ) plot similar to ours except that the definition of mu was shifted by one C α position in the backward direction compared to our definition. Our own definition was motivated by defining θ and µ at a given site i that would determine the coordinates of the (i+1)-‐th C α coordinate. Importantly, Levitt determined an approximate empirical relationship between his θ and µ to elucidate approximate potentials for folding simulations. Oldfield and Hubbard (18) considered two successive θ angles and one µ angle (defined for a bond joining the two C α atoms) for a set of 83 protein structures and carried out a comprehensive study of local conformational space (but not amino acid preferences) recognizing that the two major building blocks of protein native state structures, helices and strands, are repetitive conformations. DeWitte and Shakhnovich (19) considered 87 protein structures with a goal of deducing the pairwise potentials, in the spirit of Miyazawa and Jernigan, for the formation of secondary structures in protein simulations based on a cross-‐plot of two successive µ angles (this time again defined as bond variables rather than at a site) and employing Levitt’s empirical relationship. Finally, the approach of Bahar, Kaplan, and Jernigan (20) is most similar to ours. They do have a ( θ , µ ) plot just like ours except that their µ definition is shifted by one position compared to ours. They employed 302 protein structures for their 10 analysis, they carried out an amino acid propensity estimate like we do, and they successfully developed short-‐ranged (along the sequence) rotational potentials for single amino acids. In essence, our work here builds on these earlier advances. The principal distinctions are the definition of µ -‐ our µ is defined at a site not at a bond, it is shifted with respect to other definitions, and the number of protein structures we employ, many decades after the earliest work, is understandably larger and comprises over 4000 experimentally determined and curated protein structures. Our goal in this paper is not to extract effective potentials but rather analyze, more generally, sequence-‐local structure relationships. Furthermore, we seek to group the 20 amino acids into distinct groups in terms of their similarity to substitute for each other in local conformational space. Figure 4 shows histograms of θ and µ values and evidence for a clear correlation between the average values of θ and the average value of µ among all proteins. Table 1 presents data on the amino acid occurrence probability and the degree of localization in ( θ , µ ) space. For each amino acid, we measured the inverse participation ratio (IPR) defined as !" = ( ! ! ! ! ! ! ! ) ! ! ! ! ! ! ! ! (1) where x i denotes the normalized density of occupancy of the i-‐th bin in ( θ , µ ) space and the total number of bins N=1728. An IPR value of 1 indicates perfect localization in just one bin whereas the largest possible value of the IPR is N=1728 for a uniform occupancy of all 1728 bins. 11 A perfect localization (IPR=1) is indicative of an amino acid that is always associated with the same local structure leading to a perfect sequence-‐structure relationship. The most localized amino acid is LEU (IPR=2.70) while the least localized is PRO (IPR=83.28). Figure 5 shows the occupancies of the ( θ , µ ) space of amino acids LEU and PRO. Interestingly, even the most localized amino acid, while being largely concentrated in just a few squares, is yet spread out over many squares indicating that there is no strong selection of local structure by amino acid identity. We carried out an analysis of triplet amino acids identities of all the 324 tight bends with θ angles less than 80 ° . The smallest θ angle in the data set has a value of 59.98 ° and the corresponding amino acid triplet is GLY-‐GLN-‐ASP. These tight turns (i-‐1,i,i+1) have no selectivity in µ angles. However, there is indeed a sequence-‐structure relationship with (GLY or SER) accounting for a total of 34 % occupancy in the i-‐1 position, (PRO or SER) having 31 % residency in site i, and (ALA or SER) accounting for 21 % in site (i+1). 12 Figure 4: a) and b). Histograms of θ and µ values showing a multi-‐peaked structure. c). A plot of the average value of θ versus the average value of µ for all 4416 proteins showing a tight correlation with a Pearson correlation coefficient of 0.97. This may be readily understood by noting that a protein structure is primarily composed of helices and sheets with varying fractions depending on the protein being considered. The θ -‐ µ values for an α -‐helix are both smaller than those of a β-‐strand leading to the correlation. Note that the standard deviations (not shown) are large because of the relatively large width in angle space of the regions. 13

Amino acid type Fraction [%] Inverse Participation Ratio (IPR)

ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL 8.53 4.84 4.42 5.96 1.36 6.48 3.61 7.90 2.32 5.62 8.79 5.70 2.02 4.04 4.59 5.88 5.58 1.52 3.61 7.23 3.28 3.24 4.53 4.60 3.69 3.25 3.30 11.61 4.31 2.93 2.70 3.43 2.95 4.06 83.28 5.14 4.75 3.99 4.25 3.77

Table 1: Frequency of 20 amino acids in the set of 4416 proteins (second column) and a measure of the localization of each amino acid in ( θ , µ ) space (third column). Figure 5: Occupancy pattern of amino acids LEU and PRO in ( θ , µ ) space. a) and b) depict the locations of the two amino acids. LEU is the most localized amino acid (IPR=2.70) whereas PRO has the largest IPR=83.28 value among the amino acids and is spread out the most. A rank ordered normalized occupancy fraction of the two amino acids is shown in c), and d). The number of bins needed to account for 50% and 90% occupancy for the two amino acids are LEU – 33 and 356, and PRO – 66 and 248, respectively. We studied histograms of the θ and µ values associated with each of the twenty amino acids. The distributions are roughly equally wide and substantially independent of amino acid identity (see Figures 6 and 7). 15 Figure 6: Histograms of the θ values for each of the twenty amino acids. While the shapes of the histograms vary from amino acid to amino acid, the ranges are mostly independent of amino acid identity. PRO is a bit of an outlier with a somewhat lower upper cut-‐off value of θ . 16 Figure 7: Histograms of the µ values for each of the twenty amino acids. Even though the shapes of the histograms vary from amino acid to amino acid, the ranges are mostly independent of amino acid identity. Unlike the α -‐helix region associated with tight local packing and hence a relatively small variation in the θ angle, there is a range of θ values associated with the β-‐strand region. We carried out sequence analyses of the β-‐strands to understand whether there is an amino acid selection principle for θ . We selected the ( θ , µ ) subspace consisting of µ values in the range from 175 ° to 185 ° (±5 ° degree interval around the ideal value of 180 ° ) and of θ angles in the range from 105 ° to 145 ° . We divided up the relevant range of θ angles into 40 bins of width 1 ° . Again, we measure the IPR defined in Eq.(1) with N=40 in this case. The extreme values of the IPR are 16.08 for the most localized amino acid, PRO, and 31.46 for the most spread out amino 17 acid, ASP (see Figure 8). The average θ value and its standard deviation for all amino acids in the β-‐region is 128.0 ° and 9.5 ° respectively. Figure 8: Distribution of θ angles in the β-‐ region for PRO (a) and ASP (b). PRO is the most localized amino acid, yet exhibits some spread of θ angles. We also studied the identities of the 210 pairs of amino acids (and their associated side chain sizes) located at sites i-‐1 and i+1 (these side chains stick out in roughly the same direction with a possibility of steric clashes) flanking site i in the β-‐region. We considered only those statistically significant pairs (i-‐1,i+1) which occurred at least 162 times (estimated as the total number of pairs divided by 210) with beads i-‐1, i, i+1 all lying in the β-‐strand region and divided the θ range again into 40 equally spaced bins. The number of amino acid pairs that met the 162 threshold was 52 out of the 210 pairs. We find that all pairs are spread out in θ values. The most localized pair among these was ALA-‐THR with an IPR of 10.51 and the most spread out pair was PHE-‐PRO with an IPR of 22.77 (see Figure 9 for histograms of θ values associated with these pairs). A cross plot of the mean van der Waals diameter of a pair and its average θ value (not shown) results in a weak correlation and an overall negative trend. All these results indicate that 18 the sequence does not play a significant role in determining the θ angle associated with a β-‐strand. Figure 9: Distribution of θ angles in the β-‐region for (ALA-‐THR) and (PHE-‐PRO) amino acid pairs in positions (i-‐1,i+1) respectively. ALA-‐THR is the most localized pair in θ space, yet is spread out. PHE-‐PRO is the most spread out pair. We carried out simple sequence analyses of the loop region as well, to understand whether there is a selection principle for the value of the µ angle. We select the ( θ , µ ) subspace consisting of θ angles in the range from 87.5 ° to 97.5 ° (±5 ° interval around the value 92.5 ° , identified as the peak density green region in Figure 3c) and µ values in the range from 90 ° to 360 ° to ensure that there is no overlap with the α -‐helix region. We divided up the range of µ angles into 54 bins of width 5 ° . We measured the IPR value for the 20 amino acids and we find that the most localized amino acid is GLY with a value of 8.49, whereas the most delocalized amino acid is PHE with an IPR equal to 28.42 (see Figure 10). Note that µ =180 ° and 360 ° correspond to planar configurations of 4 consecutive C α atoms, with the former corresponding to zig-‐zagging and the latter to rotation in the same sense. 19 Figure 10: Distribution of µ angles in the loop region for GLY and PHE. GLY is the most localized amino acid, yet exhibits a spread of angles. Based on the normalized density of occupancy of the amino acids in ( θ , µ ) space, one can assess the mutual similarity of the 20 amino acids by measuring the Cartesian distance between the 190 pairs of amino acids, which serves as a proxy of similarity. We have employed the Bhattacharyya coefficient (21) in order to calculate the degree of closeness of the ( θ , µ ) distributions of amino acids. We carried out hierarchical clustering by rank-‐ordering the closeness – the two closest amino acids were placed into a single group thereby now having effectively 19 groups of amino acids. This procedure was repeated recursively to reduce the effective groups of amino acids by one each time. A natural stopping point for this hierarchical clustering is when there is a relatively large jump in the measure of closeness of the remaining groups. The result of this analysis is shown in Figure 11 and yields 6 different groups comprising 7, 7, 2, 2, 1, and 1 amino acids each. Figure 12 shows the occupancy in ( θ , µ ) space of the six amino acid groups. 20 Figure 11: Clustering of amino acids into groups.

The 6 amino acid groups obtained based on their similarity in occupying the local structural ( θ , µ ) space are shown. 6 is a natural choice because the closeness for the next collapse into five groups is approximately twice as large as the previous closeness measure. A 5 member group would result in the merger of the two largest groups, Group A and Group B. If one were to retain seven groups, SER would detach from Group B and remain isolated as its own group. The sequences of hierarchical clustering for the first four groups A (blue), B (red), C (purple) and D (green) is shown with the link thickness quantitatively representing the closeness measure. 21 Figure 12: Occupancy of the six amino acid groups in ( θ , µ ) space. Groups A and B are somewhat similar with the main difference being the relative weights of the α -‐helix and β-‐strand regions. The most distinctive groups are E and F corresponding to GLY and PRO respectively. We remind the reader (see Figure 3c) that the density peaks occur at ( θ = 92.5 ° and µ = 47.5 ° ) for α -‐helices, ( θ = 122.5 ° and µ = 192.5 ° ) for β-‐strands, and ( θ = 92.5 ° and µ = 242.5 ° ) for loops. We alert the reader that this grouping is distinct from the more familiar groupings of amino acids based on their non-‐local interactions (22-‐29). Here, instead, it is entirely based on the similarity of their propensity to adopt specific local conformations. 22 We defined three significantly occupied regions of ( θ , µ ) space corresponding to α -‐helix ( θ∈ [90 ° ,95 ° ], µ ∈ [45 ° ,50 ° ]), β-‐strand ( θ∈ [105 ° ,145 ° ], µ ∈ [175 ° ,185 ° ]), and loop ( θ∈ [87.5 ° ,97.5 ° ], µ ∈ [90 ° ,360 ° ]). The amino acid occupancies of the three regions are normalized by their frequencies in the entire ( θ , µ ) space of all 4416 proteins and they are shown in Table 2. Amino acids having a normalized occupancy greater than 1 are over-‐represented in a given region and vice versa compared to the expectation from random considerations. The over-‐represented amino acids in the α -‐helix region (second column of Table 2) are all members of Groups A and C of amino acids with the top four being LEU (1.56), MET (1.46) and ALA/GLU both having 1.42 normalized occupancy. The amino acids over-‐represented in the β-‐strand region (third column of Table 2) are all members of amino acid Groups B and C, the top three being VAL (1.93), ILE (1.55) and TYR (1.51). Finally, the most over-‐represented amino acids in the loop region correspond to those that are the most under-‐represented in both the α -‐helix and β-‐strand regions: PRO (2.49), GLY (1.76), ASP (1.33) and ASN (1.31). These four amino acids are members of the amino acid groups D (ASN and ASP), E (GLY), and F (PRO) – see amino acid grouping analysis and Figure 11. The strong correlation observed between the values of normalized occupancies of amino acids in the three regions and the results of the amino acid groupings suggests that amino acid Group A can be interpreted as the “ α -‐helical” group, amino acid Group B as the “β-‐strand” group, while Group C is over-‐represented in both α -‐helix and β-‐strand regions. Finally, amino acid Groups D, E, and F can be described as “loop” groups, since they are strongly over-‐represented in loops and under-‐represented in both α -‐helix and β-‐strand regions. These findings are in a good accord with the observed amino acid propensities in protens previously reported in the literature (3,5,31-‐33). 23 Amino acid type Normalized occupancy in the α -‐helix region Normalized occupancy in the β-‐strand region Normalized occupancy in the loop region ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL 1.42 1.29 0.77 0.77 0.87 1.42 1.38 0.34 0.80 1.26 1.56 1.21 1.46 0.88 0.14 0.61 0.67 0.89 0.80 1.00 0.85 1.02 0.60 0.48 1.41 0.63 0.78 0.60 0.98 1.55 0.94 0.75 1.21 1.38 0.40 1.05 1.43 1.23 1.51 1.93 0.89 0.89 1.31 1.33 0.67 0.97 0.88 1.76 0.81 0.51 0.70 1.08 0.69 0.64 2.49 1.04 0.76 0.89 0.63 0.50

Table 2: Propensity of the 20 amino acids to occupy the α -‐helix, β-‐strand, and loop regions in ( θ , µ ) space. The numbers shown have been normalized by the amino acid occurrences in all of the ( θ , µ ) space. With the identification of just six groups, we proceeded to an analysis of correlating the local structure ( θ , µ ) at bead i to the identity of the triplet of amino acid groups at positions (i-‐1,i,i+1). The simplicity now is that the total number of distinct triplets is 216 instead of 8000. We considered each of these triplets and studied the number of times these occurred. Obviously, 24 one would expect that triplets containing the amino acids in groups C, D, E and F would be fewer than those occurring in Groups A and B. Indeed, the number of triplets which occurred more than 4461 times (deduced by dividing the total number of triplets = 963681 and the total number of types of triplets = 216) was just 57 and we used these for our analysis because of their statistical significance. The results are summarized in Figure 13. Figure 13: The six panels show the distributions of the 6 most localized triplets in the ( θ , µ ) plane. They all occupy the α -‐helix region predominantly. But they are spread out considerably underscoring the weak role of the amino acid sequence in matching with the local structure. We remind the reader (see Figure 3c) that the density peaks occur at ( θ = 92.5 ° and µ = 47.5 ° ) for α -‐helices, ( θ = 122.5 ° and µ = 192.5 ° ) for β-‐strands, and ( θ = 92.5 ° and µ = 242.5 ° ) for loops. 25 We conclude with the lessons learned from our analysis. Our goal here was to characterize the local structures associated with protein native state folds using the simple representation of just two angles ( θ and µ ) for each C α position (Figure 1). This simplification is made possible because the vast majority of bond lengths is substantially constant (Figure 2). The ( θ , µ ) variables are a coarse-‐grained representation of successive Ramachandran angles. The local structures adopted by proteins are captured by simple patterns of points in the ( θ , µ ) plane. This reveals that protein native state structures (even at the local level) are highly structured unlike the behavior of a generic chain. Even though there is a great deal of spread in the θ and µ values, there is a tight correlation in the plot of the mean θ versus mean µ for the 4416 proteins (Figure 4). Armed with insights on the local structural pattern, we explored a potential sequence-‐structure relationship in multiple ways. We considered the propensity of the 20 amino acids to occupy certain regions of local structural space. We also divided the 20 amino acids into 6 groups based on their similarity to each other in being associated with regions in the ( θ , µ ) space. We explored singlets and triplets based on grouping. The basic result of our analysis is that any sequence-‐local structure relationship is not very strong and there is flexibility in the ability of the amino acids to adapt to the local structure. This is consistent with the prevalence of neutral evolution where neither the native state fold nor the ability to function changes under many amino acid substitutions. It serves to underscore the pioneering results of Brian Matthews (34,35) and his team who “used the lysozyme from bacteriophage T4 to define the contributions that different types of interaction make to the stability of proteins”. One of their key findings was that “the protein is, in general, very tolerant of amino acid replacement”. Our findings also 26 are in accord with more recent experimental studies on proteins (36,37) which showed that, while protein structures are highly tolerant of amino acid substitutions, a small number of key alterations can yield distinct structure and function. An interesting challenge is to be able to predict, in a transparent and reliable manner, the identity of these key amino acids. We conclude by revisiting a seminal paper by Levitt (38) more than four decades ago in which he very carefully measured the Chou-‐Fasman propensity (39) of the twenty amino acids to be housed in three secondary structures. He noted that, generally, the preferences of the individual amino acids for secondary structure are rather weak. He provided a physical interpretation of his results by noting that “the chemical structure and sterochemistry of the amino acid plays a major part in determining its preference and dislike for secondary structure….. Bulky amino acids, namely, those that are branched at the β-‐carbon or have a large aromatic side chain, prefer β-‐sheet. The shorter polar side chains prefer reverse turns, as do Gly and Pro, the special side chains. All other side chains prefer α -‐helix, except Arg which has no preference.” Table 3 shows a side-‐by-‐side comparison of the results of Levitt obtained with less than a hundred protein structures and our findings with entirely different methods and more than 4000 protein structures. Our results match those of Levitt (38) confirming the adage – old is gold . 27 α -‐helix propensity β-‐sheet propensity Loop propensity Our study Levitt [38] Our study Levitt [38] Our study Levitt [38] LEU (1.56) MET (1.47) VAL (1.93) VAL (1.49) PRO (2.49) PRO (1.91) MET (1.46) GLU (1.44) ILE (1.55) ILE (1.45) GLY (1.76) GLY (1.64) GLU (1.42) LEU (1.30) TYR (1.51) PHE (1.32) ASP (1.33) ASP (1.41)

Table 3: Identities of three amino acids with the highest propensities to occupy the α -helix, β -strand, and loop regions in ( θ , µ ) space (taken from Table 2). The Table also shows the winning amino acids from Levitt’s analysis of 1978 (38). There is excellent accord between our results and those of Levitt. The key difference is the identity of one of the top three amino acids in the β-‐sheet propensity group. PHE scores third in Levitt’s analysis with a normalized probability of 1.32 whereas PHE scores fifth in our analysis with a similar probability score of 1.38. TYR scores third in our study and fourth in Levitt’s analysis.

Acknowledgements:

We are indebted to George Rose for his collaboration and inspiration.

We are grateful to Pete von Hippel for his warm hospitality and to him and Brian Matthews for stimulating conversations. We are very thankful to two anonymous reviewers for their constructive suggestions.

Funding:

This project received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-‐Curie Grant Agreement No 894784. The contents reflect only the authors’ view and not the views of the European Commission. Support from the University of Oregon (through a Knight Chair to JRB), University of Padova through “Excellence Project 2018” of the Cariparo foundation (AM), MIUR PRIN-‐COFIN2017

Soft Adaptive Networks grant 2017Z55KCW and COST action CA17139 (AG) is gratefully acknowledged. The computer calculations were performed on the Talapas cluster at 28 the University of Oregon.

Conflict of interest:

The authors declare they have no conflict of interest.

Author Contributions:

TŠ and JRB conceived the ideas for the calculations. TŠ carried out the calculations. JRB wrote the paper. All authors participated in understanding the results and reviewing the manuscript.

Emails: [email protected], [email protected], [email protected], [email protected]

References

Adv. Prot. Chem. , 283-‐438 (1968). 2. Fitzkee, N. C. & Rose, G. D. Steric restrictions in protein folding: an α -‐helix cannot be followed by a contiguous β-‐strand. Protein Sci. , 633-‐639 (2004). 3. Lesk, A. M. Introduction to Protein Science: Architecture, function and genomics (Oxford University Press, 2004). 4. Rose, G. D., Fleming, P. J., Banavar, J. R. & Maritan, A. A backbone-‐based theory of protein folding.

Proc. Natl. Acad. Sci. USA , 16623-‐16633 (2006). 5. Bahar, I., Jernigan R. L. & Dill, K. A.

Protein Actions (Garland Science, Taylor & Francis Group, 2017). 29 6. Rose, G. D. Protein Folding – Seeing is Deceiving, preprint. , bioRxiv doi: 10.1101/2020.11.10.375105 8.

Rubin B., & Richardson J. S. The simple construction of protein alpha-‐carbon models.

Biopolymers , 2381-‐2385 (1972). 9. https://proteopedia.org/wiki/index.php/Byron%27s_Bender 10. Lovell, S. C., Word, J.M., Richardson, J. S. & Richardson, D. C. The penultimate rotamer library. Proteins , 389-‐408 (2000). 11. http://kinemage.biochem.duke.edu/databases/top8000.php. 12. Wang, G. & Dunbrack, R. L. Jr. PISCES: a protein sequence culling server. Bioinformatics , 1589-‐1591 (2003). 13. Matthews, B. W. How planar are peptide bonds? Protein Sci. , 776-‐777 (2016). 14. Wang, F.& Landau, D. Efficient, Multiple-‐Range Random Walk Algorithm to Calculate the Density of States. Phys. Rev. Lett . , 2050–2053 (2001). 15. Flory, P. J. Statistical Mechanics of Chain Molecules (Wiley & Sons, 1968). 16. Rackovsky, S. & Scheraga H. A. Differential Geometry and Polymer Conformation.

Macromolecules , 1259-‐1269 (1981). 17. Levitt, M. A Simplified Representation of Protein Conformations for Rapid Simulation of Protein Folding. J. Mol. Biol. , 59-‐107 (1976). 18. Oldfield, T. J. & Hubbard, R. E. Analysis of C α Geometry in Protein Structures.

Proteins , 324-‐337 (1994). 30 19. DeWitte, R. S. & Shakhnovich, E. I. Pseudodihedrals: Simplified protein backbone representation with knowledge-‐based anergy. Prot. Sci. , 1570-‐1581 (1994). 20. Bahar, I., Kaplan, M. & Jernigan, R. L. Short-‐Range Conformational Energies, Secondary Structure Propensities, and Recognition of Correct Sequence-‐Structure Matches. Proteins , 292-‐308 (1997). 21. Bhattacharyya, A . On a measure of divergence between two statistical populations defined by their probability distributions.

Bulletin of the Calcutta Mathematical Society , 99–109 (1943). 22. Regan, L. & DeGrado, W. E. Characterization of a helical protein designed from first principles. Science , 976-‐978 (1988). 23. Miyazawa S. & Jernigan, R. L. Residue-‐residue potentials with favorable contact pair-‐term and an unfavorable high packing density term, for simulation and threading . J. Mol. Biol. , 623-‐644 (1996). 24. Riddle, D. S., Santiago, J. V., Bray-‐Hall, S. T., Doshi, N., Grantcharova, V. P., Yi, Q. & Baker D. Functional rapidly folding proteins from simplified amino acid sequences.

Nature Struct. Biol. , 805-‐809 (1997). Nat. Struct. Biol. , 871-‐874 (1997). Phys. Rev. Lett. , 765-‐768 (1997). Nat. Struct. Biol. , 1033-‐1038 (1999). Nat. Struct. Biol. , 994-‐996 (1999). J. Chem. Phys. , 1420-‐1423 (2001).

Adv. Protein Chem. , 1-‐109 (1985). Science , 1632-‐1641(1988). α -‐helical propensities III: Positional dependence at several positions of C-‐terminus. Prot. Sci. , 766-‐777 (2002). BMC Struct. Biol. (2010). Science , 631-‐635 (1988).

Annu. Rev. Biochem. , 139–160 (1993). Proc. Natl. Acad. Sci. USA , 11963–11968 (2007).

Proc. Natl. Acad. Sci. USA , 21149–21154 (2009). 32 38. Levitt, M. Conformational Preferences of Amino Acids in Globular Proteins.

Biochemistry , 4277-‐4285 (1978). 39. Chou, P. Y. & Fasman, G. D. Prediction of Protein Conformation. Biochemistry , 222-‐245 (1974). Supplementary Information for the manuscript “Local sequence-‐structure relationships in proteins”

Tatjana Škrbić, Amos Maritan, Achille Giacometti and Jayanth R. Banavar

Table S1: PDB codes of the 4416 proteins used in our analysis.1hzo_A 1hzt_A 1i0l_A 1i0r_A 1i0v_A 1i1n_A 1i24_A 1i27_A 1i2t_A 1i4u_A 1i6m_A 1i77_A 1i7h_A 1i7k_A 1i8a_A 1i8f_F 1i8k_B 1i8o_A 1i9c_A 1iap_A 1iby_B 1idp_A 1ig3_A 1igq_A 1ihj_B 1iib_B 1ijb_A 1ijt_A 1ijx_C 1ijy_B 1ikt_A 1io0_A 1iom_A 1ioo_B 1iq6_B 1osy_B 1oth_A 1ou8_B 1ouw_C 1ov3_A 1ow3_A 1owf_A 1ox0_A 1oxj_A 1oxs_C 1oyg_A 1oz9_A 1ozn_A 1ozw_B 1p0f_B 1p1j_A 1p1m_A 1p1x_B 1p28_B 1p3c_A 1p6o_B 1p71_A 1p99_A 1pa2_A 1pam_B 1pcf_C 1pdo_A 1pe9_B 1pfb_A 1pgv_A 1pj5_A 1pk3_B 1pkh_A 1pl3_A 1pl8_D 1vg8_C 1vh5_A 1vht_B 1vhw_A 1vi6_B 1vim_A 1vj2_A 1vjk_A 1vjw_A 1vk5_A 1vkc_A 1vke_F 1vki_A 1vkk_A 1vl1_A 1vl7_A 1vlc_A 1vlj_A 1vm9_A 1vmb_A 1vmf_C 1vmj_A 1vp2_A 1vp6_C 1vph_E 1vps_A 1vq3_B 1vqe_A 1vsr_A 1vyf_A 1vyo_A 1vzi_B 1vzy_B 1w0d_A 1w0n_A 2bjf_A 2bji_A 2bjq_A 2bk9_A 2bka_A 2bkf_A 2bkl_B 2bkm_B 2bko_A 2bkr_A 2bkx_A 2bky_B 2bky_Y 2bl0_A 2bl0_B 2bl8_B 2blf_A 2blf_B 2bme_B 2bmo_A 2bmo_B 2bnm_B 2bo1_A 2bo4_F 2bo9_C 2bo9_D 2boo_A 2bpd_B 2bpq_A 2bqx_A 2br9_A 2bsj_A 2bt6_A 2bt9_A 2buu_A 2he0_A 2he2_A 2he4_A 2hek_B 2heu_B 2hew_F 2hf9_A 2hfn_H 2hgx_B 2hhv_A 2hin_B 2hjv_A 2hke_B 2hl7_A 2hlc_A 2hls_A 2hlv_A 2hmq_D 2hor_A 2hos_B 2hq6_A 2hqh_C 2hqs_H 2hqy_A 2hra_A 2hrv_B 2hsa_A 2ht9_B 2hta_A 2hu9_A 2hur_B 2hv8_A 2hv8_E 2hvm_A 2hvw_C 2qud_A 2qul_C 2quo_A 2quy_H 2qv5_A 2qvb_A 2qvo_A 2qvu_B 2qwc_A 2qwl_A 2qwo_B 2qx8_B 2qxi_A 2qy1_B 2qy9_A 2qzt_B 2r0b_A 2r0h_C 2r16_A 2r1j_R 2r2y_A 2r31_A 2r37_A 2r5o_B 2r6j_B 2r75_1 2r8e_E 2r8o_A 2r8q_A 2r99_A 2r9f_A 2ra3_B 2ra4_A 2ra6_B 2rbk_A 2z66_B 2z6n_A 2z6o_A 2z6r_A 2z6w_A 2z72_A 2z79_B 2z7f_E 2z7f_I 2z84_A 2z8f_A 2z8l_A 2z8q_A 2z8u_B 2z8x_A 2z9v_B 2za0_A 2zbo_A 2zbt_B 2zc8_A 2zd1_A 2zdh_A 2zdo_B 2zdr_A 2zex_A 2zez_B 2zfc_B 2zfd_A 2zfz_D 2zgq_A 2zhj_A 2zhn_A 2zhz_C 2zib_A 2zjd_C 3ef6_A 3efy_A 3eg4_A 3egg_D 3ego_B 3egw_C 3ehg_A 3ehw_B 3ei9_B 3eif_A 3ej9_B 3ej9_C 3eja_A 3ejf_A 3ejg_A 3eju_A 3eki_A 3elw_A 3elx_A 3em1_A 3emi_A 3emw_A 3enb_A 3enk_B 3enu_A 3eoi_A 3epr_A 3eqn_B 3er6_A 3era_B 3erj_A 3erx_B 3esg_B 3esl_B 3eu9_C 3k3c_D 3k3k_A 3k3v_A 3k62_A 3k6f_A 3k6i_A 3k6v_A 3k6y_A 3k7f_B 3k7i_B 3k7p_B 3k89_A 3k8d_A 3k8u_A 3k8w_A 3k9o_A 3k9w_A 3kbf_A 3kcc_A 3kcg_H 3kci_A 3kcp_A 3kda_A 3ke4_A 3kef_B 3keo_B 3kfa_A 3kff_A 3kg0_C 3kgr_A 3kgz_B 3kh7_A 3kij_C 3kki_A 3kkq_A 3qgz_A 3qh4_A 3qhz_M 3qk8_C 3qki_B 3qmd_A 3qp4_A 3qqi_B 3qry_A 3qu1_B 3qug_A 3qxc_A 3qy1_B 3qyj_A 3qzb_A 3r0p_B 3r1i_A 3r1w_C 3r3r_A 3r3s_C 3r6f_A 3sil_A 4ubp_A 4ubp_B 4vub_A 5pal_A 6cel_A 6rxn_A 7fd1_A 7rsa_A 8abp_A