WHAT IS LIFE - Sub-cellular Physics of Live Matter
WWHAT IS LIFE?
Sub-cellular Physics of Live Matter
Antti J. Niemi
Department of Physics and Astronomy, Uppsala University, P.O. Box 803, S-75108, Uppsala,SwedenLaboratoire de Mathematiques et Physique Theorique CNRS UMR 6083, F´ed´eration DenisPoisson, Universit´e de Tours, Parc de Grandmont, F37200, Tours, FranceDepartment of Physics, Beijing Institute of Technology, Haidian District, Beijing 100081,China a r X i v : . [ c ond - m a t . s o f t ] D ec edicated to the 70 th anniversary of Schr¨odinger’s 1944 Lectures in Dublin. reface ” There are at present fundamental problems in theoretical physics awaiting solution, e.g. therelativistic formulation of quantum mechanics and the nature of atomic nuclei (to be followedby more difficult ones such as the problem of life). ” (P.A.M. Dirac 1931)
70 years ago Erwin Schr¨odinger presented a series of lectures at the Dublin Institutefor Advanced Studies with the title
What is Life? The Physical Aspect of the LivingCell.
The lectures were subsequently published in the form of a small book [1]. Accord-ing to Google Scholar this book has been cited even more frequently than Schr¨odinger’sarticles on quantum mechanics. In particular, it was credited by Crick and Watson asthe source of inspiration that led them to reveal the double helix structure of DNA.My lectures at les Houches were a celebration of the anniversary of Schr¨odinger’slectures, and for that reason I decided to share a title.Besides Crick and Watson, Schr¨odinger’s book has been, and continues to be, asource of inspiration to generations of physicists. But it seems to me that its full dimen-sionality might not yet have been fully comprehended, nor appreciated. In particular,the book has many parallels to Philip Anderson’s highly inspiring 1972 article
More isdifferent. Broken symmetry and the nature of the hierarchical structure of science. [2].Schr¨odinger clearly realised that
Bol’she ( bol b xe ) makes a difference when he wrotethat ... living matter, while not eluding the ’laws of physics’ as established up to date, is likelyto involve ’other laws of physics’ hitherto unknown, which, however, once they have beenrevealed, will form just as integral a part of this science as the former. Time should be ripe to accept a universal formal definition of life in terms of pro-teins and their dynamics: Proteins are the workhorses of all living organisms, proteinsare true nano-machines that participate in all the metabolic activities which consti-tutes life as we know it. From this perspective life is indeed something that can bemodelled and understood using both known and still to be revealed laws that governsub-cellular physics of live matter. For a physicist like me this is an exciting way totry and answer Schr¨odinger’s question.The underlying theme in my lectures is to view proteins as an exciting example ofa physical system where much of Anderson’s
Bol’she can be found. Indeed, proteinsseem to bring together most of the contemporary lines of research in modern theo-retical physics: Geometry of string-like structures, topological solitons, spin chains,integrable models, equilibrium and non-equilibrium statistical physics, quest for en-tropy, emergent phenomena, ... and much, much more. Moreover, the tools that areneeded to fully understand proteins range from highly formal to extensively numerical,and for a theorist there are almost endless opportunities to address questions with di- iii
Preface rect and important experimental relevance; physical, chemical, biological, medical, ...The amount of data is overwhelming and it is readily accessible. Experiments can bedone and directly compared with theoretical calculations and numerical computations;the subject continues to grow at a highly exponential rate.These lectures were prepared for students in condensed matter physics, both theoreti-cal and experimental. I assumed the students did not really have any prior knowledgeof proteins, that they did not even know how protein looks like at the atomic level.Thus I begin with a protein minimum . It explains the basic facts that I think one needsto know, to get started in research on physics of proteins. The rest of the lectures ad-dress proteins from the point of view of a physicist, from a perspective that I hopeappeals to the way a physicist thinks. As such the presentation could be somewhatintimidating to chemists and biologists who might find the concepts and techniquesthat I introduce as foreign, something they have not seen and are not accustomed to,in the context of a problem which is traditionally viewed as theirs. However, I assureyou that everything I describe is very simple. A good command of basic algebra is allthat it really takes to follow these lectures. Indeed, proteins brings physics together ina unique fashion with biology, chemistry, applied mathematics, even medical research.Theory and experiments. With the goal to understand matter that is alive. I hope you catch the bug , too!These lectures are available on-line at http : // topo − houches . pks . mpg . de / ? page i d = cknowledgements I wish to thank my many collaborators, postdocs and students from whom I havelearned so much on the subject of these lectures. They include Si Chen, AlirezaChenani, Maxim Chernodub, Jin Dai, Ulf Danielsson, Jan Davidsson, Thomas Garan-del, Ivan Gordeliy, Valentin Gordeliy, Jianfeng He, Yanzhen Hou, Konrad Hinsen,Shuangwei Hu, Theodora Ioannidou, Ying Jiang, Gerald Kneller, Andrei Krokhotine,Nevena Litova, Jiaojia Liu, Adam Liwo, Martin Lundgren, Gia Maisuradze, NoraMolkenthin, Alexander Molochov, Alexander Nasedkin, Daniel Neiss, Xuan Nguyen,Stam Nicolis, Xubiao Peng, Fan Sha, Harold Scheraga, Adam Sieradzan, Ann Sinel-nikova, Maxim Ulybushev, Yifan Zhou, and many many others (to whom I apologiseif name is not mentioned above). I also thank Frank Wilczek for making me curiousabout Anderson’s bol b xe .My research has been supported by CNRS PEPS grant, Region Centre Recherched’Initiative Academique grant, Sino-French Cai Yuanpei Exchange Program (Parte-nariat Hubert Curien), Vetenskapsr˚adet, Carl Trygger’s Stiftelse f¨or vetenskaplig forsk-ning, and Qian Ren Grant. I thank the organisers of the Topological Aspects in Con-densed Matter Physics
Claudio Chamon, Mark Goerbig and Roderich Moessner forgiving me the opportunity to present these lectures at Les Houches. I thank Interna-tional Institute of Physics in Natal, Brazil for hospitality during the writing of thesenotes. Last but not least, thanks to the great audience at Les Houches! ontents α trace reconstrucion 444.2 Universal discretised energy 454.3 Discretized solitons 484.4 Proteins out of thermal equilibrium 494.5 Temperature renormalisation 50 λ -repressor as a multi-soliton 545.2 Structure of myoglobin 57 ii Contents α References A Protein Minimum
Proteins are nano-scale machines that control and operate all metabolic processes inall living organisms. Proteins often have to function with extreme precision: Like mostmachines, those made of proteins need to have their parts and pieces in the rightplace, in a good shape and finely tuned. How else could these self-producing nano-machines work in such a great harmony, cooperate over an enormous range of scalesand uphold something as complex as life? Indeed, it is widely understood that thebiological function of a protein depends critically on its three dimensional geometry.From this perspective the so called protein folding problem that aims to explain andderive the shape of a biologically active protein using laws of physics, addresses theorigin of life itself [3, 4].Furthermore, a wrong fold is a common cause for a protein to lose its function.A wrongly folded protein can be dangerous, even fatal, to a biological organism. Itis now widely understood that diverse neurodegenerative diseases including variousforms of dementia such as Alzheimer’s and Parkinson’s, type-II diabetes, and abouthalf of all cancers, are caused by wrong folds in certain proteins [5]. At the same time,bacteria are on the rampage and emergent resistance through evolutionary processesrenders existing antibiotics ineffective at a rapid pace [6, 7, 8]. No effective methodsand treatments have been found to prevent or cure viral maladies like HIV, Ebola,or respiratory syndromes such as SARS and MERS. Our future protection againstthese and various other harmful and deadly pathogens depends on our skills andknowledge to develop conceptually new, protein-level approaches to fight and eliminateour enemies. Research on proteins is really about ’Saving the Planet’ as much as inany video game or movie ever made. But it is for real: By doing research on proteinsyou have a change to become a real-life
Gordon Freeman .For all these and many other reasons the ability to accurately describe the physicsof proteins, their structure and dynamics, would have an enormous impact on biology,pharmacy, and health sciences. It would provide huge benefits to the society by pavingways to prevent and cure many tormenting diseases. In particular, it would provide usan answer to what is life along the lines foreseen by Schr¨odinger.In the following I will give a short introduction to proteins, what you need to getyour research started as a physicist. For those who are really seriously interested inthe biological aspects, I recommend the textbook
Molecular Biology of the Cell [9].
A Protein Minimum
Proteins are one dimensional linear polymers. Proteins are composed of twenty differ-ent amino acids that share a number of structural properties: There are the backbone atoms which are common to all amino acids. There are the residues or side-chains which are different for each of the twenty amino acid. Fig. 1.1
Amino acids have a common backbone with heavy atom pattern − N − C − C − O − but there are also twenty different residues (side-chains) which we denote R. When two aminoacids combine together we obtain a di-peptide, in addition of a water molecule. In Figure 1.1 we show the chemical composition of a generic amino acid. Whentwo amino acids meet, a chemical process can take place that joins them together intoa di-peptide plus water, as shown in the Figure. When this process repeats itself weeventually arrive at a long polypeptide chain a.k.a. protein as shown in Figure 1.2.Once the protein attains the correct shape, it becomes ready for biological action. + H N % C α% C% N% C α% C% N% C α% C% C%C α% N% R% R% R% R%
O% O%O%O%H% H% H% H%H%H%H%
Fig. 1.2
Proteins are long linear chains of amino acids, each with a self-similar backbonestructure but twenty different residue structures (R).
Note the carbon atoms that are denoted C α in Figures 1.1 and 1.2. These arecalled the α -carbon, and they have a central rˆole in protein structure. As shown inthe Figures the C α carbons connect the residues to the backbone; the C α forms thecenter of a sp α ata banks and experiments carbons largely determine the shape of the protein. Thus much of our subsequentanalysis of protein structure and dynamics is based on the central rˆole of the C α , forreasons that become increasingly apparent as we proceed.In a living organism like you and me, the instructions for making proteins arestored in our genome . At the level of DNA the genetic code consists of a sequence ofnucleobases, that connect the two strands of DNA. A group of three nucleobases corre-sponds to a single amino acid; there is a segment of DNA, for each protein. The geneticcode is copied from DNA to RNA in a process called transcription . This process, likeall other processed in our body is driven by various proteins. Particular proteins calledenzymes act as catalysts to help and control complex biological reactions.Our DNA consists of four different nucleobases. Hence there are 4 × × stop the process of tran-scription. Thus we have a total of 62 combinations of nucleobases that encode the 20amino acids, the genetic code is degenerate. Research project: From the point of view of physics, we have an appetising similarity betweengenetic code where a group of three nucleobases corresponds to an amino acid, and theStandard Model where baryons are made of three quarks. Can you find a symmetry principleakin the Eightfold Way that relates the codons to the 20 amino acids? Hint: A good wayto start trying is to follow [10]. Once formed, the RNA has the mission to carry the genetic code to a ribosome. Aribosome is essentially a nano-scale 3D printer. It is made of proteins, and it has theduty to produce new proteins according to the instructions given to it by RNA. Theprocess where ribosome combines amino acids into a protein chain is called translation . The amount of data and information which is available in internet is abundant, bothon the genetic code and on proteins. Ther are various open-access libraries both onthe sequences and on the structures of proteins; the amount of data is already morethan any single person can possibly ever analyse, and it continues to increase at anexponential rate.For those who are mainly interested in biology and related bioinformatics, an ex-cellent resource on protein sequences and their biological function is http : // . uniprot . org / (1.1)This data bank contains presently almost 90 million different protein sequences. Asshown in Figure 1.3 (left) the number of known sequences grows at a very high,exponential rate.For those who are mainly interested in physics of proteins, Protein Data Bank (PDB) is an excellent resource http : // . pdb . org / (1.2)As shown in Figure 1.3 (right) the number of known protein structures in PDB isaround 100.000 and growing. But not at all as fast as the number of known sequences,only about 0 .
1% of known protein sequences have a known structure.
A Protein Minimum % % % % % % % % % % UniProt( PDB(
Fig. 1.3
Left : The increase in the number of sequences in
Uniprot , as a function of year,taken from (1.1).
Right : The increase in the number of structures in PDB as a function ofyear. Both annual increase and accumulated total are shown. Taken from (1.2).
Numerous other good sources of information exist and can be found in internet. Forexample
PSI-Nature Structural Biology Knowledgebase is a comprehensive databasefor various structural aspects of proteins. It can be found at http : // sbkb . org / (1.3)Most of the 100.000 structures in PDB have been resolved using x-ray crystallog-raphy. But other techniques are also being used. In particular, the number of NMRstructures is increasing. Until now it has been very difficult to resolve long proteinsequences using NMR techniques. Most NMR structures are quite short, those withmore than 100 amino acids are rare. The advantage of NMR where no crystallisationis needed over x-ray crystallography is, that NMR can more easily provide dynamicalinformation. It is possible to follow proteins in motion using NMR, while crystallisedstructures have problems moving. But even x-ray techniques such as small angle x-rayscattering (SAXS) and wide angle x-ray scattering (WAXS) are now being developed,to observe proteins in motion. In the near future, equipments including free electronlasers can provide detailed structural and dynamical information, at very short timescales. Various other methods are also in use and under development: The experi-mental study of protein structure and dynamics is still very much in its infancy. Thismakes the study of proteins into an exciting field to enter, also for those who areexperimentally minded. New techniques are developed and introduced, all the time.The protein crystals in PDB are ordered , they are commonly presumed to displaya crystallised conformation which is close to the biologically active one. But most pro-teins are apparently intrinsically unstructured. Such proteins can not be crystallisedinto any kind of biologically unique conformation. When these proteins are biologicallyactive, they do not have any single conformation. Instead they change their form, per-petually. Most proteins in our body are like this, intricate nano-machines that are in aconstant action. Very little, in fact next-to-nothing, is known on the structural aspectsof intrinsically unstructured proteins. In these lectures we shall look at examples ofboth ordered and intrinsically disordered proteins.Our experimental considerations will mainly make use of a subset of crystallo-graphic PDB structures which have been measured with a ultra-high precision, with ata banks and experiments a resolution better than 1.0 ˚A. The reason why we prefer to use these ultra-high res-olution structures is due to a process called refinement that commonly takes placeduring experimental data validation and model building [11]. During refinement andvalidation, one iteratively improves the parameters of an approximate trial structureof the experimental observations, until one obtains some kind of a best fit between thetrial structure and the observed diffraction pattern. As an Ansatz , and as reference,the process utilises known experimental crystallographic structures. Widely used ex-perimentally determined, highly accurate template libraries of small molecules includethe one by Engh and Huber [12]. Thus, the process of refinement might introduce abias towards structures that are already known. In particular, it is not clear to whatextent the structure of a small molecule persists in the complex, highly interactiveenvironment of a large protein.Indeed, it is important to recognise and keep in mind that the PDB data filesare prone to all kinds of errors [11, 13, 14, 15]. The data should be used with care.
Molprobity is an example of a web-server that can be used to analyse the quality ofan experimental protein structure. It can be found at http : // molprobity . biochem . duke . edu / (1.4)One might presume that in the case of ultra-high resolution structures, those thathave been measured with better than 1.0 ˚A resolution, the quality of observed diffrac-tion pattern should be very good. These structures should have much less need forrefinement, during the model building. Thus, they should be much less biased towardsknown structures. The number of misplaced atoms should be relatively low. Fluctua'on*distance*(Å)* N u m b e r * o f * e n t r i e s * Fig. 1.4
The distribution of the Debye-Waller fluctuation distance for the C α atoms, amongthose crystallographic PDB structures that have been measured with better than 1.0 ˚A res-olution. In order to minimise radiation damage, crystallographic structures are often mea-sured at temperatures which are near that of liquid nitrogen i.e. around 80-90 Kelvin.Thus the thermal fluctuations in the atomic coordinates should be small. In the PDBdata, the experimental uncertainty in the atomic coordinates is estimated by the (tem-perature) B-factors. Besides the thermal fluctuations, these B-factors summarise also
A Protein Minimum all the other uncertainties that the experimentalist thinks affects the precision. In Fig-ure 1.4 we show the distribution of the Debye-Waller fluctuation distance in our subsetof the ultra-high precision structures, for the C α atom coordinates. The fluctuationdistance can be estimated using the Debye-Waller relation, √ < x > ≈ (cid:114) B π (1.5)This corresponds roughly to the one standard deviation uncertainty in the experimen-tally measured coordinate values. In Figure 1.5 is an example of a generic PDB entrythat shows how the B-factors are listed together with the atomic coordinates. Fig. 1.5
An example of PDB file, in this case the amino acid histidine (HIS). The secondcolumn lists the atom number (98-107) along the backbone. The third column lists the typeof the atom; CA stands for C α , and the entries 98-101 are the backbone N-C α -C-O atomsof HIS. The entries 102-107 are side chain atoms. In this case, HIS is the 12 th amino acidalong the backbone. The ( x, y, z ) coordinates are listed in the following three columns. TheB-factors are listed in the last column, before a list of the chemical symbols. Note that inthis list, the hydrogen atoms are absent. Hydrogens can be difficult to observe. According to Figure 1.4, among our ultra-high resolution PDB structures the onestandard deviation error distance in the C α atomic positions peaks in the range of 0.3- 0.5 ˚Angstr¨om. The lower bound is around 0.15 – 0.2 ˚Angstr¨om and in these lectureswe shall adopt ∼ .
15 ˚A i.e. around 20% of the radius of the carbon atom, as thelower bound estimate for the size of quantum mechanical zero point fluctuations inthe C α positions. Note that historically ∼ γ -rays. Like most linear polymers, proteins have a highly complex phase structure that candepend on a multitude of factors, including the chemical structure of a polymer andits solvent, temperature, pressure, changes in solvent’s acidity and many other envi-ronmental factors [16, 17]. In a good solvent environment, the interactions betweena polymer segment and the solvent molecules usually cause the polymer to expand,and the polymer behaves like a self-avoiding random walk. In a poor solvent envi-ronment, such as water that surrounds proteins in our cells, the polymer-polymerself-interactions dominate and the polymer tends to collapse into a space filling con-formation. These two phases are separated by a θ -regime, where the repulsive and hases of proteins attractive interactions cancel each other and the polymer has the geometric characterof a random walk (Brownian motion).In the limit where the number N of monomers is very large, many aspects of thephase structure become universal [18, 19, 20]. An example of a universal quantity inthe case of a linear polymer such as a protein, is the compactness index ν . It is definedin terms of the radius of gyration R g [16, 17, 21, 22, 23] R g = 12 N (cid:88) i,j ( r i − r j ) (1.6)Here r i are the coordinates of all the atoms in the polymer. In the case of a protein,with no loss of generality we may restrict r i to the coordinates of the backbone C α atoms only. The compactness index ν governs the large- N asymptotic form of (1.6).When the number N of monomers becomes very large, we have [23] R g N large −→ R N ν (1 + R N − δ + ... ) (1.7)It should be obvious that ν coincides with the inverse Hausdorff dimension of thestructure. Besides the compactness index ν , the critical exponents δ etc. are alsouniversal quantities. But the form factor R that characterises the effective distancebetween the monomers in the large N limit, and the subsequent amplitudes R etc. that parametrise the finite size corrections, are not universal [23]. These parameterscan in principle be computed from the chemical structure of the polymer and solvent,in terms of environmental factors such as temperature and pressure.As a universal quantity, ν is independent of the detailed atomic structure. Differentvalues of ν correspond to different phases of polymer. The four commonly acceptedmean-field values of ν are ν = / / /
51 (1.8)Under poor solvent conditions such as in the case of proteins in our cells, a linearsingle chain polymer collapses into the space filling conformation and we have themean field exponent ν = 1 /
3. For a fully flexible ideal chain the mean field value is ν = 1 /
2. This phase takes place at the θ -regime that separates the collapsed phasefrom the high temperature self-avoiding random walk phase, for which we have themean field Flory value ν = 3 /
5. Finally, when ν = 1 the polymer is like a rigid stick.Some proteins are like this, some collagens for example. Research project: Three dimensional dynamical systems such as the Lorenz equation providenumerous examples of space curves with attractors that have all kind of Hausdorff dimensions.Can you find physical examples of polymers (proteins) where ν takes values that correspondto phases which are different from the four listed in (1.8)? The mean-field values of the critical exponents ν , δ etc. in equation (1.7) may becorrected by fluctuations. In particular, in the universality class of the self-avoidingrandom walk the improved values are [24, 25] A Protein Minimum ν = 0 . ± . δ = 0 . ± .
03 (1.9)The computation of (1.9) in [24, 25] utilises the concept of universality to argue thatthe three dimensional self-avoiding random walk is in same universality class with the O ( n ) symmetric scalar field theory with a quartic self-interaction, in the limit wherethe number of components n →
0. A subsequent numerical Monte Carlo evaluation ofthe critical exponents (1.9), computed using a self-avoiding random walk model on asquare lattice [23] gave very similar values ν = 0 . ± . δ = 0 . ± .
03 (1.10)In the case of crystallographic PDB protein structures, we may evaluate the de-pendence of the radius of gyration on the number of residues using the coordinates ofthe C α atoms. The result is shown in Figure 1.6. A least-square fit to the data gives N ) Å ( g R a =0.370 ν Å =2.286 R = / ν N ) Å ( g R b =0.940 ν Å =1.020 R =0.973 ν Å =0.480 R = / ν Fig. 1.6
The C α radius of gyration as a function of residues, in the case of those monomericcrystallographic PDB proteins that have been measured with better than 2.0 ˚A resolution.The red line is the least-square linear fit, and the blue line denotes the Flory value ν = 3 / R g ≈ R N ν ≈ . N . ˚A (1.11)Note that proteins are not homopolymers. But when N increases, the detailed aminoacid structure of a protein should become increasingly irrelevant in determining therelation between the radius of gyration and the number of residues. For long proteinchains, the inhomogeneity due to amino acids should be treated as a finite size correc-tion in (1.7), when N becomes very large. Indeed, in the limit of very large numberof residues a generic protein is like a chain along which the 20 residues have beenquite randomly distributed. It should be like a spin-chain embedded in R where eachresidue is a spin variable, with 20 different (random) values. Thus, when the ratio ackbone geometry /N becomes very small the effect of an individual residue becomes small in an av-erage, statistical sense. The protein approaches a homopolymer that is equipped withan ”averaged” residue. Research project: Develop theory of spin chains embedded in R . According to Figure 1.6, those proteins that can be crystallised are in the collapsed ν ≈ / order parameters in the sense of Landau, Ginzburgand Wilson; the concept of order parameter is described in numerous textbooks. A local order parameter is a systematically constructed effective dynamical variable,that describes collectively a set of elemental degrees of freedom such as atoms andmolecules, in a system that is subject to the laws of statistical physics. Examples ofan order parameter include magnetisation in the case of a ferromagnet, director in anematic crystal, condensate wave function in superfluid He , and Cooper pair in aBCS superconductor. The concept of an order parameter is often intimately relatedto the concept of symmetry breaking and emergent phenomena. For example, in eachof the cases that we mentioned, we have a symmetry that becomes broken, and thissymmetry breaking gives rise to emergent structures. In particular, the breaking of thesymmetry is described in terms of the properties of the pertinent order parameter, ineach case.In the case of a protein, we have already concluded that the phase structure relatesto the aspects of protein geometry. The different phases of a protein are characterisedby different Hausdorff dimensions (1.8). Moreover, proteins have an apparent symme-try that has become broken:Amino acids are chiral molecules. An amino acid can be either left-handed ( L ) orright-handed ( D ). The only exception is glycine which has no chirality. For two aminoacids to form a di-peptide as shown in Figure 1.1, they must have the same chirality;you can’t easily shake someones left hand with your own right. For some reason thesymmetry between L and D is broken in Nature, practically all amino acids that appearin proteins of living organisms from prokaryotes such as bacteria to eukaryotes like us,are left-handed chiral. This symmetry breaking apparently reflects itself to higher levelgeometric structures of proteins: As a polymer chain, the proteins that are found inliving organisms are more often twisted in a right-handed manner, that the opposite.Thus, any local order parameter that describes the phase properties of proteins inliving organism, should somehow capture the helical aspects of protein geometry.Figure 1.7 details the local geometry of a protein. In this Figure we identify a C α atom together with its covalently bonded N, C, H atoms, and the residue R that startswith a covalently bonded carbon atom called C β . The covalent bonds between thesefive atoms form a sp α at the centre. Take the C atomto be the top of the tetrahedron, and N, H, R as the three bottom vertices. Considerthe axis of the tetrahedron that runs along the covalent bond from the C to C α and In the context of protein research, order parameters are sometimes called reaction coordinates . A Protein Minimum
C"""""O""""N"""""H""""R"
Fig. 1.7
The atoms along the protein backbone form peptide planes; the two covalent bondsC=O and N-H are anti-parallel. The figure also shows the definition of the various bond andtorsion angles, between the covalent bonds. The two torsion angles ( φ, ψ ) are the Ramachan-dran angles. look down this axis from C towards C α . If the H atom is in the clockwise directionfrom residue R, the amino acid is left-handed; this is the case in Figure 1.7. Otherwise,the amino acid is right-handed.In Figure 1.7 we have also two peptide planes , one which is prior to the C α atomand another which is after the C α atom. The C α atom is located at the joint vertexof the two adjacent peptide planes. We proceed to analyse in detail the properties ofthe peptide plane geometry.It turns out that the geometry of the peptide planes is indeed very rigidly planar.The planarity is measured by the angle ω which is shown in Figure 1.7; it is the anglebetween the C=O covalent bond, and the N-C α covalent bond (or N-H covalent bond)The values of ω are found to fluctuate very little around ω = π , which correspondsto the trans -conformation of the backbone and is shown in the Figure; there are afew entries, mainly involving the amino acid proline , where the backbone is in the cis -conformation where ω vanishes. The cis -conformation is equally planar, the fluc-tuations around ω = 0 are minimal. Figure 1.8 shows the distribution of the ω anglesin our ultra high resolution subset of crystallographic protein structures, those thathave been measured with better than 1.0 ˚A resolution. In addition, the values ofthe three covalent bond angles ( κ , κ , κ ) that are defined in Figure 1.7 are shown inthis Figure. Their values are likewise subject to relatively small variations.The various covalent bond lengths along the protein backbone have also valuesthat fluctuate very little around their average values. Figure 1.9 shows the variousbond lengths between the heavy atoms along the backbone. We also show the distance amachandran angles Fig. 1.8
Left: Distribution of the torsion angle ω between the covalent bonds C=O and N-Hin Figure 1.6, in our ultra high 1.0 ˚A resolution PDB subset. The result shows that thegeometry of the peptide planes is indeed planar, with very high precision. For trans we have ω ≈ π and for cis we have ω ≈
0. Right: The distribution of the three bond angles ( κ , κ , κ )defined in Figure 1.7 in our data set. The variation around the average values is relativelysmall. between two consecutive C α atoms, which is also subject to very small fluctuations. Fig. 1.9
Left: Distribution of the three covalent bond lengths C α -C, C-N and N-C α shown inFigure 1.7, along the protein backbone. Right: Distribution of the length between neighboringC α atoms, along the protein backbone. The smaller value ∼ . cis , andthe larger value ∼ . trans . The Figures 1.8 and 1.9 show that the three covalent bond angles ( κ , κ , κ ), thetorsion angle ω and the various covalent bond lengths reveal no dependence on localgeometry. Each of these variables have fairly uniform distributions, which is appar- A Protein Minimum ently quite insensitive to variations in local backbone geometry. Thus we are left withonly the two Ramachandran angles ( φ, ψ ) in Figure 1.7 as the potential local orderparameters, to characterise local geometry along the backbone. Indeed, it turns outthat the variations in their values are not small, and in particular appears to dependon backbone geometry. This is shown by the Ramachandran map in Figure 1.10, thatdisplays the ( φ, ψ ) distribution in PDB structures which have been measured with bet-ter than 2.0 ˚A resolution. Note the asymmetry, both in φ and in ψ ; this asymmetry (a) (b) Figure 2.6: a) Stereographic DF-plot showing the angles between two residues de-noted in the PDB database as having no secondary structure. b) Ramachandran plotof residues with no secondary structure. φ"ψ" Fig. 1.10
Distribution of Ramachandran angles ( φ, ψ ) defined in Figure 1.7 in radians. translates into helicity of the protein backbone. Regular right-handed helical struc-tures (right-handed α -helices) are quite common, while left-handed helical structuresare very rare.Since the phase structure of a protein relates to its geometry, we may expect thatthe set of the two Ramachandran angles could be utilised as the local order parametersto describe the phase structure of proteins. However, it turns out that this is not thecase [26]: The Ramachandran angles form an incomplete set of local order parameters.To show this, we consider all the PDB structures in our data set of ultra-high resolutionstructures, those that have been measured with better than 1.0 ˚A resolution. We dothe following reconstruction :From the PDB coordinates of the atoms we first compute the numerical values of allthe Ramachandran angles, for each and every peptide plane. Then we continue and dothe inverse: We start from the Ramachandran angles that we have computed, and weassume that all the other angles and bond lengths have their average values. We thentry to reconstruct the original protein structure, by computing the C α coordinates.It turns out that this reconstruction of the coordinates, going from C α to Ra-machandran and the back, usually fails. It is not possible to do such a reconstruction,in the case of a generic protein.In Figure 1.11 we show how the reconstructed proteins fail to reproduce eventhe correct fractal geometry of folded proteins [26]. Instead of (1.11), we obtain theasymptotic relation omology modelling Fig. 1.11
Comparison of R g between the PDB structures, and the structures obtained bya reconstruction which is based on Ramachandran angles, with all other backbone anglesand bond lengths set to their ideal values. Red circles are the original PDB structures, bluecrosses are the reconstructed structures. R g ≈ . N . ˚Afor the reconstructed protein structures: The inverse Hausdorff dimension ν = 0 .
45 isincorrect. In order to reconstruct the correct fractal geometry, it turns out that weneed to include all the angular variables in Figure 1.7, as variables. Only for the bondlengths, can the average PDB values be used.Eventually, in a subsequent lecture, we shall construct a different set of local orderparameters and we show their completeness. However, we first address the modellingof proteins in terms of their (primary) full atom level description.
As shown in Figure 1.3 (left), the number of sequences in
Uniprot grows at a ratewhich is much higher than that of structures in PDB, shown in Figure 1.3 (right).The gap is enormous and keep on growing: Sequencing is now fast, cheap and routinewhile crystallographic protein structure determination is difficult, time consuming,and often very expensive; apparently the average cost of a PDB structure is around100 kUSD. Thus it is impossible to close the gap between sequences and structuresby purely experimental methods. To close this gap we need to develop efficient andaccurate computational methods.Homology modelling [27, 28, 29] together with other comparative modelling tech-niques, is presently the most reliable and effective approach to generate a three-dimensional model of a protein structure from the knowledge of its amino acid se-quence. These methods are consistently the top performers in the bi-annual CriticalAssessment of protein Structure Prediction (CASP) tests, see Homology between two proteins is commonly measured on the basis of amino acid sequencesimilarity. High sequence similarity is a sign of shared ancestry, but for short proteins it can alsooccur due to chance. A Protein Minimum http : // predictioncenter . org / Homology modelling techniques aim to construct the atomic level structure of a givenprotein (target), by comparing its amino acid sequence to various libraries of known,homologically related protein structures (templates). Apparently the reason why thiskind of methods work is as follows: It seems to be the case that the number of possibleprotein folds in nature is limited, and that the three-dimensional protein structuresseem to be better conserved than the amino acid sequences [30]. However, the qualityof a model obtained for the target is largely dictated by the evolutionary distancebetween the target and the available template structures. If no closely homologoustemplate can be found, these approaches typically fail.From the point of view of physics, any template based approach has the disad-vantage that there is no energy function. Thus, no energetic analyses of structureand dynamics can be performed. For this, other techniques need to be introduced: Tounderstand the physical properties of a protein we need to know the energy function.
Research project: Try to develop a structure prediction approach that uses the best of bothworlds: One that finds an initial Ansatz structure using templates, and then develops it usingtechniques of physics. For hints, continue reading ... ... if we were organisms so sensitive that a single atom, or even a few atoms, could make aperceptible impression on our senses -Heavens, what would life be like! (E. Schr¨odinger)
We present a short overview of all-atom molecular dynamics, where impressiveprogress is being made. The aim of an all-atom approach is to model the way a proteinfolds, by following the time evolution of each and every atom involved including those ofthe surrounding solvent (water). But despite impressive progress, the subject remainsvery much under development and provides enormous challenges to those brave enoughto face them: Both conceptual level and technical level problems remain to be resolved,involving issues relevant to physics, chemistry, computer science, and also problemsfor those interested in optimisation and efficient algorithm development.There are many examples of all-atom energy functions, called force fields in thiscontext. The most widely used are
Charmm [31] and
Amber [32]; we also mention
Gromacs [33] which is a popular platform for performing molecular dynamics (MD)simulations using different force fields.A typical all atom force field that describes a protein chain with N atoms has thefollowing generic form E ( r N ) = (cid:88) bonds k b l − l ) + (cid:88) angles k a κ − κ ) + (cid:88) torsions V n nω − γ ))] (1.12)+ N − (cid:88) j =1 N (cid:88) i = j +1 (cid:40) (cid:15) i,j (cid:34)(cid:18) r ij r ij (cid:19) − (cid:18) r ij r ij (cid:19) (cid:35) + q i q j π(cid:15) r ij (cid:41) (1.13)The first two contributions in (1.12) describe harmonic oscillations of the bond lengthsand bond angles around certain ideal values ( l , κ ). The third contribution involves ll-atom models torsion angles and evokes a mathematical pendulum, similarly with ideal value groundstate(s) given by γ . Torsion angles ω are often found to be much more flexible thanbond angles κ , thus the numerical values of V n are commonly orders of magnitudesmaller than those of k a . Moreover, a mathematical pendulum which is used for thetorsion angles in lieu of a harmonic oscillator, allows for larger amplitude motions andmultiple equilibrium states, which is consistent with their more flexible character.The second contribution (1.13) involves the 6-12 Lennard-Jones potential thatapproximates the interaction between a pair of neutral atoms; the form is chosenfor computational efficiency. Finally, we have the Coulomb interaction. In practice,long-range interactions (1.13) are cut off beyond a distance around 10 ˚Angstr¨om,again for computational efficiency. The ”ideal” values of the various parameters areusually determined by a process of optimisation, using comparative simulations. Forparameter fine tuning, one may use very short peptides that have accurately knownexperimental structures. Such structures can be found for example in the Engh-Huberlibrary [12].In a full all-atom MD simulation one solves the Newton’s equation of motion with(1.12), (1.13) in an environment of water which in a full all-atom approach is alsotreated explicitly. We note that e.g. the r − term in (1.13) models short range Paulirepulsion due to overlapping electron orbitals, and the r − term emerges from longrange van der Waals interactions. Thus these terms have a quantum mechanical origin,and the proper interpretation of the ensuing Newton’s equation is not in terms of aclassical equation but in terms of a semi-classical one.A MD simulation solves the Newton’s equation iteratively, with a time-step ∆ τ .This time-step should be short in comparison to the shortest time scale ∆ t min , thatcharacterises the fastest atomic level oscillations. The ratio of the two defines a di-mensionless expansion parameter. For good convergence of the iterative, discretisedNewton’s equation we should have ∆ τ ∆ t min << τ should be no more than a few femto-seconds: The modelling ofprotein folding in an all-atom MD is conceptually a weak-coupling expansion in termsof the dimensionless ratio (1.14). Research project: The Newton’s equation for mathematical pendulum ¨ ω = V sin ω is integrable. Its naive discretisation ¨ ω → (cid:15) ( ω n +1 − ω n + ω n − ) yields the so-called standard map [34] ω n +1 − ω n + ω n − = (cid:15) V sin ω which is not integrable. However, an integrable discretisation of the mathematical pendulumis known. See for example Chapter VIII of ref. [35]. Find which one describes protein foldingmore accurately. A Protein Minimum
Enormous computational resources have been developed and dedicated to solve the protein folding problem [3, 4]. From the point of view of molecular dynamics thisamounts to a numerical simulation of the all-atom time evolution in a protein, froma random chain initial condition to the natively folded conformation using a forcefield such as (1.12), (1.13). For example, the
Blue Gene family of supercomputerswas originally designed by IBM to address the problem of protein folding and genedevelopment, which explains the name. Subsequently special purpose computers havebeen constructed to address the folding problem, at all-atom level of scrutiny. Thus farthe most powerful is the
Anton supercomputer, built by D.E. Shaw Research [36,37,38].In the case of relatively short proteins,
Anton can produce a few microseconds of invitro folding trajectory per day in silico [37]; this is about 3-4 orders of magnitude morethan a Blue Gene can achieve. In a series of MD simulations of 12 fast-folding proteins,from chignon with 10 residues to genetically modified λ -repressor with 80 residues,and with each protein capable of folding within a number of micro seconds in vitro , Anton was able to produce dynamical trajectories that reproduced the experimentallyobserved folded structures, in some examples with an impressive precision [38]. Atthe moment, these results set the benchmark for all-atom protein folding simulations.But despite the encouraging results obtained by
Anton , several issues remain to beovercome before proteins can be routinely folded at all-atom level, starting from aknowledge of the amino acid sequence only. We name a few, as a challenge for futureresearch: • For the majority of proteins it takes much longer than a millisecond or so tofold. For example myoglobin, which is probably the most widely studied protein andthat we shall fold in the sequel, needs about 2.5 seconds in vitro to reach its nativestate when it starts from a random chain initial condition. Thus, it would take atleast 1.000 years to simulate the folding of myoglobin at the level of all-atoms, usingpresently available computers. • In an all-atom MD simulation with explicit water, an increase in the number ofwater molecules quickly exhausts the capacity of presently available computers. Forexample, in the case of the 80 residue λ -repressor mutant simulation using Anton , onlyaround 11.000 explicit water molecules could be included. This should be contrastedto the physiological conditions: The normal pH of blood plasma is around 7.4. SincepH is defined as the log of the reciprocal of the hydrogen ion activity in a solution,this translates to an average of one proton per 10 . water molecules. Protein foldingis strongly affected by pH; proteins have different natively folded states, at differentpH values. Thus, it remains a formidable task to describe physiologically relevant pHenvironments in a truly all-atom manner. • All-atom force fields utilise a quadratic, harmonic oscillator approximation aroundthe ideal values of the bond lengths and angles (1.12); mathematical pendulum in thecase of torsional angles. As long as the atomic fluctuations around the ideal valuesremain very small, this is a decent approximation. But whenever the atoms deviatefrom their ideal positions more than ”just a little”, higher order non-linear corrections hermostats are inevitably present and can not be ignored. The existing all-atom force fields arenot designed to account for this. The force fields are not built to describe protein con-formations in a realistic manner, whenever the lengths and angles are not very closeto their ideal values. This section describes a technical aspect, that is not needed in the rest of the lectures. Weadd this section as we feel it addresses a highly important yet still open theoretical issue thatneeds to be addressed by any all-atom modelling approach.
Finally, we have the theoretically highly interesting problem of thermostatting :An all-atom simulation solves the Newton’s equation. As a consequence energy isconserved and we have a microcanonical ensemble. But in a living cell the energy isnot conserved, instead temperature is fixed. Accordingly proteins in living organismsshould be described in terms of a canonical ensemble. One needs to convert the mi-crocanonical ensemble which is described by the all-atom Newton’s equation, into acanonical one. To achieve a conversion, one adds thermostats to the system. A ther-mostat models an environment which maintains its own temperature constant whileinteracting with the system of interest: We refer to [39] for a detailed description ofthermostats.The Langevin equation is an example of a thermostat which is well grounded inphysical principles. In the case of uniform damping we write it as follows,¨ x i = −∇ i E ( x ) − λ ˙ x i + F i ( t ) (1.15) < F i ( t ) > = 0 & < F i ( t ) · F j ( s ) > = λk B T δ ij ( t − s ) (1.16)The derivation of the Langevin equation assumes two sets of variables, a.k.a. particles:There are light, small particles that are fast. And there are heavy particles that areslow. The Langevin equation describes the dynamics of the latter, in the limit wherethe ensuing particles are much slower and heavier than the small and fast ones. Thederivation is based on standard arguments of statistical physics. Thus, the Langevinequation is a priori a well grounded and rigorous method to introduce temperatureinto a Newtonian system.In the case of all-atom MD the Langevin equation can not be used. There are nosmall and fast variables around. The oscillating atoms are themselves the small andfast variables. Moreover, from a conceptual point of view the presence of a white noise(1.16) de facto converts the ensuing numerical algorithm into a Monte Carlo process,albeit an elaborated one.Many alternatives to Langevin thermostat have been introduced. An example isthe Gaußian thermostat [39]. Unlike Langevin’s, it is deterministic . Instead of smalland fast background variables, one couples the variables x i of interest to explicitthermostat variables X k , with equations of motion m i ¨ x i = −∇ i E ( x ) − ∇ i E ( x , X ) (1.17) M k ¨ X k = −∇ k U ( X ) − ∇ k E ( x , X ) − α k ˙ X k (1.18) A Protein Minimum
The thermostatting effect is modelled by the last term in (1.18); the α k is determinedby subjecting the auxiliary variables to the non-holonomic constraint M k X k = 32 k B T ⇒ α k = ( 32 k B T ) − (cid:20) M k X k + ˙ X k [ ∇ k U ( X ) + ∇ k E ( x , X )] (cid:21) for each of the thermostat variable. The disadvantage of a Gaußian thermostat is inthe lack of a Hamiltonian character in the equations of motion (1.17), (1.18).A Hamiltonian, albeit singular, thermostatting has been proposed by Nos´e andHoover [40,41,42,43]. Their thermostat is probably the most widely used in the contextof all atom protein folding simulations. In the simplest variant the all-atom phasespace is extended by a single ghost particle with a singular, logarithmically divergentpotential that provides the temperature for all the rest . The singular character of thepotential introduces an inexhaustible heat reservoir, in essence.Following [44] we consider the toy-model case of a single canonical degree of freedom( p, q ) in the presence of a single Nos´e-Hoover thermostat degree of freedom ( P, Q ). Theclassical action is [40, 41, 42, 43] S = T (cid:90) − T dt (cid:26) p ˙ q + P ˙ Q − p mQ − V ( q ) − P M − β ln Q (cid:27) (1.19)We may assume that V [ q ] has the double-well profile V [ q ] = λ ( q − m ) (1.20)We are interested in the tunnelling amplitude between the two minimum energy con-figurations q = ± m , < + m, T |− m, − T > = (cid:90) q (+T)=+ m (cid:90) q ( − T)= − m [ dp ][ dq ] e iS (1.21)The integration over p is Gaußian but yields a Jacobian factor that depends on Q .This Jacobian has the same functional form as the last term in (1.19), thus it canbe absorbed by a redefinition (renormalisation) of β → β . As usual, we evaluate thetransition amplitude using Euclidean (imaginary) time formalism, obtained by sending t → it . The Euclidean action is S = T (cid:90) − T dt (cid:26) Q q t + V [ q ] + 12 Q t + 1 β ln Q (cid:27) (1.22) We remark that this ghost particle is a little like a Higgs particle that gives the mass for otherparticles in sub-atomic physics. Except that instead of mass it gives temperature, and it can not beobserved. hermostats and we search for a finite action instanton trajectory that connects the two states q = ± m ; note that the continuation to imaginary time ”inverts” the potential term,as shown in Figure 1.12. The instanton is a solution to the equations of motion !m t t Fig. 1.12
Analytic continuation to Euclidean time has the effect of inverting the potential V ( q ). The instanton is a trajectory which starts at (Euclidean) time − T from the localmaximum at q = − m and reaches the local maximum at q = − m at time +T, as shown inthe Figure on right. Q q tt = V q − Q Q t q t (cid:39) V q − γq t (1.23) Q Q tt = Q q t + 1 β (1.24)Note how the coupling between q and the thermostat variable Q gives rise to aneffective friction-like coefficient γ ( t ).We first consider the case where the thermostat field Q is absent. This amounts tosetting Q ≡ q tt = V q = 4 λq ( q − m )By adjusting the initial velocity we conclude that a solution exist which starts from q = − m at time − T, and ends at q = + m at time +T. This solution is the instantonthat gives rise to a finite tunneling amplitude between ± m ; in the limit T → ∞ theinstanton has the familiar double-well topological soliton profile q ( t ) = m tanh (cid:16) √ λ m ( t − a ) (cid:17) (1.25)Now suppose the thermostat field is present. We first consider a scenario, whereat ± T the system reaches a stationary state where q = ± q (cid:54) = ± m . Since the action(1.22) should remain finite as T → ∞ , we arrive at the Gibbsian relation Q (T) T →±∞ −→ Q ± = e − βV [ q ± ] (1.26)This proposes that β is indeed the Bolzmannian temperature factor, when positive. A Protein Minimum
Next, we integrate (1.24) and then take the limit T → ∞ ; since the Euclidean actionshould be finite, we may assume that Q t should vanish as T → ∞ which removes thesurface term. We find ∞ (cid:90) −∞ dt (cid:8) Q t + Q q t (cid:9) = − ∞ (cid:90) −∞ dt β (1.27)For a non-trivial tunnelling configuration and with a finite Euclidean action (1.22)the integral on the left-hand-side of (1.27) should be finite and non-vanishing. Butsince the l.h.s. is manifestly non-negative, the Bolzmann temperature factor β can notbe positive as it should. Thus we conclude that the presence of the thermostat fieldsuppresses tunnelling. We note that a suppression of tunneling amplitude by Nos´e-Hoover thermostat in the case of double-well potential has been observed in numericalsimulations [43].Proteins regularly need to tunnel over various different potential barriers in theirpresumably highly rugged energy landscape, when proceeding from a random initialconfiguration to the natively folded state. Thus we suspect that simulations usingNos´e-Hoover thermostats can cause complications whenever we have a protein forwhich we can expect that the folding pathway goes thru various intermediates andmolten globules. Research project: Investigate how the suppression of tunneling amplitudes in the case of aproperly modified Nos´e-Hoover thermostat could be avoided.
Other kind of thermostats have also been introduced, in particular we mention theBerendsen thermostat [45]. These thermostats that are often convenient in numericalsimulations, are designed to approach canonical ensembles in a limit. But they com-monly lack a Hamiltonian interpretation i.e. are non-physical, and thus the physicalinterpretation of the results is not there. While this might not be an issue when thegoal is simply to find a local energy minimum of an all-atom force field such as (1.12),(1.13), a non-physical thermostat can not be used to model the dynamics of proteins.
The all-atom approaches where the discretised Newton’s equation is solved iterately,are conceptually weak coupling expansions in the dimensionless parameter (1.14). Todescribe the folding of most proteins, one needs to be able to extend this expansionand ensure its convergence, over some fifteen orders of magnitude or even more. Fromthe perspective of sub-atomic physics, this is like extending perturbative StandardModel calculations all the way to Planck scale.Several approaches have been developed, with the goal to introduce an expansionparameter that corresponds to a time scale which is clearly larger than the femtosecondscale. Such coarse-grained force fields average over the very short time scale atomicfluctuations: If the denominator in (1.14) can be increased, so can the numerator, andit becomes possible to develop expansions which reach longer in vivo time scales withno increase in the in silico time. In practice, a carefully crafted coarse grained forcefield can cover up to around three orders of magnitude longer folding trajectories than ther physics-based approaches: all-atom approaches, while still maintaining a good overall quality. Here we mentionin particular UNRES as an example of such a carefully crafted coarse grained forcefield [46, 47, 48]. See the homepage http : // . unres . pl / Finally, we comment on the various versions of the G˜o model [49] and the closelyrelated elastic (Gaußian) network models [50]. These approaches have played a veryimportant rˆole, to gain insight to protein folding in particular when the power ofcomputers is insufficient for any kind of serious all-atom folding simulations. In thesemodels the folded configuration is presumed to be known; the individual atomic co-ordinates of the folded protein chain appear as an input. A simple energy function isthen introduced, tailored to ensure that the known folded configuration is a minimumenergy ground state. In the G˜o model the energy could be as simple as a square wellpotential which is centered at the native conformation. In elastic network models theatoms are connected by harmonic oscillators, with energy minima that correspond tothe natively folded state.Since the positions of all the relevant atoms appear as parameters in these mod-els, they contain more parameters than unknown and thus no predictions can bemade. Only a description is possible. From the point of view of a system of equations,these models are over-determined. In any predictive energy function the number ofadjustable parameters must remain smaller than the number of independent atomiccoordinates. Otherwise, no predictions can be made, no physical principles can betested. Even though a description is possible.
Bol’she
In 1972 Anderson wrote an article [2] entitled
More is different that has been inspira-tional to many. In particular, he argued for the importance of emergent phenomena.But the call for
Bol’she is already present in Schr¨odinger’s 1944 book: ... living matter, while not eluding the ’laws of physics’ as established up to date, is likelyto involve ’other laws of physics’ hitherto unknown, which, however, once they have beenrevealed, will form just as integral a part of this science as the former.
However, Anderson makes a crucially important point that can not be found inSchr¨odinger’s book. Anderson’s article has this point even as a subtitle:
Broken sym-metry and the nature of the hierarchical structure of science . Anderson realised thatin order for emergent phenomena to give rise to structural self-organisation, one needsa symmetry which becomes broken.We have already pointed out that in the case of proteins a broken symmetry ispresent. Individual amino acids can be either left-handed chiral or right-handed chiral,equally. But for some reason, living matter is built almost exclusively from amino acidsthat are left-handed chiral. We note that, apparently as a consequence, protein chainsare predominantly chiral with right-handed helicity.
Consider a fluid dynamical system such as water, the atmosphere, or any other sce-nario that can be described by the Navier-Stokes or Euler equation or their manydescendants. These are fundamentally atomic systems, but with an enormous numberof constituents. Their macroscopic properties are governed and often with a very highprecision, by the properties of a local order parameter which computes the fluid ve-locity. Structures such as vortices and tornadoes, and solitonic waves like the one thatemerges from the Korteweg - de Vries equation, are all described by a solution thatbreaks an underlying symmetry.A fluid dynamical vortex line is a familiar example of a highly regular collectivestate of individual atoms, with a topological character. It is an example of an emergentstructure: At the atomic level of scrutiny, the individual atoms and molecules thatconstitute the vortex are subject to random, brownian thermal motion. By followinga single individual water molecule you can not really conclude whether a vortex ispresent. The vortex materialises only at the macroscopic level, when the individualhaphazard atomic motions become collectively self-organised into a regular pattern.It would be incomprehensible to construct a macroscopic vortex line such as a tor-nado in atmosphere from purely atomic level considerations, even in principle: A vortex n optical illusion is an example of a soliton, and solitons can not be constructed simply by adiabati-cally building up individual atomic level interactions as small perturbations around aground state which consists of free individual atoms. A (topological) soliton emergeswhen non-linear interactions combine elementary constituents into a localised collec-tive excitation that is stable against small perturbations and cannot decay, unwrap ordisentangle. We start to describe the importance of symmetry breaking at an intuitive level. Weconsider a simple optical illusion, not a physical example. But it nevertheless demon-strates how
Bol’she gives rise to a symmetry that becomes broken, and the illusion ofbreaking symmetry leads to the formation of structure, in our eyes.We then proceed to consider two physically relevant examples, where a very similarbroken symmetry is present. But now in a physically relevant fashion. Most impor-tantly, each of the two examples we present describes a simple physical scenario thatshares many features with proteins.In Figure 2.1 we have two sets of cubes. On the left we have two individual cubes,and on the right there are more cubes. If one looks at the cubes on the right, oneshould be able to visualise some order. For example light grey steps that come down,from the left. Now rotate the Figure, slowly. When one keeps ones eyes focused on thetwo individual cubes on the left, nothing really happens besides an overall rotationof the two cubes. But if one instead focuses on the set of cubes on right, one shouldobserve a rapid transition: There is a point where the direction of the steps changes,abruptly, so that after a rotation by 180 degrees we still have the same light grey steps,still coming down from the left.The system on the right is
Bol’she , with a discrete Z symmetry under a rotationby 180 degrees. There are two ground states, and our mind chooses one, thus breakingthe symmetry. In fact, the scenario is very much like that in Figure 1.12, there are twoidentical ground states. In this sense Anderson’s Bol’she is present in Figure 2.1. Nosimilar optical illusory effect is observed when the two individual cubes on the left arerotated.We now proceed to describe two physical examples where a similar kind of Z sym-metry becomes broken, with equally dramatic physical - not illusory - consequences. Polyacetylene in trans conformation is like a simplified protein. Figure 2.2 a) shows thestructure. There is a ”backbone” that consists entirely of carbon atoms, and at eachcarbon atom there is a ”side chain” with a single hydrogen. Much like in a protein, butsimpler. In Figure 2.2 b) we depict the ( trans ) polyacetylene chain by combining each(CH) unit into a single vertex. The double line describes the σ -bond and the singleline is the π -bond, between two consecutive C atoms. Due to a Peierls instability theasymmetry of the chain causes the ground state to be doubly degenerate. The freeenergy acquires the double-well profile that we have depicted in Figure 1.12. The twoground states are related to each other by a Z reflection (parity) symmetry of thepolyacetylene chain about a (CH) site, which exchanges the σ -bonds and the π -bonds. Bol’she
Fig. 2.1
Rotate the figure, slowly, by 180 degrees. When you focus your eyes on the twoindividual cubes on the left, nothing much happens. However, if you focus on the set ofcubes on the right, you see an abrupt effect like a phase transition between two different butidentical ground states; we have a Z symmetry that has become broken by visual perception. When we choose one of the ground states, we break the symmetry. But we mayintroduce domain walls, that interpolate between two different ground states. In Fig-ure 2.2 c) we show an example. In this Figure we have two domain walls. Each inter-polates between two different ground states; between the two domain walls we have aregion where the the σ -bonds and π -bonds have become interchanged. Quantitatively,in terms of the double well potential shown in Figure 1.12, each of the two domainwalls correspond to a topological soliton (instanton) profile (1.25).We now argue that we have Bol’she which makes things different: The chain inFigure 2.2 c) is obtained from the chain in Figure 2.2 b) by removal of a single electron.There are 15 bonds in the Figure 2.2 b) and 14 in the Figure 2.2 c). The removal of asingle electron converts a double bond into a single bond, and we have simply movedthe bonds around to make the two identical domain walls. Since the structures inFigure 2.2 b) and in Figure 2.2 c) differ from each other only by the removal of asingle bond and since the two domain walls are identical, related to each other byparity, the two domain walls must share equally the quantum numbers of the missingbond: The two domain walls have electric charge half, each.This phenomenon of fermion number (charge) fractionalisation demonstrates howAnderson’s
Bol’she really makes a difference: Such exotic quantum number assign-ments could never be obtained simply by linearly superposing an integer number ofinitially non-interacting electrons and holes adiabatically, in a continuous manner, intoa weakly interacting system:
A charge half state simply can not be made by combin-ing together any finite number of particles with an integer charge.
Unless something
Bol’she is involved. pin-charge separation C H C C C C C C C C H H H H H H H H a) b) c) Fig. 2.2 a) The trans polyacetylene. b) One of the two degenerate ground states in a trans -polyacetylene chain. The other ground state is obtained by a reflection that inter-changes the double bonds and the single bonds. c) A state of a trans -polyacetylene chainwith one of the double bonds converted into a single bond and then transported to form twodomain walls carrying fractional electric charges.
Fine points:
A bond line in a polyacetylene corresponds to two electrons, one withspin up and the other with spin down. Thus an isolated domain wall must have a netelectric charge which is equal in magnitude but opposite in sign to that of a singleelectron. But since the spins of the electrons that have been removed are paired, thedomain wall carries no spin. This unusual spin-charge assignment has been observedexperimentally and it constitutes the essence of fermion number fractionalisation [51,52,53] that gives rise to electric conductivity along the polyacetylene. In the absence ofthe spin doubling we would observe a domain wall that carries half the electric chargeof one electron. Note that if we add a single electron to a domain wall, we obtain astate which is charge neutral but carries the spin of the electron. Alternatively, if weremove a single electron from a domain wall we obtain a state with spin one-half anda charge that equals (minus) twice that of the electron. Neither of these states arepossible, without
Bol’she . We shall eventually argue that proteins are very much like one dimensional spin-chains; the side chains are akin spin variables along the backbone. Thus, our secondexample is a one dimensional spin chain. For conceptual clarity we may assume thatthe spin variables are single electrons (or maybe protons like H+). We assume thatthe background lattice prefers a ground state which is an antiferromagnetic N´eel statewhere all the spins point into an opposite direction from their nearest neighbours. Bol’she
Furthermore, we assume that in the ground state all the sites have single occupancy,and that there is a very strong repulsive force between the electrons which prevents adouble occupancy. Such a ground state is a Mott insulator, and we have depicted thestructure in Figure 2.3. As in the case of a polyacetylene, the ground state is doubly
Fig. 2.3
One the two degenerate ground states in a N´eel antiferromagnet, with alternatingspin directions along a one dimensional lattice of electrons. The Z symmetric ground stateis obtained by reversing the direction of every spin. degenerate: The Z symmetry transformation operates by reversing the direction ofthe spin at every single lattice site. By choosing one of the two ground states, we breakthe Z symmetry.If we reverse the direction of a single spin along the chain, we form a localizedconfiguration with three parallel neighboring spins; see Figure 2.4 a). By successively a) Fig. 2.4 a) The same as in Figure 2.3 but with the spin of only one electron reversed. b)The state of Figure a) is decomposed into two domain walls representing the spinons. reversing the direction of neighboring spins but without changing the total spin we candecompose this configuration into two separate domain walls, each consisting of twoparallel spins. These domain walls interpolate between the two ground states of thespin chain, as shown in Figure 2.4 b). Since the lattices in Figure 2.4 a) and b) differfrom each other only by the flip of a single spin, the total change in the spin betweenthe two lattices is one. The two domain walls in Figure 2.4 b) are also identical. Thuseach must have a spin equal to one-half.Since we have made the two domain walls without adding or removing any elec-trons, each of them must be charge neutral. We conclude that the domain walls are spinons , they describe the same spin degree of freedom as a single electron but withno charge [54]. The two domain walls interpolate between the two different ground pin-charge separation states of the antiferromagnetic chain, very much like the domain walls in the case ofpolyacetylene.Now we consider the same antiferromagnetic lattice but with one of the electronsremoved as shown in Figure 2.5. This corresponds to an underdoped case,we have a a) Fig. 2.5 a) Spinon (left) and chargon (right) states in the underdoped state. b) The stateof Figure a) is decomposed into two domain walls representing the spinons. hole in the spin chain. When we move the hole to the right we arrive at the situationdepicted in Figure 2.5 b). In addition of the hole we have here another domain wallwhich is similar to the two domain walls that we have in Figure 2.4 b). This domainwall is a spinon, it has spin one-half but carries no charge. Since we have removed oneelectron and since the spinor does not carry charge, the hole must carry a charge whichis opposite to that of one electron. But no spin is available for this hole, it describes aspinless holon .The ARPES experiment at Lawrence Berkeley Laboratory has confirmed that suchspinons and holons do exist, they have been observed in a one dimensional copper oxide(SrCuO ) wire [55].Note that in Figure 2.5 a) the two spins on the left and on the right of the holeare parallel with each other but in Figure 2.5 b) the two spins are opposite to eachother. This confirms that like the spinon, the spinless holon is a domain wall thatinterpolates between the two different ground states of the chain.Finally, in Figure 2.6 a) we have added a single electron to the lattice. This exampleis of particular conceptual interest since it allows us to directly address what happens toa (pointlike) electron when it enters the antiferromagnetic environment: The presenceof the electron introduces a single site with a double bond ( ↑↓ ). The chain is nowover-doped. As before we transport the double filled state, e.g. to the right so that wearrive at the situation depicted in Figure 2.6 b). Note that due to Pauli exclusion thetransport occurs so that we move alternatively either a spin-up or a spin-down stateone step to the right. The final configuration shown in Figure 2.6 b) describes twoseparate domain walls that both interpolate between the two distinct ground statesof the spin chain. One of these domain walls is again a spinon. The other one is a doublon . Since one electron has been added and since spinor has spin but no charge, Bol’she a)
Fig. 2.6 a) The N´eel state with one electron added (overdoped case). b) Spinon (left) anddoublon (right) states in the overdoped state. we conclude that the doublon does not have any spin but it carries the entire chargeof one electron. The charge of the doublon is opposite to that of the holon.The two examples we have discussed, fermion number fractionalisation and spincharge separation, make it plain and clear how much difference
Bol’she can do: It wouldbe impossible to make states with the spin of an electron but no charge, or states withthe charge of an electron but no spin, simply by superposing an integer number of non-interacting electrons and then adiabatically switching on their mutual interactions.For states with such exotic quantum numbers we need to have an environment with asymmetry that has become broken.
The Landau (Fermi) liquid is a paradigm on which much of our understanding of many-body systems like metals is based. This paradigm states, that in a physical system witha large number of atoms, each atom retains its individual integrity. The propertiesof a material system which is described by a Landau liquid, can be understood bysuperposing its individual constituents in a weak coupling expansion around a givenground state. In particular a Landau liquid which is made of electrons and protonscan only have spin and charge assignments which are obtained by superposing theindividual spins and charges. This is the case when the properties of the system canbe understood using the notion of adiabaticity: We imagine that we start from aninitial condition where the elemental constituents have no mutual interactions. Theinteractions are then turned on, adiabatically, in a continuous manner. Accordinglythe ground state of the original non-interacting system becomes continuously deformedinto the ground state of the interacting system.The two examples that we have described, polyacetylene and antiferromagnetic spinchain, show that the Landau liquid paradigm is not a universal one. In a Landau liquidsystem it would be impossible to have states with exotic quantum numbers such as anelectric charge which is half of that of a single electron. The Landau liquid paradigmcan fail whenever we have emergent structures that display symmetries which becomebroken. In such scenarios there are often collective excitations like topological solitons ll-atom is Landau liquid which can not be built simply by adding together small adiabatic perturbations arounda ground state of non-interacting elemental constituents.The all-atom description (1.12), (1.13) of protein force field implicitly assumes theLandau liquid paradigm. According to (1.12), (1.13) the individual atoms oscillatearound their ideal values, under the influence of a potential which is either a harmonicoscillator or a mathematical pendulum. The Lennard-Jones and Coulomb potentialsintroduce continuously evolving deformations around the ideal atomic positions, in amanner that can be modelled by a weak coupling expansion of the iterative Newton’sequation in powers of (1.14); in practical simulations these long range interactions aretuned off, beyond a distance around 10 ˚Angstr¨om.It remains to be seen whether an all-atom Landau liquid description of proteinsbreaks down. But the basal ingredient, that of a broken Z symmetry which alsoappears in our examples of polyacetylene and antiferromagnetic spin chain, is certainlypresent: The amino acids are left-handed chiral, and as a consequence proteins thatconstitute live matter prefer right-handed helicity along their backbone. Strings in three space dimensions
Thus we have come to the conclusion that an organism and all the biologically relevantprocesses that it experiences must have an extremely ’many-atomic’ structure and must besafeguarded against haphazard, ’single-atomic’ events attaining too great importance. (E.Schr¨odinger)
We start our search of broken symmetry and the ensuing
Bol’she that makes us alive,by considering differentiable (class C ) strings in R . The Abelian Higgs Model (AHM) is the paradigm framework to describe vortices assolitons. Solitonic vortices are important to many physical phenomena, from cosmicstrings in Early Universe to type-II superconductors. In particular, the Weinberg-Salam model of electroweak interactions with its Higgs boson is a non-Abelian exten-sion of AHM.The AHM involves a single complex scalar (Higgs) field φ and a vector field A i .These fields are subject to the U(1) gauge transformation φ → e ie ϑ φA i → A i + ∂ i ϑ (3.1)where ϑ is a function, and e is a parameter. The standard AHM Hamiltonian is H = 14 G ij + | ( ∂ i − ieA i ) φ | + λ (cid:0) | φ | − v (cid:1) (3.2)where G ij = ∂ i A j − ∂ j A i When the space dimension D is odd, a Chern-Simons term ( ChS ) can be added to(3.2). Explicitly, D = 1 : ChS ∼ AD = 3 : ChS ∼ AdAD = 5 :
ChS ∼ AdAdA etc. (3.3)The Chern-Simons term is the paradigm way to break parity.In a material system (3.2), (3.3) is the Kadanoff-Wilson energy in the limit wherethe fields have slow spatial variations [18,19]. To describe this limit, we start from the belian Higgs Model and the limit of slow spatial variations full free energy of a material system which is based on the AHM field multiplet; wedenote it F ( φ, A i )This free energy is in general a non-local functional of the field variables. But it must begauge invariant. Thus, it can only depend on manifestly gauge invariant combinationsof the fields such as | φ | , | ( ∂ i − ieA i ) φ | , . . . Consider the limit where the length scale that is associated to spatial variations of thefield variables becomes very large in comparison to other characteristic length scalesof the system. In this limit we can expand the free energy in powers of the gaugecovariant derivatives of the fields. The expansion looks like this [56]: F ( φ, A i ) = V ( | φ | ) + Z ( | φ | ) | ( ∂ i − ieA i ) φ | + ChS ( A ) + W ( | φ | ) G ij + . . . (3.4)The leading term is called the effective potential. The higher derivative terms aremultiplied by functions Z ( | φ | ), W ( | φ | ) etc .The AHM energy (3.2), (3.3) constitutes the leading order non-trivial contribution to(3.4), in powers of fields and their covariant derivatives.We introduce a set of new variables ( J i , ρ, θ ), obtained from ( A i , φ ) by the followingchange of variables φ = ρ · e iθ A i → J i = i e | φ | [ φ ∗ ( ∂ i − ieA i ) φ − c.c. ] (3.5)We can introduce these new variables whenever ρ (cid:54) = 0. Note that both ρ and J i aregauge invariant under the gauge transformation (3.1). But θ → θ + ϑ When we write (3.2), (3.3) in terms of these new variables (3.5), we have H = 14 (cid:18) J ij + 2 πe σ ij (cid:19) + ( ∂ i ρ ) + ρ J i + λ (cid:0) ρ − η (cid:1) + ChS (3.6)where J ij = ∂ i J j − ∂ j J i and σ ij = 12 π [ ∂ i , ∂ j ] θ (3.7)We observe that (3.6) involves only variables that are manifestly U (1) gauge invariant.In particular, unlike in the case of (3.2), in (3.6) the local gauge invariance is entirelyremoved, by a change of variables [57].The term σ ij is a string current. It has a Dirac δ -like support which coincides withthe world-sheet of the cores of vortices. When (3.6) describes a finite energy vortex,(3.7) subtracts a singular string contribution that appears in J ij . Since J i is singular in Strings in three space dimensions the presence of a vortex line, it makes a divergent contribution to the third term in the r.h.s. of (3.6). But the divergence becomes removed, provided the density ρ vanisheson the world-sheet of the vortex core. Thus the vanishing of ρ along a string-like linein space is a necessary condition for the presence of finite energy vortex lines. Proteins are string-like objects. Thus, to understand proteins we need to develop theformalism of strings in R , at least to some extent.The geometry of a class C differentiable string x ( z ) in R is governed by theFrenet equation, described widely in elementary courses of differential geometry. Theparameter z ∈ [0 , L ] where L is the length of the string in R . We can compute thelength from L = L (cid:90) dz || x z || = L (cid:90) dz √ x z · x z ≡ L (cid:90) dz √ g. (3.8)Here we recognise the static version of the standard Nambu-Goto action, with genericparameter z ∈ [0 , L ]. We re-parametrize the string to express it in terms of the arc-length s ∈ [0 , L ] in the ambient R , by the change of variables s ( z ) = z (cid:90) || x z ( z (cid:48) ) || dz (cid:48) In the following we use the arc-length parametrisation, exclusively. We consider asingle, open string that does not self-cross. We introduce the unit length tangentvector t = d x ( s ) ds ≡ x s (3.9)the unit length bi-normal vector b = x s × x ss || x s × x ss || (3.10)and the unit length normal vector, n = b × t (3.11)The three vectors ( n , b , t ) defines the right-handed, orthonormal Frenet frame. Wemay introduce this framing at every point along the string, whenever x s × x ss (cid:54) = 0 (3.12)We proceed, momentarily, by assuming this to be the case. The Frenet equation thencomputes the frames along the string as follows, dds nbt = τ − κ − τ κ nbt (3.13) rame rotation and abelian Higgs multiplet Here κ ( s ) = || x s × x ss |||| x s || (3.14)is the curvature of the string on the osculating plane that is spanned by t and n , and τ ( s ) = ( x s × x ss ) · x sss || x s × x ss || (3.15)is the torsion that measures how the string deviates from its osculating plane. Both κ ( s ) and τ ( s ) are extrinsic geometric quantities i.e. they depend only on the shape ofthe string in the ambient space R . Conversely, if we know the curvature and torsionwe can construct the string. For this we first solve for t ( s ) from the Frenet equation.We then solve for the string by integration of (3.9). The solution is unique, modulo aglobal translation and rotation of the string.Finally, we note that both the curvature (3.14) and the torsion (3.15) transform asscalars, under reparametrisations of the string. For this we introduce an infinitesimallocal diffeomorphism along the string, by deforming s as follows s → s + (cid:15) ( s ) (3.16)Here (cid:15) ( s ) is an arbitrary infinitesimally small function such that (cid:15) (0) = (cid:15) ( L ) = 0 = (cid:15) s (0) = (cid:15) s ( L )Under this reparametrization of the string, the curvature and torsion transform asfollows δκ ( s ) = − (cid:15) ( s ) κ s δτ ( s ) = − (cid:15) ( s ) τ s (3.17)which is the way how scalars transform. The Lie algebra of diffeomorphisms (3.16) isthe classical Virasoro (Witt) algebra. In order to relate the Abelian Higgs Multiplet with extrinsic string geometry, weobserve that the normal and bi-normal vectors do not appear in (3.9). As a consequencea SO(2) rotation around t ( s ), (cid:18) nb (cid:19) → (cid:18) e e (cid:19) = (cid:18) cos η ( s ) sin η ( s ) − sin η ( s ) cos η ( s ) (cid:19) (cid:18) nb (cid:19) . (3.18)has no effect on the string. For the Frenet equation this rotation gives dds e e t = τ + η s ) − κ cos η − ( τ + η s ) 0 κ sin ηκ cos η − κ sin η e e t . (3.19) Strings in three space dimensions
Fig. 3.1
Rotation between the Frenet frames and a generic frame, on the normal plane ofthe string.
We may utilise the κ dependent terms in (3.19) to promote κ into a complex quan-tity, with modulus that coincides with the manifestly frame independent geometriccurvature (3.14). κ η −→ κ (cos η + i sin η ) ≡ κe iη (3.20)This enables us to interpret the transformation of ( κ, τ ) in (3.19) in terms of a one-dimensional version of the U(1) gauge transformation (3.1): We identify the curvatureas the Higgs field and the torsion as the U(1) gauge field [58], κ → κe − iη ≡ φτ → τ + η s ≡ A i (3.21)Note that when we choose η ( s ) → η B ( s ) = − (cid:90) s τ (˜ s ) d ˜ s (3.22)we arrive at the unitary gauge in terms of the abelian Higgs multiplet. This definesBishop’s parallel transport framing [59]. The Bishop’s frame vectors e B , do not rotatearound the tangent vector, dds (cid:0) e B + i e B (cid:1) = − κ e − iη B t Thus, unlike the Frenet framing that can not be introduced when the curvature κ ( s )vanishes, the Bishop framing can be introduced and defined in an unambiguous andcontinuous manner in that case. However, it turns out that in the case of proteins whichis the subject we are interested in, the Bishop frames do not work very well [60]. he unique string Hamiltonian The curvature and torsion are the only available geometric quantities for constructingenergy functions of strings, while (3.2), (3.3) is the unique energy of the Abelian Higgsmultiplet in the Kadanoff-Wilson sense of universality:Consider a string, in the limit where the curvature and torsion are slowly varyingfunctions along it. The shape of the string can not not depend on the framing, thusits energy can only involve combinations of the curvature and torsion in a manifestlyframe independent fashion.On the other hand, (3.2), (3.3) is the unique SO(2) ∼ U(1) invariant energy func-tion that involves a complex Higgs field and a gauge field. It emerges from generalarguments and symmetry principles alone, in the limit where the length scale that isassociated to spatial variations of the field variables becomes very large in comparisonto other characteristic length scales of the system.Thus the only
Hamiltonian that can describe a string and its dynamics in the limitof fields with slow spatial variations is [58] H = L (cid:90) ds (cid:8) | ( ∂ s + ie τ ) κ | + λ ( | κ | − m ) (cid:9) + a L (cid:90) ds τ (3.23)We have here included the one dimensional version of the Chern-Simons term (3.3).It introduces net helicity along the string, breaking the Z symmetry between stringsthat are twisted in the right-handed and left-handed sense.In (3.23) both κ and τ are expressed in a generic, arbitrary framing of the string.The corresponding gauge invariant variables (3.5) are the curvature (3.14) and torsion(3.15) that characterise the extrinsic string geometry. In terms of these gauge invariantvariables, which we from now on denote by ( κ, τ ) the Hamiltonian (3.23) is H = L (cid:90) ds (cid:8) ( ∂ s κ ) + e κ τ + λ ( κ − m ) (cid:9) + a L (cid:90) ds τ (3.24)where we have simply followed the steps that gave us (3.6). Thus (3.24) is the uniqueenergy of a string, in terms of geometrically defined curvature and torsion, and in thelimit where the spatial variations of curvature and torsion along the string are small. Relations exist between (3.24), the integrable hierarchy of the nonlinear Schr¨odinger(NLS) equation, and the Heisenberg spin chain of ferromagnetism. For this we intro-duce the following Hasimoto variable ψ ( s ) = κ ( s ) e ie s (cid:82) ds (cid:48) τ ( s (cid:48) ) (3.25)that combines the curvature and torsion into a single gauge invariant complex variable.In terms of (3.25), we obtain the Hamiltonian of the integrable nonlinear Schr¨odingerequation [61, 62, 65] as follows, Strings in three space dimensions κ s + e κ τ + λκ = ¯ ψ s ψ s + λ ( ¯ ψψ ) = H (3.26)With the Poisson bracket { ψ ( s ) , ¯ ψ ( s (cid:48) ) } = iδ ( s − s (cid:48) )the lower level conserved densities in the integrable NLS hierarchy are the helicity H − , length ( i.e. Nambu-Goto) H − , number operator H and momentum H H − = τ H − = L H = κ ∼ ¯ ψψ H = iκ τ ∼ ¯ ψψ s (3.27)The energy (3.24) is a combination of H − , H and H . From the perspective of theNLS hierarchy, the momentum H should also be included for completeness. If higherorder corrections are desired the natural candidate is the mKdV density H = iκκ sss + 2 κκ ss τ + κ τ − e κ s τ + 34 λκ τ ∼ i ¯ ψψ sss + i λ ψψ ( ¯ ψ s ψ + 4 ¯ ψψ s )with appears as the next level conserved density in the NLS hierarchy.We note that the Heisenberg spin chain is obtained from H using the Frenetequation. L (cid:90) ds H = L (cid:90) ds κ = L (cid:90) ds | t s | Furthermore, the combination of H − and H yields Polyakov’s rigid string action [63].In the context of polymers, it relates to the Kratky-Porod model [64]. Solitons are the paradigm structural self-organisers in Nature. Solitons become mate-rialised in diverse scenarios [61, 62, 65, 66]; we have already seen that solitons conductelectricity in organic polymers. But solitons can also transmit data in transoceaniccables, and solitons can transport chemical energy along proteins. Solitons explainthe Meißner effect in superconductivity and solitons model dislocations in liquid crys-tals. Solitons are used to describe hadronic particles, cosmic strings and magneticmonopoles in high energy physics.We argue that solitons describe even life. We argue that each of us has some 10 solitons in our body. These solitons are the building blocks of folded proteins, they arethe ingredients of all metabolic processes that make us alive. trings from solitons The NLS equation that we obtain from (3.26), is the paradigm equation thatsupports solitons [61, 62, 66, 65]. Depending on the sign of λ , the soliton is either dark( λ >
0) or bright ( λ < H = ∞ (cid:90) −∞ ds (cid:8) κ s + λ ( κ − m ) (cid:9) (3.28)reproduces our previous instanton equation (1.23) with Q = √
2; the Hamiltonian(3.28) supports the double well soliton a.k.a. the paradigm topological soliton: When m is positive and when κ can take both positive and negative values, the equation ofmotion κ ss = 2 λκ ( κ − m )is solved by - see (1.25) κ ( s ) = m tanh (cid:104) m √ λ ( s − s ) (cid:105) (3.29)We have concluded that the energy function E = (cid:90) ds (cid:26) κ s + λ ( κ − m ) + d κ τ − bκ τ − aτ + c τ (cid:27) (3.30)is the most general one, with a solid geometric foundation; it is a linear combination of all the densities (3.26), (3.27). Note that in (3.30) we have also included the Proca mass;this is the last term. Even though the Proca mass does not appear in the integrableNLS hierarchy, it does have a claim to be gauge invariant [67,68]. Eventually, we shallpresent a topological argument why the Proca mass should be included.The energy (3.30) is quadratic in the torsion. Thus we can eliminate τ using itsequation of motion, δ E δτ = dκ τ − bκ − a + cτ = 0 (3.31)This gives τ [ κ ] = a + bκ c + dκ ≡ ac b/a ) κ d/c ) κ (3.32)and we obtain the following effective energy for the curvature, E κ = (cid:90) ds (cid:26) κ s + V [ κ ] (cid:27) (3.33)with the equation of motion δ E κ δκ = − κ ss + V κ = 0 (3.34)where V [ κ ] = − (cid:18) bc − add (cid:19) c + dκ − (cid:18) b + 8 λm b (cid:19) κ + λ κ (3.35)This is a deformation of the potential in (3.28), the two share the same large- κ asymp-totics. When we select the parameters properly, we can expect that (3.31), (3.34), Strings in three space dimensions (3.35) continue to support topological solitons. But we do not know their explicit pro-file, in terms of elementary functions. In the sequel we shall construct these solitonsnumerically.Once we have constructed the soliton of (3.34), we evaluate τ ( s ) from (3.32). Wesubstitute these profiles in the Frenet equation (3.13) and solve for t ( s ). We thenintegrate (3.9) to obtain the string x ( s ) that corresponds to the soliton. When the curvature of a string vanishes, there is an anomaly in the Frenet framing.It turns out that the origin of this anomaly is also the raison d’edre for a topologicalsoliton to reside on a string.Until now, we have assumed (3.12) so that the curvature (3.14) does not vanish. Butwe have also observed that in the case of the Abelian Higgs model it is natural for thedensity ρ to vanish, on the world-sheet of a vortex core. Thus, in the context of AHMthe vanishing of ρ relates to important, physically significant effects. Furthermore,the explicit soliton profile (3.29) displays both positive and negative values, and inparticular (3.29) vanishes when s = s . Consequently we should consider the possibilitythat κ may vanish, and even become negative, along a string.We start by extending the curvature (3.14) so that it has both positive and negativevalues. According to (3.20) the negative values of κ are related to the positive ones bya η = ± π frame rotation, κ η = ± π −→ e ± iπ κ = − κ (3.36)We compensate for an extension of (3.14) to negative values, by introducing the dis-crete Z symmetry [60] κ ↔ − κη ↔ η ± π ⇔ κe iη ↔ κe iη (3.37)An (isolated) point where κ ( s ) vanishes is called an inflection point. Figure 3.2shows an example of an inflection point. As shown in this Figure, in the limit ofa plane curve we obtain a discontinuity in the Frenet frames, when the string goesthrough the inflection point: The zweibein ( n , b ) becomes reflected according to( n + i b ) −→ − ( n + i b ) = e ± iπ ( n + i b ) (3.38)when we have a simple inflection point along the string. At the inflection point itself,the Frenet frames can not defined. Thus we can not unambiguously deduce whetherwe have a jump by η = + π or by η = − π at the inflection point. We can not concludewhether the Frenet frame vectors ( n , b ) become rotated clockwise or counterclockwiseby an angle π along the tangent vector. There is a Z anomaly in the definition ofFrenet framing, due to inflection points.To analyse the anomaly, consider a string x ( s ) that has a simple inflection pointwhen s = s . Thus κ ( s ) = 0 but κ s ( s ) (cid:54) = 0, as shown in Figure 3.2. To simplify thenotation we re-define the parameter s so that the inflection point occurs at s = 0.We can always remove the inflection point by a generic deformation of the string: Adeformation which is restricted to the plane as in Figure 3.2 only slides the inflection nomaly in the Frenet frames Fig. 3.2
A string with inflection point (ball). At each point the Frenet frame normal vectorpoints towards the center of the osculating circle. At the inflection point we have a disconti-nuity in the direction of the normal vectors: The radius of the osculating circle diverges andthe normal vector are reflected in the osculating plane, from one side to the other side of thestring. point along the string without removing it. In order to remove the inflection point weneed to deform the string off its instantaneous tangent plane: The co-dimension of theinflection point in R is two, the inflection point is not generic along a space string.Consider two different generic deformations, x ( s ) → x ( s ) + δ x , ( s ) = x , ( s ) (3.39)In the case shown in Figure 3.2, these two deformations amount to moving the stringeither slightly up from the plane, or slightly down from the plane, around the inflectionpoint. We assume that the deformations are very small, and compactly supported sothat δ x , ( ± ε ± ) = 0Here ε ± > x , ( s ) bifurcate.Imagine now a closed string denoted γ , that starts from x ( − ε − ), follows along x to x (+ ε + ) and then returns along x back to x ( − ε − ). Introduce the Frenet frame normalvectors of γ ; by shifting γ slightly into the direction of its Frenet frame normals, weobtain a second closed string which we call ˜ γ . Let t and ˜ t be the corresponding tangentvectors. The Gauß linking number of γ and ˜ γ is Lk = 14 π (cid:73) γ (cid:73) ˜ γ dsd ˜ s x − ˜ x | x − ˜ x | · ( t × ˜ t ) (3.40)Proceeding along x , ( s ) the respective Frenet frames are continuously rotated by η , ≈ ± π ; in the limit where δ x , → Strings in three space dimensions linking number has values Lk = ± η , change in the same direction; weremind that γ proceeds ”backwards” along x . But if the framing along x ( s ) and x ( s ) rotate in the opposite directions, we have Lk = 0.Accordingly, the relative sign of η , depends on the way how the inflection pointis circumvented: We have a frame anomaly in the Frenet framing as δ x , →
0, thevalue of Lk depends on the way how we define δ x , ( s ).Finally, we comment on the Bishop framing (3.22) which from the point of view ofthe AHM is like the unitary gauge. Consider our initially planar string and introducethe local deformation (3.39). Then introduce any framing that is capable of rotatingaround the tangent vector t ( s ) when we move along the string e.g. the Frenet framing.Then rotate the string any integer number of times within the locally deformed regionso that the framing becomes self-linked and the Gauß linking number (3.40) is non-vanishing. In the limit where the support of the local deformation vanishes the Bishopframing is not capable of observing any residual effect. Indeed, it is well known that inthe context of the AHM model, unitary gauge can not detect the presence of topologicaldefects such as vortices. Research project: Analyse in detail the framing of a string in the presence of an inflectionpoint, using the Bishop’s frames (3.22).
An inflection point together with the corresponding Frenet frame anomaly can be givenan interpretation in terms of a string specific bifurcation, which is called inflection pointperestroika [69,70,71,72,73]. It explains why a uniquely defined Frenet framing acrossthe inflection point, or any other framing that rotates around the tangent vector, isnot possible:Consider a segment of a string, along which the torsion τ ( s ) vanishes. Accordinglythe string segment is constrained on a plane, as in figure 3.2. When a string is con-strained on a plane, a simple isolated inflection point is generic. This follows since fora string on plane the inflection point has co-dimension one. Moreover, in the case of astring on plane a single simple inflection point is a topological invariant. It can onlybe moved around the plane but not made to disappear; unless it escapes the planewhich we assume is not the case. If we have two simple inflection points along a stringon plane, we can bring them together to deform the string so that no inflection pointremains. Thus the inflection point is a mod (2) topological invariant of a string onplane.Consider now a generic string in R ; a generic string is not constrained on a plane.The co-dimension of a single simple inflection point is two, thus a generic string doesnot have any inflection points. But along a string which moves freely in R , an isolatedsimple inflection point appears generically, at some point, at some moment, during themotion. When this infection point perestroika takes place along the moving string, itleaves a trail behind: The inflection point perestroika changes the number of flatteningpoints , which are points along the string where the torsion τ ( s ) vanishes [72, 73].At a simple flattening point the curvature κ ( s ) is generically non-vanishing, butthe torsion τ ( s ) changes its sign. Accordingly the inflection point perestroika can erestroika only change the number of simple flattening points by two. Apparently, it alwaysdoes [72, 73].Unlike the inflection point, a flattening point where τ ( s ) = 0 is generic alonga static space string. Furthermore, unlike a simple inflection point, a single simpleflattening point that occurs in a one parameter family of strings in R is a topologicalinvariant. It can not disappear on its own, under local deformations that leave theends of the string intact. A pair of flattening points can be combined together, intoa single bi-flattening point, which can then dissolve. When this happens, a secondstring-specific bifurcation which is called bi-flattening perestroika takes place.Apparently, inflection point perestroika and bi-flattening perestroika are the onlytwo bifurcations where the number of flattening points can change [73]. The number ofsimple flattening points is a local invariant of the string. Besides the flattening number,and the self-linking number in case of a framed string, a generic smooth string does notpossess any other essential local invariants [72]. The two are also mutually independent,even though they often appear together.For example, one can deduce that the self-linking number of a string increases byone if the torsion is positive when the string approaches its simple inflection point,and if two simple flattening points disappear after the passage of the inflection point.Moreover, if the torsion is negative, the self-linking number decreases by one whentwo flattening points disappear after the passage [72]. But when two simple flatteningpoints dissolve in a bi-flattening perestroika, the self-linking number in general doesnot change.A bifurcation is the paradigm cause for structural transitions, including phasetransitions, in any dynamical system. Inflection point and bi-flattening perestroika’sare the only bifurcations that are string specific. Accordingly these two perestroika’smust have a profound rˆole in determining the physical behaviour and phase structureof string-like configurations. In particular, these perestroika’s must be responsible forany string-specific structural re-organisation that takes place when the value of thecompactness index ν in (1.8) changes. Since perestroika’s relate to the creation anddisappearance of topological solitons such as (3.29) along a string, it is clear that per-estroika’s and topological solitons, with the ensuing physical effects, are commonplacewhenever we have a string with an energy function of the form (3.30). A good example of the interplay between inflection points and flattening points isgiven by (3.29) or more generally by a soliton solution of (3.34), and (3.32). For aregular string, the denominator of (3.32) should not vanish. Thus, in the case of aninflection point the ration ( d/c ) should be positive. When ( b/a ) is negative, we havea symmetric pair of inflection points around the inflection point. Thus starting froma one-parameter family of strings κ ( s, v ) with v the parameter, if initially κ ( s, v ) issufficient large and e.g. positive, and we are not close to an inflection point, there areno flattening points either. When v is varied so that the inflection point is approached,a pair of flattening points emerges and remains whenever the curvature has the profile(3.29). In particular, we conclude that it is important to retain the Proca mass term,even a very small one, as a regulator. Discrete Frenet Frames
Proteins are not like continuous differentiable strings. Proteins are like piecewise linearpolygonal strings. Thus, to understand the physical properties of proteins we need todevelop the formalism of discrete strings. Accordingly, we proceed to generalise theFrenet frame formalism to the case of polygonal strings that are piecewise linear [60].Let r i with i = 1 , ..., N be the vertices of a piecewise linear discrete string. At eachvertex we introduce the unit tangent vector t i = r i +1 − r i | r i +1 − r i | (4.1)the unit binormal vector b i = t i − − t i | t i − − t i | (4.2)and the unit normal vector n i = b i × t i (4.3)The orthonormal triplet ( n i , b i , t i ) defines a discrete version of the Frenet frames(3.9)-(3.11) at each position r i along the string, as shown in Figure (4.1) In lieu of Fig. 4.1
Discrete Frenet frames along a piecewise linear discrete string. the curvature and torsion, we have the bond angles and torsion angles, defined as inFigure 4.2. iscrete Frenet Frames Fig. 4.2
Definition of bond ( κ i ) and torsion ( τ i ) angles, along the piecewise linear discretestring. When we know the Frenet frames at each vertex, we can compute the values ofthese angles: The bond angles are κ i ≡ κ i +1 ,i = arccos ( t i +1 · t i ) (4.4)and the torsion angles are τ i ≡ τ i +1 ,i = sign { b i − × b i · t i } · arccos ( b i +1 · b i ) (4.5)Conversely, when the values of the bond and torsion angles are all known, we canuse the discrete Frenet equation n i +1 b i +1 t i +1 = cos κ cos τ cos κ sin τ − sin κ − sin τ cos τ κ cos τ sin κ sin τ cos κ i +1 ,i n i b i t i (4.6)to compute the frame at position i + i from the frame at position i . Once all the frameshave been constructed, the entire string is given by r k = k − (cid:88) i =0 | r i +1 − r i | · t i (4.7)Without any loss of generality we may choose r = 0, make t to point into thedirection of the positive z -axis, and let t lie on the y - z plane.The vectors n i and b i do not appear in (4.7). Thus, as in the case of continuumstrings, a discrete string remains intact under frame rotations of the ( n i , b i ) zweibeinaround t i . This local SO(2) rotation acts on the frames as follows nbt i → e ∆ i T nbt i = cos ∆ i sin ∆ i − sin ∆ i cos ∆ i
00 0 1 nbt i (4.8) Discrete Frenet Frames
Here ∆ i is the rotation angle at vertex i and T is one of the SO(3) generators T = −
10 1 0 T = − T = − that satisfy the Lie algebra [ T a , T b ] = (cid:15) abc T c Using these matrices we can write the effect of frame rotation on the bond and torsionangles as follows κ i T → e ∆ i T ( κ i T ) e − ∆ i T (4.9) τ i → τ i + ∆ i − − ∆ i (4.10)From the point of view of lattice gauge theories, the transformation of bond anglesis like an adjoint SO(2) ∈ SO(3) gauge rotation of a Higgs triplet around the Cartangenerator T , when the Higgs triplet is in the direction of T . The transformation oftorsion angle coincides with that of the SO(2) lattice gauge field. Since the t i remainintact under (4.8), the gauge transformation of ( κ i , τ i ) has no effect on the geometryof the discrete string. A priori , the fundamental range of the bond angle is κ i ∈ [0 , π ] while for the torsionangle the range is τ i ∈ [ − π, π ). Thus we identify ( κ i , τ i ) as the canonical latitude andlongitude angles of a two-sphere S . In parallel with the continuum case we find ituseful to extend the range of κ i into negative values κ i ∈ [ − π, π ] mod (2 π ). As in(3.36) we compensate for this two-fold covering of S by a Z symmetry which nowtakes the following form: κ k → − κ k for all k ≥ iτ i → τ i − π (4.11)This is a special case of (4.9), (4.10), with∆ k = π for k ≥ i + 1∆ k = 0 for k < i + 1 α trace reconstrucion We have already concluded that the Ramachandran angles are not sufficient for recon-structing the protein backbones: As shown in Figure (1.11) the reconstructed back-bones are not in the same universality class with folded proteins. The value of thecompactness index ν is different. For a correct reconstruction, we need to utilise allthe bond and torsion angles that we have defined in Figure 1.7. Only for the bondlengths can the average values be used.We now consider the protein backbone reconstruction, in terms of virtual C α back-bone. We identify the vertices in Figure 4.2 with the C α atoms, so that ( κ i , τ i ) are the niversal discretised energy virtual C α backbone bond and torsion angles. For the virtual C α -C α bond lengths weuse the average PDB value | r i +1 − r i | ∼ . unlike in the case of the Ramachandran angles, the ensuing recon-structed C α backbones reproduce the original crystallographic structures, with a veryhigh precision; the difference is mostly within the range of experimental errors, as mea-sured by the B-factor (1.5). In Figure 4.3 we compare the radius of gyration values
10 10 number of residues N g y r a t i o n r a d i u s R g ( ˚ A ) Fig. 4.3
Comparison of the C α -C α radius of gyration data between the original PDB struc-tures and those reconstructed in terms of variable virtual bond and torsion angles in com-bination with optimal C α -C α virtual bond lengths. The (blue) crosses are the original PDBstructures, and the (red) circles are the reconstructed ones. The line shows the fits of theradius of gyration. We have no visual difference, between the two cases. in our ultra-high resolution protein structures, for the original PDB structures andthose that have been reconstructed using the virtual C α backbone bond and torsionangles when we use the constant virtual bond length value (4.12). Unlike in the Figure(1.11), now we observe no visual difference. For the reconstructed data, we obtain therelation [26] R g ≈ . N . ˚AThis is remarkably close to (1.11), the difference is immaterial. Thus we conclude thatin the case of crystallographic protein structures the virtual C α trace bond and torsionangles ( κ i , τ i ) form a complete set of geometrical local order parameters. The goal is to describe the structure and dynamics of proteins beyond the limitationsof an expansion in a small coupling like (1.14). For this we propose to start with anenergy function where the virtual C α backbone bond and torsion angles appear as Discrete Frenet Frames local order parameters; recall that these variables form a complete set of local orderparameters, for backbone reconstruction.Let F be the thermodynamical Helmholtz free energy of a protein. Its minimumenergy configuration describes a folded protein, under thermodynamical equilibriumconditions. The free energy is the sum of the internal energy U and the entropy S , attemperature T F = U − T S (4.13)It is a function of all the inter-atomic distances F = F ( r αβ ) ; r αβ = | r α − r β | (4.14)where the indices α, β, ... extend over all the atoms in the protein system, includingthose of the solvent environment.We assume that the characteristic length scales over which spatial deformationsalong the protein backbone around its thermal equilibrium configuration take place,are large in comparison to the covalent bond lengths; there are no abrupt wrenchesand buckles, only gradual bends and twists. We also assume that the C α virtual bondlength oscillations have a characteristic time scale which is short in comparison to thetime scale we consider. The mean value (4.12) can then adopted as the universal valueof the virtual C α bond length, at least in a time averaged sense. The completenessof the C α bond and torsion angles proposes that we consider the response of theinteratomic distances to variations in these angles, r αβ = r αβ ( κ i , τ i )Suppose that at a local extremum of the free energy, the C α bond and torsionangles have the values ( κ i , τ i ) = ( κ i , τ i )Consider a conformation where the ( κ i , τ i ) deviate from these extremum values. Thedeviations are ∆ κ i = κ i − κ i ∆ τ i = τ i − τ i (4.15)Taylor expand the infrared limit Helmholtz free energy (4.13) around the extremum, F [ r αβ = r αβ ( κ i , τ i )] ≡ F ( κ i , τ i ) = F ( κ i , τ i ) + (cid:88) k { ∂F∂κ k | ∆ κ k + ∂F∂τ k | ∆ τ k } + (cid:88) k,l { ∂ F∂κ k ∂κ l | ∆ κ k ∆ κ l + ∂ F∂κ k τ l | ∆ κ k ∆ τ l + 12 ∂ F∂τ k ∂τ l | ∆ τ k ∆ τ l } + O (∆ )The first term in the expansion evaluates the free energy at the extremum. Since( κ i , τ i ) correspond to the extremum, the second term vanishes and we are left withthe following expansion of the averaged free energy, F ( κ i , τ i ) = F ( κ i , τ i ) niversal discretised energy + (cid:88) k,l { ∂ F∂κ k ∂κ l | ∆ κ k ∆ κ l + ∂ F∂κ k τ l | ∆ κ k ∆ τ l + 12 ∂ F∂τ k ∂τ l | ∆ τ k ∆ τ l } + . . . (4.16)In the limit where the characteristic scale of the extent of spatial deformations arounda minimum energy configuration is large in comparison to a covalent bond length, andthe amplitude of these deformations remains small, we may re-arrange the expansion(4.16) in terms of of the differences in the angles as follows: First comes local terms.Then comes terms that connect the nearest-neighbors. Then comes terms that connectthe next-to-nearest neighbours. And so forth ... This re-ordering of the expansionensures that we recover the derivative expansion (3.4) in leading order, when we takethe continuum limit where the virtual bond length vanishes. Moreover, since the freeenergy must remain invariant under the local frame rotations (4.9), (4.10) we conclude[58,74,75,76,77,78,79,80,81] that to the leading order the expansion of the free energy must coincide with a discretisation of the full AHM energy (3.30): F = − N − (cid:88) i =1 κ i +1 κ i + N (cid:88) i =1 (cid:26) κ i + λ ( κ i − m ) + d κ i τ i − b κ i τ i − a τ i + c τ i (cid:27) + . . . (4.17)The corrections include next-to-nearest neighbours couplings and so forth, which arehigher order terms from the point of view of our systematic expansion: The approxima-tion (4.17) is valid, as long as there are no abrupt wrenches and buckles but only longwave length gradual bends and twists along the backbone. In particular, long rangeinteractions are accounted for, as long as they don’t introduce any localised bucklingof the backbone.In (4.17) λ , a , b , c , d and m depend on the atomic level physical properties andthe chemical microstructure of the protein and its environment. In principle, theseparameters can be computed from this knowledge.We note the following: The free energy (4.17) is a deformation of the standardenergy function of the discrete nonlinear Schr¨odinger equation (DNLS) [61, 62]. Thefirst sum together with the three first terms in the second sum is the energy of thestandard DNLS equation, in terms of the discretized Hasimoto variable (3.25). Thefourth term ( b ) is the conserved momentum, and the fifth term ( a ) is the helicity; thesetwo break the Z parity symmetry and are responsible for right-handed helicity of theC α backbone. The last ( c ) term is the Proca mass that we again add, as a ”regulator”.Observe in particular the explicit presence of the non-linear, quartic contribution to the(virtual) bond angle energy. This is the familiar double-well potential, shown in Figure1.12. Its Z symmetry becomes eventually broken. The breaking of this symmetry isessential for the emergence of structure, in the case of proteins. It is the source of Bol’she that makes us alive. We note that this kind of explicit non-linearity is absentin (1.12).
We summarise:
The expression (4.17) of the free energy describes the small am-plitude fluctuations around the local extremum ( κ i , τ i ) in the space of bond andtorsion angles. It can be identified as the long wavelength (infrared) limit of the fullfree energy, in the sense of Kadanoff and Wilson. To the present order of the expansionin powers of ( κ i , τ i ), the functional form (4.17) is the most general long wavelength Discrete Frenet Frames free energy that is compatible with the principle gauge invariance. This fundamen-tal symmetry principle ensures that no physical quantity depends on our choice ofcoordinates (framing) along the backbone.
Research project: Develop a method to compute the parameters in (4.17) from an all-atomenergy function.
The energy (4.17) is a deformation of the integrable energy of the discrete nonlinearSchr¨odinger (DNLS) equation [61, 62, 65]: The first term together with the λ and d dependent terms constitute the (naively) discretized Hamiltonian of the NLS model inthe Hasimoto variable. The conventional DNLS equation is known to support solitons.Thus we can try and find soliton solutions of (4.17).As in (3.32) we first eliminate the torsion angle, τ i [ κ ] = a + bκ i c + dκ i = a bκ i c + dκ i ≡ a ˆ τ i [ κ ] (4.18)For bond angles we then have κ i +1 = 2 κ i − κ i − + dV [ κ ] dκ i κ i ( i = 1 , ..., N ) (4.19)where we set κ = κ N +1 = 0, and V [ κ ] is given by (3.35). This equation is a defor-mation of the conventional DNLS equation, and it is not integrable, a priori . For anumerical solution, we extend (4.19) into the following iterative equation [76] κ ( n +1) i = κ ( n ) i − (cid:15) (cid:110) κ ( n ) i V (cid:48) [ κ ( n ) i ] − ( κ ( n ) i +1 − κ ( n ) i + κ ( n ) i − ) (cid:111) (4.20)Here { κ ( n ) i } i ∈ N denotes the n th iteration of an initial configuration { κ (0) i } i ∈ N and (cid:15) is some sufficiently small but otherwise arbitrary numerical constant; we often choose (cid:15) = 0 .
01 in practical computations. The fixed point of (4.20) is clearly independentof the value chosen. But the radius and rate of numerical convergence in a simulationtowards the fixed point, depends on the value of (cid:15) : The fixed point is clearly a solutionof (4.19).Once the numerically constructed fixed point is available, we calculate the corre-sponding torsion angles from (4.18). Then, we obtain the frames from (4.6) and canproceed to the construction of the discrete string, using (4.7).At the moment we do not know of an analytical expression of the soliton solutionto the equation (4.19). But we have found [75, 77, 79] that an excellent approximativesolution can be obtained by discretizing the topological soliton (3.29). κ i ≈ m · e c ( i − s ) − m · e − c ( i − s ) e c ( i − s ) + e − c ( i − s ) (4.21)Here ( c , c , , m , m , s ) are parameters. The m and m specify the asymptotic κ i -values of the soliton. Thus, these parameters are entirely determined by the character roteins out of thermal equilibrium of the regular, constant bond and torsion angle structures that are adjacent to thesoliton. In particular, these parameters are not specific to the soliton per se , but tothe adjoining regular structures. The parameter s defines the location of the solitonalong the string. This leaves us with only two loop specific parameter, the c and c . These parameters quantify the length of the bond angle profile that describes thesoliton.For the torsion angle, (4.18) involves one parameter ( a ) that we have factored outas the overall relative scale between the bond angle and torsion angle contributions tothe energy; this parameter determines the relative flexibility of the torsion angles, withrespect to the bond angles. Then, there are three additional parameters ( b/a, c/a, d/a )in the remainder ˆ τ [ κ ]. Two of these are again determined by the character of the regularstructures that are adjacent to the soliton. As such, these parameters are not specificto the soliton. The remaining single parameter specifies the size of the regime wherethe torsion angle fluctuates.On the regions adjacent to a soliton, we have constant values of ( κ i , τ i ). In the caseof a protein, these are the regions that correspond to the standard regular secondarystructures: In a rough sense, proteins are made of right-handed α -helices, β -strandsand loops; we recommend the reader familiarises herself with these structures, e.g. using the resources available at (1.1)-(1.3), and in particular at [82, 83]For example, the standard right-handed α -helix is α − helix : (cid:26) κ ≈ π τ ≈ β -strand is β − strand : (cid:26) κ ≈ τ ≈ π (4.23)All the other standard regular secondary structures of proteins such as 3/10 helices,left-handed helices etc. are similarly described by definite constant values of κ i and τ i .Protein loops correspond to regions where the values of ( κ i , τ i ) are variable, proteinloops are the soliton proper: A soliton is a configuration that interpolates between tworegular structures, with predetermined constant values of ( κ i , τ i ). Like in Figure (1.12). When a protein folds towards its native state, it is out of thermal equilibrium. Severalstudies propose, that in the case of a small protein the folding takes place in a mannerwhich is consistent with Arrhenius’ law; we recall that Arrhenius’ law states thatthe reaction rate depends exponentially on the ratio of activation energy E A andtemperature, r ∝ exp {− E A k B T } On the other hand, in the case of simple spin chains it has been found that the
Glauberdynamics [84,85] describes the approach to thermal equilibrium, following Arrhenius’slaw. Since we have argued that proteins can be viewed as spin chains, with residues Discrete Frenet Frames corresponding to the spin variables, it is natural to try and model the way how a proteinfolds towards its native state in terms of Glauber dynamics. According to Glauber, anon-equilibrium system approaches thermal equilibrium in a manner which is dictatedby a Markovian time evolution, defined by the probability distribution [84, 85] P = x x with x = exp {− ∆ Ek T } (4.24)Here ∆ E is the activation energy, which in a Monte Carlo (MC) simulation is theenergy difference between consecutive MC time steps, that we compute from (4.17)in the case of a protein. In addition, in the case of a protein we need to account forsteric constraints: Analysis of PDB structures reveals, that the distance between twoC α atoms which are not nearest neighbours along the backbone, is always larger than(4.12) | r i − r k | > . | i − k | ≥ k T which appears in (4.24) as a temperature factor,should not be directly identified with the Bolzmannian temperature factor k B T . Thescale of units depends on the overall scale of the energy function (4.17), and in partic-ular by our choice of the normalisation factor in the first, nearest neighbor interactionterm. To determine the unit, we need a renormalisation condition. For this we needto perform a proper experimental measurement(s), and compare the predictions ofour model to those of the protein that it describes, at that temperature. One suit-able renormalisation point could be, to try and identify the experimentally measured θ -transition temperature by comparison with the properties of our model. This section is somewhat technical. The details are not needed in the rest of the lectures. Weinclude this section because we feel that good understanding of temperature renormalisationof parameters is relevant to physics of proteins [81]. For example, thus far this has not beenreally addressed in any other approach we are aware of. You might find the subject describedhere to be an inspiration for your future research; the exposition is preliminary and indicative.
In the probability distribution (4.24) the nearest-neighbor coupling contribution in(4.17) becomes normalised as follows, − k T N (cid:88) i =1 κ i +1 κ i (4.26)Thus the temperature factor k T depends on the physical temperature factor k B T ina non-trivial fashion. That is, we should really write2 k T → J ( T ) k B T (4.27) emperature renormalisation where J ( T ) is the strength of the nearest neighbour coupling at Bolzmannian k B T .Its numerical value depends on the temperature in a manner that is governed by thestandard renormalisation group equation T dJdT = β J ( J ; λ, m, a, b, c, d ) ∼ β J ( J ) + . . . (4.28)For simplicity we may assume that to leading order the dependence of β J on the othercouplings can be ignored. Note that the parameters and thus their β -functions, dependon the properties of the environmental factors such as properties of solvent, the pH ofsolvent, pressure etc .In the low temperature limit we can expand the nearest neighbour coupling asfollows, J ( T ) ≈ J − J T α + . . . as T → J is non-vanishing, and the critical exponent α controls the lowtemperature behavior of J ( T ). The asymptotic expansion (4.29) corresponds to a β -function (4.28) that in the T → β J ( J ) = α ( J − J ) + . . . Consequently, at low temperatures we have k T ≈ k B J T (4.30)In terms of the temperature factor, (4.28) translates into T ddT (cid:18) k T (cid:19) = − k T + 12 k B T β J (cid:18) k B Tk T (cid:19) (4.31)We try to find an approximative solution in the collapsed phase, when the temper-ature T is below the critical θ -point temperature T θ where the transition betweenthe collapsed phase and the random walk phase takes place. This coincides with thephysical temperature value that corresponds to the unfolding transition temperaturefactor value k T Θ . Let β J (cid:18) k B Tk T (cid:19) = 2 k B Tk T + F (cid:18) k B Tk T (cid:19) and define y = 1 k T x = 12 k B T The equation (4.31) then translates into dydx = − F ( yx ) Discrete Frenet Frames with the solution ln( c x ) = − (cid:90) yx duF ( u ) + u Here c is an integration constant. We shall assume that the leading non-linear correc-tions are logarithmic; this is often the case. To the leading order we then have F ( u ) = ( η − u + α u ln u + . . . Note, that in general there are higher order corrections. When we re-introduce theoriginal variables and set η = − α ln J we get for the temperature factor k T ≈ J k B T exp { J J T α } (4.32) ≈ J k B tT + 2 J J k B T α +1 + ... as T → α > J ( T ) ≈ J exp {− J J T α } Thus the coupling between bond angles becomes weak, at an exponential rate, whenthe temperature approaches the transition temperature T θ between the collapsed phaseand the random walk phase.Similarly, all the other couplings that are present in (4.17) are temperature de-pendent. Each of them has its own renormalisation group equation. For example, thequartic κ i self-coupling λ in (4.17) flows according to a renormalization group equationof the form T dλdT = β λ ( λ ) (4.33)For simplicity, we again assume that to the leading order the β λ depends only on λ .It is natural to interpret λ as a measure of the strength of hydrogen bonds: Struc-tures such as α -helices and β strands become stable, due to hydrogen bonds. At thesame time, the value of λ controls the affinity of κ i towards the ground state value ofthe quartic potential in (4.17). The hydrogen bonds are presumed to become vanish-ingly weak, when the protein unfolds. This can take place when the protein approachesthe transition temperature T θ which separates the collapsed phase from the randomwalk phases. This proposes that asymptotically, λ ( T ) → λ θ | T − T θ | γ λ as T → T θ from belowHere γ λ is a critical exponent that controls the way how the strength of (effective)hydrogen bonds vanish. More generally, we may send T θ → T H , which is the temper-ature value where hydrogen bonds disappear even between the solvent molecules; in emperature renormalisation the case of water at atmospheric conditions T H ≈ o C. In general, we expect thatthe value of T H is higher than that of T θ .Above T > T θ , when the hydrogen bonds become vanishingly weak, we expectthat effectively λ ≈ m ≈
0) in (4.17). On the other hand, we expect that asthe temperature decreases the value of λ ( T ) increases, so that in the low temperaturelimit we have λ ( T ) → λ − λ T γ + . . . as T → β λ ( λ ) ≈ γ ( λ − λ ) + O [( λ − λ ) ]Here λ should be close to the value we obtain from PDB, when we compute theparameters in (4.17) from the crystallographic low temperature structure. Research project: Try to numerically evaluate the β -function (4.28) in the case of a simpleprotein, such as villin headpiece (PDB code 1YRF), using results from detailed experimentalmeasurements.Research project: Find out how the parameters in (1.12), (1.13) depend on temperature. Solitons and ordered proteins
Various taxonomy schemes such as CATH and SCOP [82,83] have revealed that foldedproteins are build in a modular fashion, from a relatively small number of buildingblocks. Despite essentially exponential increase in the number of new crystallographicprotein structures, novel fold topologies are now rarely found and some authors thinkthat most modular building blocks of proteins are already known [30, 86]. This con-vergence in protein architecture demonstrates that protein folding should be a processwhich is driven by some kind of universal structural self-organisation principle.We know that solitons are the paradigm structural self-organisers. Thus we arguethat the soliton solution of the DNLS equation (4.19), (4.18) must be the universalmodular building block from which folded proteins can be constructed. Indeed, weknow that the energy function (4.17) is unique, in the limit where it becomes appli-cable. Moreover, it has already been shown that over 92% of all C α -traces of PDBproteins can be described in terms of no more than 200 different parametrizations ofthe discretised NLS soliton (4.21), with a root-mean-square-distance (RMSD) precisionwhich is better than 0.5 ˚A [78].Accordingly, we set up to describe the modular building blocks of proteins in termsof various parametrizations of the DNLS soliton profile, which is described by theequations (4.20), (4.18), (4.6) and (4.7). λ -repressor as a multi-soliton In order to identify the soliton structure of a given protein, we start by computingthe C α virtual backbone bond and torsion angles from the PDB data. We initially fixthe Z gauge in (4.11) so that all the bond angles take positive values κ i ∈ [0 , π ]. Ageneric protein profile consists of a set of κ i with values that are typically between κ i ≈ κ i ≈ π/
2; the upper bound can be estimated using steric constraints. Thetorsion angle values τ i are commonly much more unsettled, and their values extendmore widely over the entire range τ ∈ [ − π, + π ].As an example we consider the λ -repressor which is a protein that controls thelysogenic-to-lytic transition in bacteriophage λ infected E. coli cell. The transition be-tween the lysogenic and lytic phases in bacteriophage λ infected E. coli is the paradigmexample of genetic switch mechanism, which has been described in numerous molec-ular biology textbooks and review articles. See for example [87, 88]. The interplaybetween the lysogeny maintaining λ -repressor protein and the regulator protein thatcontrols the transition to the lytic state is a simple model for more complex regulatorynetworks, including those that can lead to cancer in humans. -repressor as a multi-soliton The λ -repressor structure we consider has PDB code 1LMB. It is a homo-dimerwith 92 residues in each of the two monomers. It maintains the lysogenic state bybinding to DNA with a helix-turn-helix motif which is located between the residuesites 33-51. The λ -repressor is a fast-folding protein: In [38] an 80-residue long mutantof the λ -repressor was studied in all-atom simulation.In figure 5.1 (left column) we show the ( κ i , τ i ) spectrum of 1LMB, with the con-vention ( i.e. Z gauge fixing) that κ i is positive. We display the segments betweenresidues 19-82. We note that this spectrum is fairly typical, for the PDB structureswe have analysed. Index
20 22 24 26 28 30 32 34 36 A n g l e s ( r ad i a n s ) -3-2-10123 θ Index
20 22 24 26 28 30 32 34 36 A n g l e s ( r ad i a n s ) -3-2-10123 θ Index
36 38 40 42 44 46 A n g l e s ( r ad i a n s ) -3-2-10123 Index
36 38 40 42 44 46 A n g l e s ( r ad i a n s ) -3-2-10123 Index
45 50 55 60 65 A n g l e s ( r ad i a n s ) -3-2-10123 Index
45 50 55 60 65 A n g l e s ( r ad i a n s ) -3-2-10123 Index
66 68 70 72 74 76 78 80 A n g l e s ( r ad i a n s ) -3-2-10123 Index
66 68 70 72 74 76 78 80 A n g l e s ( r ad i a n s ) -3-2-10123 Le
Fig. 5.1
The bond angle ( κ ) and torsion angle ( τ ) spectrum of λ -repressor 1LMB, withindexing that follows PDB. The left column shows the spectrum in the Z gauge where all κ i > Z gaugetransformations that identify the solitons. The arguments in Section 3.8 suggest that in analysing the PDB data, one shouldinitially focus to flattening points i.e. points where τ i changes its sign. The flatteningpoints should be located near a putative inflection point where a soliton is situatedand perestroika takes place. Accordingly, we perform in the spectrum of Figure 5.1 leftcolumn the Z gauge transformations (4.11) in the vicinity of the apparent flatteningpoints, to identify the putative multi-soliton profile of κ i . We observe four regions with Solitons and ordered proteins an irregular τ i profile; these are shown in the Figures. By a judicious choice of Z gaugetransformations in these regions, we identify seven different soliton profiles in κ i . Theseprofiles are shown in the right column of Figure 5.1. Each of them is remarkably similarto (4.21), like a soliton that interpolates between two regular structures such as (4.22),(4.23); see also Figure 1.12. Moreover, these soliton profiles are clearly accompaniedby putative flattening points, as expected from the structure of (4.18); Note that τ i is multivalued, mod (2 π ). Thus the large fluctuations in the values of τ i are deceitful.Once we account for the multivaluedness, the profile of τ i is actually quite regular.The flexible multi-valuedness of τ i is in full accordance with the observed, much higherflexibility of the torsion angles in relation to the bond angles, which is known to occurin proteins.On the basis of the general considerations in Section 3.8, we argue that at least inthe case of 1LMB the process of folding from a regular unfolded configuration withno solitons to the biologically active natively folded configuration with its solitons isdriven by inflection and flattening point perestroika’s. The initial configuration withno solitons can be chosen to coincide with the minimum of the second sum in (4.17).It could also be e.g. a uniform right-handed α -helix (4.22), or β -strand (4.23), or apolyproline-II type conformation. When the protein folds it proceeds from this initialconfiguration towards the final configuration, thru successive perestroika’s i.e. bifurca-tions. These perestroika’s deform the C α backbone, creating DNLS-like solitons alongit and thus causing the backbone to enter the space-filling ν ∼ / α backbone structure as a soliton solution,with a prescribed precision.We have developed a program GaugeIT that implements the Z gauge transfor-mations to identify the background, and we have developed a program PropoUI thattrains the energy so that it has an extremum which models the background in termsof solitons. These programs are described at http : // . folding − protein . org (5.1)In the case of a protein for which the PDB structure is determined with an ultra-highresolution, typically below 1.0 ˚Angstr¨om, PropoUI routinely constructs a solution thatdescribes a soliton along the C α backbone with a precision which is comparable to theaccuracy of the experimentally measured crystallographic structure; recall that theaccuracy of the experimental PDB structure is estimated by the B-factors using theDebye-Waller relation (1.5).In figure 5.2 we compare the distance between the C α backbone, and the sevenindividual soliton solutions for 1LMB. The B-factor fluctuation distance in the exper-imental structure 1LMB, evacuated from the Debye-Waller relation, is also displayed.As shown in the Figure, the solitons describe the loops with a precision that is fullycomparable with the experimental uncertainties. The grey zone around the solitonprofile denotes our best estimate for the extent of quantum mechanical zero-point tructure of myoglobin Site Index
10 15 20 25 30 A n g s t r o m s y q Site Index
26 28 30 32 34 36 38 A n g s t r o m s y q Site Index
36 38 40 42 44 46 A n g s t r o m s y q Site Index
46 48 50 52 54 56 58 A n g s t r o m s y q Site Index
56 57 58 59 60 61 62 63 64 65 A n g s t r o m s y q Site Index
62 64 66 68 70 72 74 A n g s t r o m s y q Site Index
74 76 78 80 82 84 86 A n g s t r o m s y q Fig. 5.2
The distance between the PDB backbone of the first 1LMB chain and its seven soli-tons. The black line denotes the distance between the PDB structure and the correspondingsoliton. The grey area around the black line describes the lower bound estimate of 15 picometer (quantum mechanical) zero point fluctuation distance around each soliton, obtainedfrom Figure (1.4). The red line denotes the Debye-Waller fluctuation distance (1.5). fluctuations; according to Figure (1.4) there are practically no Debye-Waller valuesless than 15 pico meters, which we have chosen here as the zero point fluctuationdistance, in the Figure.The next step is to combine the individual solitons into a single multi-solitonsolution of the pertinent energy function (4.17). This can be performed using theprogram
ProproUI . But instead of proceeding with 1LMB, we prefer to consider asomewhat more complex example, the myoglobin where techniques such as all-atomMD have thus far failed to produce results.
Myoglobin [9] is the primary oxygen-carrying protein in the muscle cells of mammals.Myoglobin is closely related to haemoglobin, which is the oxygen binding protein inblood. Myoglobin gives meat its red color; the more red, the more myoglobin. Myo-globin also allows organisms to hold their breath, for a period of time. Diving mammalssuch as whales and seals have a high myoglobin concentration in their muscles.Myoglobin is the first protein to have its three-dimensional structure determined byx-ray crystallography. Subsequently myoglobin has remained among the most actively Solitons and ordered proteins studied proteins. But theoretically, the investigation of myoglobin e.g. in an all-atommolecular dynamics simulation remains a formidable task: The experimentally mea-sured folding time from a random chain to the natively folded state is around 2.5seconds [89]. At the same time, the fastest ever built special purpose MD supercom-puter
Anton can produce at most a few microseconds of in vitro folding trajectory perday in silico , in the case of proteins that are much shorter and simpler than myoglobin.Accordingly it would take something like a million or so days to reproduce a single myoglobin folding trajectory in silico , at all-atom level, even with
Anton.
A good con-vergence of Newton’s iteration with the energy function (1.12), (1.13) over such a longtime scale would be truly amazing.We have already noted that all-atom MD simulation is conceptually a weak cou-pling expansion. As such, it is appropriate for describing phenomena over very shorttime periods only: The time ratio (1.14) is the small, iterative expansion parameter. Inthe case of long time trajectories, such a weak coupling expansion can not be expectedto be very effective, not even convergent. Alternative approximative methods need tobe introduced, to model myoglobin and the large majority of proteins that fold muchslower than the microsecond-to-millisecond scale.From the perspective of quantum field theory this means that we need a non-perturbative approach. For example, in QCD we do not expect that standard pertur-bation theory is capable of describing hadrons. On the other hand, lattice QCD isdesigned for modelling hadrons. But it can hardly describe the scattering of quarksand gluons.Indeed, we have argued that in the case of proteins, the energy function (4.17) is anexample of a non-perturbative approach. It avoids altogether the need to introduce aweak coupling expansion parameter, such as the time step ratio (1.14). Instead the C α geometry is modelled in terms of small variations in the angular variables, around theirequilibrium conformations. Since the approximation does not engage time directly, itbecomes in principle possible to describe folding phenomena that can be very difficult,even impossible, to model in terms of conventional and presently available all-atomMD.We proceed to apply the energy function (4.17) to investigate detailed propertiesof myoglobin. In this Section we analyse the static structure and in the next Sectionwe consider out-of-equilibrium dynamics.We first construct the multi-soliton solution that models the myoglobin backbone.We use the crystallographic PDB structure 1ABS as a decoy, to train the energyfunction. This structure has been measured at very low liquid helium temperatures ∼
20 K. Thus the thermal B-factor fluctuations (1.5) are small.
We propose that at this point, the reader downloads the PDB structure 1ABS, to make iteasier to follow details of our analysis. We propose to use the Java interface provided at thePDB site (1.2) for visualisation of the backbone and side-chain atoms. We also recommendthe analysis tools available on the website of
Molprobity (1.4) which we shall refer to in thefollowing. We shall provide the values for all the parameters in the energy function (4.17)which the reader can use as input in the programs
ProproUI and
GaugeIT , that are describedat our website (5.1). This enables a detailed analysis of the multi-soliton that describes 1ABS,and should help the reader to start his/her independent research.
We proceed to the construction of the multi-soliton: 1ABS has154 amino acids, and tructure of myoglobin the PDB index runs over i = 0 , ... , α -helices (A,B,C,...,H) which are separated by 7 loops as shownin Figure 5.3. A B C D E F G H
Fig. 5.3
Myoglobin has 8 α -helices that are named A,B,C, ... , H Here we shall limit the construction of the multi-soliton to the sites with PDB indexbetween N=8-149. That is, we include all the named helices but we do not include theflexible tails at the ends of the backbone. These tails could be included, but withoutmuch additional insight to the issues that we address.In Figure (5.4) (top) we show the backbone bond and torsion angle spectrum withthe convention that all κ i are positive. In Figure (5.4) (bottom) we show the spectrumafter we have implemented the Z transformations (4.11) to putatively identify themulti-soliton profile; we use the program packages ProproUI and
GaugeIT that aredescribed at our website (5.1) in our analysis.We remind that both the top and bottom of Figure (5.4) correspond to the sameintrinsic backbone geometry. The Z transformation is a symmetry of the discretestring which is obtained by solving the discrete Frenet equation.We conclude from the κ i profile of the bottom Figure (5.4) that the myoglobinbackbone has eleven helices that are separated by ten single soliton loops. The num-ber of loops and helices is more or less unambiguously determined by the number ofinflection points which we identify visually in Figure (5.4), in the manner explainedin Section 3.8. The PDB sites of the ten individual soliton profiles that we use forour construction, are identified in Table 5.1. We emphasise that our geometry basedidentification of the loops and helices along the 1ABS backbone does not necessarilyneed to coincide with the conventional one used e.g. in crystallography, which is basedon inspection of hydrogen bonds. In particular, according to the conventional classifi-cation, the soliton pair 3 and 4, the soliton pair 6 and 7, and the soliton pair 9 and 10,are all interpreted as a single loop. From our geometric point of view, the PDB datareveals that in 1ABS, we have four different types of solitons. Those that connect two α helices, those that connect an α -helix with a 3/10-helix or vice versa, and finally,those that connect two 3/10-helices; see Table 5.1.In Table 5.2 we give our parameter values for the multi-soliton solution. It describes Solitons and ordered proteins i
20 40 60 80 100 120 140-3-2-10123 y q i
20 40 60 80 100 120 140-3-2-10123 y q
Fig. 5.4
Top: The κ i (black) and τ i (red) profiles of 1ABS using the standard differentialgeometric convention that bond angles are positive. Bottom: The soliton structure becomesvisible in the κ i profile once we implement the transformations (4.11). the 1ABS backbone with 0.78 ˚A RMSD accuracy. Note that in those terms in (4.17)which engage the torsion angles, the numerical parameter values are consistently muchsmaller than in terms that contain only the bond angles. This is in line with the knownfact, that in proteins the torsion angles i.e. dihedrals are usually quite flexible whilethe bond angles are relatively stiff.Note also that our energy function has 80 parameters, while there are 153 aminoacids in the entire myoglobin backbone. Thus the energy function (4.17) is highlypredictive : The number of free parameters is even less than the number of aminoacids . This shows that myoglobin displays structural redundancy, in its amino acids.The predictive power of (4.17) can alternatively be characterised as follows: Whenwe assume that all the bond lengths have the constant value (4.12), we are left with 282C α angular coordinate values in our truncated backbone. These we need to determinefrom the properties of the energy function (4.17), in order to construct the backbonefrom (4.6), (4.7). We have a total of 80 parameters in Table 5.2, thus a total of 202coordinates remain to be determined by the multi-soliton solution that minimises theenergy function. Therefore, we have 202 unknowns which are to be predicted by the tructure of myoglobin Table 5.1
The solitons along the 1ABS C α -backbone, with indexing starting from the Nterminus. We have left out the end sites that correspond to monotonous helices, and the Nand C termini segments. The type identifies whether the soliton corresponds to a loop thatconnects α -helices and (or) 3 / soliton 1 2 3 4 5sites 15-27 30-41 39-49 47-57 54-66type α - α α -3/10 3/10-3/10 3/10-3/10 3/10- α soliton 6 7 8 9 10site 72-87 83-92 94-106 110-123 121-135type α - α α - α α - α α - α α - α model. These predictions then directly probe the physical principles on which (4.17)has been built.We remind that approaches such as the G¯o model and its variants and variouselastic network models are descriptive, and lack this kind of predictive power. In thosemodels the positions of all the atoms are assumed to be known a priori . In addition oneneeds a description how the atoms interact. Thus there are always more parameters than degrees of freedom, and only a description is possible. Fig. 5.5
Comparison between the PDB structure 1ABS (dark purple) and the correspondingmulti-soliton solution (light blue).
In Figure 5.5 we interlace the 1ABS backbone with the multi-soliton; the differenceis very small.In Figure 5.6 we analyse site-wise the precision of the multi-soliton configurationwith the PDB structure 1ABS. The 15 pico-meter gray-scaled region around the multi- Solitons and ordered proteins
Table 5.2
Parameter values in energy (4.17) for the multi-soliton solution that describes1ABS. soliton λ λ m m a b c d single myoglobinstructure in the limit of vanishing temperature. In particular, as such the multi-solitondoes not account for any kind of conformational fluctuations that are due to thermaleffects, lattice imperfections, or any other kind of conformational sub-state effects;we model thermal effects using the Glauber heath bath (4.24). On the other hand,the experimentally measured 1ABS crystal structure should not be interpreted as asingle static low temperature structure. Instead, it is an average over a large number ofclosely packed crystallographic structures. A comparison between Figures 5.2 and 5.6shows that in the latter, the distance between the PDB backbone and the multi-solitonprofile is larger than that between the PDB backbone and the individual solitons inFigure 5.2. The Figure 5.6 describes the single multi-soliton solution to the equations(4.19), (4.18) while in Figure 5.2 we have constructed the individual solitons by solving(4.19), (4.18) independently for each loop. A similar individual soliton construction inthe case of 1ABS gives profiles that are comparable, even slightly more precise, thanthose in Figure 5.2. But for energetic studies we need the full multi-soliton with itsenergy function, we need the local energy minimum of (4.17) for the entire backbone. tructure of myoglobin Fig. 5.6
Comparison of the RMSD distance between the 1ABS configuration and the mul-ti-soliton solution, with the Debye-Waller B-factor fluctuation distance around the 1ABSbackbone. The blue marking at top, along 2.0 ˚A line, denotes sites where
Molprobity (1.4)detects imperfections.
We presume that a multi-soliton solution could be constructed with a precisioneven better than 0.78 ˚A in C α RMSD. But the convergence of (4.20) becomes slowwhen we use a laptop like MacAir. Thus we have simply terminated the process whenwe reach the value 0.78 ˚A which is in any case much better than that obtained in anyother approach, using any other computer, to our knowledge.In Figure 5.6 we observe that the distance between the multi-soliton solution andthe C α carbon backbone of 1ABS has its largest values mainly in two regimes. Theseare located roughly between the sites 35-45, and between the sites 79-98; we proposethe reader inspects the structure of 1ABS using the visualisation interface of PDB.The first regime corresponds to the single soliton that models the loop between helixB and helix C in Figure 5.3. The second regime corresponds to the location of thehelix F. This helix is part of the ”V”-shaped pocket of helices E and F, where theheme group is located. In particular, the reader can observe that helix F includes theproximal histidine at site 93, which is bonded to the iron ion of the heme. Note thatin addition, our analysis detects an anomaly at around site 121.In order to understand the origin of the observed deviations from an ideal multi-soliton crystal, we check for the presence of potential structural disorders in 1ABS using Molprobity (1.4); we recommend the reader performs this analysis on the
Molprobity web-site. In Figure 5.5, along the top, at the level of the 2.0 ˚A line, we have markedwith blue those regions where according to
Molprobity we have potential clashes.The
Molprobity clash score of 1ABS is 20.32 which puts it in the 10th percentileamong structures with comparable resolution, 100 being the best score. The regionsof potential structural clashes correlate with those regions, where our multi-solitonprofile has the largest deviations from the 1ABS backbone. Except the vicinity of thesite 121, which is unproblematic according to
Molprobity .We first consider the difference between the multi-soliton and the 1ABS backbonearound the sites 79-98, that was also identified by the
Ansatz as a potentially trouble-some one. The difference appears to be largely due to a deformation of the helix F. Itcould be caused by a bond between the proximal histidine at site 93 and the oxygenbinding heme iron. This might introduce a strain which modifies the backbone. The Solitons and ordered proteins effect of the heme is not accounted for by our energy function, in the present form.Consequently we propose the histidine-heme interaction to be the likely explanationfor the relatively large deviation between our multi-soliton profile and the 1ABS back-bone, at this position.We proceed to consider the difference between the multi-soliton and the 1ABSbackbone around the sites 35-45. These sites are also located very close to the heme.For example, the distance between the C α carbon at site 45 (Arg) and the hem oxygen154 is 4.84 ˚A, and the C α of Phe at site 43 is even closer to the heme. This proximitybetween the backbone and the heme is reflected in the Molprobity clash at site 45(between C δ and 154 HEM). We conclude that there could be strain in the backbonestructure which is due to the heme, and this could explain the difference between1ABS backbone and the multi-soliton configuration in this regime.Finally, we note that in Figure 5.6, we also have our previously observed anomalyat site 121 (glycine). At this point, we have no explanation for the anomaly except thatglycine is flexible and that this region is on the exterior of the protein. This leaves thehydrophobic phenylalanine at nearby site 123 exposed to the solvent. Consequentlyrelatively strong fluctuations between several locally conformationally different butenergetically degenerated sub-states are possible. The myoglobin stores O by binding it to the iron atom, which is inside the myoglobin.The oxidisation causes a conversion from ferrous ion (Fe2+) to ferric ion (Fe3+); theoxidised molecule is called oxymyoglobin. When the oxygen is absent, the molecule iscalled deoxymyoglobin. We propose the reader finds examples of each from PDB, andinspects the structures using the 3D Java interface.Numerous detailed experimental investigations have been made both of oxymyo-globin and deoxymyoglobin. But to our knowledge, the understanding of the oxygentransport mechanism in myoglobin remains incomplete: We do not yet know exactly,how small non-polar ligands such as O , CO and NO move between the external sol-vent and the iron containing heme group, which is located inside the myoglobin. Fromthe available static crystallographic PDB structures one can not identify any obviousligand pathway. It has been proposed that the process involves thermally driven largescale conformational motions. Collective thermal fluctuations could open and closegates through which the ligands migrate. Such gates are not necessarily visible in thecrystallised low temperature structures. Computational investigations that model thedynamics of myoglobin might provide a glue how these gates operate. We start our investigation how ligand gates might open and close, by performingheating and cooling simulations of myoglobin with the energy function (4.17) in com-bination with Glauber dynamics (4.24), (4.25) [80]. We use the 1ABS multi-soliton wehave constructed. The structure is a carbon-monoxy-myoglobin (MbCO), but with thecovalent bond between the CO and iron broken by photodissociation. In any case, itsdynamical properties should serve as a good first-approximation model how myoglobinbehaves, more generally. ynamical myoglobin We start our simulations at a low temperature value, from the multi-soliton config-uration that we have constructed: A classical soliton solution is commonly interpretedas a structure that describes the limit of vanishing temperature where thermal fluctu-ations become very small.We take k T L = 10 − for the numerical value of the low temperature factor, in terms of the dimensionlessunit that is determined by our choice of the overall energy scale in (4.17). At this valueof the temperature factor thermal fluctuations are absent in our multi-soliton. For thenumerical high temperature value we choose k T H = 10 − By applying the renormalisation group arguments in Section 4.5 we have related thetwo temperature factors k T and k B T . Our conversion relation that we shall justify inthe sequel, is k T = 1 . · − k B T exp { . T } (5.2)We use CGS units so that k B = 1 . × − erg/K .Under in vivo conditions myoglobin always interacts with water. This interactionis essential for maintaining the collapsed phase. In our approach we account for thesolvent (water) implicitly, in terms of the parameter values in (4.17). In particular,as such our model can only effectively take into account the highly complex phaseproperties of water at sub-freezing temperatures. We do not even try to address theobvious complications that appear when the temperature raises above the boiling pointof water.We start the simulation at k T L . The heating takes place with an exponential rate ofincrease during 5 million Monte Carlo steps. We model the non-equilibrium responseusing the Glauber protocol (4.24). According to (5.2) in terms of physical temperaturefactor k B T , the heating process should correspond to an adiabatically slow nearly-linear rate of increase.When we arrive at the high temperature k T H = 10 − we fully thermalise thesystem by keeping it at this temperature value during 5 million steps. We then proceedto cool it back down to k T L . We use the same rate of cooling as we use for heating, i.e. we cool exponentially in k T during 5 million steps. Each complete heating-coolingcycle takes about 3 minutes of wall-clock time when we use a single processor ina standard laptop (MacAir). Consequently time is no constraint for us and we cancollect very, very good statistics. In particular, we have checked that our results andconclusions are quite insensitive to the rate of heating and cooling.For statistical purposes, we have performed 100 repeated heating and cooling cyclesthat we have then analysed in detail; an increase in the number of cycles does notchange our conclusions. Note that in an all-atom approach a comparable simulationwould take over 100.000 years even with Anton [36] while for us a few minutes isenough.During the simulations, we follow the evolution of both the radius of gyration R g and the RMSD distance R rmsd between the simulated configuration and the folded Solitons and ordered proteins
260 295 330 365 380 380 380 365 330 295 2600510152025 R M S D ( ˚A ) T(K)
Number of Steps ×
260 355 450 545 580 580 580 545 450 355 2600102030 R M S D ( ˚A ) T(K)
Number of Steps ×
260 295 330 365 380 380 380 365 330 295 26010152025 R a d i u s o f G y r a t i o n ( ˚A ) T(K)
Number of Steps × a) Fig. 5.7 a) The evolution of the radius of gyration during 100 repeated heating and coolingcycles. b) The evolution of the RMSD distance between the 1ABS backbone and the simulatedconfiguration during 100 repeated heating and cooling cycles. c) The evolution of the RMSDdistance between the 1ABS backbone and the simulated configuration during 100 repeatedheating and cooling cycles, to very high temperature values. In each Figure the (blue) linedenotes the average, and the shaded area around it is the extent of one standard deviationfluctuations. Along the top axis, we show the temperature in the Kelvin scale, using theconversion relation (5.2).
We make the following observations: At low temperatures, with temperature factors k T < − ynamical myoglobin the radius of gyration is essentially constant R g ≈ . − < k T < − we have a regime when the radius of gyration increases at an accelerating rate in thenumber of steps. The increase in R g continues until we reach a temperature near k T H .But for temperatures where the temperature factor is in the range10 − < k T < k T H = 10 − the rate of increase decelerates so that when we reach the temperature k T H we observeno increase in R g . This proposes that we have reached the random walk θ -regime.Furthermore, the radius of gyration value R g ≈
22 ˚Ais extremely close to the experimentally measured value ∼ . k T H is indeed a θ -transition betweencollapsed phase and random walk phase, by heating the configuration to substantiallyhigher temperature factor values. We have found that above this putative θ -transition,the radius of gyration remains essentially intact under temperature variations all theway to k T = 10 − Around this temperature value another transition takes place, presumably to the ν ∼ / k T L = 10 − and k T = 10 − . We observetwo clear transitions that are consistent with the transitions between collapsed andrandom walk phases, and between random walk and self-avoiding random walk phasesaccording to (1.8).When we decrease the temperature, the evolution of R g becomes inverted. At theend of the cycle, when the temperature reaches k T L , the configuration returns back toa very close proximity of the original folded state; see Figure 5.7 b). This demonstratesthe stability of the multi-soliton solution that describes the natively folded myoglobinas a local minimum of the energy (4.17).Figure 5.8 a) shows the average values of R g . These averages are evaluated atseveral different temperature values, over 100 runs. Both during the heating periodwhen 0 < x < .
5, and during the cooling period when 7 . < x <
15 where x is thenumber of MC steps in millions. The data can be approximated by a fitting functionof the form R g ( x ) ≈ a · tanh { b ( x − c ) } + d (5.3)The parameter values are listed in Table 5.3. Solitons and ordered proteins
T(K)T(K)
Fig. 5.8
The red line is the fitting of (5.3) to the average values of blue dots which are thenumerically computed values of R g , over the heating and cooling periods. The shaded areaaround the red fitting line is one standard deviation estimate. Note: The difference betweenthese three is so small that it is barely observable in the Figure. Also shown is the derivativeof (5.3) (light blue line). Along the top axis, we have converted the temperature into Kelvinscale. Table 5.3
Parameter values in the fits (5.3), (5.5) for the two ranges 0 - 7.5 and 7.5 - 15 (inmillion) of iteration steps R g R rmsd range a b c d a b c d0 - 7.5 3.519 0.9047 3.6855 18.29 7.9 0.8318 3.5715 9.2917.5 - 15 -3.486 0.9193 11.2965 18.28 -7.872 0.8327 11.4255 9.298In Figure 5.8 a) we display the derivative of (5.3). We can try and use the maximumof the derivative to identify the θ -transition temperature in our model. For this, weassume that the experimentally measured [90,91] transition temperature at T c ≈ K is the one that corresponds to the θ -transition. We identify it with the maximum of thederivative of R g , to conclude that during the heating cycle the θ -transition temperaturerelates to our dimensionless temperature values as follows, ynamical myoglobin T hg ≈ . · − ≈ K (5.4)We use this value to determine one of the two parameters in (5.2).During the cooling cycle, we find the slightly different T cg ≈ . · − ≈ K We note that an asymmetry between heating and cooling has been observed experi-mentally [90].The RMSD distance between the simulated configuration and the 1ABS backbonedepends on temperature in a very similar manner. In Figure 5.8 b) we show thecomparison between simulation, and the corresponding approximation (5.3), R rmsd ≈ a · tanh { b ( x − c ) } + d (5.5)These parameter values are also listed in Table 5.3 separately for the heating period0 < x < . . < x <
15. The Figure 5.8 b) also shows thederivative of R rmsd ( x ). Like in the case of radius of gyration, we use the maximumof the derivative to estimate the peak rate of change in the transition temperature.During the heating period the increase in R rmsd peaks at T hrmsd ≈ . · − ≈ K During the cooling period we find that the peak corresponds to a slightly highertemperature value, T hrmsd ≈ .
45 10 − ≈ K These valus are very close to those we observe in the case of R g . We proceed to try and identify potential thermally driven backbone ligand gates. Weare interested in studying how the gates open and close as the myoglobin is heatedand cooled. Moreover, thus far we have fixed only one of the two parameters in (5.4)using the θ -transition temperature. We shall fix the second parameter in the sequel,by considering the dynamics of the backbone ligand gates.We investigate the shape of the backbone visually, during the heating and cooling.We find that qualitatively the thermal fluctuations follow a very similar pattern. Thebackbone becomes unfolded in more or less the same manner, again and again, as thetemperature becomes increased. The inverse pattern is observed during the cooling.During heating we observe a clear onset of the unfolding transition, which wecharacterise in terms of backbone ligand gates. We have identified three major gatesthat we call Gate 1,2 and 3, and we define them as follows:The Gate 1 shown in Figure 5.9 a) is defined as the area between the segment thatstarts from PDB site 37 (Pro) and ends at PDB site 44 (Asp), and the segment thatstarts at 96 (Lys) and ends at 103 (Tyr). The opening of this gate takes place as thedistance between the two segments increases. The open gate exposes the heme to thesolvent. Figure 5.9 a) shows the location of this gate along the 1ABS backbone. Solitons and ordered proteins a)
Fig. 5.9
Stereographic cross-eyed view of the ligand Gate 1, 2 and 3 as defined in the text.Figure a) is Gate 1 formed by segment located between 37 (Pro) and 44 (Asp), and segmentthat starts at 96 (Lys) and ends at 103 (Tyr). Figure b) is Gate 2, between segment thatstarts from PDB site 61 (Leu) and ends at PDB site 68 (Val), and segment that starts at89 (Leu) and ends at 96 (Lys). Figure c) is Gate 3, between segment starting from site 25(Gly) and ending at 32 (Leu), and segment that starts at 106 (Phe) and ends at 113 (His).We also show the location of the heme (orange), the proximal histidine (93), the valine (68),the distal histidine (64) (all green) and the CO (black ellipsoid) of 1ABS.
The Gate 2 is located between the helical structures E and F, as shown in 5.9 b).This gate extends over the entire length of both helices E and F. Thus, in order tocompare it with Gate 1 that is composed of segments with only eight residues, we selecttwo opposing segments along helices E and F, each with eight amino acids. The firstsegment, located in the helical structure E, starts with site 61 (Leu) and ends with site68 (Val). The second segment, located in the helical structure F opposite to the firstsegment, starts with site 89 (Leu) and ends with site 96 (Lys). We have intentionallyselected these two segments to be far from the loop that connects the helices E andF. This is because in our simulations, we have observed that the amplitudes of the ynamical myoglobin thermal fluctuations in the segment distances tend to increase, the further away thesegment is located from the connecting loop: The opening and closing of the gateresembles the opening and closing of scissors, with blades formed by helices E and Fthat are connected by the loop between these two helices. Note that the first segmentalong helix E, includes both the distal histidine at site 64 and the valine at the endsite 68. This valine is also inside the heme pocket, and it is presumed to have animportant role in CO vs. O discrimination. Similarly, the opposite segment in thehelical structure F includes the proximal histidine at site 93.Finally, the Gate 3 which is shown in 5.9 c) is located between the helical structuresB and G. Again, in order to compare this relatively long gate with Gate 1 we select twoopposing segments, each with eight amino acids. The segment in the helical structureB starts at site 25 (Gly) and ends at site 32 (Leu). The segment in helix G starts atsite 106 (Phe) and ends at site 113 (His).During the heating and cooling cycle of the myoglobin, we follow the size of thethree gates. We do this by computing the distance d i ( i = 1 , ,
3) between the respectivesegments, as a functions of temperature. We define the distance d i between the twosegments for each of the three gates, as follows: d i = (cid:118)(cid:117)(cid:117)(cid:116) (cid:88) n =1 ( x n − y n ) (5.6)Here x n are the eight coordinates in the first segment, and y n are the correspondingcoordinates in the second segment, along Gate i = 1 , ,
3. Note that the two segmentsin each of the three gates, are spatially oriented in an anti-parallel manner with respectto PDB indexing. Consequently, in computing (5.6), we invert the indexing in one ofthe two segments with respect to the PDB indexing.We start by investigating the temperature dependence of the three gates, using theexperimental data which is available in PDB. For this we compute the following threegate ratios, from PDB data:Gate1Gate2 = d d & Gate3Gate2 = d d & Gate3Gate1 = d d (5.7)We use all the presently available myoglobin structures in PDB, that have been mea-sured with resolution 2.0 ˚A or better. The results are shown Figures 5.10. In eachFigure, we observe substantial fluctuations in the gate ratios, in case of PDB data thathas been taken at around 100 K . But this reflects only the fact that the majority ofPDB data has been collected at this temperature value. Overall, we conclude that thegate ratios show no temperature dependence for T < K .We proceed to the computation of the temperature dependence of the gate ratiosusing our 1ABS multi-soliton with the energy function (4.17). The results are shownin Figure 5.11 a), b) and c). We have found that the Gate 3 is the first gate to openas the temperature increases, and the last one to close as temperature decreases. TheGate 2 is the one to open last, and the one to close first. In the low temperature limitthe Gate 3 is about half the size of the Gate 2. But its size exceeds that of the Gate2 in the segment separation distance (5.6) at temperature Solitons and ordered proteins
T(K) G a t e R a t i o Gate1/Gate2
T(K) G a t e R a t i o Gate3/Gate2
T(K) G a t e R a t i o Gate3/Gate1
Fig. 5.10
The three Gate ratio (5.7). The dotted (red) lines are intended to guide eye only. k T c ≈ − ∼ K The transition is very rapid, in line with the general results of reference [92]: Whenthe temperature reaches the θ -transition value ∼ K , the Gate 3 is about twice aslarge as the Gate 2. ynamical myoglobin The Gate 1 also opens much faster than Gate 2, but slower than Gate 3. It alsocloses slower than Gate 2, but faster than Gate 3. In the low temperature limit theGate 1 is about half as wide as Gate 2. But it becomes wider than Gate 2 when thetemperature reaches a value k T c ≈ − However, the Gate 1 does not become quite as wide as Gate 3. This is shown in Figure5.11 a).
260 295 330 365 380 380 380 365 330 295 26000.511.522.5
T(K) G a t e R a t i o Gate1/Gate2
Number of Steps ° C × ° C
260 295 330 365 380 380 380 365 330 295 26000.511.522.5
T(K) G a t e R a t i o Gate3/Gate2
Number of Steps × ° C54 ° C
260 295 330 365 380 380 380 365 330 295 26000.511.522.5
T(K) G a t e R a t i o Gate3/Gate1
Number of Steps × Fig. 5.11
The temperature dependence of the three Gate ratios (5.7) during our heatingand cooling cycles. Solitons and ordered proteins
We are now in a position to determine the second parameter in (4.32) to arriveat (5.2); we remind that one of the two parameters is already determined, in (5.4).For this we proceed as follows: When we compare Figures 5.10 we conclude thatexperimentally, the gate ratios do not display any observable temperature dependencewhenever
T < K . Consequently, the lowest possible value of the temperature factor k T where Figures 5.11 can display any change in the gate ratios, should correspond toa temperature which is above 300 K . When we read off the lowest possible k T valuewhere we have an observable effect in Figures 5.11, we conclude that, necessarily, k T ≈ − > k B K This gives a lower bound. When we adopt this lower bound value as our estimate forthe gate opening temperature we obtain the second parameter value in (5.2).In reality the actual gate opening temperature can be higher, but at the momentexperimental basis for choosing a higher value is lacking: The single presently existingNMR data of myoglobin in PDB is 1MYF. It has been measured at the slightly highertemperature value of 308 K. But the quality of data does not enable us to improve ourestimate.We note that a higher gate opening temperature has no qualitative effect to ourconclusions, quantitatively the differences are minor. The only effect would be a sharp-ening of the θ -transition onset.We conclude with a summary of the consequences that results in Figures 5.11might have to ligand migration: We have found that to the extent backbone thermalfluctuations play a rˆole in ligand migration, the Gate 3 between the helical structuresB and G can be very important. This gate opens very much like a baseball glove, as weincrease temperature. The Gate 1 might also play a rˆole, but probably a lesser one thanGate 3. On the other hand, the V-shaped Gate 2 between helices E and F seems to bequite sturdy, it does not seem to open as much as the other two gates. The presenceof the distal and proximal histidines in Gate 2 and their attractive interactions withheme, might have an additional stabilising effect that is not accounted by our model.Consequently we do not see how the thermal backbone fluctuations that take place inthe Gate 2, could play a major rˆole in ligand migration. At least, to the extent thatbackbone fluctuations are relevant.A recent THz time-scale spectroscopy experiment has detected collective thermalfluctuations in the protein, that might relate to our theoretical proposals of ligandgate dynamics [93]. However, more detailed experiments need to be performed. Intrinsically disordered proteins
The crystallographic protein structures in PDB are ordered proteins. An ordered pro-tein has an essentially unique native fold which can be determined by x-ray crys-tallography. But most proteins are not ordered, most proteins do not seem to havean essentially unique native fold. Instead, the low energy landscape of most proteinsseems to comprise several states which are energetically degenerate but conformation-ally disparate. With local energy minima that are separated from each other by verylow free energy barriers. We call them intrinsically disordered proteins. Normallythese proteins can not be crystallised, and structural data is in short supply. Our aimis to extend the methods that we have developed, to describe the properties of suchproteins. For this we start by developing some formalism.But please keep in mind that there is a grey zone between ordered and intrinsicallydisordered proteins: A skilfull crystallographer might be able to produce a crystal outof a protein that others consider hopeless.
When an ordered protein becomes cooled down to low temperatures it should assumean essentially unique native fold, that corresponds to a minimum of the low tem-perature thermodynamic (Helmholtz) free energy. More specifically, in the case of aprotein with an ordered native fold, a cooling should produce a highly localized statis-tical distribution of structurally closely related conformational substates. When takentogether, this ensemble constitutes the folded native state at low temperatures. Butif a protein is intrinsically disordered, instead we expect that the low temperaturelimit produces a scattered statistical distribution of structurally disparate but ener-getically comparable ensembles of conformational substates. Moreover, these differentsubstates should be separated from each other only by relatively low energy barriers.The unstructured, disordered character of the protein is a consequence of a motionaround this scattered landscape: The protein swings and sways back and forth, quitefreely, over the low energy barriers that separate the various energetically degeneratebut structurally disparate conformations.We may think that the state space of a ordered protein with an essentially uniquenative fold, consists of a set of snapshot states | s > that form a tightly localizedand essentially Gaußian distribution around an average state | s > ave . When takentogether, the set of these snapshots determine the low temperature folded native stateas an ensemble of conformational substates. In fact, we expect that for an ordered Intrinsically disordered proteins protein, the extend of conformational variations around the average state | s > ave canbe estimated from the crystallographic Debye-Waller B-factor.However, in the case of an intrinsically disordered protein the low temperatureset of snapshot states | s > exhibits a disperse statistical distribution. We have nosingle tightly localised peak, with a clearly identifiable average value, in the statisticaldistribution of snapshots. Instead, the statistical distribution of the snapshot statesis scattered . The snapshots become apportioned between several structurally dispersebut energetically degenerate conformational substates.Of particular interest are those energetically degenerate substates that are sepa-rated from each other by relatively low energy barriers. The unstructured and unsettledcharacter of an intrinsically disordered protein is a consequence of thermally inducedfluctuations that move the configuration around the structurally diverse and energeti-cally degenerate landscape: The protein swings and sways back and forth between thedisparate snapshot states | s > . Due to very low energy barriers, this dynamics per-sists even at very low temperatures, where quantum mechanical tunneling transitionseventually take over the thermally induced ones. Thus the unstructured character canpersists even at very low temperatures.We interpret the ensemble of snapshot states in terms of a Hartree state | Φ > whichis a linear combination of the form | Φ > (cid:39) (cid:88)(cid:90) i p i | s i > (6.1)Here the index set i can have both discrete and continuum portions, including varioussmall and large amplitude collective coordinates. We envision that the state spacewhich is spanned by | s i > can be endowed with a norm (metric) which enables us toselect and orthonormalize the set of eigen-conformations (snapshot structures) | s i > .The detailed construction of a norm in the space of string-like structures will not beaddressed here.When | s i > are normalised, the coefficient p i determines the probability weight forthe ensuing eigen-conformation | s i > to contribute in the Hartree state. In particular,the Hartree state (6.1) is a mixed state and not a pure state, when it describes anintrinsically disordered protein; the conformational entropy is non-vanishing, S = − (cid:88)(cid:90) i p i log p i > p k ≈
1, and the other p i with i (cid:54) = k arevanishingly small p i ≈
0, the Hartree state | Φ > reduces to a pure state and describesan ordered native fold.The eigen-conformations | s i > are time independent, akin states in the quantummechanical Heisenberg picture. But the p i can be time dependent quantities, they thendescribe the time evolution of | Φ > . For sufficiently long time scales, larger than thecharacteristic thermal tunneling time between different eigen-conformations | s i > , thetime dependence of the p i is governed by a Liouville equation d ˆ ρdt = ∂ ˆ ρ∂t + { ˆ ρ, H } ≡ L ˆ ρ (6.2) IAPP and type-II diabetes where ˆ ρ = (cid:88) i p i | s i >< s i | is the density matrix. The second term in (6.2) is the Poisson bracket with the (total)semiclassical Hamiltonian H ; in a quantum mechanical version the density matrix andthe Hamiltonian are replaced by the corresponding Heisenberg operators. We notethat for a non-equilibrium system, the total operator L becomes the (semiclassical)Lindblad superoperator.For a system at or very near a thermodynamical equilibrium, the | s i > concur withthe extrema configurations of the low temperature limit of Helmholz free energy. Theprobabilities p i are evaluated from corresponding values of the free energy, and (6.1)acquires the form | Φ > (cid:39) Z (cid:88)(cid:90) { s } e − βE ( s ) | s > (6.3)where Z = (cid:88)(cid:90) { s } e − βE ( s ) and β = 1 /kT is the inverse temperature. For completeness, we note that for anextremum conformation | s i > that is not a local minimum of the free energy, a Maslovindex contribution needs to be included. We now proceed to consider an example of an intrinsically disordered protein withvery extensive and important biological, medical and pharmaceutical ramifications.The human islet amyloid polypeptide (hIAPP), also known as amylin, is a widelystudied 37 amino acid polypeptide hormone; for extensive reviews we refer to [94, 95].It is processed in pancreatic β -cells from an 89 residue precursor protein, by a proteasecleavage in combination of post-translational modifications. The secretion of hIAPPresponses to meals, and the peptide co-operates with insulin to regulate blood glucoselevels. But hIAPP can also form pancreatic amyloid deposits, and their formation andbuildup correlates strongly with the depletion of islet β -cells. The hIAPP amyloidosisis present in over 90 per cent of the type-II diabetes patients, and the deposits areconsidered as the hallmark of the disease in progression.It still remains to be fully clarified whether the hIAPP amyloid aggregation is thedirect cause of apoptosis in the islet β -cells. Instead, the amyloid fibers might onlybe a consequence of the disease which is ultimately caused by some other and yet-to-be-identified agent. Among the arguments that support the existence of a first-handcausal relationship between hIAPP fibrillation and the onset of type-II diabetes, is theobservation that wild-type mice do not develop the disease while transgenic mice thatexpress hIAPP can fall ill with the disease. It has also been observed that a directcontact between the hIAPP amyloid fibrils and the surfaces of pancreatic islet β -cellshas a toxic effect on the latter. Intrinsically disordered proteins
There is a real possibility that by understanding the structural landscape of hIAPP,in particular how the amyloidosis transition takes place, we could gain a major steptowards the identification of therapeutic targets and the development of strategiesto combat a potentially deadly disease, that presently plagues around 5 per cent ofthe world’s adult population. Indeed, type-II diabetes is arguably among the mostdevastating diseases to curse mankind. Its annual economic cost has been globallyestimated to be in excess of 425 billion Euro’s, and the number of sufferers is estimatedto almost double during the next 20 years.The structure of aggregated hIAPP fibrils have been studied very extensively[94, 95]. The fibrils consist of an ordered parallel arrangement of monomers, with azipper-like packing. Apparently the fibril formation proceeds by nucleation, with onemonomeric hIAPP molecule first assuming a hairpin-like structure. This is followedby a piling-up of several monomers, which eventually leads to the buildup of amyloidfibrils as the hallmark of the disease in progression. But the structure of a full-lengthmonomeric hIAPP, its potentially disease causing dynamical conformational state inour pancreatic cells, and why it occasionally may form the disease causing hairpin-likestructure, all these remain unknown. Moreover, despite the highly ordered nature ofamyloid fibril aggregates, only very recently have experimental advances made it pos-sible to obtain high-resolution models. However, we still largely lack detailed atomiclevel knowledge, needed for drug development.The monomeric form of hIAPP is presumed to be an example of an intrinsically dis-ordered protein. When biologically active and healthy, it is in an unsettled and highlydynamic state. As such it lacks an ordered three dimensional folded state that could bestudied by conventional x-ray crystallography approaches. Several experimental meth-ods which are based e.g. on solution and solid-state NMR and other techniques, arecurrently under development to try and characterise the conformation of monomerichIAPP. But the existing techniques do not yet permit a direct examination of theatomic level structure. Detergents such as Sodium dodecyl sulfate (SDS) micelles arecommonly introduced as stabilising agents.The detailed atomic level information could in principle be extracted by theoreticalmeans with molecular dynamics simulations. However, with explicit water presentlyavailable computer power can at best cover a dynamical in vitro/in vivo trajectory upto around a microsecond per a day in silico . At the same time amyloid aggregation takeshours, even days. Thus the present all-atom computational investigations are largelydependent on our ability to determine an initial conformation for the simulations.Otherwise, one might only end up simulating the initial condition.Due to its intrinsically disordered character, the structural data of hIAPP in isola-tion remains sparse in PDB. The only presently available PDB data of hIAPP consistsof two NMR structures and one crystallographic structure. Both NMR structures de-scribe hIAPP in a complex with SDS micelles, the PDB access codes are 2L86 and2KB8. The sole available crystallographic structure describes hIAPP that has beenfused with a maltose-binding protein, the PDB access code is 3G7V. We note thatthese three structures are all very different from each other.The NMR structure 2L86 has been measured at pH of 7.4 i.e. around the pH valuein the extracellular domain where the hIAPP amyloid deposits appear. On the other
IAPP as a three-soliton hand, the NMR structure 2KB8 has been measured with pH value 4.6 which is closerto the pH value around 5.5 inside the β -cell granules of pancreas. We point out thateven though hIAPP amyloidosis is apparently an extracellular process, some evidencesuggests that the aggregation might have an intracellular origin. Thus, a thoroughinvestigation of the rˆole of hIAPP in the onset of type-II diabetes, should account both for the extracellular and the intracellular structural properties of the peptide. Inaddition, a detailed analysis how hIAPP interacts with cell membranes is needed; wenote that to some extent, micelles might mimic membrane effects.We conclude, that it has been pointed out, that hIAPP affects also several otherorgans besides pancreas. For example, hIAPP is known to have binding sites in thebrain, where it apparently has a regulatory effect on gastric emptying. A delayed gastricemptying is commonly diagnosed in patients with diabetes. But gastroparesis is alsoa component in a number of other disorders. Certainly, the ability of hIAPP to crossthe blood-brain barrier and affect the central nervous system, relates to its structure.Amyloid fibers can hardly cross the barrier. Thus, besides apparently contributingdirectly to type-II diabetes, aggregation should also have a wider influence on theregulatory activity of hIAPP. In this Section we shall investigate in detail the physical properties of a 28 segmentmonomer of hIAPP; we propose that the reader gets access to the entry 2L86 in PDB.The segment we are interested in, consists of the residues 9-36 where several studieshave either observed or predicted that the amyloid fibril formation starts [94, 95]. Thephysical properties of the short N-terminal segment that comprises the residues 1-8 is not addressed here. The structure of this segment is more involved, due to thedisulfide bond that connects the cysteines which are located at the residues 2 and 7.Moreover, it remains to be understood what is the rˆole, if any, of the residues 1-8 inhIAPP aggregation. These residues appear to have a tendency towards forming longand stable non- β -sheet fibers in solution, under the same conditions in which hIAPPaggregates into amyloid fibers.We use the NMR structure 2L86 in PDB as a decoy to train the energy function. Weconstruct a multi-soliton configuration as an extremum of the energy function (4.17),that accurately describes 2L86. Since 2L86 is a composite of hIAPP with SDS micelles,we propose the following biological set-up: We consider the structural evolution of anisolated hIAPP in the extracellular domain, in a scenario where the polypeptide isinitially in a direct but residual interaction with the cell membrane. The effect of aninitial cell membrane interaction is modelled by the effect of SDS micelles in 2L86.Following the construction of the multi-soliton configuration, we study the presumeddisordered structural landscape of an isolated hIAPP; we try and model hIAPP asit enters the extracellular domain. For this we subject the multi-soliton to a series ofheating and cooling simulations as in the case of myoglobin, using the Glauber dy-namics. During the heating, we increase the temperature until we detect a structuralchange in the multi-soliton, so that the configuration behaves like a random walker.We fully thermalise the configuration at the random walk phase. We then reduce theambient temperature, to cool down the configuration to very low temperature values Intrinsically disordered proteins until it freezes into a conformation where no thermal motion prevails. Since an iso-lated hIAPP is intrinsically disordered, instead of a single native fold as in the caseof myoglobin we expect that the low temperature limit produces a scattered statis-tical distribution of structurally disparate but energetically comparable ensembles ofconformational substates (6.1). Moreover, the individual different substates should beseparated from each other by relatively low energy barriers. The unstructured, disor-dered character of hIAPP is then a consequence of a motion around this landscape: Itswings and sways back and forth, quite freely, over the low energy barriers that sepa-rate the various energetically degenerate but structurally disparate conformations. Weshall find that in the case of hIAPP, the heating and cooling procedure which in thecase of myoglobin yields a single low energy state, now produces exactly this kind ofstructurally scattered ensembles of conformations.In Table 6.3 we have the parameter values, that we find by training the energyfunction (4.17) to describe 2L86. These parameters values are taken from [96].
Table 6.1
Parameter values for the three-soliton configuration that describes 2L86; thesoliton 1 cover the PDB segment THR 9 – ASN 21, the soliton 2 covers the segment ASN 22– ALA 25 and the third soliton covers the segment ILE 26 – THR 36. The value of a is fixedto a = − − . soliton q q m m d/a c/a b/a · − -1.402 · − -2.5682 2.927 2.441 1.667 1.534 -4.894 · − -1.067 · − -19.483 1.119 8.086 1.522 1.514 -3.578 · − -5.907 · − - 1.908In Figure 6.1 a) we show the spectrum of bond and torsion angles for the firstNMR structure of 2L86, with the convention that the bond angle takes values between κ ∈ [0 , π ]. In Figure 6.1 b) we have introduced the Z symmetry (4.11) to disclosethree individual solitons along the backbone. The first soliton from the N-terminalis centered at the site 17. The third soliton is centered at the site 27. Both of thesetwo solitons correspond to clearly visible loops in the three dimensional structure, inthe PDB entry 2L86 (see PDB). The second soliton, centered at site 23, is much lesspalpable in the three dimensional NMR structure. This soliton appears more like abend in an α -helical structure, extending from the first soliton to the third soliton.The Z transformed ( κ, τ ) profile shown in Figure 6.1 b) is the background that wehave used in training the energy function (4.17).In Figure 6.2 we compare the bond and torsion angle spectrum of our three-solitonsolution with the first NMR structure of 2L86; the solution is obtained using theprogram ProPro in (5.1). The quality of our three-soliton solution is clearly verygood, at the level of the bond and torsion angles.Figure 6.3 shows our three-soliton solution, interlaced with the first NMR structureof 2L86. The RMSD distance between the experimental structure and the three-soliton configuration is 1.17 ˚A. This is somewhat large, when compared to the multi-
IAPP as a three-soliton Figure 1: Experiment bond angles(red) and torsion angles(black). κκκ" a)" (cid:1) (cid:2) Figure 2: Experiment bond angles(red) and torsion angles(black) after Z gauge transformation at 18, 22 and 29. b)" R a d i a n s " R a d i a n s " PDB"index"PDB"index"κκκ" (cid:1) (cid:2)
Fig. 6.1 a) The spectrum of the bond and torsion angles of 2L86 (first entry) with theconvention that bond angle takes values in κ ∈ [0 , π ). b) The spectrum of the bond andtorsion angles that identifies the soliton structures. soliton structures that we have found previously. But the resolution of the presentexperimental NMR structure is not that good, and this is reflected by the somewhatlower quality of the three-soliton solution, in comparison to the case of high resolutioncrystallographic structures.Figure 6.4 compares the residue-wise C α distances, between the 20 different NMRstructures in the PDB entry 2L86, and our three-soliton solution. For those residuesthat precede the bend-like second soliton which is centred at site 23, the distancebetween the experimental structures and the numerically constructed solution is rel-atively small. We observe a quantitative change in the precision of the three-solitonsolution, that takes place after site 23. The distance between the experimental struc-tures and the three-soliton solution clearly increases after this residue. It could be thatthis change is due to the SDS micelles, used in the experimental set-up to stabilisehIAPP/2L86: SDS is widely used as a detergent, to enable NMR structure determina-tion in the case of proteins with high hydrophobicity. The mechanism of SDS-proteininteraction is apparently not yet fully understood. But it is known that the hydropho-bic tails of SDS molecules interact in particular with the hydrophobic core of a protein.These interactions are known to disrupt the native structure to the effect, that the pro-tein displays an increase in its α -helical posture; these additional α -helical structures Intrinsically disordered proteins
Figure 4: Residue Number v.s. Bond angles. Blue: fitting value; Red: experiment value. X-axis: angle index; Y-axis: angle’svalue in Radians. Figure5: ResidueNumberv.s. Torsionangles. Blue: fittingvalue;Red: experimentvalue. X-axis: angleindex;Y-axis: angle’svalue in Radians. Bond%angle%Torsion%angle% a)%b)% R a d i a n s % R a d i a n s % PDB%index%PDB%index%
Fig. 6.2
Top: Comparison of the three-soliton bond angle (blue) with the experimental 2L86bond angle spectrum (red). Bottom: Comparison of the three-soliton torsion angle (blue) withthe experimental 2L86 torsion angle spectrum (red). tend to be surrounded by SDS micelles.The residue site 23 of hIAPP is the highly hydrophobic phenylalanine. It is followedby the very flexible glycine at site 24. Thus, the apparently abnormal bend which islocated at the site 23 and affects the quality of our three-soliton configuration, couldbe due to an interaction between the phenylalanine and the surrounding SDS micelles.A high sensitivity of the hIAPP conformation to the phenylalanine at site 23 is welldocumented.An analysis of 2L86 structure using
Molprobity (1.4) suggests a propensity towardspoor rotamers between the sites 23-36, i.e. the region where the quality of our three-soliton solution decreases.A comparison with the statistically determined radius of gyration relation (1.11)reveals that for 2L86 the value of R g ≈ . N = 9 , ..., R g ≈ . N = 28. The structure of 2L86 should be more compact.We conclude that most likely the SDS-hIAPP interaction has deformed a loopwhich, in the absence of micelles, should be located in the vicinity of the residuenumber 23. Probably, the interaction with micelles has converted this loop into astructure resembling a bend in an α -helix. This interaction between hIAPP and SDSinterferes with our construction of the three-soliton configuration, adversely affecting eating and cooling hIAPP Figure 5: 2L86 PDB entry(red) interlaced with the fitting segment(blue) from [THR]9 to [THR]36. Fig. 6.3
The three-soliton solution (blue) interlaced with the 2L86 experimental structure(red). D i s t a n c e ) ( Å ) ) PDB)index)
Figure 6: The black line denotes the C ↵ atom distance between fitting configuration and corresponding NMR model con-figuration. The red line denotes the Debye-Waller distance that is computed from the temperature factors in PDB. The greyarea describes the estimated . Å zero point fluctuation distance of the fitting configuration. Blue points denote the average C ↵ atom distance between NMR model configuration and the other configurations, whose errorbar corresponds to themaximal and minimal atom distances. X-axis: residue index; Y-axis: C ↵ atom distance in Angstroms. Fig. 6.4
The black line denotes the C α atom distance between the 3-soliton configurationand the model 1 NMR configuration 2L86; the grey region is an estimated 0.15 ˚A zero-pointfluctuation distance from the three-soliton configuration. The red line denotes the B-factorDebye-Waller fluctuation distance from model 1 of 2L86. The blue-colored points denote theaverage C α distance between the model 1 NMR structure from the average of the remaining19 models on 2L86; the error-bars denote the maximal and minimal C α distances. its precision. Following our myoglobin analysis, we proceed to investigate the properties of thethree-soliton model of 2L86 under repeated heating and cooling, using the Glauberalgorithm.The Figures 6.5 describe the evolution of the three-soliton configuration duringrepeated heating and cooling. The Figure 6.5 (top) shows the evolution of the radius ofgyration, and the Figure 6.5 (bottom) shows the RMSD distance to the PDB structure2L86. Both the average value and the one standard deviation from the average valueare shown. During the cooling period we observe only one transition, in both the radiusof gyration and the RMSD. Thus, based on our previous experience, we are confident Intrinsically disordered proteins
Monte Carlo steps
Figure 10: Radius of gyration durring Heating-Cooling. X-axis: Monte Carlo simulation step; Y-axis: RMSD in the unitof angstroms. Red line: average; Grey region: standard deviation. X-axis: simulation step; Y-axis: Radius of gyration inAngstroms. R a d i u s o f g y r a o n ( Å ) Figure 9: RMSD during Heating-Cooling. X-axis: Monte Carlo simulation step; Y-axis: RMSD in the unit of angstroms. Redline: average; Grey region: standard deviation. X-axis: simulation step; Y-axis: RMSD in Angstroms. R M S D ( Å ) Monte Carlo steps
Fig. 6.5
The top figure shows how the radius of gyration of the three-soliton configurationevolves during heating and cooling cycle. The bottom figure shows the same for the RMSDdistance from the initial configuration. The red line is the average value over all configurations,and the grey zone marks the extent of one standard deviation from the average value. TheMonte Carlo steps are displayed in multiplets of 10 . that at high temperatures we are in the random walk regime. The profile of each curvein Figure 6.5 also shows that the structures are fully thermalised, both in the hightemperature and in the low temperature regimes.We observe that the average final value of the radius of gyration R g ≈ . . | s > in the expansion (6.1),(6.3). Five of the clusters, denoted 2-6 in the Figure, have an apparent spread. Thisimplies that the energy has a flat conformational direction around its extremum. Theclusters 3 and 5 are also somewhat more scattered than the clusters 2, 4 and 6. Finally,the cluster number 1 is a localized one: This cluster corresponds to a sharply localizedsnapshot state | s > . Note that the initial conformation, marked with red triangle in eating and cooling hIAPP Figure 4: Illustration of 6 clusters. X-axis: radius of gyration; Y-axis: end-to-end distance. Red triangle denotes the positionof initial structure on Rg-Ree plane. Radius of gyra-on (Å) E nd -‐ t o -‐ e nd d i s t a n c e Fig. 6.6
The distribution of all final configurations, in a run with 1.500 full heating andcooling cycles, classified in terms of the radius of gyration and end-to-end distance of the finalconfiguration. Each blue dot represents a single final configuration. The six major clustersare encircled with a black ellipse; a wider grey ellipse around the clusters 3 and 5 includessome nearby scattered states. The red triangle identifies the initial configuration, the entry 1in 2L86. Note also the presence of a cluster encircled with yellow between clusters 1 and 2. the Figure 6.6, does not appear among the final configurations. It is apparently anunstable extremum of the energy, stabilised by the micelles.In Figure 6.7 we display the average conformations in each of the six clusters,interlaced with each other and the initial 2L86 configuration. In this Figure, the firsttwo C α atoms from the N-terminus are made to coincide. We have maximised thealignment of the subsequent C α atoms, to the extent it is possible. The Figure revealsthe presence of substantial conformational difference between the clusters. The totalityof the conformations shown in Figure 6.7 can be given an interpretation in terms ofthe dynamical hIAPP. It is a long time period average picture of the Hartree state(6.1), (6.3) that is a linear combination of the various snapshot conformations | s i > ( i = 1 , ..., α backbone of the human islet amyloid polypeptide, is quite unsettled: Its lowtemperature limit comes endowed with six different conformational clusters. This is amarked contrast with the properties of a multi-soliton configuration which models aprotein that is known to possess a unique folded native state, such as myoglobin. Thelow temperature clustering of hIAPP is in full accord with the intrinsically disordered Intrinsically disordered proteins
Figure 11: Superposition of all typical structures and the initial structure(blue). Cluster 1, gray; Cluster 2, green; Cluster 3,red; Cluster 4, brown; Cluster 5, yellow; Cluster 6, purple. Fig. 6.7
Superposition of all the six major clusters in Figure 6.6, interlaced with each otherand with the PDB entry 2L86. character of the protein: The different clusters can be viewed as instantaneous snapshotconformations, between which the dynamic hIAPP swings and sways in an apparentlyunsettled manner which is characteristic to an intrinsically disordered protein.Only the cluster number one appears different. This cluster has a much morelocalised conformational distribution than the other five clusters, and the posturecomprises of two anti-parallel helices. This proposes to us, that the cluster numberone is a good candidate to trigger the formation of hIAPP fibrils and amyloidosiswhich correlates with diabetes-II.
My advice to all my students is, be careful with your life-style to keep your hIAPPfolds under good control. eating and cooling hIAPP Figure5: Typicalstructures(red)inCluster1andtheinitialstructure(blue). Figure6:Typicalstructures(red)inCluster2andtheinitialstructure(blue). Figure7: Typicalstructures(red)inCluster3andtheinitialstructure(blue). Figure8:Typicalstructures(red)inCluster4andtheinitialstructure(blue). Figure9: Typicalstructures(red)inCluster5andtheinitialstructure(blue). Figure10:Typicalstructures(red)inCluster6andtheinitialstructure(blue). Fig. 6.8
Superposition of ten representative conformations (red) in each of the six clusters,as marked, together with the PDB entry 2L86 (blue).
Beyond C α Thus far we have analysed the protein structure and dynamics in terms of the C α atomsonly, we have argued that the virtual C α backbone bond and torsion angles form acomplete set of local order parameters to describe the protein backbone conformation.The C α atoms are in a central rˆole in x-ray crystallography, where the experimentaldetermination of protein structure often starts with a skeletonisation of the electrondensity map: From Figures 1.1, 1.7 we observe that the C α atoms are located centrally.They form the vertices that connect the peptide planes. They coincide with the branchpoints between the backbone and the side chain. Thus the C α atoms are subject tostringent stereochemical constraints. Accordingly, the first step in experimental modelbuilding is the initial identification of the skeletal C α trace.The central rˆole of the C α atoms is widely exploited in structural classificationschemes like CATH and SCOP [82, 83], in various homology modeling techniques[27, 28, 29] in de novo approaches [3], and in the development of coarse grained en-ergy functions for folding prediction [47,48]. The so-called C α -trace problem has beenformalised, and it is the subject of extensive investigations [97, 98, 99]. The aim is toconstruct an accurate main chain and/or all-atom model of the folded protein fromthe knowledge of the positions of the central C α atoms only. Both knowledge-basedapproaches such as Maxsprout [100] and de novo methods like
Pulchra [101] have beendeveloped for this purpose. In the case of the backbone atoms, various geometric al-gorithms can be utilised. For the side chain atoms, most approaches rely either on astatistical or on a conformer rotamer library in combination with steric constraints,complemented by an analysis which is based on diverse scoring functions. For the finalfine-tuning of the model, all-atom molecular dynamics simulations can be utilised.The Ramachandran map shown in Figure 1.10 is used widely both in various anal-yses of the protein structures, and as a tool in protein visualisation. It describes thestatistical distribution of the two dihedral angles φ and ψ that are adjacent to theC α carbons along the protein backbone. In the case of side chain atoms, visual anal-ysis methods similar to the Ramachandran map have been introduced. For example,there is the Janin map that can be used to compare observed side chain dihedralsin a given protein against their statistical distribution in a manner which is analo-gous to the Ramachandran map. Crystallographic refinement and validation programslike Phenix [102],
Refmac [103] and many others, utilise the statistical data obtainedfrom libraries such as Engh and Huber library [12], that are built using small molec-ular structures which have been determined with a very high resolution. At the levelof entire proteins, side chain restraints are commonly derived from analysis of highresolution PDB crystallographic structures [104]. Backbone independent rotamer li-
What-you-see-is-what-you-have” braries that make makes no reference to backbone conformation, and both secondarystructure and backbone dependent rotamer libraries have been developed. Accordingto [104] the information content in the secondary structure dependent libraries and thebackbone independent libraries essentially coincide, and both are often used duringcrystallographic protein structure model building and refinement. But for the predic-tion of side-chain conformations for example in the case of homology modelling andprotein design, it is often an advantage to use the more revealing backbone dependentrotamer libraries. We shall present a short introductory outline, how the C α Frenet frames can be utilisedto develop new generation visualisation techniques for protein structure analysis, re-finement and validation. Our outline is based on [105, 106, 107].Despite the availability of various 3D visualisation tools such as the Java-basedviewer on PDB website, thus far the visualisation of proteins has not yet taken full advantage of modern visualisation techniques. The commonly available 3D viewerspresent the protein in the ’laboratory’ frame and as such they provide mainly an ex-ternal geometry based characterisation of the protein structure. On the other hand,the method that we describe is based on internal, co-moving framing of the proteinbackbone: To watch a roller-coaster is not the same as taking the ride. As such ourapproach provides complementary visual information. Starting from the positions ofthe C α atoms, we aim for a 3D what-you-see-is-what-you-have type visual map of theall-atom structure. Indeed, the visualization of a three dimensional discrete framedcurve is an important and widely studied topic in computer graphics, from the as-sociation of ribbons and tubes to the determination of camera gaze directions alongtrajectories.In lieu of the backbone dihedral angles that appear as coordinates in the Ra-machandran map and correspond to a toroidal topology, we use the geometry of virtualtwo-spheres that surround each heavy atom. For this we employ the geometric inter-pretation of the virtual C α backbone bond and torsion angles in terms of latitude andlongitude on the surface of a sphere S . We shall outline how the approach works inthe case of the backbone N and C atoms, and the side chain C β atoms. The approachcan be easily extended to visually describe all the higher level side chain atoms on thesurface of the sphere, level-by-level along the backbone and side chains [105, 106, 107].The outcome is a 3D visual map that describes the backbone and side-chain atomsexactly in the manner how they are seen by an imaginary, geometrically determinedand C α based miniature observer who roller-coasts along the backbone: At each C α atom the observer orients herself consistently according to the purely geometricallydetermined C α based discrete Frenet frames. Thus the visualisation of all the otheratoms depends only on the C α geometry, there is no reference to the other atomsin the initialisation of the construction. The other atoms - including subsequent C α atoms along the backbone chain - are all mapped on the surface of a sphere that sur-rounds the observer, as if these atoms were stars in the sky; the construction proceedsalong the ensuing side chain, until the position of all heavy atoms have been deter-mined. This provides a purely geometric and equitable, direct visual information on Beyond C α the statistically expected all-atom structure in a given protein, based entirely on theC α trace.We start with the bond and torsion angles (4.4), (4.5) and we choose each bondangle to take values κ ∈ [0 , π ]. We identify the bond angle with the latitude angle ofa two-sphere which is centered at the C α carbon. We orient the sphere so that thenorth-pole where κ = 0 is in the direction of t . The torsion angle τ ∈ [ − π, π ) is thelongitudinal angle of the sphere. It is defined so that τ = 0 on the great circle thatpasses both through the north pole and through the tip of the normal vector n . Thelongitude angle increases towards the counterclockwise direction around the vector t .We find it also useful, to introduce the stereographic projection of the sphere onto theplane. The standard stereographic projection from the south-pole of the sphere to theplane with coordinates ( x, y ) is given by x + iy ≡ r e iτ = tan ( κ/ e iτ (7.1)This maps the north-pole where κ = 0 to the origin ( x, y )=(0 , κ = π is sent to infinity; see Figure 7.1 If need be, the visual effects of the ( (cid:2) , (cid:1) ) (cid:2) = (cid:3) (cid:2) = Fig. 7.1
Stereographic projection of two-sphere S on the plane R , from the southpole. projection can be enhanced by sending κ → f ( κ )where f ( κ ) is a properly chosen function of the latitude angle κ . α map We first explain how to visually describe the C α trace in terms of the C α Frenet frames(4.1)-(4.3). Consider the virtual miniature observer who roller-coasts the backboneby moving between the C α atoms. At the location of each C α the observer has anorientation that is determined by the Frenet frames (4.1)-(4.3). The base of the i th tangent vector t i is at the position r i . The tip of t i is a point on the surface of the sphere( κ, τ ) that surrounds the observer and points towards the north-pole. The vectors n i and b i determine the orientation of the sphere. These vectors define a frame on thenormal plane to the backbone trajectory, as shown in Figure 4.1. The observer mapsthe various atoms in the protein chain on the surface of the surrounding two-sphere,as if the atoms were stars in the sky. What-you-see-is-what-you-have” The map of the C α backbone is constructed as follows. The observer first translatesthe center of the sphere from the location of the i th C α , all the way to the location ofthe ( i + 1) th C α with no rotation of the sphere with respect to the i th Frenet frames.The observer then identifies the direction of t i +1 , i.e. the direction towards the site r i +2 to which she proceeds from the next C α carbon, as a point on the surface of thesphere. This determines the corresponding coordinates ( κ i , τ i ). After this, the observerre-defines her orientation so that it matches the Frenet framing at the ( i + 1) th centralcarbon, and then proceeds by repeating the construction, exactly in the same manner.The ensuing map, over the entire backbone, gives an instruction to the observer ateach point r i , how to turn at site r i +1 , to reach the ( i + 2) th C α carbon at the point r i +2 . Fig. 7.2
The stereographically projected Frenet frame map of backbone C α atoms, withmajor secondary structures identified. Also shown is the directions of the Frenet frame normalvector n ; the vector t corresponds to the red circle at the center, and it points away fromthe viewer. The map is constructed using all PDB structures that have been measured withbetter than 2.0 ˚A resolution. In figure 7.2 we show the C α Frenet frame backbone map. It describes the statisticaldistribution that we obtain when we plot all PDB structures which have been measuredwith better than 2.0 ˚A resolution, and using the stereographic projection (7.1). For ourobserver, who always fixes her gaze position towards the north-pole of the surroundingtwo-sphere at each C α i.e. towards the red dot at the center of the annulus, the colorintensity in this map reveals the probability of the direction at position r i , wherethe observer will turns at the next C α carbon, when she moves from r i +1 to r i +2 . Inthis way, the map is in a direct visual correspondence with the way how the Frenetframe observer perceives the backbone geometry. Note how the probability distributionconcentrates within an annulus, roughly between the latitude angle values κ ∼ κ ∼ /
2. The exterior of the annulus is a sterically excluded region while the entireinterior is in principle sterically allowed but very rarely occupied in the case of folded Beyond C α proteins. In the figure we identify the four major secondary structure regions, accordingto the PDB classification; the α -helices, β -strands, left-handed α -helices and loops. β maps Consider our imaginary miniature observer, located at the position of a C α atom andoriented according to the discrete Frenet frames. She proceeds to observe and recordthe backbone heavy atoms N, C and the side-chain C β atoms that are covalentlybonded to C α . These atoms form the covalently bonded heavy-atom corners of the C α centered sp α centered sphere, the way how they are seen bythe miniature observer. These figures are constructed from all the PDB entries thathave been measured using diffraction data with better than 1.0 ˚A resolution. !" !" L- α %" $" %" α L- α &" C C β N F i g . . a ) D i s t r i bu t i o n o f C β a t o m s i n t h e C α ce n t e r e d F r e n e t f r a m e s i n P D B s t r u c t u r e s m e a s u r e d w i t hb e tt e r t h a n . ˚A r e s o l u t i o n . T h e t h r ee m a j o r d o m a i n s α - h e li ce s , β - s t r a nd s a nd l e f t - h a nd e d α - h e li ce s h a v e b ee n m a r k e d . b ) S a m e a s a ) bu t f o r b a c k b o n e C a t o m s . N o t e t h a t c i s - p r o li n e s a r e a l s o c l e a r l y i d e n t i fi a b l e . c ) S a m e a s a ) bu t f o r b a c k b o n e N a t o m s . d ) A c o m b i n a t i o n o f d i s t r i bu t i o n s i n a ) - c ) . I U C r m a c r o s v e r s i o n . . : Fig. 7.3 a) Distribution of C β atoms in the C α centered Frenet frames in PDB structuresmeasured with better than 1.0 ˚A resolution. The three major domains α -helices, β -strandsand left-handed α -helices have been marked. b) Same as a) but for backbone C atoms. Notethat cis -prolines are also clearly identifiable. c) Same as a) but for backbone N atoms. d) Acombination of distributions in a)-c). As visible in Figures 7.3 in the C α entered Frenet frames the C β , C and N atomseach oscillate in a manner that depends on the local secondary structure. Note thatboth in the case of C β and N, the left-handed α region (L- α ) is distinctly detached fromthe rest. But in the case of C the L- α region is connected with the other regions. Onthe other hand, in the case of N the cis -prolines form a clearly detached and localizedregion, which is not similarly visible in the case of C and C β .We now consider the three bond angles What-you-see-is-what-you-have” ϑ NC (cid:39) N − C α − C ϑ N β (cid:39) N − C α − C βϑ β C (cid:39) C β − C α − C (7.2)The ϑ NC angle relates to the backbone only, while the definition of the other twoinvolves the side chain C β . In experimental protein structure validation these threeangles are often presumed to have their ideal values. For example, the deviation of theC β atom from its ideal position is among the validation criteria used in Molprobity (1.4) that uses it in identifying potential backbone distortions around C α .In Figure 7.4 we show the distribution of the three tetrahedral bond angles in our1.0 ˚A PDB data set. We find that in the case of the two side chain related angles ϑ N β and ϑ β C the distribution displays a single peak which is compatible with the idealvalues; the isolated small peak in Figure 7.4 b) is due to cis -prolines. But in the caseof the backbone specific angle ϑ NC we find that this is not the case. The PDB datashows a clear correlation between the ϑ NC distribution and the backbone secondarystructure. We remind that ϑ NC pertains to the relative orientation of the two peptideplanes that are connected by the C α . Fig. 7.4
Distribution of the three angles (7.2) according to secondary structures. Blue are α -helices, red are β -strands and yellow are loops; the small (yellow) peak in N-C α -C β withangle around 103 o is due to prolines. The two Ramachandran angles ( φ, ψ ) in Figure 1.10 are directly adjacent to thegiven C α , and each is specific to a single peptide plane; see Figure 1.7. But this isnot the case of ϑ NC which connects two peptide planes. This angle contributes to thebending of the backbone. Consequently a systematic secondary structure dependence should be present in this angle. Due to the very rigid structure of the C α centered Beyond C α sp ϑ N β and ϑ β C . Thelack of any observable secondary structure dependence in the experimental data ofthese two angles proposes, that existing validation methods distribute the refinementtension entirely to ϑ NC . Research project: Can you explain in detail, whysecondary structure dependence is absent inPDB data of ϑ N β and ϑ β C . The construction we have presented, can be continued to visualise all the higherlevel side chain atoms in a protein, beyond C β [105, 106, 107]. It turns out that thesehigher level atoms also display a highly organised, systematic pattern akin those shownin Figure 7.3. In particular, the underlying geometry of the C α backbone is clearlyvisible in these side chain atoms; there is a very strong coupling between the C α backbone geometry and the side chain geometry. Accordingly it becomes possible, inprinciple, to highly accurately determine the all-atom structure of the protein fromthe knowledge of the C α atom positions only. This enables one to extend our C α basedapproach that builds on (4.17), to construct the full all-atom structure of the entireprotein: The C α positions can be computed from the energy function (4.17), and theremaining atoms are located by s statistical analysis.According to Figure 7.3 a) in the intrinsic, purely geometric Frenet frame coordi-nate system the directions of the C β atoms are only subject to small fluctuations. Atthe level of C β , the side chains all point to essentially the same direction. Thus thecrystallographic protein structures are like a spin chain where the side chains are thespin variables, and the collapsed phase bears resemblance to a ferromagnetic phase. Research project: See how far you get using spin chain analogy of folded proteins, with sidechains as the spin variables. T h a t ’ s a l l , f o l k s eferences [1] E. Schr¨odinger, ”What is Life?” The Physical Aspect of the Living Cell Cambridge UniversityPress (1948)[2] P. W. Anderson,
Science
393 (1972)[3] K. Dill, S.B. Ozkan, T.R. Weikl, J.D. Chodera, V.A. Voelz,
Current Opinions in Structural Bi-olology
342 (2007)[4] K.A. Dill and J.K. MacCallum,
Science
Annual Reviews in Biochemistry
333 (2006)[6] J. Davies,
Nature
219 (1996)[7] C.T. Walsh,
Nature
775 (2000)[8] M.A. Fischbach, C.T. Walsh,
Science th Edition
Garland Science (2014)[10] L. Frappat, P. Sorba, A. Sciarrino,
Physics Letters
B250
214 (1998)[11] R.J. Read et.al. Structure International Tables for Crystallography , Vol. F , 382; edited by M. G.Rossmann and E. Arnold (Kluwer Academic Publishers, Dordrecht, 2001)[13] R. Joosten et.al. Journal of Applied Crystallography Nature
Journal of the International Unionof Crystallography IUCrJ Cornell University Press, Ithaca (1979)[17] L. Sch¨afer, Excluded Volume Effects in Polymer So- lutions, as Explained by the RenormalizationGroup
Springer Verlag, Berlin (1999)[18] L.P. Kadanoff,
Physics
263 (1966)[19] K.G. Wilson,
Physical Review B4 Physics Reports The Journal of Chemical Physics
440 (1941)[22] P.J. Flory,
The Journal of Chemical Physics Journal of Statistical Physics Physics Letters
Physical Review
B21
The Journal of Chemical Physics
Annual Review ofBiophysics and Biomolecular Structure Nucleic Acids Research Current Opinions in Structural Biology
145 (2009)[30] C. Chothia,
Nature [32] http://ambermd.org/ [33] [34] E. Ott, Chaos in Dynamical Systems.
Cambridge University Press, New York (2002)[35] V.E. Korepin, N.M. Bogoliubov, A.G. Izergin, Quantum Inverse Scattering Method and Corre-lation Functions,
Cambridge university press, Cambridge (1997)[36] D. E. Shaw, et al. , Communications of the ACM et al. , High Performance Computing Networking, Storage and Analysis, Proceedingsof the Conference
IEEE (2009)[38] K. Lindorff-Larsen, S. Piana, R. O. Dror, D.E. Shaw.
Science
The Journal of Chemical Physics
511 (1984)[41] W.G. Hoover,
Physical Review
A31
The Journal of Chemical Physics References [43] Y. Liu, M.E. Tuckerman,
The Journal of Chemical Physics
Theoretical and Mathematical Physics
TheJournal of Chemical Physics Journal of Com-putational Chemistry Annual Reviews in Physical Chemistry
57 (2007)[48] A. Liwo, A.J. Niemi, X. Peng, A.K. Sieradzan, A novel coarse-grained description of proteinstructure and folding by UNRES force field and Discrete Nonlinear Schr¨odinger Equation, in
Frontiers in Computational Chemistry (Bentham Science Publishers) (2014)[49] N. N. G¯o,
Annual Review of Biophysics and Bioengineering Physical Review Letters Physical Review Letters Nuclear Physics
B190
Physics Reports
Solid state communications Nature Physics Physical Review D7 Journal of High Energy Physics
014 (2008)[58] A.J. Niemi,
Physical Review
D67
The American Mathematical Monthly
246 (1975)[60] S. Hu, M. Lundgren, A.J. Niemi
Physical Review
E83 (Springer Verlag,Berlin, 1987) [62] M.J. Ablowitz, B. Prinari, A.D. Trubatch, Discrete and Continuous Nonlinear Schr¨odinger Sys-tems (London Mat. Soc. Lect. Note Series , London, 2003) [63] A.M. Polyakov,
Nuclear Physics
B268
406 (1986)[64] O. Kratky, G. Porod,
Recueil des Travaux Chimiques des PaysBas (Springer-Verlag, Berlin, 2009) [66] N. Manton, P. Sutcliffe, Topological Solitons (Cambridge University Press, Cambridge, 2004) [67] S. Hu, Y. Jiang, A.J. Niemi, Physical Review
Physical Review
D90 (Kluwer Academic Publishers, Dordrechts,1990) [70] V.I. Arnold,
Russian Mathematical Surveys American Mathematical Society Translations
11 (1996)[72] F. Aicardi,
Functional Analysis and Applications L’Enseignement Math´ematique Physical Review
E82
Physical Review
E82
Physical Review Letters
Physical Review
E83
Physical Review
E85
Physical Review
E86
The Journal of chemical physics
Journal of Physics: Condensed Matter [83] http://scop.mrc-lmb.cam.ac.uk/scop/ [84] R.J. Glauber, Journal of Mathematical Physics
294 (1963)[85] A.B. Bortz, M.H. Kalos, J.L. Lebowitz,
Journal of Computational Physics
10 (1975)[86] J. Skolnick, A.K. Arakaki, Y.L. Seung, M. Brylinski,
Proceeding of National Academy of SciencesUSA
Cold Spring HarborLaboratory Press, Cold Spring Harbor (2004)[88] M.E. Gottesman, R.A. Weisberg,
Microbiology and Molecular Biology Reviews
796 (2004)[89] P.A. Jennings, P.E. Wright,
Science
892 (1993) eferences [90] Y. Moriyama, K. Takeda, Journal of Physical Chemistry
B114
Bioscience, Biotechnology,and Biochemistry Physical Review
E83
Soft Matter Physiological Reviews
795 (2011)[95] K. Pillay, P. Govender,
BioMed Research International
The Journal of Chemical Physics (to appear)[97] L. Holm, C. Sander,
Journal of Molecular Biolology
183 (1991)[98] S.C. Lovell, I.W. Davis, W.B. Arendall III, P.I.W. de Bakker, J.M. Word, M.G. Prisant, J.S.Richardson, D.C. Richardson, Proteins
437 (2003)[99] P. Rotkiewicz, J. Skolnick,
Journal of Computational Chemistry [101] http://cssb.biology.gatech.edu/PULCHRA [102] [103] [104] R.L. Dunbrack Jr., Current Opinions in Structural Biology
431 (2002)[105] M. Lundgren, A.J. Niemi, F. Sha,
Physical Review
E85
Physical Review
E86