[PDF] Structural Learning for Template-free Protein Folding

Abstract

The thesis is aimed to solve the template-free protein folding problem by tackling two important components: efficient sampling in vast conformation space, and design of knowledge-based potentials with high accuracy. We have proposed the first-order and second-order CRF-Sampler to sample structures from the continuous local dihedral angles space by modeling the lower and higher order conditional dependency between neighboring dihedral angles given the primary sequence information. A framework combining the Conditional Random Fields and the energy function is introduced to guide the local conformation sampling using long range constraints with the energy function. The relationship between the sequence profile and the local dihedral angle distribution is nonlinear. Hence we proposed the CNF-Folder to model this complex relationship by applying a novel machine learning model Conditional Neural Fields which utilizes the structural graphical model with the neural network. CRF-Samplers and CNF-Folder perform very well in CASP8 and CASP9. Further, a novel pairwise distance statistical potential (EPAD) is designed to capture the dependency of the energy profile on the positions of the interacting amino acids as well as the types of those amino acids, opposing the common assumption that this energy profile depends only on the types of amino acids. EPAD has also been successfully applied in the CASP 10 Free Modeling experiment with CNF-Folder, especially outstanding on some uncommon structured targets.

Full PDF

ii Structural Learning for Template-free Protein Folding

By Feng ZhaoFeng ZhaoFeng ZhaoFeng Zhao Submitted to: Toyota Technological Institute at ChicagoToyota Technological Institute at ChicagoToyota Technological Institute at ChicagoToyota Technological Institute at Chicago

Jinbo Xu (Thesis Supervisor)Jinbo Xu (Thesis Supervisor)Jinbo Xu (Thesis Supervisor)Jinbo Xu (Thesis Supervisor) David McAllesterDavid McAllesterDavid McAllesterDavid McAllester Jie LiangJie LiangJie LiangJie Liang Tobin SoTobin SoTobin SoTobin Sosnisnisnisnicccckkkk i Structural Learning for Template-free Protein Folding

By Feng ZhaoFeng ZhaoFeng ZhaoFeng Zhao Submitted to: Toyota Technological Toyota Technological Toyota Technological Toyota Technological Institute at ChicagoInstitute at ChicagoInstitute at ChicagoInstitute at Chicago

Jinbo Xu (Thesis Supervisor)Jinbo Xu (Thesis Supervisor)Jinbo Xu (Thesis Supervisor)Jinbo Xu (Thesis Supervisor) Signature: Date: 10/25/2013 David McAllesterDavid McAllesterDavid McAllesterDavid McAllester Signature: Date: 10/25/2013 Jie LiangJie LiangJie LiangJie Liang Signature: Date: 10/25/2013 Tobin STobin STobin STobin Soooosnisnisnisnicccckkkk Signature: Date: 10/25/2013 ii v Abstract

Acknowledgements

It was when I started to write this acknowledgement that I realize I owe my gratitude to many people who have made this thesis possible. My deepest gratitude is to my advisor, Jinbo Xu, whose guidance, support, advice and long discussions have enabled me to gain a profound understanding throughout the PhD studies. It was from him that I have learned the spirit of research, which I believe will be beneficiary to my whole life, and that I will cherish forever . I would like to thank my committee members David McAllester, Tobin Sosnick and Jie Liang for all their support and suggestions. My heartfelt thanks are also dedicated to all the professors from TTI-C and University of Chicago who have taught me, David McAllester, Nati Srebro, Tamir Hazan, Raquel Urtasun, Julia Chuzhoy, Mattias Blume, Karen Livescu, Greg Shakhnanovich, Anastasios Sidiropoulos, Yury Makarychev, Yang Shen, Laszlo Babai, Tobin Sosnick, Paul J. Sally, and Gregory F. Lawler for their encouraging and sharing their plethora of knowledge on various subjects and courses. It has been a pleasure to study at TTI-C during the years. My fellow students are cordially thanked for their friendship: Jian Peng, Karthik Sridharan, Andrew Cotter, Zhiyong Wang, Avleen Bijral, Payman Yadollahpour, Jianzhu Ma, Sheng Wang, Joe DeBartolo, Aashish Adhikari, Raman Arora, Qingming Tang, Taewhan Kim, Jian Yao, Hao Tang, Somaye Hashemifar and Behnam Tavakoli Neyshabur. Thank you for your good company and support. Research reported in this thesis was supported by the National Science Foundation grant DBI-0960390 and National Institutes of Health grant R01GM089753. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF or NIH. This work was made possible by the computing resources and facilities of TeraGrid, Beagle, SHARCNET and Open Science Grid (OSG). Finally, special thanks to my wife Wentao, my parents and my beloved Ayla. It was because of them that my work becomes meaningful. i Contents Abstract.................................................................................................................................................. iv Acknowledgements .............................................................................................................................. v Chapter 1. Introduction ........................................................................................................................... 1 1.1 Conformation Sampling ................................................................................................................. 3 1.2 Energy function ............................................................................................................................... 8 1.3 Contribution..................................................................................................................................... 9 1.4 Organization .................................................................................................................................. 10 Part 1. ....................................................................................................................................................... 12 Probabilistic and Continuous Models of Protein Conformational Space for Fragment-free and Template-Free Modeling ....................................................................................................................... 12 Chapter 2. Continuous representation of protein conformations ................................................... 13 2.1 C α -trace representation ................................................................................................................. 13 2.2 Distribution of bond angles ......................................................................................................... 14 2.3 Building backbone atoms ............................................................................................................. 15 2.4 Mathematical Symbols ................................................................................................................. 15 Chapter 3. First-order CRF-Sampler ................................................................................................... 18 3.1 Introduction ................................................................................................................................... 18 3.2 CRF model for sequence-structure relationship ....................................................................... 19 3.3 Conformation sampling algorithm ............................................................................................. 26 3.4 Experiments and Results.............................................................................................................. 27 Data set .............................................................................................................................................. 28 Label assignment and distribution parameters ........................................................................... 28 Parameter tuning.............................................................................................................................. 28 Comparison with the HMM model ............................................................................................... 29 Comparison with Xia et al ............................................................................................................... 35 Comparison with Rosetta ............................................................................................................... 37 Computational efficiency ................................................................................................................ 38 ii Cα distance distribution using probabilistic neural network (PNN). .. 88 iii Model parameter training. .............................................................................................................. 92 Training and validation data. ......................................................................................................... 93 Estimating inter-atom distance distribution for non- Cα main chain atoms. ........................... 93 Estimating the reference state. ....................................................................................................... 93 Window size and the number of neurons in the hidden layers. ............................................... 94 6.3 Results and Discussion ................................................................................................................. 98 Distance dependence of the statistical potentials. ....................................................................... 98 Performance on decoy discrimination. ....................................................................................... 100 Performance on the 2007 Rosetta dataset. .................................................................................. 100 Performance on template-based models. ................................................................................... 101 Performance on the CASP9 models............................................................................................. 104 Performance on the Decoy ‘R’ Us dataset. ................................................................................. 104 Performance on the I-TASSER dataset. ....................................................................................... 105 Performance on the CASP5-8 dataset. ........................................................................................ 108 Is our probabilistic neural network (PNN) model over-trained? ........................................... 108 Folding Results with EPAD in the blind CASP 10 experiment ............................................... 111 6.4 Conclusion ................................................................................................................................... 116 Chapter 7. Conclusion and Future works ........................................................................................ 117 7.1 Conclusions .................................................................................................................................. 117 7.2 Future Works ............................................................................................................................... 117 References ............................................................................................................................................. 119 x List of Figures

Figure 1 Yearly growth of total structures in RCSB Protein Data Bank (PDB). .............................. 2

Figure 2 The growth of the SwissProt database. ................................................................................. 2

Figure 3 The 6 Ramanchandran basins: β (blue), poly-proline II, PPII (green), αR (red), αL (magenta) and ε (grey). Color intensity reflects the ϕ, ψ occupancy. Adapted from (Colubri et al., 2006) ..................................................................................................................................................... 5

Figure 4 Backbone ϕ, ψ -space subdivided into 36 alphabetically labeled, 60° × 60° grid regions. Adapted from (Gong et al., 2005) .......................................................................................................... 5

Figure 5 The (cid:6), (cid:7), (cid:8) dihedral angles in one residue of the protein backbone. ω can be assumed to be fixed at 180° (trans) or 0° (cis). .................................................................................................... 13 Figure 6 Three point sets sampled from the FB5 distribution on the unit sphere. ....................... 16

Figure 7 An example CRF model for protein conformation sampling. ......................................... 23

Figure 8 Native structures (in orange) and the best decoys of 1FC2, 1ENH, 2GB1, 2CRO, 1CTF, and 4ICB .................................................................................................................................................. 33

Figure 9 A second-order Conditional Random Fields (CRF) model of protein conformation space. Each backbone angle state depends on a window (size 9) of sequence profiles and secondary structure and also the states in its local neighborhood. ................................................ 46

Figure 10 A 1 st -order CNF model consists of three layers: input, output and hidden layer...... 69 Figure 11 The relationship between RMSD (y-axis) and energy (x-axis) for T0397_D1, T0416_D2, T0476, T0482, T0496_D1 and T0510_D3. ......................................................................... 79

Figure 12 Two typical mirror images generated by the CRF method for T0416_D2.. ................. 80

Figure 13 Ranking of our CNF predictions for T0416_D2, T0476, T0496_D1 and T0510_D3. .... 84

Figure 14 Distribution of pairwise (cid:10)(cid:11) distances in a representative set of protein structures. .. 89

Figure 15 An example probabilistic neural network, in which (cid:12)(cid:13) and (cid:12)(cid:14) are the sequence profile contexts centered at the i th and j th residues, respectively. (cid:15)(cid:16)1 and (cid:15)(cid:19)2 are the neurons in the 1 st and 2 nd hidden layers. ............................................................................................................................ 92 Figure 16 Inter-residue contact prediction accuracy of our PNN and the top 12 (human and server) groups in CASP8 and CASP9. ................................................................................................. 96

Figure 17 Distance dependence of DOPE and our potential EPAD.. ............................................. 99

Figure 18 . RaptorX-roll’s Result on targets T0734 (A) and T0740 (B). ......................................... 115

List of Tables

Table 1 Some Mathematical Symbols Used in the CRF Models ..................................................... 17

Table 2 F1-Values (%) of CRF-Sampler with Respect to Window Size .......................................... 25

Table 3 Decoy Quality Comparison between the HMM Model and CRF-Sampler ..................... 31

Table 4

Decoy Quality Comparison between the HMM Model17 and CRF-Sampler ................. 32

Table 5 Secondary Structure Content of Good Decoys .................................................................... 34

Table 6 Decoy Quality Comparison Between Xia et al. (Xia et al., 2000) and CRF-Sampler ...... 36

Table 7 Percentage of correct secondary structure (Q3-value) and secondary structure scntent of all the decoys generated by CRF-Sampler, compared with PSIPRED Predictions .................. 37

Table 8 Decoy Quality Comparison Between ROSETTA2 and CRF-Sampler ............................. 39

Table 9 Approximate Running Time in Minutes Spent by CRF-Sampler in Generating 100 Decoys ...................................................................................................................................................... 40

Table 10

Quality Comparison of the Decoys Generated by the 1st-Order and 2ND-Order CRF Models ..................................................................................................................................................... 53

Table 11

Quality Comparison of the Decoys Generated by FB5-HMM, the 1 st -Order CRF, the 2 nd -Order CRF, and Rosetta .................................................................................................................. 56 Table 12

Performance Comparison between CRFFolder and Skolnick's TOUCHSTONE-II .... 59

Table 13

Performance Comparison (TM-Score) between CRFFolder and Robetta on Some CASP8 Hard Targets.............................................................................................................................. 61

Table 14

Summarized results of RAPTOR++ predictions of free modeling targets in CASP8. .. 64

Table 15

Secondary structure prediction accuracy comparison with the current top predictors.................................................................................................................................................................... 69

Table 16

Performance of the CNF and CRF methods on the old test set. ...................................... 74

Table 17 Performance of our CNF and CRF methods on the CASP8 test set. ............................... 76

Table 18 Clustering result of the 12 CASP8 free-modeling targets. ............................................... 83

Table 19 (cid:10)(cid:11) − (cid:10)(cid:11) distance prediction accuracy with respect to the window size ....................... 97

Table 20

Performance of EPAD and several popular full-atom statistical potentials on the Rosetta decoy set. ................................................................................................................................. 102

Table 21 Performance of EPAD and several popular statistical potentials on the Rosetta decoy sets when only Cα atoms are considered. ......................................................................................... 102 i Table 22 Performance of EPAD and several popular statistical potentials on the template-based models. ................................................................................................................................................... 103

Table 23 Performance of statistical potentials with respect to the hardness of the CASP9 targets. ................................................................................................................................................... 106

Table 24 Performance of several statistical potentials on the Decoy ‘R’ Us dataset................... 106

Table 25 Performance of several statistical potentials on the I-TASSER dataset. ....................... 107

Table 26 Performance of a variety of potentials on the selected CASP5-8 models. ................... 109

Table 27 Folding comparison with Rosetta (GDT) on CASP9 FM targets ................................... 113

Table 28 Comparison with threading method RaptorX-ZY on CASP10 FM targets ................. 114

Chapter 1. Introduction

As shown in Figure 1, the number of solved 3D protein structures in RCSB Protein Data Bank (PDB) increases to 84223 as of Aug. 28 th , 2012. On the other hand, the genome sequencing projects have led to the identification of tens of millions of protein sequences publicly available in NCBI Non-redundant database. The high quality annotated and non-redundant protein sequence database SwissProt contains 536789 sequences at Release 2012_07 (Figure 2). The fact that so many protein sequences are part of the public knowledge, and that so many of them have structures that we have not yet solved, means that we have a long way to go to understand the way that proteins work and function within the body. Many computational methods have been developed to predict the structure of a protein from its primary sequence, based on the famous Anfinsen dogma that all of the information necessary for a protein to fold to the native structure resides in the protein’s amino acid sequence(Anfinsen, 1973). These methods can be roughly classified into two categories: template-based and template-free modeling. Despite considerable progress taken place over the past decade, template free modeling remains one of the unsolved mysteries in the field of computational structural biology. The primary challenges still remain in two areas: the vast conformation space to be searched and limited accuracy of the current energy functions designed. The purpose of this thesis is to explore these topics in two sections: first, protein conformation sampling, that is, the exploration of conformational space that corresponds with a particular protein sequence; second, Design of energy function, that is, an accurate physics-based or knowledge-based potential to quantify interactions among residues or atoms. Figure 1 Yearly growth of total structures in RCSB Protein Data Bank (PDB). Adapted from

Figure 2 The growth of the SwissProt database.

Adapted from http://web.expasy.org/docs/relnotes/relstat.html T h o u s a nd s YearlyTotal

Template-free modeling that comes from fragment assembly (Bowie and Eisenberg, 1994; Claessens et al., 1989; Jones and Thirup, 1986; Levitt, 1992; Simon et al., 1991; Sippl, 1993; Unger et al., 1989) and lattice-models (Kihara et al., 2001; Wendoloski and Salemme, 1992; Xia et al., 2000; Zhang et al., 2003) have been studied extensively. These two common methods and their combined use in template-free modeling have brought impressive results in CASP (Critical Assessment of Structure Prediction) competitions (Moult, 2005; Moult et al., 2007; Moult et al., 2005; Moult et al., 2003). The popular fragment assembly program Robetta (Misura et al., 2006; Simons et al., 1997) has the best track records in all of the template-free modeling programs. Both TASSER (Zhang and Skolnick, 2005a) and its derivative Zhang-Server (Wu et al., 2007a) have garnered recognition for stellar results in both CASP7 and CASP8 through the combination of threading-generated fragments, distance restraints, and lattice modeling. Fragment based protein structure prediction takes place in two distinct steps. First, it is necessary to cut a protein sequence into minuscule segments and then pick out literally scores of possible fragments that might construct each segment. Then, one uses a simulation or search algorithm to build the protein structure. While it is important to note that the results attained by existing template-free modeling methods are exciting, there are several areas of concern that have not yet been resolved. One of these has to do with the small pool of proteins with solved experimental structures within the PDB. This makes it almost impossible to put together a collection of fragments that can match the number of ways a protein can locally conform – particularly in the loop regions, where the possibilities grow. It is true that a new fold may entirely consist of rarely occurring super-secondary structure motif that is found nowhere else in the protein data bank. Second, because the conformation space that a lattice model or fragment library defines is by nature discrete, it may even keep the original fold from being searched since a slight change in backbone angles, even in the most minuscule degree, can result in a totally different fold. It is possible to make the conformational space continuous, as Bystroff et al. have shown with HMMSTR (Bystroff et al., 2000), which is a Hidden Markov Model (HMM) model trained from a fragment library, to generate local conformations for a given sequence. Both Colubri et al. (Colubri et al., 2006) and Gong et al. (Gong et al., 2005) have have analyzed the skeletal structure of protein, to see if that structure can be reconstructed – under the assumption that the angles are limited to the degrees allowed in their original Ramachandran basins ( a native Ramachandran basin is the region that holds the values of the native backbone (cid:22)ϕ, ψ(cid:23) torsion angle pairs). Sosnick and coworkers (Colubri et al., 2006) divided the torsion angle space into six distinct sections (see Figure 3) while Gong et al. (Gong et al., 2005) split the space into 36 distinct regions (see Figure 4). The upshot from both pieces of research is that it is possible to reconstruct the backbone structure of many of the smaller proteins with good accuracy, if the angles are limited to those native basins. These studies do not consist of pure ab initio folding; instead, they show that if researchers can come up with a reasonably close guess as to each angle's native basin, then it should also be possible to forecast with accuracy a protein's backbone structure. Fragment-HMM (Li et al., 2008), which is a variant of Robetta, has the ability to sample conformations within a continuous space. However, because Hidden Markov model is constructed from 9-mer fragments, it still has the issue of coverage that plagues many of the other analytical machines. The TOUCHSTONE programs (Kihara et al., 2001; Zhang et al., 2003) , which uses the lattice model, does not have the coverage issue. but its sampling of protein conformations comes from a three-dimensional lattice that has only finite resolution. More importantly, these conformations that get sampled might not even have a local structure that resembles those of native proteins, because TOUCHSTONE does not sample a conformation based upon the primary sequence of a protein. Rather, these applications take several statistical potentials over short range in the energy function to manage the creation of a local structure that closely resembles the desired protein(s). It is worth mentioning that some other methods exist for trying to sample conformations of protein in a continuous space using the probability. For each conformation, the probability is an approximation to its overall stability. Sequence information is used to estimate those probabilities. In response, Feldman and Hogue developed the FOLDTRAJ (Feldman and Hogue, 2002) program, which implements a probabilistic all-atom conformation sampling algorithm. FOLDTRAJ is tested on three small proteins 1VII, 1ENH, and 1PMC, with 100,000 decoys for each protein. The best models achieve 3.95, 5.12 and 5.95 Angstroms in term of RMSD from the native structures. However, because FOLDTRAJ does not use the nearest neighbor effects or the sequence profile in modeling the relationship between sequence and structure, it is not able to generate models that match the quality of the models from the popular fragment assembly Rosetta application.

Figure 3 The 6 Ramanchandran basins: β (blue), poly-proline II, PPII (green), αR (red), αL (magenta) and ε (grey). Color intensity reflects the (cid:22)ϕ, ψ(cid:23) occupancy. Adapted from (Colubri et al., 2006) Figure 4 Backbone (cid:22)ϕ, ψ(cid:23) -space subdivided into 36 alphabetically labeled, 60° × 60° grid regions. Adapted from (Gong et al., 2005)

Hamelryck et.al have developed two different HMM models FB5-HMM (Hamelryck et al., 2006) and TorusDBN (Boomsma et al., 2008) to sample protein conformations from a continuous space. These models not only capture the relationship between backbone angles and primary sequence, but also consider the angle-dependency between two adjacent residues. FB5-HMM uses a Hidden Markov Model to discover the native basins of the local (cid:22)θ, τ(cid:23) pseudo backbone dihedral angles of a protein sequence by learning the local sequence-structure relationship; and TorusDBN learns the local (cid:22)ϕ, ψ(cid:23) backbone dihedral angle dependency using a dynamic Bayesian network, which is a generalization of the Hidden Markov Model. There are two particular advantages of these methods: first, they model the backbone angles using a directional statistics distribution (Kent, 1982; Mardia et al., 2007; Singh et al., 2002) so that they can be sampled from a continuous space; second, they can find how two adjacent backbone angles are dependent on one another. The experiments shows that by modeling the dependency between two adjacent positions, FB5-HMM can generate more conformations that are close to the native structure than the applications that do not model the correlation between neighboring residue positions. There are quite a few fragment assembly methods (Kent, 1982; Mardia et al., 2007; Singh et al., 2002; Tuffery and Derreumaux, 2005) that also exploit the dependency between the two adjacent fragments. As a result, researchers have learned that this dependency is one of the most important things of conformation sampling. They also demonstrated that Torus-DBN can generate local conformations as accurately as the fragment assembly method (Boomsma et al., 2008). However, it is important to note that these models only consider angle-dependency among two neighboring residues. Because of this restriction, these models cannot make use of more enriched forms of the sequence information, such as PSI-BLAST sequence profile or the threading-generated constraints. Hence their sampling efficiency is quite limited. Furthermore, these models have not been considered in the context of template-free modeling in the real world yet. Due to the expressivity of the HMM model, FB5-HMM assumed that each residue is independent of its secondary structure type, and each backbone angle only depends on its corresponding residue (monomer, or 1-mer) and secondary structure at the same position, though it actually depends on at least three neighborhood residues (3-mer). Zhao et al (2008, 2009) implemented an extensible protein conformation sampler, CRF-Sampler (Zhao et al., 2008; Zhao et al., 2009), based on a probabilistic graphical model Conditional Random Fields. CRF-Sampler incorporates the dependence among up to 9 neighboring residues and sequence profile information with continuous distribution into the backbone angle model. The application CRF-Sampler is a protein conformation sampler that is extensible and has been built using a discriminative learning method by modeling the dependence of the unobserved variables upon observed variables in the forms of the conditional probability distribution

P(backbone angle | protein sequence) to predicting (cid:22)θ, τ(cid:23) from the sequence information. CRF-Sampler can learn over 10,000 different parameters that quantify relationships among the backbone angles, the primary sequence, and the secondary structure. By using just the self-avoiding constraints and compactness, CRF-Sampler can simulate conformations from the existing primary sequence and the predicted secondary structure. CRF-Sampler also exhibits high flexibility in that a variety of model topologies and feature sets can be defined to model the sequence-structure relationship without worrying about parameter estimation. Experimental results show that the first-order CRF-Sampler using a small collection of features can generate decoys with much higher quality than the FB5-HMM model. The second-order Conditional Random Fields (CRFs) model has the capability of portraying more complicated levels of dependency among the dihedral angles in the local sequences. When coupled with a simple energy function, this probabilistic method compares favorably with the fragment assembly method in the blind CASP8 evaluation, especially on alpha or small beta proteins. When the second-order CRF-Sampler is combined with a simple energy function with 3 terms including a distance potential, a hydrogen potential and a hydrophobic potential, it generates superior outcomes that compare favorably with the fragment assembly method in the blind CASP8 evaluation, especially on alpha or small beta proteins. This is the first probabilistic method that can search conformations in a continuous space and achieves favorable performance. In fact, this is the very first method with the ability to search within a continuous conformation space and also generate favorable decoys. In addition, this method also created three dimensional models that were more accurate than template-based methods for several of the hard targets in CASP8. The second-order CRF-Sampler can also be applied to protein loop modeling, model refinement, and even RNA tertiary structure prediction. The second-order CRF-Sampler is used in RAPTOR++ in CASP8 to predict the 3D structures of the hard targets. RAPTOR++ has four different modules: threading, model quality assessment, multiple protein alignment and template-free modeling, i.e. the second-order CRF-Sampler. RAPTOR++ first tries to connect a target protein to the templates using the first three modules. Next, it predicts the quality of the 3D model implied by each alignment using a model quality assessment method. If all the alignments come back with insufficient quality, RAPTOR++ uses the second-order CRF-Sampler. CRF-Sampler is also used to refine template-based models. The CRF models describe the relationships among input features and outputs by using linear potential functions. The difficulty emerges because this relationship is frequently nonlinear, complex beyond the capacity of the model. To take advantage of both the structured graphical models and non-linear classifiers such as SVM and neural networks, Zhao et al. proposed CNF-Folder, a fragment-free approach to protein folding based on Conditional Neural Fields (CNF) (Peng et al., 2009; Zhao et al., 2010). CNF-Folder extends CRF-Sampler by adding a middle layer between input features and outputs. The middle layer consists of a series of hidden gates, each with its own parameter, and each acts as a local neuron (i.e. feature extractor) to capture the non-linear relationship between input features and outputs. Based on experimental results, CNF-Folder can create superior decoys on a variety of test proteins including the CASP8 free-modeling targets when it is used with replica exchange simulation and same energy function as used by CRF-Sampler. The most impressive results that CNF-Folder can forecast include the correct fold for T0496_D1, which is one of the two CASP8 targets that have a truly new fold. For T0496, the predicted model is also considerably superior to all of the CASP8 models.

To express the function and structure of a protein, one often needs the ability to give a quantifiable value to evaluate the interactions among residues or atoms. This quantifiable value is called potential which is created either based on physics rules or extracted from knowledge base. There are studies (Bradley et al., 2005; Skolnick, 2006) indicating that knowledge-based statistical potentials (Li and Liang, 2007; Lu et al., 2008; Miyazawa and Jernigan, 1985; Shen and Sali, 2006; Simons et al., 1997; Sippl, 1990; Tanaka and Scheraga, 1976; Zhang and Zhang, 2010; Zhou and Zhou, 2002) compare favorably to physics-based potentials (Brooks et al., 2009; Bryngelson et al., 1995; Case et al., 2005; Dill, 1985, 1997; Dobson et al., 1998; Schuler et al., 2001; Shakhnovich, 2006) in many applications including ab initio folding (Jones and Thirup, 1986; Kachalo et al., 2006; Kihara et al., 2001; Levitt, 1992; Simons et al., 1997; Wu et al., 2007a; Zhao et al., 2010), docking (Zhang et al., 1997), binding (Kortemme and Baker, 2002; Laurie and Jackson, 2005), mutation study (Gilis and Rooman, 1996, 1997), decoy ranking (Bauer and Beyer, 1994; Casari and Sippl, 1992; Gatchell et al., 2000; Hendlich et al., 1990; Samudrala and Moult, 1998; Simons et al., 1999; Vendruscolo et al., 2000) and protein model quality assessment (Jones and Thornton, 1996; Panchenko et al., 2000; Peng and Xu, 2010; Reva et al., 1997; Sippl, 1993). Knowledge-based statistical potentials extract interactions from the solved protein structures in the Protein Data Bank (PDB) (Berman et al., 2000). They are more user-friendly than physics-based potentials in terms of user API and calculation complexity. Many Knowledge-based statistical potentials have been created, including the popular DOPE (Shen and Sali, 2006) and DFIRE (Zhou and Zhou, 2002). Some statistical potential quantify local atomic interactions (e.g., torsion angle potential) while others capture non-local atomic interactions (e.g., distance-dependent potential). Even after extensive study though, designing protein potential with significant precision remains highly challenging. A lot of knowledge-based statistical potentials are derived from the inverse of the Boltzmann law. They have two primary components: reference state and observed atomic interacting probability. The observed atomic interacting probability is usually assumed to correlate with only atom types, and is estimated through a similar simple counting method. The reference state can be derived theoretically based on physical rules or be estimated statistically from the empirical dataset. Many potentials are distinguished in the construction of the reference state. Zhao et al. (2012) designed the statistical potential EPAD (Evolutionary PAirwise Distance potential) by making use of protein evolutionary information, which has not been used by currently popular statistical potentials (e.g., DOPE and DFIRE). EPAD is unique in that is has different energy profiles for the same type of atom pairs, depending on their sequence positions. Experiments confirm that this position-specific statistical potential EPAD outperforms several popular statistical potentials in both decoy discrimination and ab initio folding. Overall, the results suggest several implications: 1. statistical potentials that are protein-specific and position-specific work more effectively; 2. evolutionary information makes energy potentials perform more effectively; 3. observed probability and reference state both make energy potentials different; 4. context-specific information improves the estimation of observed probability.

The thesis has several contributions to science discovery as following. First, in protein folding, the local dihedral angles are from a continuous space. A first-order CRF-Sampler is proposed to model the conditional dependency between two neighboring dihedral angles given the primary sequence information. Second, the 3D structure of protein depends on higher order relationships among the constructing amino-acids (residues) in every neighborhood, as well as the long range interactions among the residues at the positions that are far from each other in the primary sequence. A framework combining the second-order Conditional Random Fields and the energy function is introduced to guide the local conformation sampling using long range constraints with the energy function. Third, it is a nonlinear relationship between the sequence profile and the local dihedral angle distribution. A novel machine learning model Conditional Neural Fields which utilizes the structural graphical model with the neural network is applied to model this complex relationship in CNF-Folder, a powerful fragment-free protein conformation sampler. Fourth, the energy profiles of the pairwise distance potential in the proteins depend on the positions of the interacting amino acids in the primary sequence as well as the types of those amino acids opposing the common assumption that this energy profile depends only on the types of amino acids. A new probabilistic neural network model is proposed to estimate the observed probability of the pairwise distances, and a novel statistical distance potential (EPAD) is designed for protein structure evaluation and functional studies. The demand for using these tools already exists. We have developed software packages of the CNF-Folder and EPAD for the users to download. This thesis is essentially a collection of original published papers in the area of template-free protein folding and ranking. Each chapter is written in such way that it can be read independently. Chapter 2 introduces the basic concepts on continuous representation of protein conformation. The notations are used in the subsequent chapters. Chapter 3 presents the first order Conditional Random Fields model for continuous protein conformation sampling. With the help of the continuous representations defined in Chapter 2, Chapter 3 presents the CRF model to learn the relationship between the sequence information and the local backbone dihedral angles. An efficient forward back-track conformation sampling algorithm is also presented. The performance of the first order CRF-Sampler is evaluated against several protein simulation approaches including the most popular Rosetta method. Chapter 4 describes the higher dependency among the dihedral angles in the local sequences, as well as the importance of long range interactions in protein folding. Hence we proposes a framework combining the second order Conditional Random Fields to model the local interactions with the energy function to add long range constraints in the simulated annealing sampling algorithm. The second order CRF-Sampler is built upon this framework, and applied in CASP8. Chapter 5 introduces the nonlinear relationship between the sequence information and the local conformations. The Conditional Neural Fields model is applied to learn this complex relationship and explained in detail in this chapter. Moreover, a different simulation method parallel tempering is implemented in the CNF-Folder application to give out more high quality decoys in comparison to CRF-Samplers as well as other popular free-modeling methods. Chapter 6 briefly introduces the statistical potentials and then presents in detail the novel Evolutionary PAirwise Distance potential (EPAD). A complete comparison of EPAD against other knowledge-based distance potentials on all the previously used datasets is provided, which demonstrates the superior performance of EPAD. A new large scale dataset is also proposed to avoid the statistical potentials’ over-tuning over the small datasets. In Appendix A, we have proposed a contact capacity potential (CCP). The experimental results imply that the contact capacity potential helps significantly improve decoy discrimination when combined with distance-dependent potentials. The results also support the hypothesis that the wrapping of each residue by surrounding amino acids is guided by maximizing the local static electric field, so that the core of the residue is protected against the water molecules. Part 1. Probabilistic and Continuous Models of Protein Conformational Space for Fragment-free and Template-Free Modeling Chapter 2. Continuous representation of protein conformations

Evaluating a full-atom energy function takes a great deal of time. However, a residue-level energy function simply lacks the accuracy of an atom-level energy function. In this thesis, we use a continuous yet simplified representation of a protein model. We only consider the main chain of C-beta atoms in the folding simulations discussed in the subsequent chapters. Figure 5 The (cid:26), (cid:27), (cid:28) dihedral angles in one residue of the protein backbone. (cid:29) can be assumed to be fixed at 180° (trans) or 0° (cis). α -trace representation A protein backbone conformation can be described by angle triples (cid:22)(cid:6), (cid:7), (cid:8)(cid:23) , as well as a set of bond lengths. (cid:6) is the dihedral angle around N − C α bond and ψ is the dihedral angle around C α − C bond. Because it is possible to approximate ω and the bond lengths as constants, we can represent a protein backbone as a set of (cid:22)(cid:6), (cid:7)(cid:23) angles. With the exception of both the two terminal residues of a protein chain, each residue has a pair of (cid:6) and (cid:7) angles. These (cid:22)(cid:6), (cid:7)(cid:23) angles give us what we need to calculate the coordinates for all the nonhydrogen atoms of a protein backbone. However, for some proteins, even if we have all of their native ϕ and ψ angles, we cannot accurately rebuild their backbone conformations because of slight variation of other angles. This article utilizes a different way to represent a protein backbone instead of the (cid:22)(cid:6), (cid:7)(cid:23) representation. Because the virtual bond length between two adjacent C α atoms can be approximated as a constant, i.e., 3.8 Å, the protein backbones can be represented as a set of pseudo angles (cid:22)(cid:30), (cid:31)(cid:23) (Levitt, 1976). One rare exception is when the second residue is cis proline, the virtual bond length is approximately 3.2 Å. In this representation, all the other atoms are ignored except the C α atoms. For any position i in a backbone, (cid:30) is defined as the pseudo-bond angle formed by the C α atoms at positions i −1, i and i +1; τ is a pseudo dihedral angle formed by the C α atoms at positions i −2, i −1 i and i +1. Given the coordinates of the C α atoms at positions i −2, i −1, and i , the coordinates of the C α atom at position i +1 can be calculated from (cid:22)(cid:30), (cid:31)(cid:23) at position i . Therefore, for a protein with N amino acid residues, given the positions of the first three C α and N −2 (cid:22)(cid:30), (cid:31)(cid:23) pairs, we can build the C α trace of a protein. The relative positions of the first three C α atoms are determined by the θ angle of the second residue. Using (cid:22)(cid:30), (cid:31)(cid:23) representation, only the coordinates of the C α atoms can be recovered. The coordinates of other backbone atoms and C β atom can be built using programs such as MaxSprout (Holm and Sander, 1991), BBQ (Gront et al., 2007), and SABBAC (Maupetit et al., 2006), which can build backbone positions with RMSD less than 0.5 Å. The preferred conformations of an amino acid in the protein backbone can be described as a probabilistic distribution of the (cid:22)(cid:30), (cid:31)(cid:23) angle pair. Each (cid:22)(cid:30), (cid:31)(cid:23) corresponds to a unit vector in the three-dimensional space (i.e., a point on a unit sphere surface). We can use the 5-parameter Fisher-Bingham (FB5) distribution (Hamelryck et al., 2006; Kent, 1982) to model the probability distributions over unit vectors. FB5 is the analogue on the unit sphere of the bivariate normal distribution with an unconstrained covariance matrix. The probability density function of the FB5 distribution is given by (cid:22)!(cid:23) = 1 + . ! + %(cid:22)(cid:22)* . . !(cid:23) . − (cid:22)* / . !(cid:23) . (cid:23)0 where u is a unit vector variable and c (κ,β) is a normalizing constant (Kent, 1982). The parameters $(cid:22)> 0(cid:23) and %(cid:22)0 < 2% ≤ $(cid:23) determine the concentration of the distribution and the ellipticity of the contours of equal probability, respectively. The higher the $ and % parameters, the more concentrated and elliptical the distribution is, respectively. The three vectors * + , * . , and * / are the mean direction, the major and minor axes, respectively. The latter two vectors determine the orientation of the equal probability contours on the sphere, while the first vector determines the common center of the contours. We cluster the whole space of ( θ, τ ) into 100 groups, each of which can be described by an FB5 distribution. We calculate the ( θ, τ ) distribution for each group from a set of ∼ θ, τ ) at one residue, we can sample a pair of real-valued ( θ, τ ) angles in a probabilistic way and thus, explore protein conformations in a continuous space. Using ( θ, τ ) representation, only the coordinates of the C α atoms can be built. To use an atom-level energy function, we also need to build the coordinates of other atoms. Given a C α trace, there are many methods that can build the coordinates for the main chain and C β atoms (Gront et al., 2007; Holm and Sander, 1991; Maupetit et al., 2006). To save computing time, we want a method that is both accurate and efficient. We choose to use a method similar to BBQ (Gront et al., 2007). The original BBQ method can only build coordinates for the backbone N, C, and O atoms. We extend the method to build coordinates for the C β atom. Experimental results (data not shown) indicate that RMSD of this method is approximately 0.5 Å supposing the native C α -trace is available. This level of accuracy is good enough for our folding simulation. To employ the KMB hydrogen-bonding energy (Morozov et al., 2004) for β -containing proteins, we also need to build the backbone hydrogen atoms. We use a quick and dirty method to build coordinates for the hydrogen atom HN (Tooze and Branden, 1999)). Let N i denote the position of the main chain N atom in the same residue as the HN atom. Let N i C i−1 denote the normalized bond vector from the N atom to the C atom in the previous residue. Let N i C α denote the normalized bond vector from the N atom to the C α atom in the same residue. Then the position of the hydrogen atom HN can be estimated by − : =8 : > |8 : =8 : > | . The average RMSD of this method is approximately 0.2Å (data not shown) supposing the native coordinates of other main chain atoms are available. Unless specifically clarified, we will use the mathematical symbols for the rest chapters in in Part 1 as listed in Table 1. In the context of the Conditional Random Fields (CRFs) and Conditional Neural Fields (CNFs) models we use, the primary sequence (or sequence profile) and predicted secondary structure are viewed as observations; the backbone angles and their FB5 distributions are treated as hidden states or labels. Figure 6 Three point sets sampled from the FB5 distribution on the unit sphere. The three node values are typical representatives of coil (blue), α-helix (red), and β-strand (green). The arrows points to the mean directions of the three point sets. Adapted from (Hamelryck et al., 2006). Table 1 Some Mathematical Symbols Used in the CRF Models

Symbols

Annotations X The PSIPRED-predicted secondary structure likelihood scores. A matrix with 3× N elements where N is the number of residues in a protein. X i The predicted likelihood of three secondary structure types at position i . It is a vector of three values, indicating the likelihood of helix, beta and loop, respectively. X i ( x ) The predicted likelihood of secondary structure type x at position i . M The position-specific frequency matrix with 20× N entries, each being the occurring frequency of one amino acid at a given position. M i A vector of 20 elements, denoting the occurring frequency of 20 amino acids at position i . M i ( aa ) The occurring frequency of amino acid aa at position i . H (cid:15) = @ℎ + , ℎ . , … , ℎ +CC D , the set of 100 backbone angle states, each representing an FB5 distribution. Λ Λ = FG + , G . , … , G H I , the set of model parameters used by CRF/CNF model. Chapter 3. First-order CRF-Sampler

The FB5-HMM and TorusDBN models developed by Hamelryck et.al (Boomsma et al., 2008; Hamelryck et al., 2006) sample the protein conformations from a continuous space. These models are able to capture the relationship between backbone angles and primary sequence, and the angle-dependency between two adjacent residues. FB5-HMM uses a Hidden Markov Model to discover the native basins of the local (cid:22)θ, τ(cid:23) pseudo backbone dihedral angles of a protein sequence by learning the local sequence-structure relationship; and TorusDBN learns the local (cid:22)ϕ, ψ(cid:23) backbone dihedral angle dependency using a dynamic Bayesian network, which is a generalization of the Hidden Markov Model. There are two particular advantages of these methods: first, they model the backbone angles using a directional statistics distribution (Kent, 1982; Mardia et al., 2007; Singh et al., 2002) so that they can be sampled from a continuous space; second, they can find how two adjacent backbone angles are dependent on one another. Although experimental result shows that the sampler based on Hidden Markov Model(HMM) method is promising in generating decoys with good quality, the model has a couple of assumptions that may be relaxed. The is that the residue at any position is independent of its type of secondary structure. This would appear to be against the fact that one type of amino acid may prefer some type of secondary structure to the others. The nd assumption is that the hidden state (i.e., the distribution of backbone angles) at position i is only dependent on the type of amino acid residue and the type of secondary structure at that particular position. This contradicts with the finding in (Jha et al., 2005) that the angles at any particular position i depends on at least the three residues at positions i−1 , i and i+1 . These assumptions are required in (Hamelryck et al., 2006) because the parameters in the HMM model are estimated by maximizing the joint probability J(cid:22)K, L, (cid:12)(cid:23) of a primary sequence A , secondary structure X and hidden states S (i.e., angles). For a reasonable estimate of the joint probability J(cid:22)K, L, (cid:12)(cid:23) , these assumptions are necessary because of the sparsity in the training data, as well as for the purpose to avoid overfitting. In FB5-HMM (Hamelryck et al., J(cid:22)K, L, (cid:12)(cid:23) is estimated using Law of total probability ∑ J(cid:22)K|(cid:15)(cid:23)J(cid:22)(cid:12)|(cid:15)(cid:23)J(cid:22)L|(cid:15)(cid:23)J(cid:22)(cid:15)(cid:23) N where H is a possible hidden node sequence. Due to the sparsity in the training data, it is also very difficult to incorporate more complexity into the HMM model without elevating the risk of overfitting, which restricts the expressive power of the HMM model. The problem is not merely limited to the HMM model, it is actually a common problem of all the generative learning methods (e.g., HMM) that relies on optimizing the joint probability of states and observations, where the primary sequence and secondary structure are both observations. As an alternative, discriminative learning methods such as CRFs (Conditional Random Fields) can be more expressive and at the same time keeping the risk of overfitting under control. Discriminative learning is different from generative learning in that the former one optimizes the conditional probability of states on the observations while the latter one the joint probability of states and observations. This chapter presents the first-order CRF-Sampler, an extensible and fully automatic framework, for effective protein conformation sampling, based on a probabilistic graphical model Conditional Random Fields (Lafferty et al., 2001; Sha and Pereira, 2003). Similar to the HMM model, CRF-Sampler samples backbone angles from a continuous space using sequence and secondary structure information. CRF-Sampler also models the dependency between the angles at two adjacent positions. CRF-Sampler differs from the HMM model in the following aspects. First, CRF-Sampler is more expressive than the HMM model. The backbone angles at position i can depend on residues and secondary structures at many positions instead of only one. In CRF-Sampler, a sophisticated model topology and feature set can be defined to describe the dependency between sequence and structure without worrying about learning of model parameters. Different from the HMM model, in which the model complexity (hence risk of overfitting) roughly equals to the number of parameters in the model, the effective complexity of CRF-Sampler is regularized by a Gaussian prior of its parameters, allowing the user to achieve a balance between model complexity and expressivity. Second, CRF-Sampler does not assume that primary sequence is independent of secondary structure in determining backbone angles. Instead, CRF-Sampler can automatically learn the relative importance of primary sequence and secondary structure. Finally, CRF-Sampler can easily incorporate sequence profile (i.e., position-specific frequency matrix) and predicted secondary structure likelihood scores into the model to further improve sampling performance. Our experimental results demonstrate that, using only compactness and self-avoiding constraints, CRF-Sampler can quickly generate more native-like conformations than the HMM model and best decoys closer to their natives. Conditional random fields (CRFs) are probabilistic graphical models that have been extensively used in modeling sequence data. Please refer to (Lafferty et al., 2001) and (Sha and Pereira, 2003) for a complete description of CRFs. Here, we describe how to predict the backbone angles of a protein from its primary sequence and secondary structure using CRFs. In this context, the primary sequence and secondary structure of a protein are called observations and its backbone angles are hidden states or labels. Let

O = @O + , O . , … , O D denote an observation sequence of length N where o i is an observed object. Each observed object can be a residue or a secondary structure type or their combination. Let (cid:15) = @ℎ + , ℎ . , … , ℎ P D be a finite set of labels (also called states), each representing a distribution of backbone angles. Let (cid:12) = @(cid:12) + , (cid:12) . , … , (cid:12) | (cid:12) ∈ (cid:15)D be a sequence of labels corresponding to the observation o . As opposed to the HMM model defining a joint probability of the label sequence S and the observation o , our CRF model defines the conditional probability of s given o as follows. J R (cid:22)(cid:12)|O(cid:23) = exp(cid:22)∑ S(cid:22)(cid:12), O, (cid:13)(cid:23) (cid:23) U(cid:22)O(cid:23) (cid:22) (cid:23) where Λ=( λ ,λ ,…λ p ) is the model parameter and U(cid:22)O(cid:23) = ∑ exp(cid:22)∑ S(cid:22)(cid:12)′, O, (cid:13)(cid:23) (cid:23) XY is a normalization factor summing over all the possible label sequences for a given observation sequence. F ( S,o,i ) is the sum of the CRF features at sequence position i : S(cid:22)(cid:12), O, (cid:13)(cid:23) = Z G [ \ [ (cid:22)(cid:12) , (cid:12) (cid:23) [ + Z G ^ _ ^ (cid:22)O, (cid:12) (cid:23) ^ (cid:22) (cid:23) where \ [ (cid:22)(cid:12) , (cid:12) (cid:23) and _ ^ (cid:22)O, (cid:12) (cid:23) are called edge and label feature functions, respectively. The edge and label functions are defined as \ [ (cid:22)(cid:12) , (cid:12) (cid:23) = `(cid:12) = ℎ + a`(cid:12) = ℎ . a (cid:22) (cid:23) and _ ^ (cid:22)O, (cid:12) (cid:23) = `b ^ (cid:22)O, (cid:13)(cid:23)a`(cid:12) = ℎa (cid:22) (cid:23) where S i = h indicates that the label (or state) at position i is h . And b ^ (cid:22)O, (cid:13)(cid:23) is a logical context predicate indicating whether or not the context of the observation sequence o at position i holds a particular property or fact of empirical data. [ f ] is equal to 1 if the logical expression f is true, and zero otherwise. Note that we can also define the edge feature function \ [ (cid:22)(cid:12) , (cid:12) (cid:23) as `b ^ (cid:22)O, (cid:13)(cid:23)a`(cid:12) = ℎ + a`(cid:12) = ℎ . a , to capture relationship between two adjacent labels and observations. By expanding Equation 3.1 using Equation 3.2, 3.3, and 3.4 and merging the same items, the conditional probability can also be reformulated as follows. J R (cid:22)(cid:12)|O(cid:23) = exp(cid:22)∑ S(cid:22)(cid:12), O, (cid:13)(cid:23) (cid:23) U(cid:22)O(cid:23) = exp(cid:22)∑ G [ (cid:10) [ (cid:22)(cid:12), O(cid:23) [ (cid:23) ∑ exp(cid:22)∑ G [ (cid:10) [ (cid:22)(cid:12)′, O(cid:23) [ (cid:23) XY (cid:22) (cid:23) where (cid:10) [ (cid:22)(cid:12), O(cid:23) represents the occurring times of the k th feature in a pair of label sequence s and observation sequence o and the model parameter λ k is the weight of this feature. Here, the parameter λ k does not correspond to the log probability of an event (as in the HMM model) Instead, it is a real-valued weight that either raises or lowers the “probability mass” of s relative to other possible label sequences. The parameter λ k can be negative, positive, or zero. The CRF model is more expressive than the HMM model. First, we do not have to interpret the parameter λ k as the log probability of an event. Second, CRFs do not have to assume that the observation object at one position is independent of other objects. That is, for any ν l ( o,S i ), the label at position i can depend on many observed objects in the observation sequence or even the whole observation sequence. In addition, S i can also depend on any nonlinear combination of several observed objects. Therefore, the CRF model can accommodate complex feature sets that may be difficult to incorporate within a generative HMM model. The underlying reason is that CRFs only optimize the conditional probability P Λ ( S | o ) instead of joint probability P Λ ( S,o ), avoiding calculating the generative probability of the observation sequence.

Model parameter estimation

Given a set of observation sequences and their corresponding label sequences ( o i , S i ), CRFs train its parameter Λ = FG + , G . , … , G H I by maximizing the conditional log-likelihood L of the data: e = Z log (cid:19) R (cid:22)(cid:12) i |O i (cid:23) i − 12j . ‖G‖ + (cid:22) (cid:23) This kind of training is also called discriminative training or conditional training. Different from the generative training in the HMM model, discriminative training directly optimizes the predictive ability of the model while ignoring the generative probability of the observation. The e + norm in the last term in Equation 3.6 is a regularization item to deal with the sparsity in the training data. When the complexity of the model is high (i.e., the model has many features and parameters) and the training data is sparse, overfitting may occur and it is possible that many models can fit the training data. To prevent this, we place a Gaussian prior, exp m +.n o ∑ G [[ p , on the model parameter to choose the model with a “small” parameter. This regularization can improve the generalization capability of the model in both theory and practice. The objective function in Equation 3.6 is convex and hence theoretically a globally optimal solution can be found using any efficient gradient-based optimization technique. There is no analytical solution to the above equation for a real-world application. Quasi-Newton methods such as L-BFGS (Liu and Nocedal, 1989) can be used to solve the above equation and usually can converge to a good solution within a couple of hundred iterations. The log-likelihood gradient component of λ k is qeqG [ = Z (cid:10) [ (cid:22)(cid:12) , j (cid:23) − Z Z (cid:19) R )(cid:12)rO X (cid:10) [ )(cid:12), O s − G [ j . (cid:22) (cid:23) The first two items on the right of the above equation is the difference between the empirical and the model expected values of feature count C k . The expected value ∑ (cid:19) R (cid:22)(cid:12)|O(cid:23) X (cid:10) [ (cid:22)(cid:12), O(cid:23) for a given o can be computed using a simple dynamic programming algorithm if the model only has edge feature functions defined in Equation 3.3. Model topology

As illustrated in Figure 7, we use a CRF model to capture the relationship between a protein sequence and its (pseudo) backbone angles. Let S i denote the label at position i . Each label represents a distribution of backbone angles in a protein position. We use the Sine model (Mardia et al., 2007; Singh et al., 2002) to describe the distribution of the (cid:22)(cid:6), (cid:7)(cid:23) angles and the FB5 mode (Kent, 1982) for the (θ, τ) angles, respectively. Each label depends on a window of residues in the primary sequence, their secondary structure types, and any nonlinear combinations of them. There is also interdependence between two adjacent labels. The CRF model is not necessary a linear-chain graph. It can be easily extended to model the long-range relationship between two positions. For example, if distance restraints are available from NMR or threading programs, then we can add some edges in the CRF model to capture the long-range interactions between two nonadjacent residues. In the CRF model, we do not assume that the residues in primary sequence are independent of each other and that primary sequence is independent of secondary structure. Our CRF model can easily capture this kind of interdependence in its conditional probability (cid:19) R (cid:22)(cid:12)|O(cid:23) . In this example, S i (i.e., the label at position i ) depends on the residues and secondary structure types at positions i−2, i−1, i, i+1 , and i+2 and any nonlinear combinations of them. There is also interdependence between two adjacent labels. This CRF model can also be extended to incorporate long-range interdependence between two labels. Figure 7 An example CRF model for protein conformation sampling. Model features

CRF-Sampler uses two different types of feature functions. At each position i , CRF-Sampler uses e ( S i −1 , S i ) = [ S i −1 = h ] [ S i = h ] as its edge feature function. That is, currently, we only consider the first order dependence between labels. We are also investigating the second order dependence between labels. We use the following label feature functions to model the relationship among primary sequence, secondary structure, and backbone angles. Meanwhile, w is half of the window size. 1. ν j ( o,S i ) = [ A i + j = a ] [ S i = h ]. This feature set describes the interdependence between the label at position i and the residue at position i + j , where − w ≤ j ≤ w . A feature in this set is identified by a triple ( a, h, j ). 2. ν j ( o,S i ) = [ X i + j = x ] [ S i = h ]. This feature set describes the interdependence between the label at position i and the secondary structure type at position i + j , where − w ≤ j ≤ w . A feature in this set is identified by a triple ( x, h, j ). 3. ν j ( o,S i ) = [ A i + j = a ] [ X i + j = x ] [ S i = h ]. This feature set describes the interdependence among the label at position i , the residue at position i + j , and the secondary structure type at position i + j where − w ≤ j ≤ w . A feature in this set represents a nonlinear combination of secondary structure and primary sequence and is identified by a quadruple ( a, x, h, j ). We use a window size 9 (i.e., w = 4), which is slightly better than a window size 5 when predicted secondary structure information is used as input of CRF-Sampler (see Table 2). In total there are more than ten thousand features in CRF-Sampler quantifying protein sequence-structure relationship. Extension to continuous-valued observations

We can also extend CRF-Sampler to make use of sequence profile and predicted secondary structure likelihood scores to improve sampling performance. In this case, for each protein, the observation is not two strings any more but consists of two matrices. One is the position-specific frequency matrix containing 20 × N entries (where N is the number of residues in a protein); each element in this matrix is the occurring frequency of one amino acid at a given position. The other matrix is the PSIPRED-predicted secondary structure likelihood matrix containing 3 × N elements; each element is the predicted likelihood of one secondary structure type at a specific position. To use this kind of continuous-valued observations, we extend label feature functions as follows. In defining ν l,j ( o, S i ) ( l = 1,2,3), instead of assigning [ A i + j = a ] to a binary value (i.e., 0 or 1), we assign [ A i + j = a ] to the frequency of amino acid a appearing at position i + j . Similarly, we assign [ X i + j = x ] to the PSIPRED-predicted likelihood of secondary type x at position i + j . Table 2 F1-Values (%) of CRF-Sampler with Respect to Window Size

Window size a b c d

1 10.01 10.02 15.3 17.34 5 14.08 17.5 19.65 20.49 9 14.76 18.58 19.81 20.89 F1-value is an even combination of precision ( p ) and recall ( r ) and defined as 2 pr /( p + r). a, trained and tested using primary sequence; b, trained and tested using PSI-BLAST sequence profile; c, trained and tested using primary sequence and PSIPRED-predicted secondary structure; d, trained and tested using PSI-BLAST sequence profile and PSIPRED-predicted secondary structure confidence scores. Sample one conformation for the whole protein

Given a CRF model and its parameters, we used a forward-backward sampling algorithm to generate protein conformations. The algorithm is an extension of the sampling algorithm described for the HMM model (Hamelryck et al., 2006). The major difference is that our sampling algorithm needs to deal with many more sophisticated features and we also need to transform likelihood to probability in sampling. Let ν l ( i, h ) denote a label feature function associated with position i and label h . For a given position i and a label h , we recursively define and calculate u(cid:22)(cid:13), ℎ(cid:23) from N -terminal to C -terminal as follows. u(cid:22)0, ℎ(cid:23) = exp vZ G ^ _ ^ (cid:22)0, ℎ(cid:23) ^ w u(cid:22)(cid:13), ℎ(cid:23) = exp vZ G ^ _ ^ (cid:22)(cid:13), ℎ(cid:23) ^ w Z u)(cid:13) − 1, ℎx0\ y z{,z |{ where λ l is the trained parameter for the label feature ν l (,) and G |{,| is the trained parameter for the edge feature \)ℎx, ℎ0 . After G ( N − 1, h ) ( N is the protein size) is calculated, we can sample a conformation from C -terminal to N -terminal. First, we sample the label h for the last position according to probability (cid:19)(cid:22)ℎ(cid:23) = u(cid:22)6 − 1, ℎ(cid:23)∑ u(cid:22)(cid:13), ℎ′(cid:23) |Y Then we sample the label ℎx for position i according to probability (cid:19))ℎx0 = u)(cid:13), ℎx0\ y z{,z ∑ u(cid:22)(cid:13), ℎ′(cid:23)\ y z},z |Y assuming that the sampled label at position i + 1 is h . Note that each label corresponds to a distribution of backbone angles. Based on the sampled labels, we can sample the two backbone angles for each position and build a backbone conformation for the protein. Resample a small segment of the backbone conformation

Given a backbone conformation, we generate the next conformation by resampling a small segment of the protein. First, we randomly sample the starting position of the segment and its length. The length is uniformly sampled from 1 to 15. We resample the labels of the segment including positions i, i + 1, …, j , conditioned on the current labels at positions i − 1 and j + 1. Suppose that the labels at positions i − 1 and j + 1 are h and h , respectively. We calculate u̅(cid:22)(cid:127), ℎ(cid:23) (cid:22)(cid:13) ≤ (cid:127) ≤ (cid:14)(cid:23) from position i to j as follows. u̅(cid:22)(cid:13), ℎ(cid:23) = exp vZ G ^ _ ^ (cid:22)(cid:13), ℎ(cid:23) ^ w \ y z<,z u̅(cid:22)(cid:127), ℎ(cid:23) = exp vZ G ^ _ ^ (cid:22)(cid:127), ℎ(cid:23) ^ w Z u̅)(cid:127) − 1, ℎx0\ y z{,z |{ After calculating G ( k, h ) for all the k between i and j , we can sample the labels for the segment from j to i . At position j −1, we sample a label h according to probability (cid:19)(cid:22)ℎ(cid:23) = u̅(cid:22)(cid:14) − 1, ℎ(cid:23)\ y z,zo ∑ u̅(cid:22)(cid:14) − 1, ℎ(cid:23)\ y z,zo | For any position k ( i ≤ k ≤ j − 1), we sample a label h according to probability (cid:19)(cid:22)ℎ(cid:23) = u̅(cid:22)(cid:127), ℎ(cid:23)\ y z,z{ ∑ u̅(cid:22)(cid:127), ℎ(cid:23)\ y z,z{ | supposing ℎ is the sampled label at position k +1. After resampling the labels of this segment, we can resample the angles of this segment and then rebuild the backbone conformation. Folding simulation

Since the focus of this chapter is the protein conformation sampling algorithm, we use only compactness and self-avoiding constraints to drive conformation search during the folding simulation process. We start with sampling the whole backbone conformation of a given protein and then optimize its conformation by minimizing the radius of gyration. Given a conformation, we generate its next potential conformation by resampling the local conformation of a small segment. If this potential conformation has no serious steric clashes among atoms, then we compare its radius with that of current conformation. If this potential conformation has a smaller radius, then we accept this conformation, otherwise reject it. This process is terminated if no better conformations can be found within 1000 consecutive resamplings. There is a steric clash if the distance between two C α atoms is less than 4 Å. Data set

The first-order CRF-Sampler is tested on the following proteins: 1FC2, 1ENH, 2GB1, 2CRO, 1CTF, 4ICB, 1AA2, 1BEO, 1DKT, 1FCA, 1FGP, 1JER, 1NKL, 1PGB, 1SRO, 1TRL, T0052 (PDB code: 2EZM), T0056 (1JWE), T0059 (1D3B), T0061 (1BG8), T0064 (1B0N), and T0074 (1EH2). The first six proteins have been studied in (Simons et al., 1997) and (Hamelryck et al., 2006); and the last 18 in (Xia et al., 2000); the last six proteins are also CASP3 targets. We obtained a set of non-redundant protein structures using the PISCES server (Wang and Dunbrack, 2003) as our training data. Each protein in this set has resolution at least 2.0 Å, R factor no bigger than 0.25 and at least 30 residues. Any two proteins in this set share no more than 30% sequence identity. To avoid overlap between the training data and the test proteins, we removed the following proteins from our training data:(1) the proteins sharing at least 25% sequence identity with our test proteins; (2) the proteins in the same fold class as our test proteins according to the SCOP classification (Murzin et al., 1995); and (3) the proteins having a TM-score ≥ 0.5 with our test proteins in case some recently released proteins do not have a SCOP ID. According to (Zhang and Skolnick, 2005b), if the TM-score of two protein structures is smaller than 0.5, then a threading program such as PROSPECTOR_3 (Skolnick et al., 2004) cannot identify their similarity relationship with high confidence. Label assignment and distribution parameters

To train CRF-Sampler, we also need to assign a label to each position in a protein. In this part, we only tested our algorithm on the (θ, τ) representation of a protein backbone conformation. There can be various methods to assign a label to a protein position. For example, we can cluster all the (θ, τ) angles into dozens of groups; each group corresponds to a label. Here, we just simply use the five-residue fragment libraries developed by Kolodny et al. (Kolodny et al., 2002) since these libraries have already been carefully designed. The library containing 100 five-residue fragments is used as the set of hidden labels; each label corresponds to a cluster in the fragment library. We calculated the (θ, τ) distribution for each cluster from the training proteins using the KentEstimator program enclosed in Mocapy(Hamelryck et al., 2006). Only the angles of the middle residue in a fragment are used to calculate the angle distribution parameters. We also tested other four-residue and five-residue fragment libraries developed by Kolodny et al. and it turns out that the five-residue fragment library with 100 clusters yields the best performance.

Parameter tuning

We randomly divided the training proteins into five sets of same size and then used them for five-fold cross validation. We trained CRF-Sampler using several different regularization factors [i.e., σ in Equation 3.7]: 50, 100, 200, 400, and 800 and choose the one with the best F1- value. F1-value is a widely used measurement of the prediction capability of a machine learning model in the machine learning community. F1-value is an even combination of precision ( p ) and recall ( r ) and defined as . The higher the F1-value is, the better. When both PSI-BLAST sequence profile and PSIPRED-predicted secondary structure likelihood scores are used and a window size 9 is used to define the model features, the average F1-values for regularization factors 50, 100, 200, 400, and 800 are 20.82%, 20.89%, 20.83%, 20.71%, and 20.56%, respectively. In fact, there is no big difference among these regularization factors in terms of F1-value. However, we prefer to choose a small regularization factor 100 to control the model complexity. The regularization factor is the only parameter that we need to tune manually. All the other model parameters (i.e., weights for features) can be estimated automatically in training. In addition, we also tested the performance of our algorithm with respect to window size in defining model features. As shown in Table 2, our experimental results indicate that when 100 labels are used in CRF-Sampler, a window size 5 can yield a much higher F1-value than a window size 1. Increasing the window size to 9 can improve the F1-value, but the improvement is small when PSIPRED-predicted secondary structure is used in CRF-Sampler. This may be because the predicted secondary structure also contains partial information of neighbor residues. In our remaining experiments, we used window size 9 to define model features for CRF-Sampler. Comparison with the HMM model

It is not easy to fairly compare two protein conformation sampling algorithms. Many ab initio folding programs use a sophisticated energy function to drive conformation search and it is hard to evaluate performance of their conformation sampling algorithms alone, without considering their energy functions. The focus of this chapter lies in only protein conformation sampling algorithm. To evaluate the sampling algorithm, we drive conformation search by minimizing the radius of gyration instead of a well-designed energy function. Here we compare CRF-Sampler mainly with the HMM method described in (Hamelryck et al., 2006), which is also a protein conformation sampling algorithm and drives conformation search by minimizing the radius instead of an energy function. The major difference between CRF-Sampler and the HMM model is that the former generates conformations using a CRF model while the latter uses an HMM model. We tested CRF-Sampler on six proteins studied in (Hamelryck et al., 2006) and compared the quality of the decoys generated by CRF-Sampler with those by the HMM model. Since for most proteins without known structures, we cannot obtain their true secondary structures, here we compare the HMM model and CRF-Sampler using only PSIPRED-predicted secondary structure and sequence information as their inputs. As shown in Table 3, when only sequence information used, CRF-Sampler can generate decoys with much higher quality than the HMM model. The only exception is on 1ENH where the HMM model has a comparable performance with CRF-Sampler when only primary sequence is used. When only primary sequence is used, the number of good decoys (RMSD ≤ 6 Å) generated by CRF-Sampler is 2 ∼ (Hamelryck et al., 2006)). This confirms that CRF-Sampler can capture well the relationship between a sequence stretch and its local conformation. Table 3 Decoy Quality Comparison between the HMM Model and CRF-Sampler Note: Only sequence information is used in both training and testing. In total, 100,000 decoys are generated by the HMM model. Column “L” lists the length of the test proteins; column “ α,β ” lists the number of α-helices and β-strands of the test proteins; columns “Good” and “Best” list the percentage of good decoys (with RMSD ≤ 6 Å) and the RMSD of the best decoy, respectively. a Trained and tested using primary sequence and the results are taken from (Hamelryck et al., 2006) . b Trained and tested using primary sequence and 40,000 decoys are generated. c Trained and tested using PSI-BLAST sequence profile. Only 20,000 decoys are generated.

Test proteins HMM a CRF-Samplerb b CRF-Sampler c Name, PDB code L α,β Good (%) Best (Å) Good (%) Best (Å) Good (%) Best (Å)

Protein A, 1FC2 43 2,0 9.59 2.7 20.9 2.08 24.8 2.09 Homeodomain, 1ENH 54 2,0 6.6 2.5 6.23 2.68 14 1.98 Protein G, 2GB1 56 1,4 0.04 4.9 0.16 4.67 10.1 3.36 Cro repressor, 2CRO 65 5,0 0.46 3.9 1.94 4.05 13.3 2.37 Protein L7/L12, 1CTF 68 3,1 0.01 5.4 0.04 4.94 0.15 4.49 Calbidin, 4ICB 76 4,0 0.09 4.3 0.17 4.57 0.42 4.72 Table 4

Decoy Quality Comparison between the HMM Model17 and CRF-Sampler Note: Both sequence and secondary structure information are used. In total, 100,000 decoys are generated by the HMM model while only 20,000 decoys by each CRF-Sampler. Please refer to Table 3 for details of the Column definitions. a Trained using true secondary structure and primary sequence while tested using predicted secondary structure (by PSIPRED(McGuffin et al., 2000)) and primary sequence and the results are taken from (Hamelryck et al., 2006). b Trained and tested using predicted secondary structure (by PSIPRED) and primary sequence. c Trained and tested using predicted secondary structure likelihood scores (by PSIPRED) and PSI-BLAST sequence profile. Test proteins HMM a CRF-Sampler b CRF-Sampler c Name, PDB code L α,β Good (%) Best (Å) Good (%) Best (Å) Good (%) Best (Å) Protein A, 1FC2 43 2,0 17.1 2.6 26.8 2.13 49.1 1.94 Homeodomain, 1ENH 54 2,0 12.2 3.8 16.7 2.29 22.4 2.32 Protein G, 2GB1 56 1,4 0 5.9 26.4 3.05 23.3 2.91 Cro repressor, 2CRO 65 5,0 1.1 4.1 18.3 2.76 16.8 2.79 Protein L7/L12, 1CTF 68 3,1 0.35 4.1 3 4.04 2.4 3.7 Calbidin, 4ICB 76 4,0 0.38 4.5 0.24 4.45 0.51 4.63 Figure 8 Native structures (in orange) and the best decoys of 1FC2, 1ENH, 2GB1, 2CRO, 1CTF, and 4ICB Table 5 Secondary Structure Content of Good Decoys Note: a Percentage of correct secondary structure (Q3-value) and secondary structure content of good decoys (RMSD ≤ 6 Å) generated using primary sequence. b Q3-value and secondary structure content of good decoys generated using PSI-BLAST sequence profile. c Q3-value and secondary structure content of good decoys generated using primary sequence and PSIPRED-predicted predicted secondary structure. d Q3-value and secondary structure content of good decoys generated using PSI-BLAST sequence profile and PSIPRED-predicted secondary structure likelihood scores. CRF-Sampler a CRF-Sampler b Protein Q3 H E C Q3 H E C c CRF-Sampler d Protein Q3 H E C Q3 H E C Comparison with Xia et al

In their paper, Xia et al. (Xia et al., 2000) have developed a hierarchical method to generate decoys. This method first exhaustively enumerates all the possible conformations on a lattice model for a given protein sequence and then builds conformations with increasing detail. At each step, this method chooses a good subset of conformations using hydrophobic compactness constraint and empirical energy functions such as RAPDF(Samudrala and Moult, 1998) and Shell(Park et al., 1997) and finally generate 10,000 or 40,000 decoys for a protein sequence. Table 6 lists the RMSD ranges of all the decoys generated by this method and CRF-Sampler for 18 test proteins. As shown in this table, CRF-Sampler can generate decoys with smaller RMSD on all the test proteins with less than 100 residues. CRF-Sampler is significantly better than this hierarchical method on 1CTF, 1NKL, 1PGB, 1SRO, 1TRL-A, T0052, T0059 and T0074 in terms of the best decoys. For those test proteins with more than 100 residues, CRF-Sampler is worse than the method of Xia et al. on two proteins (1AA2 and T0056) and better one on protein (T0064), and has comparable performance on 1JER. This may indicate that we need to improve CRF-Sampler further to search conformation space more effectively for a protein with more than 100 residues. We also used the Wilcoxon signed-rank test(Wilcoxon, 1945), a nonparametric alternative to the paired Student's t -test, to calculate the significance level at which CRF-Sampler is better than the method of Xia et al. Since we only had the RMSD ranges of the decoys generated by Xia et al. , we only considered the best decoys in calculating the statistical test. For the first 12 proteins in Table 6, we used the best decoys in a set of randomly chosen CRF-Sampler 10,000 decoys since Xia et al. only generated 10,000 final decoys for these proteins. For the last six proteins, we used their best decoys listed in Table 6. When the absolute RMSD difference between the best decoys is used to calculate the statistical test, CRF-Sampler is better than the method of Xia et al. at significance level 0.01(In fact the significance level is very close to 0.005). When the relative RMSD difference is used, CRF-Sampler is better than the method of Xia et al. at significance level 0.005. Finally, CRF-Sampler tends to generate decoys with larger RMSD variance because CRF-Sampler does not use any empirical energy functions to filter those bad conformations. CRF-Sampler generated 20,000 decoys for each test protein using PSI-BLAST sequence profile and predicted secondary structure likelihood scores. Xia et al. (Xia et al., 2000) conducted a complete enumeration on a lattice model for each test protein and then generated 10,000 decoys for each of the first 12 proteins and 40,000 decoys for each of the six CASP3 targets, respectively, using predicted secondary structure and empirical energy functions as filters. Table 6 Decoy Quality Comparison Between Xia et al. (Xia et al., 2000) and CRF-Sampler

Test proteins Xia et al.

CRF-Sampler PDB code L Class All RMSD range All RMSD range Table 7 Percentage of correct secondary structure (Q3-value) and secondary structure scntent of all the decoys generated by CRF-Sampler, compared with PSIPRED Predictions

Test proteins CRF-Sampler PSIPRED PDB code L Class Q3 H E C Q3 H E C

Comparison with Rosetta

Here, we compare CRF-Sampler with the well-known fragment-assembly-based program Rosetta (Simons et al., 1997). Rosetta uses multiple sequence alignment information to choose 25 fragments for each sequence segment of nine residues and then assembles them into decoys using a time-consuming simulated annealing procedure. Rosetta drives conformation search using a well-developed energy function and generates a few hundred decoys, while CRF-Sampler generates 20,000 decoys without using any energy function. Although this comparison is interesting, we want to point out it is also unfair to both CRF-Sampler and Rosetta. On one hand, CRF-Sampler does not use an energy function to drive conformation search. On the other hand, the result for Rosetta is taken from an article published in 1997, which may not be the performance of the state-of-the-art Rosetta. As indicated in Table 8, by quickly generating a large number of decoys, CRF-Sampler can obtain decoys with much smaller RMSD than Rosetta. However, it is also not surprising that Rosetta can generate higher percentage of good decoys for five out of six test proteins by using a time-consuming energy minimization procedure. Computational efficiency

Although we have not optimized the C++ code of CRF-Sampler, CRF-Sampler can quickly generate a decoy within seconds for a test protein. Table 9 the approximate running time in minutes spent by CRF-Sampler generating 100 decoys for each test protein, on a single 2.2 GHz CPU. As indicated in this table, CRF-Sampler can generate decoys for these test proteins very quickly. It takes CRF-Sampler approximately 1 h to generate 100 decoys for protein G (2GB1) and no more than ten minutes for 1FC2. It does not increase the running time of CRF-Sampler by using more information as input such as PSI-BLAST sequence profile and secondary structure likelihood scores. Instead, using them tend to reduce the running time of CRF-Sampler, maybe because of the reduction in the entropy of conformation search space. Table 8 Decoy Quality Comparison Between ROSETTA2 and CRF-Sampler Note: Only sequence information is used in both training and testing. ROSETTA generates a few hundred decoys using energy function optimization while CRF-Sampler generates 20,000 decoys by minimizing radius of gyration. Refer to Table 3 for details of the Column definitions. a Multiple sequence alignment information is used and the results are taken from (Simons et al., 1997). b Trained and tested using primary sequence. c Trained and tested using PSI-BLAST sequence profile.

Test proteins

ROSETTA a CRF-Sampler b CRF-Sampler c Name, PDB code L α, β Good (%) Best(Å) Good (%) Best (Å) Good (%) Best (Å)

Protein A, 1FC2 43 2,0 95 3.3 20.9 2.08 24.8 2.09 Homeodomain, 1ENH 54 2,0 47 2.7 6.23 2.68 14 1.98 Protein G, 2GB1 56 1,4 0 6.3 0.16 4.67 10.1 3.36 Cro repressor, 2CRO 65 5,0 18 4.2 1.94 4.05 13.3 2.37 Protein L7/L12, 1CTF 68 3,1 6 5.3 0.04 4.94 0.15 4.49 Calbidin, 4ICB 76 4,0 17 4.7 0.17 4.57 0.42 4.72 Table 9 Approximate Running Time in Minutes Spent by CRF-Sampler in Generating 100 Decoys Note: The features are used in both the training and testing data set. The unit of time in the grids is Minute. Features primary sequence only 7 13 45.5 29.5 46 37.5 PSI-BLAST sequence profile only 8 12 63 14.5 78 43.5 PSIPRED-predicted secondary structure and primary sequence 6 7.5 67.5 7.5 16 29.5 PSIPRED-predicted secondary structure confidence scores and PSI-BLAST sequence profile 4 8 56 7 13 27.5 This chapter presented an extensible and fully automatic framework CRF-Sampler that can be used to effectively sample conformations of a protein from its sequence information and predicted secondary structure. CRF-Sampler uses thousands of parameters to quantify the relationship among backbone angles, primary sequence and secondary structure without worrying about risk of overfitting. Experimental results demonstrate that CRF-Sampler is more effective in sampling conformations than the HMM model(Hamelryck et al., 2006). CRF-Sampler is quite flexible. Using CRF-Sampler, the user only needs to choose a set of appropriate features describing the relationship between protein sequence and structure. CRF-Sampler can take care of the remaining tasks such as parameter estimation and conformation sampling. The first order CRF-Sampler only takes into consideration the dependency between two adjacent positions. It may explore the conformation space of medium-sized proteins (≥ 100 residues) more effectively by incorporating the interdependency among more residues. For example, given the labels at positions i and i + 1, there are on average only 10 possible labels at position i + 2. Given the labels at positions i and i + 2, there are on average only 16 possible labels at position i + 4. If this kind of constraint information is incorporated into CRF-Sampler, it can greatly reduce the entropy of conformation search space and is very likely to scale CRF-Sampler up to proteins with more than 100 residues. In this chapter, we drove the conformation optimization by minimizing the radius of gyration. Our next step is to couple CRF-Sampler with a good energy function such as DOPE (Shen and Sali, 2006) and DFIRE (Zhou and Zhou, 2002), to do real protein structure prediction. The decoys generated by CRF-Sampler can also be used to benchmark energy functions since CRF-Sampler does not employ any energy function to generate these decoys and thus no energy-bias is introduced into these decoys. We can also incorporate other predicted information such as solvent accessibility and contact capacity (i.e., the number of contacts for a residue) into CRF-Sampler, which may improve sampling performance. In addition, if some distance restraints can be obtained from NMR data or comparative modeling, then it is also possible to extend CRF-Sampler to incorporate long-range interdependency. CRF-Sampler can also be extended to make use of other NMR data sources such as chemical shifts (Neal et al., 2006) and residue dipolar coupling (RDC) data (Meiler et al., 2000). We will address the above issues in the discussion of the second-order CRF-Sampler in the next chapter. Since the experimental results indicate that using predicted secondary structure we can dramatically improve the sampling performance of CRF-Sampler, compared to using PSI-BLAST sequence profile only, we will employ those features in the second order CRF-Sampler. Chapter 4. Second-order CRF-Sampler with Energy

In Chapter 3, we have described a protein conformation sampling algorithm based on the 1 st -order conditional random fields (CRF) and directional statistics. The CRF model is a generalization of the HMM models and is much more powerful. Our CRF model can accurately describe the complex sequence-angle relationship and estimate the probability of a conformation, by incorporating various sequence and structure features and directly taking into consideration the nearest neighbor effects. We have shown that by using the 1 st -order CRF model, we can sample conformations with better quality than Hamelryck et al.’s FB5-HMM (Zhao et al., 2008). All these studies have demonstrated that it is promising to do template-free modeling without using discrete representations of protein conformational space. This chapter presents the first template-free modeling method that can search conformations in a continuous space and at the same time achieves performance comparable to the popular fragment assembly methods. This method differs from our previous work discussed in chapter 3 (Zhao et al., 2008) and FB5-HMM (Hamelryck et al., 2006) in that the latter two only describe a method for conformation sampling in a continuous space, but did not demonstrate that this sampling technique actually lead to a template-free modeling method with comparable performance as the fragment assembly method. By contrast, in this chapter we describe a 2 nd -order CRF model of protein conformational space and show that with a simple energy function, the 2 nd -order CRF model works well for template-free modeling. We will show that it is necessary to use the 2 nd -order model instead of the 1 st -order model described in our previous work since the former can dramatically improve sampling efficiency over the latter, which makes the 2 nd -order model feasible for real-world template-free modeling. Blindly tested in the CASP8 evaluation, our CRF method compares favorably with the Robetta server (Misura et al., 2006; Simons et al., 1997), especially on alpha and small beta proteins. Our method also generated 3D models better than template-based methods for a couple of CAP8 hard targets. A 2nd-order CRF model of protein conformation space

We have described a 1 st -order CRF model for protein conformation sampling in Chapter 3. Here we extend our 1 st -order CRF model to a 2 nd -order model to more accurately capture local sequence-angle relationship. Given a protein with solved structure, we can calculate its backbone angles at each position and determine one of the 100 groups (i.e., states or labels) in which the angles at each position belong. Each group is described by an FB5 distribution. Let

S = @(cid:12) + , (cid:12) . , … , (cid:12) +CC D (cid:22)(cid:12) ∈ (cid:15)(cid:23) denote such a sequence of states/labels (i.e., FB5 distributions) for this protein. We also denote the sequence profile of this protein as M and its secondary structure as X . As shown in Figure 9, our CRF model defines the conditional probability of S given M and X as follows. J R (cid:22)(cid:12)|ƒ, L(cid:23) = exp(cid:22)∑ S(cid:22)(cid:12), ƒ, L, (cid:13)(cid:23) (cid:23)U(cid:22)ƒ, L(cid:23) (cid:22)4.1(cid:23) where Λ = )G + , G . , … , G H is the model parameter and U(cid:22)ƒ, L(cid:23) = Z exp vZ S(cid:22)(cid:12), ƒ, L, (cid:13)(cid:23) w „ is a normalization factor summing over all the possible labels for the given M and X . Comparing to Equation 3.1, we will use both the sequence profile X and predicted secondary structure M as our observation features, since we have learned from the results in Chapter 3 that this combination contains information for generating higher percentage of good decoys. F ( S, M, X, i ) consists of two edge features and two label features at position i . It is given by S(cid:22)(cid:12), ƒ, L, (cid:13)(cid:23) = \ + (cid:22)(cid:12) , (cid:12) (cid:23) + \ . (cid:22)(cid:12) , (cid:12) , (cid:12) (cid:23) + Z _ + )(cid:12) , ƒ … , L … + Z _ . )(cid:12) , (cid:12) , ƒ … , L … (cid:22)4.2(cid:23) where e ( s i −1 , s i ) and e ( s i −1 , s i , s i +1 ) are the 1 st -order and 2 nd -order edge feature functions, respectively. v ( s i , M j , X j ) and v ( s i −1 , s i , M j , X j ) are the 1 st -order and 2 nd -order label feature functions, respectively. If we remove e ( s i −1 , s i , s i +1 ) and v ( s i −1 , s i , M j , X j ), then we can get a 1 st -order CRF model. The two edge functions model local conformation dependency, given by \ + (cid:22)(cid:12) , (cid:12) (cid:23) = G(cid:22)ℎ Y , ℎ′′(cid:23)`(cid:12) = ℎ′a`(cid:12) = ℎ′′a (cid:22)4.3(cid:23) \ . (cid:22)(cid:12) , (cid:12) , (cid:12) (cid:23) = G(cid:22)ℎ Y , ℎ YY , ℎ′′′(cid:23)`(cid:12) = ℎ′a`(cid:12) = ℎ′′a`(cid:12) = ℎ′′′a (cid:22)4.4(cid:23) Meanwhile, [ s i = h ] is an indicator function, which is equal to 1 if the state at position i is ℎ ∈ (cid:15) , otherwise 0; G(cid:22)ℎ Y , ℎ YY (cid:23) is a model parameter identified by two states h ′ and h ″; and G(cid:22)ℎ Y , ℎ YY , ℎ′′′(cid:23) is a model parameter identified by three states. The two label feature functions are given by _ + )(cid:12) , ƒ … , L … … (cid:22)(cid:12)(cid:23)ƒ … (cid:22)ˆˆ(cid:23)`(cid:12) = ℎa ‰‰X + Z G(cid:22)(cid:14) − (cid:13), ‡, ℎ(cid:23)L … (cid:22)(cid:12)(cid:23)`(cid:12) = ℎa X + Z G(cid:22)(cid:14) − (cid:13), ˆˆ, ℎ(cid:23)ƒ … (cid:22)ˆˆ(cid:23)`(cid:12) = ℎa ‰‰ (cid:22)4.5(cid:23) _ . )(cid:12) , (cid:12) , ƒ … , L …

0= Z Z G(cid:22)(cid:14) − (cid:13), ‡, ˆˆ, ℎ Y , ℎ′′(cid:23)L … (cid:22)(cid:12)(cid:23)ƒ … (cid:22)ˆˆ(cid:23)`(cid:12) = ℎ Y a`(cid:12) = ℎ YY a ‰‰X + Z G(cid:22)(cid:14) − 1, ‡, ℎ Y , ℎ′′(cid:23)L … (cid:22)(cid:12)(cid:23)`(cid:12) = ℎ′a`(cid:12) = ℎ′′a X + Z G(cid:22)(cid:14) − 1, ˆˆ, ℎ Y , ℎ′′(cid:23)ƒ … (cid:22)ˆˆ(cid:23)`(cid:12) = ℎ′a`(cid:12) = ℎ′′a ‰‰ (cid:22)4.6(cid:23) The label feature functions model the dependency of backbone angles on protein sequence profiles and predicted secondary structure. Equation 4.5 and 4.6 indicate that not only the state (i.e., angle distribution) itself but also the state transition depend on sequence profiles and predicted secondary structure. As shown in the third and fourth items in the right hand side of Equation 4.2, the state (or state transition) at one position depends on sequence profile and secondary structure in a window of width 2 w + 1 where w is set to 4 in our experiments. It will slightly improve sampling performance by setting the window size larger. Since secondary structure is predicted from sequence profiles, the former is not independent of the latter. Therefore, we need to consider the correlation between sequence profiles and predicted secondary structure, as shown in the first items of the right hand sides of Equation 4.5 and 4.6. The model parameters for the label features are identified by one or two states, secondary structure type, amino acid identity, and the relative position of the observations. The 2 nd -order CRF model has millions of features, each of which corresponds to a model parameter to be trained. Once this model is trained, we can use it to sample protein conformations in a continuous space. Coupled with an energy function and a folding simulation method, we can also use it for template-free modeling. Figure 9 A second-order Conditional Random Fields (CRF) model of protein conformation space. Each backbone angle state depends on a window (size 9) of sequence profiles and secondary structure and also the states in its local neighborhood. Model parameter training

Given a set of m proteins with sequence profile M i , predicted secondary structure X i and corresponding backbone angles (cid:12) (cid:22)(cid:13) = 1,2, … , Š(cid:23) , our CRF model trains its parameter ‹ = FG + , G . , … , G H I by maximizing the conditional log-likelihood L of the data: e = Z log (cid:19) Œ (cid:22)‡ i |ƒ i , L i (cid:23) i − 12j . ‖‹‖ + (cid:22)4.7(cid:23) The second item in Equation 4.7 is a regularization factor to deal with the sparsity in the training data. When the complexity of the model is high (i.e., the model has many features and parameters) and the training data is sparse, overfitting may occur, and it is possible that many models can fit the training data. Our 2 nd -order CRF model has around one million of parameters;, we place a Gaussian prior exp m− +.n o ∑ G [[ p on the model parameter to choose the model with a “small” parameter in order to avoid overfitting. This regularization can improve the generalization capability of the model in both theory and practice (Vapnik, 1998). This kind of training is also called discriminative training or conditional training. Different from the generative training in the FB5-HMM model, discriminative training directly optimizes the predictive ability of the model while ignoring the generative probability of the observation. The objective function in Equation 4.7 is convex and hence theoretically a globally optimal solution can be found using any efficient gradient-based optimization technique. There is no analytical solution to the above equation for a real-world application. Quasi-Newton methods such as L-BFGS (Liu and Nocedal, 1989) can be used to solve the above equation and usually can converge to a good solution within a couple of hundred iterations. For a detailed description of how to train a CRF model, see elsewhere (Lafferty et al., 2001; Sha and Pereira, 2003). We revised the FlexCRFs program (Phan et al., 2005) to train our CRF model, and it takes approximately 24 hours to train a single model on a cluster of 150 2GHz CPUs. We used a set of ∼ The training set is randomly divided into five sets of same size and then used for five-fold cross validation. We trained our CRF model using three different regularization factors (i.e., σ in Equation 4.7): 25, 125, and 625, and chose the one with the best F1-value. F1-value is a widely used measurement of the prediction capability of a machine learning model. F1-value is an even combination of precision p and recall r and calculated as .H(cid:141)H=(cid:141) . The higher the F1-value is, the better the CRF model. The average F1-values for regularization factors 25, 125, and 625 are 21.73%, 21.55%, and 22.03%, respectively. In terms of F1-value, the difference among these regularization factors is small. Therefore, we choose a small regularization factor 25 to control the model complexity since a model with lower complexity usually generalizes better to the test data. The regularization factor is the only parameter that we need to tune manually. All the other model parameters (i.e., weights for features) can be estimated automatically in the training process. Conformation sampling and resampling

Initial conformation sampling

Once the CRF model is trained, we can sample a protein conformation or resample the local conformation of a segment by probability using a forward-backward algorithm. We first sample labels (i.e., angle distribution) by probability estimated from our CRF model and then sample real-valued angles from the labels. Let V ( i, S i , S i +1 ) denote the sum of all the edge features associated with edge ( s i , s i+1 ) and all the label features associated with labels S t , S i+1 and ( S i , S i +1 ). Let G ( i, S t , S i +1 ) denote the marginal probability of a label pair ( S i , S i +1 ). We can recursively calculate G ( i, S t , S i +1 ) from N-terminal to C-terminal as follows. u(cid:22)0, (cid:12) C , (cid:12) + (cid:23) = \ Ž(cid:22)C,X (cid:143) ,X < (cid:23) u(cid:22)(cid:13), (cid:12) , (cid:12) (cid:23) = \ Ž(cid:22)7,X ,X (cid:23) Z u(cid:22)(cid:13) − 1, (cid:12) , (cid:12) (cid:23)\ y(cid:22)X ,X ,X (cid:23)X where λ ( s i −1 , s i , s i +1 ) can be interpreted as state transition log-likelihood. Once G ( N −1, s N − 1 , s N ) is calculated where N is the protein size, we can sample a conformation from C-terminal to N-terminal. First, we sample a label pair ( s N −1 , s N ) for the last two positions by probability u(cid:22)6 − 1, (cid:12) , (cid:12) (cid:23)∑ u(cid:22)6 − 1, (cid:12) , (cid:12) (cid:23) X ‘;< ,X ‘ . Then we sample the label s i for position i by probability u(cid:22)(cid:13), (cid:12) , (cid:12) (cid:23)\ y(cid:22)X ,X ,X (cid:23) ∑ u(cid:22)(cid:13), (cid:12) , (cid:12) (cid:23)\ y(cid:22)X ,X ,X (cid:23)X , supposing that the sampled labels at position i +1 and i +2 are s i +1 and S i +2 , respectively. Conformation resampling

The algorithm for resampling the local conformation of a randomly chosen segment is similar. We first randomly determine a segment for which we are going to resample backbone angles. Then we resample the angles for this segment using a forward-backward algorithm similar to the initial conformation sampling algorithm. The major difference is that in this scenario we calculate G ( i, S i , S i +1 ) for a segment conditioning on the labels of the two residues flanking this segment at the left and when do resampling we also have to consider the two residues flanking this segment at the right. Biased sampling

Our sampling method works well in alpha regions but not in beta regions. We decided that we should do more frequent sampling in beta and loop regions than in alpha regions. This is because both beta and loop regions are more varied than alpha regions. By sampling in the beta and loop regions more frequently, we can generate decoys with better quality. We achieve this goal by empirically assigning different weights to each position depending on its predicted secondary structure type. The weights for alpha, beta and loop regions are 1, 5, and 3 respectively. These weights are empirically determined using a simple enumeration method on the test proteins in Table 10. To determine which segment with angles to be resampled, we first uniformly sample the segment length l between 1 and 15. Then we sample the starting position of this segment using biased sampling. We calculate the weight of a segment as the sum of the weights of all the positions in this segment. Then we randomly sample a segment with the length l by probability proportional to the weight of this segment. Biased sampling is employed only when we do folding simulations using the energy function described in this work. In the case that the energy function is not used, we still use uniform sampling. Energy function

The energy function we used for folding simulation consists of three items: DOPE, KMBhbond, and ESP. The weight factors combining these three energy items are trained on the proteins in Table 10 using grid search in a progressive way. First, we fix the weight factor of DOPE to 1 and determine the weight factor for ESP by minimizing the average RMSDs of generated decoys. Then we fix the weight factors of both DOPE and ESP and determine the weight factor for KMBhbond using the same way. DOPE

DOPE is a full-atom, distance-dependent pairwise statistical potential originally designed by Shen and Sali and then improved by the Sosnick group (Fitzgerald et al., 2007; Shen and Sali, 2006). DOPE performs as well or better than many other statistical potentials and force fields in differentiating a native structure from decoys. The statistical potential in DOPE distinguishes the amino acid identity and atomic identity of two interacting particles. In our folding simulation, we only build coordinates for main chain and C β atoms, so only the statistical potentials related to main-chain and C β atoms are used to calculate the energy of a conformation. We denote this revised DOPE as DOPE-C β . According to (Fitzgerald et al., 2007), DOPE-C β is highly correlated with the full-atom DOPE. DOPE-C β also performs favorably in applications to intra-basin protein folding (Colubri et al., 2006). Hydrogen bonding

KMBhbond is a statistical potential for hydrogen bonding developed by the Baker group (Morozov et al., 2004). It depends on the distance between the geometric centers of the N–H bond vector and the C=O bond vector, the bond angle between the N–H bond vector and the hydrogen bond, the bond angle between the C=O bond vector and the hydrogen bond, and the dihedral angle about the acceptor-acceptor base bond. The three angles describe the relative orientation of the bond vectors in the hydrogen bond.

ESP

ESP is an approximation to the Ooi-Scheraga solvent-accessible surface area (SASA) potential (Li et al., 2008). Since our conformation representation does not contain side-chain atoms, which are necessary for the calculation of the solvent-accessible surface area potential, we employ a simple ESP that assigns each residue with an environmental energy score. ESP is a function of the protein size and the number of C α atoms contained within an 8.5-Å sphere centered on the residue's C α atom (Fernandez et al., 2002). Explicitly, the ESP statistical potential has the form given by ESP(cid:22)ˆˆ, ”(cid:23) = − ln J(cid:22)”|–, ˆˆ(cid:23)J(cid:22)”|–(cid:23) where n is the number of C α atoms in an 8.5-Å sphere centered on the C α atom of the residue, R is the radius of gyration of the protein, aa is the amino acid identity of the residue, P(n|R) is the number of C α atoms in an 8.5-Å sphere for a given protein radius regardless of amino acid identity, and P(n|R,aa) is the number of C α atoms in an 8.5-Å sphere for a given protein radius and amino acid identity. We calculate ESP(aa, n) from a set of ∼ R , which ranges from 7Å to 39Å in our training set. We tested the following three discretization schemes: (1) R is discretized into 65 bins with equal width 0.5Å; (2) R is discretized into 33 bins with equal width 1Å; and (3) R is first discretized into 33 bins with equal width 1Å. Then we merge [7, 9), [34, 36) and [37, 39] into a single bin, respectively, to guarantee sufficient statistics for these intervals. We calculated the Pearson correlation coefficient between the resultant ESP energies and TM-score of the decoys. The third scheme yields the best correlation and thus is used in our energy function. Energy minimization

We employ a simulated annealing (SA) algorithm to minimize the energy function for a given protein. The SA routine is based on the algorithm proposed by Aarts and Korst in 1991 (Aarts and Korst, 1991). We start with sampling an initial conformation and then search for a better one by minimizing the energy function. Given a conformation, we propose a new conformation by resampling the local conformation of a randomly-chosen small segment using the CRF model. The new conformation is rejected if there are serious steric clashes among atoms; otherwise, it is accepted with probability min)1, ™ ]š›/i where Δ E is the energy increment and t is the annealing temperature. The initial annealing temperature is chosen so that at the beginning of the annealing process an energy increase is accepted with a given probability p (=0.8). The initial temperature t is determined by œ C = − š›(cid:157)ž H (cid:143) where Δ E is the average energy increase. To determine Δ E , we first conduct a series of trial conformation samplings and accept all the generated conformations. Then we estimate Δ E by calculating the average energy increase observed in our trial samplings. During the folding simulation process, we decrease the annealing temperature gradually using an exponential cooling schedule. The temperature is updated by t k+1 = 0.9 t k . At each annealing temperature, the number of sampled conformations is set to (100+N) where N is the number of residues in the protein. This number is set to achieve thermal equilibrium. The termination of the SA process is triggered when any of the following two conditions is satisfied: (1) either the temperature is low enough such that almost no energy increase is accepted and the annealing process is trapped at local minima; or (2) or the number of conformations generated in a single simulation process reaches a threshold (say 10,000). The 2nd-order CRF model is much better than the 1st-order model

We compare our 2 nd -order CRF model with the 1 st -order model described in Chapter 3 to see how much improvement we can achieve by considering the interdependency among three adjacent residues. To exclude the impact of an energy function in this comparison, we guide conformation search using only compactness and self-avoiding constraints but not an energy function (see Chapter 3 for more details). In total, we tested our models on a set of 22 proteins with different structure properties. We generated ∼ nd -order model is better on 13 out of 22 test proteins and worse on seven proteins. The best decoys may be generated by chance, so they cannot be reliably used to evaluate the performance of the two CRF models. We further examine their difference in terms of the percentage of decoys with RMSD smaller than a given threshold. A general trend we observed is that when the proteins under consideration are not big (<100 residues), the 2 nd -order model generally outperforms the 1 st -order model by a large margin. The only exception is 4icbA. The performance difference between these two CRF models is small on relatively large proteins such as 1aa2, 1jer, and T056. This may be because a large protein tends to have a large conformation space and neither CRF model can search a very large conformation space efficiently. The reason that the 2 nd -order model performs worse on 4icb is because there is a cis -proline in 4icb, and the length of the virtual C α -bond ending at this proline is approximately 3.2Å instead of our assumption 3.8Å. Therefore, the more accurately can our CRF models predict the backbone angles, the more the decoys deviate from the native structure of 4icb. It is not very difficult to resolve this issue since from PSI-BLAST sequence profile we can predict with accuracy 92% if a residue is a cis -proline or not (data not shown). This comparison result indicates that we can dramatically improve sampling efficiency by using the 2 nd -order CRF model. Table 10

Quality Comparison of the Decoys Generated by the 1st-Order and 2ND-Order CRF Models Note: Columns 1–3 list the PDB code, protein size and the type of the test proteins. Columns “best” list the RMSD (Å) of the best decoys; the other columns list the percentage of decoys with RMSD smaller than a given threshold. “O1” and “O2” denote the 1 st -order and the 2 nd -order CRF models, respectively. In total, ∼ Size C

Best ≤6Å ≤7Å ≤8Å ≤9Å ≤10Å ≤11Å ≤12Å α O1 7.34 0 0 0.035 0.245 1.05 4.27 13.5 O2 7.31 0 0 0.0145 0.116 0.97 4.99 17.7 1beo 98 α O1 6.42 0 0.02 0.2 0.99 3.27 9.7 23 O2 5.84 0.0096 0.048 0.385 1.37 4.5 12.5 29.3 1ctfA 68 αβ O1 3.7 2.41 7.18 16.2 31.4 51.7 77.2 95.9 O2 3.67 6.62 22.3 47.1 64.7 78.9 94.8 99.5 1dktA 72 β O1 6.15 0 0.1 0.87 3.81 12.5 33.4 62.8 O2 5.07 0.121 1.48 5.94 16.2 34.3 59.3 82.6 1enhA 54 α O1 2.32 22.4 32.4 44.7 61.2 85.4 98.5 100 O2 2.21 69.3 72 77.7 87.1 97.4 99.9 100 1fc2C 43 α O1 1.94 49.1 64.1 85 97.6 99.7 100 100 O2 2.28 85.4 91.7 97.3 99.8 100 100 100 1fca 55 β O1 4.99 0.145 1.3 6 19.9 49.7 85.5 99.3 O2 4.96 0.207 2.65 13.5 36.3 68.2 94 99.8 1fgp 67 β O1 7.4 0 0 0.035 0.46 4.21 20.2 54.2 O2 5.94 0.0048 0.043 0.582 4.25 18.2 48.2 81.8 1jer 110 β O1 9.64 0 0 0 0 0.005 0.12 0.91 O2 10.2 0 0 0 0 0 0.185 1.11 1nkl 78 α O1 3.64 5.91 14.1 25.1 45.2 66.6 86.8 97.7 O2 3.06 20.3 30.1 44.2 65.3 84 96.7 99.8 1pgb 56 αβ O1 3.15 22.3 45 65.1 81 93 98.6 99.9 O2 2.6 63.2 85.8 93.9 97.7 99.5 99.9 100 1sro 76 β O1 6.22 0 0.07 0.525 2.75 10 27.1 54.7 O2 5.39 0.0193 0.289 1.54 5.09 14.5 32.9 60 1trlA 62 α O1 3.53 13.5 25.3 38.5 57.4 85.1 99.1 99.9 O2 3.72 34.1 45.3 53.5 68 94.8 100 100 2croA 65 α O1 2.8 16.8 31.4 47.9 63.3 79.9 93 99.6 O2 2.58 35.4 52.9 67.4 79.3 88.9 95.6 99.9 2gb1A 56 β O1 2.91 23.3 45.6 65.8 81.8 93.2 98.8 99.9 O2 2.04 65.2 86.1 93.8 97.8 99.5 100 100 4icbA 76 α O1 4.63 0.515 2.65 7.64 16.8 33.6 59.1 84.3 O2 4.4 0.125 0.6 2.85 12.3 31 58.9 88.5 T052 98 β O1 7.58 0 0 0.01 0.035 0.135 0.8 3.52 O2 8.37 0 0 0 0.025 0.296 1.78 6.58 T056 114 α O1 7.78 0 0 0.0198 0.084 0.51 2.26 7.07 O2 7.57 0 0 0.005 0.094 1.07 3.83 8.38 T059 71 β O1 6.3 0 0.01 0.135 1.14 7.2 26.8 61.1 O2 6.21 0 0.025 0.421 3.85 17.4 45.1 77.3 T061 76 α O1 5.36 0.01 0.37 2.89 10.7 27 50.9 77.6 O2 6.04 0 0.282 4.73 19.9 40.5 62.7 82.6 T064 103 α O1 7.23 0 0 0.035 0.2 0.91 2.79 7.3 O2 7.47 0 0 0.032 0.412 1.53 3.62 9.05 T074 98 α O1 4.86 0.015 0.235 1.23 4 10.4 22 41 O2 4.22 0.098 0.835 3.56 9.62 18.7 30.4 49.3 Comparison with FB5-HMM and fragment assembly

We further compare our two CRF models with the FB5-HMM model in (Hamelryck et al., 2006), as shown in Table 11. Here we compare FB5-HMM and our CRF models using PSIPRED-predicted secondary structure and sequence information as their input. For each test protein, FB5-HMM generates 100,000 decoys, and we generated only ∼ nd -order CRF models over the FB5-HMM model lies in that in estimating the probability of the angles at one residue, the 2 nd -order CRF model can directly take into consideration the effects of its neighbor residues. Our 2 nd -order CRF also models the relationship among three adjacent residues. By contrast, FB5-HMM only takes into consideration the relationship between two adjacent residues. Furthermore, FB5-HMM does not consider the effects of the neighbor residues when estimate the probability of angles at one residue. We also compare our 2 nd -order CRF model with the fragment assembly method without using energy function. We revised the Rosetta code to do conformation optimization using the compactness and self-avoiding constraints instead of the Rosetta energy function. As shown in Table 11, the advantage of our 1 st -order CRF model over Rosetta is not obvious. However, our 2 nd -order model can generate a much larger percentage of good decoys than Rosetta for five out of six proteins. The protein 4icb is an exception, which has been explained in previous sections. This comparison result further indicates that it is essential to use the 2 nd -order model instead of the 1 st -order model for template-free modeling. In terms of the quality of the best decoys, Rosetta is slightly better. One of the major differences between these two methods is that our CRF model uses a more simplified representation of protein conformation than Rosetta. That is, we use the pseudo backbone angles to represent a protein conformation while Rosetta uses the true backbone angles (i.e., phi/psi). The phi/psi representation has almost twice the degree of freedom as that of the pseudo backbone angle representation. This may explain why our method tends to generate more decoys with RMSD smaller than 6 Å and the best decoys generated by Rosetta tend to have smaller RMSD. Table 11

Quality Comparison of the Decoys Generated by FB5-HMM, the 1 st -Order CRF, the 2 nd -Order CRF, and Rosetta For each protein, 100,000 decoys are generated by FB5-HMM, while only ∼ α -helices, and β -strands of the test proteins. Columns “Good” and “Best” list the percentage of good decoys (with RMSD ≤ 6 Å) and the RMSD of the best decoys, respectively. Test proteins FB5-HMM 1 st -order CRF 2 nd -order CRF Rosetta PDB L α, β Good Best Good Best Good Best Good Best Comparison with lattice model

By combining a simple energy function and our 2 nd Performance in the blind CASP8 evaluation

We tested the performance of our method by participating in the blind CASP8 evaluation with RAPTOR++, a new protein structure prediction method. Our 2nd-order CRF model was trained before CASP8 started (in May 2008), so it is unlikely for us to overfit our model for the CASP8 targets.

Multi-domain proteins

Some of the CASP target proteins are large and contain multiple domains. For those targets, we parse them into several possible domains by searching through the Pfam database (Finn et al., 2008) using HMMER (Eddy, 1998; Krogh et al., 1994). If one target protein can be aligned to a single template, then domain parsing is skipped. In the case that there is a big chunk of the target not aligned to any top templates, we will treat this unaligned chunk as a single target and do protein modeling separately. Except the last several CASP8 targets, the models for multiple domains are not assembled into a single coordinate system. Comparison with Robetta

We first examine the performance of our method by comparing it with Baker's Robetta server on some CASP8 hard targets, on which both Robetta and CRFFolder did template-free modeling before their experimental structures were released. These hard targets have no good templates in the PDB. It is unclear how many decoys Robetta generated for each target, but the top five models generated by Robetta for each target are available for download from the official website of CASP8 ( http://predictioncenter.org/download_area/CASP8/server_predictions/ ). Using our template-free modeling program CRFFolder, we generated ∼ Table 12

Performance Comparison between CRFFolder and Skolnick's TOUCHSTONE-II Note: Columns 1–3 list the PDB code, size, and type of the test proteins. Column “Best Cluster” lists the RMSDs of the representative decoys of the best clusters. Column “best” lists the RMSDs of the best decoys. The first number in parentheses denotes the rank of the best cluster and the second is the total number of clusters. Columns “1%” and “2%” list the average RMSDs of the top 1% and 2% decoys, respectively. The results of TOUCHSTONE-II are from (Zhang et al., 2003).

TOUCHSTONE CRFFolder Target Size Class BestCluster BestCluster Best 1% 2% α α α α α α β α β α β α β α β β β β β β As shown in Table 13, overall CRFFolder is better than Robetta by ∼ Comparison with template-based methods

Our program CRFFolder can also generate 3D models better than template-based methods for a couple of hard CASP8 targets. According to the CASP8 official assessment, if only the first-ranked models are evaluated, CRFFolder produced the best model among all the CASP8 human and server groups for T0510_D3, a small alpha/beta protein with 43 residues (http://predictioncenter.org/casp8/results.cgi). T0510_D3 is treated as a free-modeling target by CASP8 while Grishin et al. classified it as a fold recognition target (Shi et al., 2009). We also examined all the template-based models generated by our threading methods in CASP8 for this target. The best template-based model has TM-score of 0.339. Table 13

Performance Comparison (TM-Score) between CRFFolder and Robetta on Some CASP8 Hard Targets

Target Size Class Robetta CRFFolder

T0397_D1 70 αβ αβ T0465 157 αβ β β T0468 109 αβ T0476 108 αβ β T0482 120 αβ α αβ T0496_D1 110 αβ T0496_D2 68 α T0510_D3 43 αβ T0513_D2 77 αβ αβ CRFFolder also produced one of the best models for T0496_D1, better than other template-based models. Both CASP8 and Grishin et al. classified T0496_D1 as a free-modeling target (Shi et al., 2009). In fact, the first-ranked model we submitted for T0496_D1 is much worse than the best decoy we generated for this target. Among the ∼ http://prodata.swmed.edu/CASP8/evaluation/DomainsAll.First.html ) defined 146 domains and classified them into five categories. As shown in Columns 3-5, the best models generated by RAPTOR++ for FM targets are much better than the first models. This indicates that we still need to improve our model selection method for FM targets. As shown in Columns 5-7, for FM targets, the best models submitted by all CASP8 servers are much better than the best generated by RAPTOR++. This means that in addition to improve model selection, we also need to further improve our model generation method for FM targets. We can have similar observations when Grishin’s domain definition and classification is used. Our template-free modeling method samples protein conformations in a continuous space without using fragments in the PDB. Our method aims to overcome two major issues with current popular fragment assembly and lattice model methods. These two methods may exclude native structure from search space by sampling in a discrete space since a small change in a backbone angle can result in a totally different fold or by assembling a protein structure using even medium-sized fragments since a “new fold” seems to be composed of rarely occurring super-secondary structure motifs (Andras Fisher, CASP8 talk). Compared to the Robetta server (see Table 13), our method performs very well on mainly-alpha proteins, e.g., T0460, T0496_D1 and T0496_D2. This is not surprising since our CRF model can capture well the local sequence-structure relationship. Our method also works well on small mainly-beta proteins. For example, our method is better than Robetta on T0480 and T0510_D3. However, our method does not fare well on a relatively large protein (>100 residues) with a few beta strands, e.g., T0482 and T0513_D2. This is probably because our CRF method can only model local sequence-structure relationship while a beta sheet is stabilized by non-local hydrogen bonding. Although sampling in a continuous space, our method can still efficiently search the conformation space of a small beta protein. However, for a large protein with a few beta sheets, the search space is too big to be explored by our continuous conformation sampling algorithm. It is also worth to note that compared to Robetta, our method works well on T0397_D1 and T0496_D1, which, according to Nick Grishin, are the only two CASP8 targets with really new folds. Table 14

Summarized results of RAPTOR++ predictions of free modeling targets in CASP8. Note: The upper half table contains the results of 13 CASP8 official domains and the lower half contains the results of 35 domains by Grishin’s definition (http://prodata.swmed.edu/CASP8/evaluation/). R1: GDT-TS score sum of the first-ranked models by RAPTOR. RB: GDT-TS score sum of the best models submitted by RAPTOR. RBAll: GDT-TS score sum of the best models generated by RAPTOR. S1: GDT-TS score sum of the best first models submitted by all servers. SB: GDT-TS score sum of the best models submitted by all servers.

FR:

Top 10 First-GDT_TS avg ≥ 30

FM:

Top 10 First-GDT_TS avg < 30

Category(

FM (13) 393.20 459.24 511.74 591.34 646.47

Grishin’s Definition

FR (30) 997.41 1125.73 1242.65 1386.72 1456.00 FM (5) 106.46 117.58 131.59 157.11 168.15 This chapter has presented a probabilistic and continuous model of protein conformational space for template-free modeling. By using the 2 nd -order CRF model and directional statistics, we can accurately describe protein sequence-angle relationship and explore a continuous conformation space by probability, without worrying about that the native fold is excluded from our conformation space. This method overcomes the following limitations of the fragment assembly method: (1) fragment assembly samples conformations in a discrete space; and (2) fragment assembly is not really template free since it still uses short fragments (e.g., 9-mer) extracted from the PDB. Both restrictions may cause loss of prediction accuracy. Even though we use a simple energy function to guide conformation search, our probabilistic model enables us to do template-free modeling as well as two well-developed programs TOUCHSTONE-II and Robetta. Both of them have been developed for many years and have well-tuned and sophisticated energy functions. Our template-free modeling is much better than TOUCHSTONE-II on alpha proteins and has similar performance on mainly beta proteins. Blindly tested on some CASP8 hard targets, our method is also better than the Robetta server on quite a few (mainly alpha and small beta) proteins but worse on some relatively large beta-containing proteins. Finally, our method also generated the 3D models for a couple of CASP8 targets (i.e., one mainly-alpha target T0496_D1 and one small alpha/beta target T0510_D3) better than template-based methods. The good performance on alpha proteins indicates that our 2 nd -order CRF model can capture well the local sequence-structure relationship for alpha proteins. The good performance on small beta proteins indicates that by sampling in a continuous space we can explore the conformational space of small beta proteins more thoroughly. To improve the performance of our template-free modeling on relatively large beta-containing proteins, we need to further improve our probabilistic model of beta regions and develop a better hydrogen-bonding energy item for the formation of beta sheets. Chapter 5. CNF-Folder: Modeling with Nonlinear Features

This chapter presents a new probabilistic graphical model Conditional Neural Fields (CNF) for ab initio protein folding. Please see the work of (Peng et al., 2009) for a detailed exposition. CNF is similar to but much more powerful than CRF on modeling the sequential data in that CNF can naturally model the nonlinear relationship between input and output while CRF cannot do so. Thus, CNF can model better the sophisticated relationship between backbone angles, sequence profile and predicted secondary structure, estimate the probability distribution of backbone angles more accurately and sample protein conformations more efficiently. In addition, this work also differs from those shown in Chapter 3&4 in that (1) We developed a Replica Exchange Monte Carlo (REMC) method for folding simulation instead of using a simulated annealing (SA) method; (2) This work will use the position-specific scoring matrix (PSSM) generated by PSI-BLAST as the input of our CNF model instead of using the position-specific frequency matrix (PSFM) generated by PSI-BLAST as the input. The REMC method enables us to minimize energy function to a lower level and thus possibly produce better decoys. Also, it has been proved that PSSM contains more information than PSFM for structure prediction such as secondary structure prediction. We did not use PSSM with CRF because CRF cannot easily take PSSM as input. By contrast, we can easily feed PSSM into our CNF model. We will show that our new method is much more effective than our previous method and can dramatically improve sampling efficiency and we can generate much better decoys than before on a variety of test proteins.

A 2nd-order CNF model of conformation space

The Conditional Random Fields (CRF) methods developed earlier for protein conformation sampling use a linear combination of input features (i.e., PSI-BLAST sequence profile and predicted secondary structure) to estimate the probability distribution of backbone angles. This kind of linear parameterization implicitly assumes that all the features are linearly independent, which contradicts with the fact that some input features are highly correlated. For example, the predicted secondary structure is correlated with sequence profiles since the former is usually predicted from the latter using tools such as PSIPRED (Jones, 1999). To model the correlation between predicted secondary structure and sequence profiles, an easy way is to explicitly enumerate all the possible combinations of secondary structure type and amino acid identity in the linear CRF model. In fact, we can always combine some basic features to form a complex feature. However, explicitly defining complex features may introduce a number of serious issues. First, it will result in a combinatorial explosion in the number of complex features, and hence, in the model complexity. It is challenging to train a model with a huge number of parameters without overfitting. Second, explicit enumeration may miss some important complex features. For example, the CRF model presented in Chapter 3&4 (Zhao et al., 2008; Zhao et al., 2009) does not accurately model the correlation among sequence information at several adjacent positions. Finally, explicit enumeration of complex features may also introduce a large number of unnecessary features, which will increase the running time of probability estimation. Instead of explicitly enumerating all the possible nonlinear combinations of the basic sequence and structure features, we can use a better graphical model to implicitly account for the nonlinear relationship between sequence and structure. Very recently, we have developed a new probabilistic graphical model Conditional Neural Fields (CNF) (Peng et al., 2009), which can implicitly model nonlinear relationship between input and output. As shown in Figure 10, CNF consists of at least three layers: one or more hidden layers, input (i.e., sequence profile and secondary structure) and output (i.e., backbone angles) while CRF consists of only two layers: input and output. The relationship between the backbone angles and the hidden layer is still linear. However, the hidden layer uses some gate functions to nonlinearly transform the input features into complex features. Here we use u Ÿ (cid:22)b(cid:23) = ++= ¡¢(cid:22)]Ÿ £ ¤(cid:23) as the gate function where θ is the parameter vector and x a feature vector. CNF can also be viewed as the seamless integration of CRF and neural networks (NN). The neurons in the hidden layer will automatically extract nonlinear relationship among input features. Therefore, without explicit enumeration, CNF can directly model nonlinear relationship between input and output. The training of a CNF model is similar to that of a CRF, but more complicated. We have tested this CNF model for protein secondary structure (SS) prediction from sequence profiles. Table 15 compares the performance of various machine learning methods for SS prediction. The results are averaged on a 7-fold cross-validation on the CB513 data set, except that SPINE uses 10-fold cross-validation. As shown in Table 15, by using only one hidden layer to model nonlinear relationship between output and input, CNF achieves almost 10% relative improvement over CRF. CNF also outperforms other methods including SVMpro (Hua and Sun, 2001), SVMpsi (Kim and Park, 2003), YASSPP (Karypis, 2006), PSIPRED (Jones, 1999), SPINE (Dor and Zhou, 2007) and TreeCRFpsi (Dietterich et al., 2004). The linear CRF is the worst since it does not model nonlinear relationship between secondary structure and sequence profile. This result indicates that we can indeed benefit from modeling nonlinear sequence-structure relationship . We expect that using CNF, we are able to more accurately model sequence-angle relationship and thus, to sample conformations more efficiently. In the context of CNF, the PSI-BLAST sequence profile (i.e., position-specific scoring matrix) and predicted secondary structure are viewed as observations; the backbone angles and their FB5 distributions are treated as hidden states or labels. Similar to what we have discussed in Chapter 4, let H denote the 100 groups (i.e., states or labels) generated from clustering of the backbone angles. Each group is described by an FB5 distribution. Given a protein with solved structure, we calculate its backbone angles at each position and determine one of the 100 groups (i.e., states or labels) to which the angles at each position belong. Let (cid:12) = @(cid:12) + , (cid:12) . , … , (cid:12) D(cid:22)(cid:12) ∈ (cid:15)(cid:23) denote such a sequence of states/labels (i.e., FB5 distributions) for this protein. We also denote the sequence profile of this protein as M and its secondary structure as X . As shown in Figure 10, our CNF model defines the conditional probability of S given M and X in the same form as in Equation (4.1) as follows. J Œ (cid:22)(cid:12)|ƒ, L(cid:23) = exp)∑ S(cid:22)(cid:12), ƒ, L, (cid:13)(cid:23) where ‹ = )G + , G . , … , G H , 0 is the model parameter and U(cid:22)ƒ, L(cid:23) = ∑ exp(cid:22)∑ S(cid:22)(cid:12), ƒ, L, (cid:13)(cid:23) (cid:23) X is a normalization factor summing over all the possible labels for the given M and X. S(cid:22)(cid:12), ƒ, L, (cid:13)(cid:23) consists of two edge feature functions and one label feature function at position i . It is given by ¥(cid:22)¦, §, ¨, ©(cid:23) = ª « (cid:22)¦ ©]« , ¦ © (cid:23) + ª ¬ (cid:22)¦ ©]« , ¦ © , ¦ ©=« (cid:23)+ Z )¦ ©]« , ¦ © , § ® , ¨ ® ©=¯®T©]¯ (cid:22)°. «(cid:23) where \ + (cid:22)(cid:12) , (cid:12) (cid:23) and \ . (cid:22)(cid:12) , (cid:12) , (cid:12) (cid:23) are the 1 st -order and 2 nd -order edge feature functions, respectively, and _)(cid:12) , (cid:12) , ƒ … , L … is the label feature function. The edge functions describe the interdependency between two or three neighboring labels. CNF is different from CRF in the label feature function. In CRF, the label feature function is defined as a linear combination of features. In CNF, there is an extra hidden layer between the input and output, which consists of K gate functions (see Figure 10). The K gate functions extract a K -dimensional implicit nonlinear representation of input features. Therefore, CNF can be viewed as a CRF with its inputs being K homogeneous hidden feature-extractors at each position. Figure 10 A 1 st -order CNF model consists of three layers: input, output and hidden layer. A 2 nd -order model is similar but not shown for the purpose of simplicity. By contrast, a CRF model consists of only input and output. Table 15 Secondary structure prediction accuracy comparison with the current top predictors. Note : Q3 denotes the Percentage of correct secondary structure; Methods Q3 (%) Methods Q3 (%) CRF 72.3

CNF 80.1

TreeCRFpsi 77.6 YASSPP 77.8 SVMpro 73.5 PSIPRED 76.0 SVMpsi 76.6 SPINE 76.8 The label feature function of CNF is defined as follows. _(cid:22)(cid:12) , (cid:12) , ƒ, L(cid:23) = Z ± X ,X ,² u Ÿ ³ ) (cid:22)ƒ, L, (cid:13)(cid:23)0 ´²T+ (cid:22)5.2(cid:23) That is, the label feature function is a linear combination of K gate functions G . In the above definition, w is the parameter vector and f is a vector of basic features at position i . In our current implementation, f contains 23×9 (=207) elements, corresponding to the sequence profile and secondary structure information in a window of size 9 centered at position i . We use PSIPRED to predict the secondary structure of a protein from its sequence profile. PSIPRED generates likelihood score of three secondary structure types for each residue, which is used as the input of our CNF model. Similar to CRF, we use the maximum likelihood method to train the model parameters such that J Œ (cid:22)(cid:12)|ƒ, L(cid:23) is maximized. That is, we maximize the occurring probability of a set of ~3000 non-redundant high-resolution protein structures. Although both the output and hidden layers contain model parameters, all the parameters can be learned together by gradient-based optimization. We use LBFGS (Liu and Nocedal, 1989) as the optimization routine to search for the optimal model parameters. Since CNF contains a hidden layer of gate functions G , the log-likelihood function is not convex any more. Therefore, it is very likely that we can only obtain a local optimal solution of the model parameters. To achieve a good solution, we run the training algorithm several times and use the solution with the best objective function as the final solution of the model. See (Peng et al., 2009) for a detailed description of training CNF. Model parameter training

To do a fair comparison between our previous CRF model and this CNF model, we used exactly same data to train both CRF and CNF models. That is, we use a set of ~3000 non-redundant proteins to train the parameters in our CNF and CRF models. Any two proteins in the training set share no more than 30% sequence identity and the resolution of a training protein is at least 2.0Å. To avoid overlap between the training data and the test proteins, we removed the following proteins from our training set: 1) the proteins sharing at least 25% sequence identity with our test proteins; 2) the proteins in the same fold class as our test proteins according to the SCOP classification; and 3) the proteins having a TM-score (Zhang and Skolnick, 2007) at least 0.5 with our test proteins. Finally, the training data was prepared before CASP8 started. Therefore, we can use our CRF/CNF models to test the CASP8 free-modeling targets without worrying about bias. The training set is randomly divided into five sets of same size and then used for 5-fold cross validation. To train a CNF model, we shall determine the number of gate functions at the hidden layer. In addition, since the CNF model contains a very large number of model parameters, to avoid overfitting, we shall also control the model complexity. We achieve this by regularizing the L -norm of the model parameters using a regularization factor. We trained our CNF model by enumerating the number of gate functions (50, 100, 200, and 300) and different regularization factors: 25, 50 100, and 200 to see which one yields the best F1-value. F1-value is widely-used to measure the prediction capability of a machine learning model. F1-value is an even combination of precision p and recall r and defined as . The higher the F1-value is, the better the CNF model. Our CNF model achieves the best F1-value (23.44%) when 200 gate functions are used with regularization factor 50. By contrast, the best F1-value achieved by our previous CRF method is 22.0%. The F1-value improvement achieved by CNF over CRF seems not to be very big, partially because in total 100 labels are used in our models. Later we will show that CNF can do conformation sampling much better than CRF. Conformation sampling and resampling

Using the trained CNF model, we can sample the whole conformation of a protein or propose a new conformation from an existing one by resampling the local conformation of a segment. This procedure is very similar to the conformation sampling algorithm in our CRF method described in Chapter 3&4. That is, we can use the forward-backward algorithm to first sample labels (i.e., angle distribution) by probability estimated from our CNF model and then sample real-valued angles from the labels. See Chapter 3 for a detailed description of the algorithm.

Replica exchange Monte Carlo simulation

The energy function we used for folding simulation consists of three items: DOPE (a pairwise statistical potential) (Fitzgerald et al., 2007; Shen and Sali, 2006), KMBhbond (hydrogen-bonding energy) (Morozov et al., 2004) and ESP (a simplified solvent accessibility potential) (Fernandez et al., 2002). We use the weight factors previously trained for the CRF model for these three energy items. Therefore, the energy function is not biased towards our CNF method. The weight factor for DOPE is always fixed to 1, so only two weight factors shall be determined. See Chapter 4 for a detailed description of weight determination. Previously we employ a simulated annealing (SA) algorithm to minimize energy function, based upon the algorithm proposed by Aarts and Korst (Aarts and Korst, 1991). In this work, we employ a Replica Exchange Monte Carlo (REMC) method (Earl and Deem, 2005; Swendsen and Wang, 1986) to minimize energy function. By using REMC, we can minimize energy function to lower values and thus produce better decoys for most of our test proteins. Our REMC method employs 20 replicas and the highest temperature is set to 100. The temperature for replica i ( i =1, 2, …, 20) is set to 5 i . We have also tested other temperature assignment, but have not seen much difference in terms of folding performance. Each replica consists of 24,000 time steps. At each time step a new conformation is proposed and then accepted with probability min ¶1, exp m− š›· p¸ where Δº is the energy difference between the new and old conformations and T i is the temperature for this replica. The conformations between two neighboring replicas are exchanged every 30 time steps. Therefore, in total 800 conformation exchange events will happen between two neighboring replicas during the whole folding simulation. It will make our simulation process very inefficient if we yield only the decoy with the lowest energy at the end of the folding simulation. To generate more decoys from a single folding simulation, we output the final decoy of each replica as long as it has an energy value within 15% of the lowest energy we can achieve. Experimental results indicate that on average, each folding simulation can generate ~10 decoys. In Chapter 4, we have demonstrated that our CRF method compares favorably with the popular fragment-based Robetta server in the CASP8 blind prediction , in this chapter we will focus on the comparison between our CNF and CRF methods and show that our new method is indeed superior over our previous method. We test our new method using two datasets and compare it with our previous method. These two datasets were used to evaluate our previous method before. The first dataset consists of 22 proteins: 1aa2, 1beo, 1ctfA, 1dktA, 1enhA, 1fc2C, 1fca, 1fgp, 1jer, 1nkl, 1pgb, 1sro, 1trlA, 2croA, 2gb1A, 4icbA, T052, T056, T059, T061, T064 and T074. These proteins have very different secondary structure type and their sizes range from 40 to 120 residues. Some proteins (e.g., T052, T056, T059, T061, T064 and T074) in this dataset are very old CASP targets. Therefore, we denote this dataset as “old testset”. The second dataset contains 12 CASP8 free-modeling targets: T0397_D1, T0405_D1, T0405_D2, T0416, T0443_D1, T0443_D2, T0465, T0476, T0482, T0496_D1, T0510_D3 and T0513_D2. These proteins are called free-modeling targets because a structurally similar template cannot be identified for them using a template-based method. We denote this dataset as “CASP8 testset”. To avoid bias, we removed all the proteins similar to the first dataset from our training set (see section Model parameter training). Since the training set was constructed before CASP8 started, there is no overlap between our training data and the CASP8 testset.

Performance on the old test set

As shown in Table 16, we evaluate our CNF and CRF methods in terms of their capability of generating good decoys. We run both methods on each test protein and generate similar number of decoys (5000-10,000). Each decoy is compared to its native structure and RMSD to the native is calculated for this decoy. Then we rank all the decoys of one test protein in an ascending order by RMSD. Finally we calculate the average RMSD of the top 1%, 2%, 5% and 10% decoys, respectively. We do not compare these two methods using the best decoys because they may be generated by chance and usually the more decoys are generated, the better the best decoys will be. In terms of the average RMSD of the top 5% or 10% decoys, our CNF method outperforms the CRF method on all test proteins except 1ctfA, 1dktA, 1fc2C and 1fgp. The CNF method reduces the average RMSD of top 10% decoys by at least 1Å for many proteins such as 1aa2, 1beo, 1fca, 1pgb, 1sro, 2gb1A, 4icbA, T052, T056, T059, T061 and T064. Furthermore, our CNF method dramatically reduces the average RMSD of top 10% decoys for some proteins. For example, our CNF method reduces the average RMSD of top 10% decoys for 4icbA from 8.0Å to 5.2Å, for T056 from 11.1Å to 7.2Å and for T061 from 7.6Å to 5.6Å. Even for some test proteins (e.g., 1enhA, 1pgb and 2gb1A) on which the CRF method has already performed well, our CNF method still improves a lot. Performance on the CASP8 test set.

To further compare our CRF and CNF methods, we also evaluate them on the 12 CASP8 free-modeling (FM) targets. During the CASP8 competition, structurally similar templates cannot be identified for these targets. Similarly, we evaluate both methods in terms of the average RMSD of the top 1%, 2%, 5% and 10% decoys, respectively. Compared to CRF, our CNF method does not significantly worsen the decoy quality of any of the 12 CASP8 targets. Instead, our CNF method outperforms the CRF method on 10 of the 12 targets and yields slightly worse performance on another two targets: T0397_D1 and T0482. In particular, our CNF method reduces the average RMSD of the top 10% decoys by at least 1Å for the following seven targets: T0405_D1, T0405_D2, T0416_D2, T0443_D2, T0476, T0496_D1, and T0510_D3. Our CNF method reduces the average RMSD of top 10% decoys for T0510_D3 from 9.1 Å to 6.3 Å and for T0496_D1 from 10.1 Å to 8.1 Å. Even for T0416_D2, a target on which our CRF method performed well, our CNF method improves the average RMSD of the top 10% decoys by 1Å. We have also examined the average TM-score/GDT-TS of the top 10% decoys, on average our CNF method is better than the CRF method by ~10% (data not shown due to space limitation). Table 16

Performance of the CNF and CRF methods on the old test set. Note: Column “s/t” lists the size and secondary structure content of the test proteins. Column “M” indicates methods. “N” and “R” represent the CNF and CRF methods, respectively. Column “x%” lists the average RMSD (Å) of the decoys among the top x% of the generated decoys. Column “best” lists the RMSD of the best decoys. s/t

M best 1% 2% 5% 10% 1aa2 108 N 6.0 7.0 7.6 8.5 9.2 5α R 7.1 9.0 9.4 10.0 10.4 1beo 98 N 5.5 6.1 6.5 7.4 8.3 5α R 5.6 7.2 7.8 8.7 9.3 1ctfA 68 N 3.6 4.5 4.8 5.4 6.1 3α3β R 3.3 3.9 4.1 4.6 5.2 1dktA 72 N 4.5 5.1 5.5 6.2 6.9 4β R 4.5 5.0 5.3 5.9 6.6 1enhA 54 N 1.5 2.0 2.1 2.3 2.4 3α R 2.1 2.6 2.7 2.9 3.0 1fc2C 43 N 2.0 2.3 2.4 2.5 2.6 2α R 2.1 2.3 2.3 2.4 2.4 1fca 55 N 3.2 3.9 4.2 4.6 5.0 4β R 5.0 5.6 5.8 6.2 6.4 1fgp 67 N 6.4 7.5 8.0 8.6 9.1 6β R 6.6 7.3 7.6 8.1 8.6 1jer 110 N 9.6 10.8 11.1 11.6 12.1 2α6β R 10.0 11.5 11.9 12.4 12.8 1nkl 78 N 1.8 2.5 2.6 2.8 3.0 5α R 2.3 2.8 2.9 3.2 3.4 1pgb 56 N 1.4 1.9 2.0 2.3 2.6 1α4β R 2.2 3.0 3.2 3.5 3.7 1sro 76 N 4.2 5.2 5.9 6.7 7.4 6β R 5.1 6.4 6.9 7.7 8.4 1trlA 62 N 3.2 3.6 3.7 3.9 4.1 6α R 3.9 4.2 4.4 4.5 4.7 2croA 65 N 1.8 2.2 2.3 2.4 2.5 5α R 2.2 2.5 2.6 2.7 2.8 2gb1A 56 N 1.7 1.9 2.0 2.3 2.6 1α4β R 1.9 3.1 3.3 3.6 3.8 4icbA 76 N 4.1 4.8 4.9 5.1 5.2

4α R 5.3 6.1 6.5 7.3 8.0 T052 98 N 7.6 8.1 8.5 9.1 9.6 8β R 8.6 9.6 10.0 10.7 11.3 T056 114 N 4.1 4.9 5.3 6.1 7.2 6α R 7.9 9.4 9.7 10.3 11.1 T059 71 N 5.7 6.9 7.3 7.7 8.1 7β R 6.9 8.4 8.7 9.2 9.6 T061 76 N 2.8 3.4 3.7 4.6 5.6 4α R 5.9 6.6 6.8 7.2 7.6 T064 103 N 6.5 7.0 7.2 7.5 7.9 8α R 5.9 7.1 7.5 8.2 8.9 T074 98 N 3.7 5.0 5.4 5.9 6.3 4α R 5.0 6.0 6.4 6.7 6.9 Table 17 Performance of our CNF and CRF methods on the CASP8 test set. Column “s/t” lists the size and secondary structure content of the test proteins. Column “M” indicates methods. “N” and “R” represent the CNF and CRF methods, respectively. Column “x%” lists the average RMSD (Å) of the decoys among the top x% of the generated decoys. Column “best” lists the RMSD of the best decoys. s/t M best 1% 2% 5% 10% T0397_D1 70 N 6.4 8.2 8.5 9.0 9.4 7β R 7.0 8.0 8.3 8.9 9.4 T0405_D1 80 N 5.0 5.4 5.5 5.7 5.9 4α R 5.7 6.6 6.8 7.1 7.4 T0405_D2 112 N 7.1 9.0 9.5 10.1 10.5 3α6β R 8.5 10.1 10.5 11.0 11.5 T0416_D2 57 N 1.4 1.9 2.1 2.3 2.6 4α R 1.6 2.6 2.8 3.3 3.6 T0443_D1 86 N 4.8 6.0 6.4 7.2 7.9 6α R 5.6 7.1 7.7 8.3 8.7 T0443_D2 114 N 9.3 10.6 10.9 11.5 11.9 2α8β R 10.4 11.9 12.3 12.9 13.4 T0465 157 N 11.0 11.8 12.2 12.9 13.5 5α8β R 10.2 12.2 12.7 13.4 13.9 T0476 108 N 5.3 6.3 6.8 7.4 8.0 4α6β R 5.9 7.8 8.2 8.7 9.3 T0482 120 N 10.7 11.9 12.2 12.8 13.2 3α5β R 8.8 10.9 11.5 12.3 13.0 T0496_D1 110 N 5.7 6.2 6.6 7.3 8.1 3α6β R 6.3 8.2 8.7 9.5 10.1 T0510_D3 44 N 3.0 4.0 4.5 5.3 6.3 1α3β R 4.7 7.2 7.7 8.6 9.1 T0513_D2 77 N 7.5 8.4 8.7 9.1 9.5 2α4β R 8.0 9.3 9.6 10.0 10.4 We have also examined the relationship between RMSD and energy. Due to space limitation, here we only visualize the RMSD-energy relationship for several typical targets: T0397_D1, T0416_D2, T0476, T0482, T0496_D1 and T0510_D3, as shown in Figure 11. Note that in the figure, we normalize the energy of a decoy by the mean and standard deviation calculated from the energies of all the decoys of one target. By energy normalization, we can clearly see the energy difference between the decoys generated by the CNF/CRF methods. Figure 11 clearly demonstrates that our CNF method can generate decoys with much lower energy than the CRF method. However, decoys with lower energy might not have better quality if the correlation between RMSD and energy is very weak. For example, our CNF method can generate decoys for T0397_D1 and T0482 with much lower energy, but cannot improve decoy quality for them. To improve the decoy quality for T0397_D1 and T0482, we have to improve the energy function. By contrast, the correlation between RMSD and energy is positive for T0416_D2, T0476, T0496_D1 and T0510_D3. Therefore, we can improve decoys quality for these four targets by generating decoys with lower energy. Our CNF method dramatically improves the decoy quality on T0416_D2 over the CRF method, as shown in Figure 11(b). The underlying reason is that our CNF method can estimate the backbone angle probability more accurately. Around half of the decoys generated by the CRF method for T0416_D2 are the mirror images of the other half. These mirror images are introduced by the non-native-like backbone angles around residue (a) T0397_D1 (b)

T0416_D2 (c)

T0476 (d)

T0482 (e) T0496_D1 (f)

T0510_D3 Figure 11 The relationship between RMSD (y-axis) and energy (x-axis) for T0397_D1, T0416_D2, T0476, T0482, T0496_D1 and T0510_D3. The red and blue colors represent the CRF and CNF methods, respectively. See text for the energy normalization method. Figure 12 Two typical mirror images generated by the CRF method for T0416_D2. The decoys in blue and gold represent the lower and upper regions in Figure 11(b), respectively. Comparison with CASP8 models.

Specific examples

In CASP8, we did prediction using the CRF method for T0476, T0496_D1 and T0510_D3, but not for T0416_D2 because our CRF method was not ready at the beginning of CASP8. The server model generated by our CRF method for T0510_D3 is among the best CASP8 server models (available at http://predictioncenter.org/casp8/results.cgi). Our CNF method further improves predictions for these four targets over the CRF method.

T0416_D2.

The first and best cluster centroids have GDT-TS 69.3 and 76.8, respectively. As shown in Figure 13(a), the best cluster centroid is better than all the CASP8 server models. In fact the best cluster centroid is also better than all the CASP8 human models (data not shown). The best cluster centroid also has a small RMSD 2.7Å.

T0476.

The first and best cluster centroids have GDT-TS 34.2 and 35.6, respectively. Our first and best cluster centroids for T0476 are ranked No.4 out of 66 and No. 15 out of 287 CASP8 server models, respectively. The best human model for T0476 has GDT-TS 48.3 and RMSD 7.8Å. Our best cluster centroid also has RMSD 7.8 Å.

T0496_D1.

According to Grishin group, T0496_D1 is one of the only two CASP8 targets representing new folds (Shi et al., 2009).

Our first and best cluster centroids have GDT-TS 30.5 and 49.1, respectively.

As shown in Figure 13(c), the best cluster centroid is significantly better than all the CASP8 server models. In fact the best cluster centroid is also significantly better than all the CASP8 human models. The best CASP8 model has GDT-TS only 33.96. The smallest RMSD among the CASP8 models with 100% coverage is 11.34Å. Our best cluster centroid has a pretty good RMSD 6.2Å considering that this target has more than 100 residues. In summary, our CNF method can predict an almost correct fold for this target. T0510_D3.

The first and best cluster centroids have GDT-TS 47.7 and 51.7, respectively. The best cluster centroid has RMSD 6.9Å. As shown in Figure 13(d), our first cluster centroid is better than all the

This chapter has presented a new fragment-free approach to protein ab initio folding by using a recently-invented probabilistic graphical model Conditional Neural Fields (CNF). Our fragment-free approach can overcome some limitations of the popular fragment assembly method. That is, this new method can sample protein conformations in a continuous space while the fragment-based methods cannot do so. This CNF method is also better than our previous CRF method in that 1) this method can easily model nonlinear relationship between protein sequence and structure; and 2) we can also minimize energy function to lower values. Experimental results indicate that our CNF method clearly outperforms the CRF method on most of the test proteins. Previously, we have compared our CRF method with the popular fragment-based Robetta server in the CASP8 blind prediction and shown that our CRF method is on average better than Robetta on mainly-alpha or small beta proteins (Zhao et al., 2009). This work further confirms our advantage on mainly-alpha or small beta proteins. Since CNF is better than CRF in modeling nonlinear sequence-structure relationship, we are going to incorporate more information (such as amino acid physical-chemical property profile) to our model so that we can improve sampling efficiency further. Table 18. Clustering result of the 12 CASP8 free-modeling targets. Column “GDT” lists the GDT-TS of the first and best cluster centroids. Column “CASP8 Rank” lists the rank of the

Target

First Cluster Best Cluster GDT CASP8 Rank Internal Rank(%) GDT CASP8 Rank Internal Rank(%) T0397_D1 25.7 12/60 50.6 28.6 28/262 18.8 T0405_D1 39.2 6/63 41.6 48.4 14/285 6.5 T0405_D2 27.0 10/62 72.3 34.6 19/280 5.1 T0416_D2 69.3 1/53 5.4 76.8 1/242 3.5 T0443_D1 46.9 3/64 38.2 49.2 6/253 19.7 T0443_D2 24.8 26/59 35.3 27.9 73/252 12.1 T0465 31.3 12/65 12.6 31.3 34/286 12.6 T0476 34.2 4/66 17.5 35.6 15/287 10.0 T0482 34.2 34/65 4.3 34.2 132/279 4.3 T0496_D1 30.5 1/59 30.3 49.1 1/266 0.4 T0510_D3 47.7 1/54 15.7 51.7 2/244 3.3 T0513_D2 57.7 5/50 3.8 57.7 17/225 3.8 (a) T0416_D2 (b)

T0476 (c)

T0496_D1 (d)

T0510_D3 Figure 13 Ranking of our CNF predictions for T0416_D2, T0476, T0496_D1 and T0510_D3. The x-axis is percentile ranking and y-axis GDT-TS. Our first and best cluster centroids are plotted in black and magenta lines, respectively. The Part 2. Statistical Potentials Using Machine Learning Methods Chapter 6. EPAD: A position-specific pairwise distance potential

A lot of statistical potentials are derived from the inverse of the Boltzmann law. In the traditional position-independent distance-dependent statistical potentials (e.g., DOPE and DFIRE), the interaction potential of two atom types a and b can be estimated as follows. ½(cid:22)¾ |a, b(cid:23) = −(cid:127)¿ À” J(cid:22)¾ | ˆ, Á (cid:23)q(cid:22)¾(cid:23) (cid:22)6.1(cid:23) Where k is the Boltzmann constant, T is the temperature, and ¾ represents the inter-atom distance shell `¾, ¾ + ∆¾a . Meanwhile, J(cid:22)¾ |a, b (cid:23) is the observed probability of two atoms interacting within the distance shell and q(cid:22)¾(cid:23) is the reference state (i.e., the expected probability of two non-interacting atoms within the distance shell). The reference state is used to rule out the average and generic correlation of two atoms not due to atomic interactions . Most statistical potentials parameterize the observed atomic interacting probability by (residue-specific) atom types and use a simple counting method to estimate it. For example,

J(cid:22)¾ |a, b (cid:23) in Eq. (6.1) is often calculated by

PÄÅÆi(cid:22)Ç,‰,È(cid:23)∑ PÄÅÆi(cid:22)Ç,‰,È(cid:23) É where is the number of observed occurrences of two atoms a and b within a distance shell [ ¾, ¾ + ∆¾a .The distance-dependent statistical potentials developed so far mainly differ from one another in estimating the reference state (Shen and Sali, 2006; Wu et al., 2007b; Zhang and Zhang, 2010; Zhou and Zhou, 2002). Some (e.g., DFIRE and DOPE) use analytical methods to estimate the reference state while others use statistical methods (e.g., KBP (Lu and Skolnick, 2001) and RAPDF (Samudrala and Moult, 1998)). Although using different reference states, these potentials do not have very different energy curves (see Figure 2 in the RW paper (Zhang and Zhang, 2010) and Figure 4 in the DOPE paper (Shen and Sali, 2006)). These traditional position-independent potentials share a couple of common properties: 1) once the atom distance and types are given, the atomic interaction potential is fixed across all proteins and residues; and 2) the atomic interaction potentials approach to 0 when the distance is larger than 8Å. This chapter presents a novel protein-specific and position-specific statistical potential EPAD. We parameterize the observed probability in EPAD by the evolutionary information and radius of gyration of the protein under consideration, in addition to atom types. EPAD distinguishes itself from others in that it may have different energy profiles for two atoms of given types, depending on the protein under consideration and the sequence profile context of the atoms (i.e., evolutionary information). Evolutionary information has been extensively used in protein secondary structure prediction(Jones, 1999; Wang et al., 2011), fold recognition (Maiorov and Crippen, 1992; Panchenko et al., 2000; Peng and Xu, 2009, 2010; Sippl and Weitckus, 1992; Skolnick et al., 2000), protein alignment (Notredame et al., 2000; Pei et al., 2008; Wu and Zhang, 2008b; Xu, 2005; Zhang and Skolnick, 2005b), model quality assessment (Jones and Thornton, 1996; Panchenko et al., 2000; Peng and Xu, 2010; Reva et al., 1997; Sippl, 1993) and even protein conformation sampling (Bystroff et al., 2000; Simons et al., 1997; Zhao et al., 2008; Zhao et al., 2010). However, evolutionary information is rarely used to design a statistical potential suitable for ab initio protein folding. Panjkovich et al have developed a structure-specific statistical potential using evolutionary information for the assessment of comparative models (Panjkovich et al., 2008). Nevertheless, this potential is not position-specific and subject to a couple of restrictions: 1) it requires the knowledge of at least one native structure in a protein family, so it cannot be applied to ab initio folding a protein with novel fold or to the assessment of models built from distantly-related templates; and 2) it requires at least 50 sequence homologs for sufficient statistics. By contrast, our statistical potential is not subject to such restrictions and thus, is more widely applicable. We term our statistical potential as e volutionary pa irwise d istance-dependent potential (EPAD). Experimental results show that our position-specific statistical potential outperforms currently many popular ones in several decoy discrimination tests. These results imply that in addition to reference state, the observed atomic interacting probability is also critical to statistical potentials and can be estimated much more accurately using context-specific evolutionary information. Let ˆ and ˆ … denote two atoms of two residues at positions (cid:13) and j , respectively. Let (cid:12) denote the sequence profile of the protein under consideration. It is generated by running PSI-BLAST on the NR database with at most 8 iterations and E-value 0.001. (cid:12) is a position-specific scoring matrix with dimension

20 × 6 where is the sequence length. Each column in (cid:12) is a vector of 20 elements, containing the mutation potential to the

20 amino acids at the corresponding sequence position. The sequence profile context of the residue at sequence position (cid:13) is a 20×15 submatrix of (cid:12) , consisting of 15 columns (cid:13) − 7 , (cid:13) − 6 ,…, (cid:13) , (cid:13) + 1 , …, (cid:13) + 7 . In case that one column does not exist in (cid:12) (when (cid:13) ≤ 7 or (cid:13) + 7 > 6 ), the zero vector is used. Let (cid:12) and (cid:12) … denote position-specific sequence profile contexts at positions i and j, respectively. Our distance-dependent statistical potential is defined as follows. ½(cid:22)¾ | ˆ , ˆ … , (cid:12) , (cid:12) … , € ² (cid:23) = −(cid:127)¿ À” J(cid:22)¾ rˆ , ˆ … , (cid:12) , (cid:12) … , € ² ² (cid:23) (cid:22)6.2(cid:23) where k is the Boltzmann constant and T is the temperature, q(cid:22)¾| € ² (cid:23) is the reference state, and J(cid:22)¾|ˆ , ˆ … , (cid:12) , (cid:12) … , € ² (cid:23) is the observed probability of two atoms ˆ and ˆ … interacting within a distance shell [ ¾ , ¾ + Ì¾a conditioned on atom types, residue sequence profile contexts and € ² (the estimated radius of gyration of the protein under consideration). We use € ² = 2.2 6 C./Í to estimate the radius of gyration where N is the protein sequence length. Comparing to Eq. (6.1), our statistical potential differs from the traditional position-independent potentials (e.g., DOPE and DFIRE) in a couple of aspects. First, the interaction potential of two atoms is protein-specific since it depends on the evolutionary information and radius of gyration of the protein under consideration. Second, our potential is position-specific since it is parameterized by sequence profile contexts in addition to atom types. We use the same reference state as DOPE (Shen and Sali, 2006), which is a finite sphere of uniform density with appropriate radius. That is, the reference state depends on only the size of a sample protein structure (see the section of Estimating the reference state for more details). We cannot use the simple counting method to calculate

J(cid:22)¾|ˆ , ˆ … , (cid:12) , (cid:12) … , € ² (cid:23) since there are insufficient number of solved protein structures in PDB for reliable simple counting of sequence profile contexts (cid:12) and (cid:12) … . Instead, we apply a probabilistic neural network (PNN) (Specht, 1990) to estimating J(cid:22)¾|ˆ , ˆ … , (cid:12) , (cid:12) … , € ² (cid:23) when both ˆ and ˆ … are (cid:10) Ï atoms. PNN will effectively learn the sophisticated relationship between inter-atom distance and sequence profiles and yield accurate distance probability distribution. We then estimate J(cid:22)¾|ˆ , ˆ … , (cid:12) , (cid:12) … , € ² (cid:23) for non- (cid:10) Ï atoms conditioned upon (cid:10) Ï distance distribution. Distribution of Ð Ñ distances in a representative set of protein structures. Simple statistics on the PDB25 data set indicates that nearly 90% of residue pairs have Ð Ñ distance larger than 15Å and only 1% of them have Ð Ñ distance less than 4Å (see Figure 14). Estimating pairwise Ò Ó distance distribution using probabilistic neural network (PNN). We discretize all the (cid:10) Ï − (cid:10) Ï distance into 13 bins (3~4Å, 4-5Å, 5-6Å, …, 14-15Å, and >15Å). Given a protein and its k th residue pair of two residues i and j, let ¾ [ denote the bin into which the distance of the k th residue pair falls into, and b [ the position-specific feature vector, which contains sequence profile contexts (cid:12) and (cid:12) … centered at the two residues i and j under consideration and the estimated radius of gyration of the protein under consideration. Figure 14 Distribution of pairwise Ð Ñ distances in a representative set of protein structures. We always use € ² = 2.2 6 C./Í to estimate the radius of gyration for one protein where is the protein sequence length. That is, € ² is independent of any 3D models including the native structure. We do not use € ² specific to a decoy because our training set does not contain any decoys. We do not use € ² calculated from the native structures either because in the realistic settings they are unavailable. Let (cid:19) Ÿ (cid:22)¾ [ |b [ (cid:23) be the probability of the distance label ¾ [ conditioned on the feature vector b [ . Meanwhile, (cid:30) is the model parameter vector. We estimate (cid:19) Ÿ (cid:22)¾ [ |b [ (cid:23) as follows. (cid:19) Ÿ (cid:22)¾ [ |b [ (cid:23) = exp)(cid:6) Ÿ (cid:22)b [ , ¾ [ (cid:23)0U Ÿ (cid:22)b [ (cid:23) (cid:22)6.3(cid:23) where U Ÿ (cid:22)b [ (cid:23) = ∑ exp)(cid:6) Ÿ (cid:22)b [ , ¾(cid:23)0 Ç is the partition function and (cid:6) Ÿ (cid:22)b, ¾(cid:23) is a two-layer neural network. Figure 15 shows an example of the neural network with three and five neurons in the first and second hidden layers, respectively. Each neuron represents a sigmoid function ℎ(cid:22)b(cid:23) = 1/(cid:22)1 + exp(cid:22)b(cid:23)(cid:23) . Therefore, we have (cid:6) Ÿ (cid:22)b [ , ¾ [ (cid:23) = Z (cid:30) Ç Ô ,² < C ℎ Õ Z (cid:30) ² < ,² o + ℎ)< (cid:30) ² o . , b [ >0 Ö o ² o T+ × Ö < ² < T+ (cid:22)6.4(cid:23) Where u + and u . are the number of gates in the two hidden layers, <. , . > denotes the inner product of two vectors, (cid:30) ² o . is the weight vector of the Ø .ÙÚ neuron (also known as gate) in the 2 nd layer; (cid:30) ² < ,² o + is the weight connecting the Ø .ÙÚ neuron in the 2 nd layer to the Ø +ÙÚ neuron in the 1 st layer; and (cid:30) Ç ,² < C is the weight connecting the Ø +i| neuron in the 1 st layer to the label ¾ [ . In the implementation, our neural network consists of two hidden layers. The first hidden layer (i.e., the layer connecting to the input layer) contains 100 neurons and the second hidden layer (i.e., the layer connecting to the output layer) has 40 neurons. This neural network is similar to what is used by the Zhou group for inter-residue contact prediction (Xue et al., 2009), which uses 100 and 30 neurons in the two hidden layers, respectively. The Zhou group has shown that using two hidden layers can obtain slightly better performance than using a single hidden layer. The input layer of our network has about 600 neurons, so in total our neural network has between 60,000 and 70,000 parameters to be trained. Model parameter training.

We use the maximum likelihood method to train the model parameter (cid:30) and to determine the window size and the number of neurons in each hidden layer, by maximizing the occurring probability of the native (cid:10) Ï − (cid:10) Ï distance in a set of training proteins. Given a training protein t with solved experimental structure, let Û i denote the set of pairwise residue distances and L Ù the set of all feature vectors. By assuming any two residue pairs to be independent of one another, we have (cid:19) Ÿ (cid:22)Û i |L i (cid:23) = Ü (cid:19) Ÿ (cid:22)¾ [i |b [i (cid:23) Ý Þ [T+ (cid:22)6.5(cid:23) where Š i is the number of residue pairs in the protein t . Given T training proteins, we need to maximize ∏ (cid:19) Ÿ (cid:22)Û i |L i (cid:23) ·iT+ , which is equivalent to the following optimization problem. min à Z − log (cid:19) Ÿ (cid:22)Û i |L i (cid:23) ·iT+ + G‖(cid:30)‖ .. = min à Z Z)− log U Ÿ (cid:22)b [i (cid:23) + (cid:6) Ÿ (cid:22)b [i , ¾ [i (cid:23)0 Ý Þ + G‖(cid:30)‖ .. (cid:22)6.6(cid:23) Meanwhile,

Gr|(cid:30)|r .. is a e . -norm regularization item to avoid overfitting and λ is a hyper-parameter to be determined. This optimization problem can be solved by LBFGS(Liu and Nocedal, 1989). It is very challenging to solve this non-convex optimization problem due to the huge amount of training data. We generate an initial solution randomly and then run the training algorithm on a supercomputer for about a couple of weeks. Our training algorithm terminates when the probability of either the training set or the validation set does not improve any more. Note that all the model parameters are learned from the training set, but not the validation set. The validation set, combined with the training set, is only used to determine when our training algorithm shall terminate. Our training algorithm usually terminates after 3000 iterations. We also reran our training algorithm starting from 9 initial solutions and did not observe explicit performance difference among these runs. Figure 15 An example probabilistic neural network, in which ¦ © and ¦ ® are the sequence profile contexts centered at the i th and j th residues, respectively. á â« and á ã¬ are the neurons in the 1 st and 2 nd hidden layers. Training and validation data.

We use the PDB25 set of the PISCES server (Wang and Dunbrack, 2003) early in 2011 as the training and validation data. Any two proteins in PDB25 share no more than 25% sequence identity. Such a set in total includes more than 6000 proteins. We randomly chose about 5000 proteins from this PDB25 set as the training and validation proteins and also make sure that they have no overlap (i.e., > 25% sequence identity) with the Rosetta set (Qian et al., 2007) and the Decoy ‘R’ Us set (Samudrala and Levitt, 2000). We randomly choose of the 5000 proteins as the training data and the remaining as the validation data, which contain ~73 million training and ~19 million validation residue pairs, respectively. It is challenging to train our neural network model because 1) the number of training residue pairs is huge; and 2) the distance distribution is extremely unbalanced. As shown in Figure 14, 90% of residue pairs have (cid:10) Ï distance larger than 15Å and only 1% of them have (cid:10) Ï distance less than 4Å. It takes a couple of weeks to train a single neural network model using 1296 CPUs on a Cray supercomputer. Estimating inter-atom distance distribution for non- Ò Ó main chain atoms. We discretize the inter-atom distance of non- C ä atoms into 26 equal-width bins, each with 0.5Å. Due to limited computation resources, instead of training neural network models for each pair of atom types, which will take months or even a year to finish, we use a different approach to estimate the pairwise distance probability distribution for non- C ä main chain atoms. In particular, we calculate the inter-atom distance probability distribution for non- C ä main chain atoms conditioned upon (cid:10) Ï − (cid:10) Ï distance probability distribution. Let J ÏÏ )¾ ÏÏ |(cid:12) , (cid:12) … , € ² denote the (cid:10) Ï − (cid:10) Ï distance probability distribution for residues i and j, which can be estimated by our probabilistic neural network. Let a and b denote the amino acid types of the residues at i and j, respectively . For the purpose of simplicity, we use and å atoms as an example to show how to calculate the observed atomic interacting probability. Let J(cid:22)¾|6, å, (cid:12) , (cid:12) … , € ² (cid:23) denote the distance probability distribution for the nitrogen atom in residue i and the oxygen atom in residue j . We calculate J(cid:22)¾|6, å, (cid:12) , (cid:12) … , € ² (cid:23) as follows. J(cid:22)¾|6, å, (cid:12) , (cid:12) … , € ² (cid:23) = Z J (cid:22)¾ | ¾ ÏÏ (cid:23) Ç >> J ÏÏ )¾ ÏÏ |(cid:12) , (cid:12) … , € ² where J (cid:22)¾ | ¾ ÏÏ (cid:23) is the conditional distance probability distribution for atom N in amino acid a and O in amino acid b when the (cid:10) Ï distance of these two amino acids is ¾ ÏÏ . Since J (cid:22)¾ | ¾ ÏÏ (cid:23) is position-independent, it can be estimated by simple counting. Estimating the reference state.

We calculate the reference state of a distance d as the probability of two random points at the distance d within a 3D ball of uniform density. Such a reference state has been discussed in detail before (Deltheli, 1919; Garcia-Pelayo, 2005; Shen and Sali, 2006). Here we briefly introduce it for completeness. Let € ² be the radius of gyration of a protein. The estimated radius of the 3D sphere, denoted as (cid:12) / , corresponding to such a protein is ˆ = ç è/ € ² . The probability density (cid:16)(cid:22)¾|€ ² (cid:23) for the distance ¾ between two randomly chosen points é + and é . inside such a 3D sphere can be calculated as follows. (cid:16))¾|€ ² . ëëëëì X í ê ¾é + ëëëëì X í î(cid:22)|é + ëëëëì − é . ëëëëì| − ¾(cid:23)ê ¾é . ëëëëì X í ê ¾é + ëëëëì X í (cid:22)6.8(cid:23) where the delta function is defined by î(cid:22)é(cid:23) = ð 1, (cid:13) é = 00, Oœℎ\€±(cid:13)‡\. To calculate (cid:16))¾ |€ ² , we fix point é + and move é . around é + within the sphere and then consider the spherical area formed by é . , denoted as å(cid:22)ℎ, ¾(cid:23) where ℎ is the distance between the center of the 3D ball and point é + . See (Garcia-Pelayo, 2005) for detailed deduction of å(cid:22)ℎ, ¾(cid:23) . We only give the final result as follows . å(cid:22)ℎ, ¾(cid:23) = ñ 4ò¾ . , ¾ < ˆ − ℎò¾(cid:22)¾ + ˆ − ℎ(cid:23) ó1 + ˆ − ¾ℎ ô , ˆ − ℎ ≤ ¾ ≤ ˆ + ℎ0, ¾ > ˆ + ℎ Since ê ¾é . ëëëëì X í = ê ¾é + ëëëëì X í = õö‰ í / , we have (cid:16))¾|€ ² / . ÷ ¾ℎ ‰C ℎ . å(cid:22)ℎ, ¾(cid:23) = 3¾ . (cid:22)¾ − 2ˆ(cid:23) . (cid:22)¾ + 4ˆ(cid:23)16ˆ ø (cid:22)6.9(cid:23) Window size and the number of neurons in the hidden layers.

The window size for a sequence profile context and the number of neurons in the hidden layers are important hyper-parameters of our probabilistic neural network. Because it is time-consuming to train even a single neural network model for the estimation of distance probability distribution, we determine these hyper-parameters by training a neural network for inter-residue contact prediction, which obtains the best performance when the window size is 15 and the numbers of neurons in the first and second hidden layers are 40 and 100, respectively. The window size used by us is consistent with what is used by the Zhang group (Wu and Zhang, 2008a) and the numbers of hidden neurons are not very different from what is used by the Zhou group (Xue et al., 2009). A small window size or number of neurons may result in a neural network with suboptimal performance while a very large window size or number of neurons may cause overfitting. Because it is time-consuming to train even a single PNN model for the estimation of distance probability distribution, we determine these hyper-parameters by training a neural network for inter-residue contact prediction. This takes much less time since there are fewer labels and we also use less training data. Our neural network obtains the best performance when the window size is 15 and the numbers of neurons in the first and second hidden layers are 40 and 100, respectively. Following CASP definition, there is a contact between two residues if their (cid:10) ú − (cid:10) ú distance is no more than 8Å. A pair of residues is treated as a positive example if they are in contact; otherwise a negative example. To compare our prediction results with CASP8 and CASP9 (human and server group) submissions, we use the PDB25 set generated by the PISCES server before CASP8 started to train our neural network. This PDB25 set contains ~3000 proteins and any two of them share no more than 25% sequence identity. We randomly chose ¾ of the set as the training data and the remaining ¼ as the validation data and use PSI-BLAST sequence profiles and PSIPRED predicted secondary structures as the input features. To avoid bias, we also used PSIPRED, PSI-BLAST and the NR database released before CASP8 in 2008. Our method achieves the best performance when the window size is 15 and the numbers of neurons in the first and second hidden layers are 40 and 100, respectively. We test our neural network models on the 13 CASP8 and 28 CASP9 contact prediction targets, all of which are template-free modeling targets. Our method obtains an e/5 average accuracy of 57.87% on the CASP8 targets and of 36.63% on the CASP9 targets, respectively. As shown in Figure 16, our method exceeds all the CASP8 top groups and the 10 CASP9 server groups (represented by blue bars). Our method works particularly well for the beta proteins and alpha-beta proteins, obtaining an accuracy of 49.30% on the CASP9 beta proteins, significantly better than the accuracy (25.65%) on the alpha or mainly-alpha proteins. After determining the number of neurons in the hidden layers, we also examine the (cid:10) Ï −(cid:10) Ï distance prediction accuracy of our PNN with respect to the window size. As shown in Table 19, the neural network model yields almost the best performance when the window size is 15, which is consistent what is used by the Zhang group for contact prediction(Wu and Zhang, 2008a). The results in this table are obtained from the EPAD validation data. (A) (B) Table 19 (cid:10) Ï − (cid:10) Ï distance prediction accuracy with respect to the window size window size 9 15 19 23 distance precision recall precision recall precision recall precision recall <8 Å 26.79 5.97 38.04 20.1 36.84 19.6 36.23 20.17 8-9 Å 10.23 0.01 0 0 0 0 0 0 9 -10Å 16.25 0.01 25.79 0.66 23.93 0.49 22.29 0.65 10 -11Å 15.79 0.17 48.14 9.56 49.84 9.3 50.35 9.36 11 -12Å 24.03 0.01 17.92 0.22 17.58 0.19 18.37 0.2 12 -13Å 58.13 0.87 58.59 5.38 58.76 5.38 54.58 5.63 13 -14Å 11.11 0.01 25.09 0.05 22.75 0.04 22.81 0.04 >14 Å 83.65 99.9 81.38 99.7 81.36 99.7 81.44 99.63 Distance dependence of the statistical potentials.

To examine the difference between our potential EPAD and the popular DOPE, we plot the potentials as a function of inter-atom distance for two atom pairs, as shown in Figure 17. Figure 17(A) shows the DOPE interaction potential for the atom pair ALA C ä and LEU C ä and also the EPAD interaction potential for this pair in three different positions of protein 1gvp. DOPE has the same energy curve for this atom pair regardless of its sequence positions. In particular, DOPE always has a favorable potential when the distance of this pair of atoms is between 5 and 7Å and has an interaction potential close to 0 when the distance is larger than 8.0Å. By contrast, EPAD has one unique energy curve for this atom pair for each position. The figure legend indicates the corresponding native distances (Å) between atom ALA C ä and atom LEU C ä at the three different sequence positions. For example, the bottom curve in Figure 17(A) visualizes the EPAD interaction potential for the ALA C ä and LEU C ä pair with native distance 8.31Å. This curve shows that when the distance between ALA C ä and LEU C ä is close to the native, their EPAD interaction potential is favorable. In fact, EPAD always has a favorable potential for these three ALA C ä and LEU C ä pairs when their distances are close to the natives. Figure 17(B) compares the EPAD and DOPE interaction potentials for another atom pair Cys N and Trp O in three different proteins of 1b3a, 1bkr and 1ptq. Similar to Figure 17(A), EPAD has different interaction potentials for the same atom pair in three different proteins while DOPE has the same potential across all proteins. In particular, EPAD has a favorable potential when the distance between Cys N and Trp O is close to the native. Nevertheless, DOPE has a favorable potential when their distance is between 2 and 4Å and a potential close to 0 when the distance is >8.0Å. In summary, our statistical potential EPAD is significantly different from currently popular potentials such as DOPE and DFIRE. DOPE, DFIRE, RAPDF and RW have more or less similar energy profiles for atom pairs of the same type. The difference between EPAD and any of DOPE, DFIRE, RAPDF and RW is much larger than that among DOPE (Shen and Sali, 2006), DFIRE (Zhou and Zhou, 2002), RAPDF (Samudrala and Moult, 1998) and RW (Zhang and Zhang, 2010). (A) (B) Figure 17 Distance dependence of DOPE and our potential EPAD. (A) The solid curve shows the DOPE interaction potential for atom Ð Ñ in ALA and atom Ð Ñ in LEU. The other 3 curves show the EPAD potentials for the same atom pair in three different positions of protein 1gvp. The legend shows the native distances of this atom pair in these positions. (B) The curves show the DOPE and EPAD potential for atom N in Cys and atom O in Trp in three different proteins of 1b3a, 1bkr and 1ptq. Performance on decoy discrimination.

We test our backbone-based potential EPAD on several decoy sets including the Rosetta set (Qian et al., 2007), the CASP9 models, the I-TASSER dataset (Zhang and Zhang, 2010), the CASP5-8 dataset (Rykunov and Fiser, 2010) and the Decoy ‘R’ Us (Samudrala and Levitt, 2000) set as well as an in-house large set of template-based models. We evaluate EPAD and several others DOPE, DFIRE, OPUS (Lu et al., 2008) and RW (Zhang and Zhang, 2010) using five performance metrics including the number of correctly-identified natives, the ranking of the native structures, the Z-score of the native energy, the model quality of the first-ranked decoys and the Pearson correlation coefficient (Pearson CC) between the energy and the model quality. The first three metrics evaluate how well a statistical potential can differentiate natives from decoys. The Pearson CC is more important when we want to apply the potentials to folding a protein sequence. We evaluate the model quality of a decoy using the widely-used GDT (Zemla, 2003; Zemla et al., 1999, 2001), which compares a decoy with its native and generates a quality value between 0 and 100. The higher the GDT, the better quality the decoy has.

Performance on the 2007 Rosetta dataset.

The set contains decoys generated and refined by the popular fragment assembly ab initio folding program Rosetta (Qian et al., 2007) for 58 proteins. To evaluate our potential in a more realistic setting, for each protein we use only the 100 low-quality decoys in the set, excluding the high-quality decoys. The average GDT of the best decoys is about 60. As shown in Table 20, our EPAD, which currently considers only backbone atoms, correctly identifies 34 native structures with the lowest Z-score (-2.46), while two full-atom potentials DOPE and DFIRE can identify only 21 natives. EPAD also exceeds DFIRE and DOPE in terms of the average ranking of the native structures (15.70 vs. 23.71 and 21.59). In terms of the average per-target Pearson correlation coefficient (CC) between the energy and GDT, EPAD (-0.42) is significantly better than DOPE (-0.32) and DFIRE (-0.25). EPAD also exceeds RW by all the 5 metrics. EPAD compares favorably to OPUS-PSP, a full-atom statistical potential. OPUS can correctly identify many more native structures than EPAD, but it has a very low Pearson CC to decoy quality, which indicates that OPUS-PSP may not be good for ab initio folding. Since EPAD does not contain side-chain atoms, we simply build a full-atom potential EPAD2 by linearly combining EPAD with the side-chain component in OPUS-PSP (with equal weight). EPAD2 significantly outperforms DOPE, DFIRE, RW and OPUS-PSP by all the 5 metrics. EPAD2 greatly outperforms EPAD in correctly recognizing the native structures, which may imply that side-chain information is very helpful for the identification of the native structures. This trend is also observed on other datasets (e.g., I-TASSER and CASP5-8). Table 21 compares the performance of several statistical potentials when only C ä atoms are considered in scoring a decoy. Again, EPAD significantly outperforms DOPE, DFIRE and OPUS. EPAD even performs as well as the full-atom potentials DOPE, DFIRE and RW. To excluding the impact of the different datasets used to build EPAD and DOPE, we rebuild a DOPE using the EPAD training data, denoted as MyDope. MyDope performs slightly worse than DOPE, possibly because we do not fine-tune MyDope. However, EPAD performs significantly better than both DOPE and MyDope. This indicates that EPAD outperforms DOPE not due to the training data, but due to the novel methodology. Performance on template-based models.

To examine the performance of EPAD on template-based models, we constructed a large set of 3600 protein pairs, denoted as

InHouse , with the following properties: 1) any two proteins in a pair share less than 25% sequence identity; 2) all protein classes (i.e., alpha, beta and alpha-beta proteins) are covered by this set; 3) the protein lengths are widely distributed; 4) the structure similarity of two proteins in a pair, measured by TM-score, ranges from 0.5 to 0.8 and is almost uniformly distributed; and 5) within each protein pair, one protein is designated as the target protein and the other as the template protein. Any two target proteins share less than 25% sequence identity. We generated four different target-template alignments for each protein pair using our in-house protein structure alignment program DeepAlign, our two threading programs BoostThreader (Peng and Xu, 2009) and CNFpred and HHpred (Soding, 2005). CNFpred is an improved version of BoostThreader, replacing regression trees in the latter with neural networks and making use of more protein features. Then we used MODELLER (N. Eswar, 2006) to build four different 3D models for each target protein based upon the four alignments, respectively. MODELLER also builds models for the unaligned regions. To remove overlap with the training data, we constructed 4 subsets

Set1,

Set2, Set3 and

Set4 of InHouse as follows.

Set1 contains no target proteins sharing >25% sequence identity with the EPAD training proteins.

Set2 is a subset of

Set1 , containing no target proteins sharing >25% sequence identity with the EPAD validation proteins.

Set3 contains no target proteins with a BLAST E-value<0.01 with the EPAD training proteins.

Set4 is a subset of

Set3 , containing no target proteins with a BLAST E-value<0.01 with the EPAD training and proteins. In total,

Set1,

Set2, Set3 and

Set4 contain 1139, 331, 965, and 266 protein pairs, respectively. Table 22 lists the performance of several statistical potentials in identifying the 3D models with the best GDT in the five datasets:

InHouse , Set1 , Set2 , Set3 and

Set4 . As shown in Table 22, EPAD is able to recognize many more the best template-based models than the others, which are no better than random guess. Further, EPAD has similar performance on the 5 sets, which confirms that our probabilistic neural network (PNN) model is not over-trained. For over 95% of protein pairs in

InHouse , the 3D models built from the structure alignments have the best GDT. This implies that except EPAD, the other potentials are not able to differentiate structure alignments from threading-generated alignments. Table 20

Performance of EPAD and several popular full-atom statistical potentials on the Rosetta decoy set. Numbers in bold indicate the best performance. The per-target Pearson CC is calculated between the energy and GDT and then the average value is reported in the table. EPAD DOPE DFIRE OPUS RW EPAD2 ranking of native 15.7 23.7 21.6 9.8 23.9 first-ranked GDT 51.6 49.7 49.4 49.7 48.5 Pearson CC -0.42 -0.32 -0.25 -0.20 -0.32 -0.39 Z-score -2.46 -1.61 -1.67 -3.27 -1.51 -3.28

Table 21 Performance of EPAD and several popular statistical potentials on the Rosetta decoy sets when only C ä atoms are considered. Numbers in bold indicate the best performance. MyDope is a re-compiled DOPE using the EPAD training data. EPAD DOPE DFIRE MyDope OPUS

11 12 10 6 Ranking of native

18. 7 30.7 21.7 55.3 first-ranked GDT -0.40 -0.24 -0.20 -0.21 -0.15 Z-score -2.45 -1.51 -0.66 -1.23 0.25 Table 22 Performance of EPAD and several popular statistical potentials on the template-based models. Only EPAD is a backbone-based potential while the others are full-atom potentials. In each cell, the numbers in and out parenthesis are the number and percentage of correctly-identified models (i.e., models with the lowest energy and the best GDT). Bold numbers indicate the best performance.

InHouse Set1 Set2 Set3 Set4 EPAD

DOPE 900 (25%) 288 (25%) 82 (25%) 252 (26%) 74 (28%) DFIRE 936 (26%) 286 (25%) 86 (26%) 253 (26%) 74 (28%) OPUS 900 (25%) 289 (25%) 73 (22%) 251 (26%) 69 (26%) RW 762 (21%) 248 (22%) 68 (21%) 218 (23%) 60 (22%) Performance on the CASP9 models.

To further examine the performance of EPAD, we compile a test set from the CASP9 models submitted by the top 18 servers. We exclude the CASP9 targets with many domains since some servers do not place the models of all the domains in a single coordinate system. These 18 servers are BAKER-ROSETTASERVER (Raman et al., 2009), chunk-TASSER (Zhou et al., 2009), chuo-fams (Kanou et al., 2009), CLEF-Server (Shao et al., 2011), FAMSD (Kanou et al., 2009), gws (Joo et al., 2009), HHpredA (Hildebrand et al., 2009), Jiang_Assembly (Hu et al., 2011), MULTICOM-CLUSTER (Tegge et al., 2009), MULTICOM-NOVEL (Tegge et al., 2009), Pcomb (Larsson et al., 2008), Phyre2 (Kelley and Sternberg, 2009), pro-sp3-TASSER (Zhou and Skolnick, 2009), QUARK (Xu et al., 2011), RaptorX (Peng and Xu, 2011), Seok-server (Lee et al., 2010), Zhang-Server (Xu et al., 2011), ZHOU-SPARKS-X (Yang et al., 2011). We do not include the models from RaptorX-MSA, RaptorX-Boost, HHpredB, HHpredC, MULTICOM-REFINE, MULTICOM-CONSTRUCT and Jiang_THREADER since they are not very different from some of the 18 servers. In summary, this CASP9 dataset contains the first models submitted by 18 servers for 92 targets. This set is very challenging for any energy potentials because the models submitted by these top servers have similar quality especially for those not-so-hard targets. The first-ranked models by EPAD, DOPE, DFIRE and RW have an average GDT of 58.6, 55.7, 56.0, and 57.4, respectively. The average Pearson correlation coefficient (between GDT and energy values) for EPAD is -0.364, which is significantly better than DOPE (-0.25), DFIRE (-0.23) and RW (-0.28). Note that RW parameters are fine-tuned using the CASP8 and CASP9 models while EPAD, DOPE and DFIRE are independent of any decoy sets. In addition, EPAD is only a backbone-based potential while the other three are full-atom potentials. Table 23 shows the performance of EPAD, DOPE, DFIRE and RW with respect to the hardness of the targets, which is judged based upon the average GDT of all the models of this target. We divide the targets into four groups according to the average GDT: <30, 30-50, 50-70, >70. EPAD performs very well across all difficulty levels and has particularly good correlation coefficient for the targets with average GDT less than 30. Even for easy targets EPAD also outperforms the others although it is believed that sequence profiles are not very effective in dealing with easy targets. The only exception is that EPAD has a worse average GDT of the first-ranked models than RW for the targets with average GDT between 30 and 50. This is because RW performs exceptionally well on a single target T0576. The best model identified by RW has GDT 53.3 while EPAD, DOPE and DFIRE can only identify a model with GDT 17.0.

Performance on the Decoy ‘R’ Us dataset.

The set is taken from http://dd.compbio.washington.edu/, containing decoys for some very small proteins. In terms of the average rank of the native structures EPAD significantly exceeds the others, but EPAD correctly identifies slightly fewer native structures than DOPE and OPUS_PSP, in part because EPAD does not include side-chain atoms. The results of DOPE and DFIRE in Table 24 are taken from (Shen and Sali, 2006), that of OPUS_PSP is from (Lu et al., 2008), and that of RW is calculated by the program in (Zhang and Zhang, 2010). As shown in the table, all the energy functions are able to correctly differentiate the native structures from decoys for the proteins in Lattice_ssfit. EPAD identifies more native structures for the proteins in Lmds, while DOPE and OPUS_PSP do so for 4state_reduced and Fisa_casp3. In terms of the average rank of the native structures EPAD significantly exceeds the others, but EPAD correctly identifies slightly fewer native structures than DOPE and OPUS_PSP, in part because EPAD does not contain side-chain information.

Performance on the I-TASSER dataset.

This set contains decoys for 56 proteins generated by I-TASSER (http://zhanglab.ccmb.med.umich.edu/). The average TMscore of the decoys in this set ranges from 0.346 to 0.678. EPAD outperforms DFIRE and DOPE by 5 measures. EPAD is slightly better than RW in terms of the first-ranked TMscore and the correlation, but slightly worse than RW in terms of the Z-score of the natives. EPAD2 (i.e., the combination of the OPUS-PSP side-chain potential and EPAD) can obtain much better Z-score of the natives although the correlation is slightly decreased. This is consistent with what is observed on the Rosetta set. As shown in Table 25, our statistical potential EPAD outperforms DFIRE and DOPE no matter which measure is used. EPAD is slightly better than RW in terms of the first-ranked TMscore and the correlation, but slightly worse than RW in terms of the Z-score of the natives. EPAD2 (i.e., the combination of the OPUS-PSP side-chain potential and EPAD) can obtain much better Z-score of the natives although the correlation is slightly decreased. This is consistent with what is observed on the Rosetta set. Table 23 Performance of statistical potentials with respect to the hardness of the CASP9 targets. To save space, DOPE and DFIRE are denoted as “DP” and “DF”, respectively, and we also omit the negative sign of the correlation coefficient. The first column indicates the hardness of the targets, judged by the average GDT of all the models of the target. GDT of the first-ranked models Correlation Coefficient EPAD DP DF RW EPAD DP DF RW <30

Table 24 Performance of several statistical potentials on the Decoy ‘R’ Us dataset. In each entry of columns 2-6, the numbers in and out parenthesis are the numbers of correctly-identified native structures and the average rankings of the natives, respectively. Numbers in bold indicate the best performance.

Decoy Set EPAD DFIRE DOPE OPUS_PSP RW / / / /39.1 27/47.0 32 Table 25 Performance of several statistical potentials on the I-TASSER dataset. CC-TM is the correlation coefficient between energy and the quality, measured by TMscore, of the decoys. Numbers in bold indicate the best performance. EPAD DFIRE DOPE RW EPAD2 -0.532 -0.492 -0.317 -0.500 -0.512 Z-score of natives -3.61 -3.58 -2.18 -4.42 -5.94 Performance on the CASP5-8 dataset.

EPAD is only second to QMEAN6, RW and RWplus in ranking the best models in the absence of the native structures. When the native structures are included, EPAD does not perform as well as when the native structures are not included. EPAD2 outperforms all the others in terms of the average ranking of the best models in the absence of the native structures or the average ranking of the native structures. EPAD2 also performs very well in terms of the number of correctly-identified models (or native structures). These results may further indicate that side-chain information is needed for the accurate identification of the native structures. Table 26 shows the performance of a variety of potentials on the CASP5-8 models selected by Rykunov and Fiser (Rykunov and Fiser, 2010). As shown in this table, EPAD2, which is the combination of our potential EPAD and the side-chain potential in OPUS, outperforms all the others in terms of the average ranking of the best models in the absence of the native structures or the average ranking of the native structures. EPAD2 also performs very well in terms of the number of correctly-identified models (or native structures). EPAD is only second to QMEAN6, RW and RWplus in ranking the best models in the absence of the native structures. When the native structures are included, EPAD does not perform as well as when the native structures are not included. All these results may indicate that the side-chain component helps a lot in recognizing the native structures.

Is our probabilistic neural network (PNN) model over-trained?

Our PNN model has 60,000-70,000 parameters to be trained. A natural question to ask is if our PNN model is biased towards some specific patterns in the training data? Can our PNN model be generalized well to proteins of novel folds or sequences? According to our experimental results on contact prediction (see “

Window size and the number of neurons in the hidden layers ” section in

Methods and Experimental procedures ), our PNN model is not over-trained. In this experiment, we used a training set built before CASP8 started, which are unlikely to have similar folds and sequence profiles with our test set (i.e., the CASP8 and CASP9 free modeling targets). Experimental results indicate that our PNN method compares favorably to the best CASP8 and CASP9 server predictors, which implies that our PNN model is not biased towards the training data. Table 26 Performance of a variety of potentials on the selected CASP5-8 models. Note: The results of EPAD and EPAD2 are shown in bold and the results of all the other potentials are taken from (Zhang and Zhang, 2010). a. The average rank of lowest energy decoy according to GDT score (over 143 decoy sets) in the absence of native structures. b. The number of sets when the best model was ranked as first, in the absence of native structures. c. The average rank of the lowest energy decoy in GDT when native structures are present. d. The number of sets when the best model was ranked as first when native structures are present. e. Expected random values were generated by picking a wining model from the decoy sets randomly. Average values over 1000 random trials are shown in the last row. Scoring function models only native structures included Average a Ranking b Average c Ranking d EPAD2 2.79 69 1.24 127

QMEAN6 2.87 85 1.71 113 RWplus 2.97 57 1.78 106 RW 3.08 51 1.71 110

EPAD 3.29 62 2.71 74

QMEANall_atom 3.59 74 1.71 119 QMEANSSE_agree 3.74 62 3.72 39 QMEANACC_agree 4.04 40 3.78 48 RF_CB_SRS_OD 4.16 61 2.08 110 RF_CB_OD 4.62 62 2.00 111 RF_HA_SRS 4.65 49 1.38 137 RF_CB_SRS 4.72 56 2.18 114 OPUS_CA 4.72 79 5.13 55 VSCOREcombined 4.79 53 2.2 117 QMEAN-pairwise 4.8 54 3.15 85 Rosetta 5.01 57 4.09 68 Dong-pair 5.01 58 6.32 4 RF_CB 5.06 52 2.46 106 VSCORE-pair 5.08 54 1.85 128 PROSAcombined 5.11 57 3.38 87 OPUS_PSP 5.39 54 2.99 118 RF_HA 5.44 62 2.78 112 DOPE 5.77 54 3.27 95 DFIRE 6.03 50 5.69 33 PROSA-pair 6.03 56 3.54 95 QMEAN-torsion 6.71 45 3.24 114 Shortle2006 6.85 35 1.79 129 Liang_geometric 6.88 44 2.48 114 QMEANsolvation 7.32 33 6.27 54 Shortle2005 7.73 42 3.39 109 Floudas-CM 7.75 38 7.05 42 Floudas-Ca 7.79 33 8.36 10 NAMD_1000 8.06 24 4.96 78 Melo-ANOLEA 9.62 19 5.19 86 PC2CA 9.75 19 5.06 85 Melo-NL 9.99 14 5.85 80 NAMD_1 11.91 5 10.98 24 Random e Note that our PNN model for statistical potential uses exactly the same architecture (2 hidden layers with 100 and 40 hidden neurons, respectively) as our PNN model for contact prediction. Considering that much more training data (~73 millions of residue pairs) is used for the derivation of our statistical potential than for contact prediction, it is less likely that our PNN model for statistical potential is biased towards some specific patterns in the training set. The result in Table 22 further confirms this. We use the 25% sequence identify or an E-value 0.01 as the cutoff to exclude proteins in InHouse with similar sequences to the training set and generate two subsets Set2 and Set4. Even if Set2 (Set4) contains some sequence profiles similar to the training set, the similarity between the whole InHouse set and the training set is still much larger than that between Set2 (Set4) and the training set, but the performance on the whole InHouse set is even slightly worse than that on Set2 (Set4).

Folding Results with EPAD in the blind CASP 10 experiment

One of the most important functions of a good potential is its support in the folding process. We tested EPAD using CNF-Folder (see Chapter 5) by replacing the DOPE term in the energy function (see Chapter 4 for more details) with EPAD. The Energy guiding the folding simulation is the linear combination of 4 potential terms: ½ ‰^^ = ± + ½ ›ûüý + ± . ½ ýæû› +± / ½ þ (cid:1)(cid:2) + ± õ ½ ›Xû . The weight parameters ( ± ) are learned using grid search through a series of folding experiments over a set of proteins containing all the targets from Table 18. Base on the observation from the experiments, our CNF-Folder adopts the strategy to apply DOPE with more weight on those low Neff targets or targets with short sequence length (≤65), while keeping ± . to zero for other targets. We have tested this new CNF-Folder with the new energy function on CASP9 Free modeling targets. Table 27 compares the performance of CNF-Folder and Rosetta on CASP9 Free modeling targets. We have downloaded the server prediction of Rosetta from CASP9 official web site (http://predictioncenter.org/download_area/CASP9/server_predictions/) and calculated the GDT scores using TM-score (Zhang and Skolnick, 2005b). In the previous CASP experiments, the GDT scores of our best clusters have been almost always within the top 7% decoys. Hence, for simplicity, we do not generate clusters as we normally do in the CASP blind experiments, but instead use the average GDT score of the top 10% decoys for each target in the comparison. As can be seen from Table 27, with the EPAD term, the new CNF-Folder performs favorably comparing to the most up to date Rosetta server. Backed with this encouraging result, we have participated in the CASP 10 blind experiment using the new CNF-Folder-EPAD program. The server we registered for CASP 10 FM targets is named RaptorX-roll. There are 16 template free modeling targets in CASP 10 experiment. We used a standalone classification program to pick out the FM targets from where 10 targets have overlap with CASP 10 official classifications. Table 28 compares the performance of our FM method RaptorX-roll on the FM targets with RaptorX-ZY, one of the top threading servers in CASP 10. It can be seen that RaptorX-roll yields better decoys on most of the FM targets than those from RaptorX-ZY. Figure 18 visualizes RaptorX-roll’s predictions on targets T0734 and T0740. The native structures are plotted in blue and decoy structures are in pink color. The Percentage of Residues comparison figures are also listed beside each target structure where RaptorX-roll’s distributions are marked in pink color. The more a decoy’s distribution curve stands to the right hand side, the higher quality a decoy structure is. T0734 had two domains and T0740 has one domain. It can be seen that RaptorX-roll gave much better prediction than any of the other predictors. One thing to notice is that we have given better predictions on most of the alpha targets while mediocre on the beta proteins. It emphasizes the need of more focus on the hydrogen bond potential in the future works. Table 27 Folding comparison with Rosetta (GDT) on CASP9 FM targets Note: The Rosetta data is obtained from CASP9 official web site. Rosetta’s Best denotes the GDT score of the best decoy submitted by Rosetta. We list the average GDT of top 10% decoys generated by CNF-Folder as well as the best GDT. For convenience, those low Neff targets are listed below the other targets.

Target Type Neff Len

Rosetta CNF-Folder Best Top 10% Best

T0531 1α3β 2.9 β β β T0561 6α

161 31.5

89 19.1 Table 28 Comparison with threading method RaptorX-ZY on CASP10 FM targets Note: RaptorX-ZY is one of the top threading servers in CASP 10. The scores listed are GDT score of the best decoys submitted by each server. The CaspBest column lists the GDT scores of the best decoys out of all the CASP10 servers.

Target Type Neff length RaptorX-roll RaptorX-ZY

CaspBest

T0653 40β 12.2 414 4.2 (A) (B) Figure 18 . RaptorX-roll’s Result on targets T0734 (A) and T0740 (B). The native structures are plotted in blue and decoy structures are in pink color. In the Percentage of Residues comparison figures, RaptorX-roll’s distributions are marked in pink color. The more a decoy’s distribution curve stands to the right hand side, the higher quality a decoy structure is. This chapter has presented a novel protein-specific and position-specific knowledge-based statistical potential EPAD for protein structure and functional study. EPAD is unique in that it may have different energy profiles for two atoms of given types, depending on the protein under consideration and the sequence profile contexts of the residues containing them, while other potentials have the same energy profile for a given atom pair across all proteins. We achieved this by parameterizing EPAD using evolutionary information and radius of gyration of the protein under consideration in addition to atom types, which enables us to obtain a much more accurate statistical potential. We also made a novel technical contribution to estimating the observed atomic interacting probability by introducing a probabilistic neural network to calculate the inter-atom distance probability distribution from sequence profiles and the radius of gyration. This is very different from the simple counting method widely used to derive the position-independent statistical potentials such as DOPE and DFIRE. The simple counting method does not work for our potential simply because there is not enough number of solved protein structures in PDB for reliable counting of sequence profile contexts. Experimental results indicated that EPAD significantly outperforms several popular higher-resolution full-atom potentials in several decoy discrimination tests even if only backbone atoms are considered in EPAD. If we combine EPAD with the side-chain component in OPUS-PSP, we can achieve much better decoy discrimination performance especially in the presence of native structures. As opposed to the RW potential and many others, EPAD is not trained by any decoys, so in principal it is not restricted to any decoy generation method. Currently EPAD uses only 1Å resolution for the (cid:10) Ï − (cid:10) Ï distance discretization. We will further improve EPAD by using a 0.5Å resolution, but this will take a very long time to train a neural network model for accurate estimation of the extremely unbalanced distance probability distribution. Chapter 7. Conclusion and Future works

The thesis is aimed to solve the template-free protein folding problem by tackling two important components: efficient sampling in vast conformation space, and design of knowledge-based potentials with high accuracy. The CRF-Samplers are proposed to sample structures from the continuous local dihedral angles space by modeling the lower and higher order conditional dependency between neighboring dihedral angles given the primary sequence information. A framework combining the Conditional Random Fields and the energy function is introduced to guide the local conformation sampling using long range constraints with the energy function. In order to model the complex nonlinear relationship between the sequence profile and the local dihedral angle distribution, we further built the CNF-Folder by applying a novel machine learning model Conditional Neural Fields which utilizes the structural graphical model with the neural network. CRF-Samplers and CNF-Folder perform very well in CASP8 and CASP9. On the other track, we have designed a novel pairwise distance statistical potential (EPAD) to capture the dependency of the energy profile on the positions of the interacting amino acids as well as the types of those amino acids, opposing the common assumption that this energy profile depends only on the types of amino acids. EPAD has also been successfully applied in the CASP 10 Free Modeling experiment with CNF-Folder, especially outstanding on some uncommon structured targets.

Conformation Sampling

Although the primary purpose of the free modeling folders including the CRF-Samplers and CNF-Folder is for template-free protein folding, they can be applied to other important applications. One direct example is to refine template-based models. Some preliminary results on CASP blind experiments can be found in (Xu et al., 2009) and (Ma et al., 2013). The method can also be applied to protein loop modeling, model refinement, and even RNA tertiary structure prediction (Wang and Xu, 2011). The free modeling folders have some immediate limitations to relax. By extracting distance constraints from template-based models, the conformational space of a target is dramatically reduced and thus we can afford to search this reduced space, which may search conformational space more thoroughly and lead to better prediction accuracy. They can also be further extended to model the long-range hydrogen bonding effect to significantly enhance the performance on beta proteins.

Statistical Potentials

We may extend our statistical potential as follows. Currently EPAD considers only backbone atoms and is also orientation-independent. We can extend it to side-chain atoms and also make it orientation-dependent. Second, in estimating the distance probability distribution of two positions, we use only sequence profile contexts relevant to only these two positions. We shall also use information in the sequence segment connecting the two residues, which contains important information in determining the relative orientation of the two residues. Thirdly, we may also estimate the distance probability distribution more accurately by adding some physical constraints. For example, given any three atoms in a protein, their pairwise distances must satisfy the triangle inequality. Further, for any three residues which are close to one another along the primary sequence, their C ä distances are also subject to the restriction of local atomic interaction. If we assume that there is a contact between two residues if their C ä or C (cid:3) atoms are within 8Å, then the number of contacts for any given residue is limited by a constant (~13) due to geometric restraint. By enforcing these constraints, we shall be able to estimate the inter-atom distance probability distribution much more accurately and thus, design a much better statistical potential. Other two important potentials to probe are hydrogen bonding potential and environmental potential. We have some preliminary results on the latter category that a contact number potential is designed and experimental results support the hypothesis that the wrapping of each residue by surrounding amino acids is guided by maximizing the local static electric field. We will keep working on it in the near future. References

Aarts, E., and Korst, J. (1991). Simulated Annealing and Boltzmann Machines: A Stochastic Approach to Combinatorial Optimization and Neural Computing (Wiley). Anfinsen, C.B. (1973). Principles that govern the folding of protein chains. Science , 223-230. Bauer, A., and Beyer, A. (1994). An Improved Pair Potential to Recognize Native Protein Folds. Proteins , 254-261. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H., Shindyalov, I.N., and Bourne, P.E. (2000). The protein data bank. Nucleic Acids Res , 235-242. Boomsma, W., Mardia, K.V., Taylor, C.C., Ferkinghoff-Borg, J., Krogh, A., and Hamelryck, T. (2008). A generative, probabilistic model of local protein structure. Proceedings of the National Academy of Sciences , 8932-8937. Bowie, J.U., and Eisenberg, D. (1994). An Evolutionary Approach to Folding Small Alpha-Helical Proteins That Uses Sequence Information and an Empirical Guiding Fitness Function. P Natl Acad Sci USA , 4436-4440. Bradley, P., Malmstrom, L., Qian, B., Schonbrun, J., Chivian, D., Kim, D.E., Meiler, K., Misura, K.M.S., and Baker, D. (2005). Free modeling with Rosetta in CASP6. Proteins-Structure Function and Bioinformatics , 128-134. Brooks, B.R., Brooks, C.L., Mackerell, A.D., Nilsson, L., Petrella, R.J., Roux, B., Won, Y., Archontis, G., Bartels, C., Boresch, S. , et al. (2009). CHARMM: The Biomolecular Simulation Program. J Comput Chem , 1545-1614. Bryngelson, J.D., Onuchic, J.N., Socci, N.D., and Wolynes, P.G. (1995). Funnels, Pathways, and the Energy Landscape of Protein-Folding - a Synthesis. Proteins , 167-195. Bystroff, C., Thorsson, V., and Baker, D. (2000). HMMSTR: a hidden Markov model for local sequence-structure correlations in proteins. J Mol Biol , 173-190. Casari, G., and Sippl, M.J. (1992). Structure-Derived Hydrophobic Potential - Hydrophobic Potential Derived from X-Ray Structures of Globular-Proteins Is Able to Identify Native Folds. J Mol Biol , 725-732. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A., Simmerling, C., Wang, B., and Woods, R.J. (2005). The Amber biomolecular simulation programs. J Comput Chem , 1668-1688. Claessens, M., Vancutsem, E., Lasters, I., and Wodak, S. (1989). Modeling the Polypeptide Backbone with Spare Parts from Known Protein Structures. Protein Eng , 335-345. Colubri, A., Jha, A.K., Shen, M.Y., Sali, A., Berry, R.S., Sosnick, T.R., and Freed, K.F. (2006). Minimalist representations nearest neighbor effects in and the importance of protein folding simulations. J Mol Biol , 535-557. Deltheli, R. (1919). Sur la théorie des probabilités géométriques. Ann Fac Sci Univ Toulouse , 1-65. Dietterich, T., Ashenfelter, A., and Bulatov, Y. (2004). Training conditional random fields via gradient tree boosting. Paper presented at: In Proceedings of the 21th International Conference on Machine Learning (ICML). Dill, K.A. (1985). Theory for the Folding and Stability of Globular-Proteins. Biochemistry-Us , 1501-1509. Dill, K.A. (1997). Additivity principles in biochemistry. J Biol Chem , 701-704. Dobson, C.M., Sali, A., and Karplus, M. (1998). Protein folding: A perspective from theory and experiment. Angew Chem Int Edit , 868-893. Dor, O., and Zhou, Y.Q. (2007). Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins-Structure Function and Bioinformatics , 838-845. Earl, D.J., and Deem, M.W. (2005). Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics , 3910-3916. Eddy, S.R. (1998). Profile hidden Markov models. Bioinformatics , 755-763. Feldman, H.J., and Hogue, C.W. (2002). Probabilistic sampling of protein conformations: new hope for brute force? Proteins-Structure Function and Bioinformatics , 8-23. Fernandez, A., Sosnick, T.R., and Colubri, A. (2002). Dynamics of hydrogen bond desolvation in protein folding. J Mol Biol , 659-675. Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L.L. , et al. (2008). The Pfam protein families database. Nucleic Acids Res , D281-D288. Fitzgerald, J.E., Jha, A.K., Colubri, A., Sosnick, T.R., and Freed, K.F. (2007). Reduced C-beta statistical potentials can outperform all-atom potentials in decoy identification. Protein Science , 2123-2139. Garcia-Pelayo, R. (2005). Distribution of distance in the spheroid. J Phys a-Math Gen , 3475-3482. Gatchell, D.W., Dennis, S., and Vajda, S. (2000). Discrimination of near-native protein structures from misfolded models by empirical free energy functions. Proteins , 518-534. Gilis, D., and Rooman, M. (1996). Stability changes upon mutation of solvent-accessible residues in proteins evaluated by database-derived potentials. J Mol Biol , 1112-1126. Gilis, D., and Rooman, M. (1997). Predicting protein stability changes upon mutation using database-derived potentials: Solvent accessibility determines the importance of local versus non-local interactions along the sequence. J Mol Biol , 276-290. Gong, H.P., Fleming, P.J., and Rose, G.D. (2005). Building native protein conformation from highly approximate backbone torsion angles. P Natl Acad Sci USA , 16227-16232. Gront, D., Kmiecik, S., and Kolinski, A. (2007). Backbone building from quadrilaterals: A fast and accurate algorithm for protein backbone reconstruction from alpha carbon coordinates. J Comput Chem , 1593-1597. Hamelryck, T., Kent, J.T.T., and Krogh, A. (2006). Sampling Realistic Protein Conformations Using Local Structural Bias. PLoS Comput Biology . Hendlich, M., Lackner, P., Weitckus, S., Floeckner, H., Froschauer, R., Gottsbacher, K., Casari, G., and Sippl, M.J. (1990). Identification of Native Protein Folds Amongst a Large Number of Incorrect Models - the Calculation of Low-Energy Conformations from Potentials of Mean Force. J Mol Biol , 167-180. Hildebrand, A., Remmert, M., Biegert, A., and Soding, J. (2009). Fast and accurate automatic structure prediction with HHpred. Proteins-Structure Function and Bioinformatics

77 Suppl 9 , 128-132. Holm, L., and Sander, C. (1991). Database Algorithm for Generating Protein Backbone and Side-Chain Coordinates from a C-Alpha Trace Application to Model-Building and Detection of Coordinate Errors. J Mol Biol , 183-194. Hu, Y., Dong, X., Wu, A., Cao, Y., Tian, L., and Jiang, T. (2011). Incorporation of Local Structural Preference Potential Improves Fold Recognition. Plos One . Hua, S.J., and Sun, Z.R. (2001). A novel method of protein secondary structure prediction with high segment overlap measure: Support vector machine approach. J Mol Biol , 397-407. Jha, A.K., Colubri, A., Zaman, M.H., Koide, S., Sosnick, T.R., and Freed, K.F. (2005). Helix, sheet, and polyproline II frequencies and strong nearest neighbor effects in a restricted coil library. Biochemistry-Us , 9691-9702. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol , 195-202. Jones, D.T., and Thornton, J.M. (1996). Potential energy functions for threading. Curr Opin Struc Biol , 210-216. Jones, T.A., and Thirup, S. (1986). Using Known Substructures in Protein Model-Building and Crystallography. Embo J , 819-822. Joo, K., Lee, J., Seo, J.H., Lee, K., Kim, B.G., and Lee, J. (2009). All-atom chain-building by optimizing MODELLER energy function using conformational space annealing. Proteins-Structure Function and Bioinformatics , 1010-1023. Kachalo, S., Lu, H.-M., and Liang, J. (2006). Protein folding dynamics via quantification of kinematic energy landscape. Physical Review Letters , 058106. Kanou, K., Iwadate, M., Hirata, T., Terashi, G., Umeyama, H., and Takeda-Shitaka, M. (2009). FAMSD: A powerful protein modeling platform that combines alignment methods, homology modeling, 3D structure quality estimation and molecular dynamics. Chem Pharm Bull (Tokyo) , 1335-1342. Karypis, G. (2006). YASSPP: Better kernels and coding schemes lead to improvements in protein secondary structure prediction. Proteins-Structure Function and Bioinformatics , 575-586. Kelley, L.A., and Sternberg, M.J.E. (2009). Protein structure prediction on the Web: a case study using the Phyre server. Nat Protoc , 363-371. Kent, J.T. (1982). The Fisher-Bingham Distribution on the Sphere. J Roy Stat Soc B Met , 71-80. Kihara, D., Lu, H., Kolinski, A., and Skolnick, J. (2001). TOUCHSTONE: An ab initio protein structure prediction method that uses threading-based tertiary restraints. P Natl Acad Sci USA , 10125-10130. Kim, H., and Park, H. (2003). Protein secondary structure prediction based on an improved support vector machines approach. Protein Eng , 553-560. Kolodny, R., Koehl, P., Guibas, L., and Levitt, M. (2002). Small libraries of protein fragments model native protein structures accurately. J Mol Biol , 297-307. Kortemme, T., and Baker, D. (2002). A simple physical model for binding energy hot spots in protein-protein complexes. P Natl Acad Sci USA , 14116-14121. Krogh, A., Brown, M., Mian, I.S., Sjolander, K., and Haussler, D. (1994). Hidden Markov-Models in Computational Biology - Applications to Protein Modeling. J Mol Biol , 1501-1531. Labesse, G., Colloc'h, N., Pothier, J., and Mornon, J.P. (1997). P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins. Comput Appl Biosci , 291-295. Lafferty, J., McCallum, A., and Pereira, F.C.N. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Larsson, P., Wallner, B., Lindahl, E., and Elofsson, A. (2008). Using multiple templates to improve quality of homology models in automated homology modeling. Protein Science , 990-1002. Laurie, A.T.R., and Jackson, R.M. (2005). Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics , 1908-1916. Lee, J., Lee, D., Park, H., Coutsias, E.A., and Seok, C. (2010). Protein loop modeling by using fragment assembly and analytical loop closure. Proteins-Structure Function and Bioinformatics , 3428-3436. Levitt, M. (1976). A simplified representation of protein conformations for rapid simulation of protein folding. J Mol Biol , 59-107. Levitt, M. (1992). Accurate Modeling of Protein Conformation by Automatic Segment Matching. J Mol Biol , 507-533. Li, S.C., Bu, D., Xu, J., and Li, M. (2008). Fragment-HMM: a new approach to protein structure prediction. Protein Science , 1925-1934. Li, X., and Liang, J. (2007). Knowledge-based energy functions for computational studies of proteins. In Computational methods for protein structure prediction and modeling (Springer), pp. 71-123. Liu, D., and Nocedal, J. (1989). On the Limited Memory Bfgs Method for Large-Scale Optimization. Math Program , 503-528. Lu, H., and Skolnick, J. (2001). A distance-dependent atomic knowledge-based potential for improved protein structure selection. Proteins , 223-232. Lu, M., Dousis, A., and Ma, J. (2008). OPUS-PSP: An orientation-dependent statistical all-atom potential derived from side-chain packing. J Mol Biol , 288-301. Ma, J., Wang, S., Zhao, F., and Xu, J. (2013). Protein threading using context-specific alignment potential. Bioinformatics , i257-i265. Maiorov, V.N., and Crippen, G.M. (1992). Contact Potential That Recognizes the Correct Folding of Globular-Proteins. J Mol Biol , 876-888. Mardia, K.V., Taylor, C.C., and Subramaniam, G.K. (2007). Protein bioinformatics and mixtures of bivariate von Mises distributions for angular data. Biometrics , 505-512. Maupetit, J., Gautier, R., and Tuffery, P. (2006). SABBAC: online structural alphabet-based protein backbone reconstruction from alpha-carbon trace. Nucleic Acids Res , W147-W151. McGuffin, L.J., Bryson, K., and Jones, D.T. (2000). The PSIPRED protein structure prediction server. Bioinformatics , 404-405. Meiler, J., Peti, W., and Griesinger, C. (2000). DipoCoup: A versatile program for 3D-structure homology comparison based on residual dipolar couplings and pseudocontact shifts. Journal of biomolecular NMR , 283-294. Misura, K.M.S., Chivian, D., Rohl, C.A., Kim, D.E., and Baker, D. (2006). Physically realistic homology models built with ROSETTA can be more accurate than their templates. P Natl Acad Sci USA , 5361-5366. Miyazawa, S., and Jernigan, R.L. (1985). Estimation of Effective Interresidue Contact Energies from Protein Crystal-Structures - Quasi-Chemical Approximation. Macromolecules , 534-552. Morozov, A.V., Kortemme, T., Tsemekhman, K., and Baker, D. (2004). Close agreement between the orientation dependence of hydrogen bonds observed in protein structures and quantum mechanical calculations. Proc Natl Acad Sci U S A , 6946-6951. Moult, J. (2005). A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr Opin Struct Biol , 285-289. Moult, J., Fidelis, K., Kryshtafovych, A., Rost, B., Hubbard, T., and Tramontano, A. (2007). Critical assessment of methods of protein structure prediction-Round VII. Proteins-Structure Function and Bioinformatics

69 Suppl 8 , 3-9. Moult, J., Fidelis, K., Rost, B., Hubbard, T., and Tramontano, A. (2005). Critical assessment of methods of protein structure prediction (CASP)-round 6. Proteins: Structure, Function and Bioinformatics

61 Suppl 7 , 3-7. Moult, J., Fidelis, K., Zemla, A., and Hubbard, T. (2003). Critical assessment of methods of protein structure prediction (CASP)-round V. Proteins-Structure Function and Bioinformatics

53 Suppl 6 , 334-339. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol , 536-540. N. Eswar, M.A.M.-R., B. Webb, M. S. Madhusudhan, D. Eramian, M. Shen, U. Pieper, A. Sali. (2006). Comparative Protein Structure Modeling With MODELLER. Current Protocols in Bioinformatics

Supplement 15, 5.6.1-5.6.30 . Neal, S., Berjanskii, M., Zhang, H., and Wishart, D.S. (2006). Accurate prediction of protein torsion angles using chemical shifts and sequence homology. Magnetic Resonance in Chemistry , S158-S167. Notredame, C., Higgins, D.G., and Heringa, J. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol , 205-217. Panchenko, A.R., Marchler-Bauer, A., and Bryant, S.H. (2000). Combination of threading potentials and sequence profiles improves fold recognition. J Mol Biol , 1319-1331. Panjkovich, A., Melo, F., and Marti-Renom, M. (2008). Evolutionary potentials: structure specific knowledge-based potentials exploiting the evolutionary record of sequence homologs. Genome Biol . Park, B.H., Huang, E.S., and Levitt, M. (1997). Factors affecting the ability of energy functions to discriminate correct from incorrect folds. J Mol Biol , 831-846. Pei, J., Kim, B., and Grishin, N. (2008). PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res , 2295-2300. Peng, J., Bo, l., and Xu, J. (2009). Conditional Neural Fields. Paper presented at: Advances in Neural Information Processing Systems (NIPS) (Vancouver, Canada). Peng, J., and Xu, J. (2009). Boosting Protein Threading Accuracy. Lect Notes Comput Sci , 31. Peng, J., and Xu, J. (2010). Low-homology protein threading. Bioinformatics , i294-i300. Peng, J., and Xu, J. (2011). RaptorX: exploiting structure information for protein alignment by statistical inference. Proteins-Structure Function and Bioinformatics

79 Suppl 10 , 161-171. Phan, X.-H., Nguyen, L.-M., and Nguyen, C.-T. (2005). FlexCRFs: Flexible Conditional Random Fields. Qian, B., Raman, S., Das, R., Bradley, P., McCoy, A., Read, R., and Baker, D. (2007). High-resolution structure prediction and the crystallographic phase problem. Nature , 259-U257. Raman, S., Vernon, R., Thompson, J., Tyka, M., Sadreyev, R., Pei, J.M., Kim, D., Kellogg, E., DiMaio, F., Lange, O. , et al. (2009). Structure prediction for CASP8 with all-atom refinement using Rosetta. Proteins-Structure Function and Bioinformatics , 89-99. Reva, B.A., Finkelstein, A.V., Sanner, M.F., and Olson, A.J. (1997). Residue-residue mean-force potentials for protein structure recognition. Protein Eng , 865-876. Rykunov, D., and Fiser, A. (2010). New statistical potential for quality assessment of protein models and a survey of energy functions. BMC Bioinformatics , 128. Samudrala, R., and Levitt, M. (2000). Decoys 'R' Us: A database of incorrect conformations to improve protein structure prediction. Protein Science , 1399-1401. Samudrala, R., and Moult, J. (1998). An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol , 895-916. Schuler, L.D., Daura, X., and Van Gunsteren, W.F. (2001). An improved GROMOS96 force field for aliphatic hydrocarbons in the condensed phase. J Comput Chem , 1205-1218. Sha, F., and Pereira, F. (2003). Shallow parsing with conditional random fields. Paper presented at: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1 (Association for Computational Linguistics). Shakhnovich, E. (2006). Protein folding thermodynamics and dynamics: Where physics, chemistry, and biology meet. Chem Rev , 1559-1588. Shao, M.F., Wang, S., Wang, C., Yuan, X.Y., Li, S.C., Zheng, W.M., and Bu, D.B. (2011). Incorporating Ab Initio energy into threading approaches for protein structure prediction. BMC Bioinformatics . Shen, M., and Sali, A. (2006). Statistical potential for assessment and prediction of protein structures. Protein Science , 2507-2524. Shi, S., Pei, J., Sadreyev, R., Kinch, L., Majumdar, I., Tong, J., Cheng, H., Kim, B.-H., and Grishin, N. (2009). Analysis of casp8 targets, predictions and assessment methods, . Database . Simon, I., Glasser, L., and Scheraga, H.A. (1991). Calculation of Protein Conformation as an Assembly of Stable Overlapping Segments: Application to Bovine Pancreatic Trypsin Inhibitor. Proceedings of National Academy Sciences, USA , 3661-3665. Simons, K.T., Kooperberg, C., Huang, E., and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol , 209-225. Simons, K.T., Ruczinski, I., Kooperberg, C., Fox, B.A., Bystroff, C., and Baker, D. (1999). Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins , 82-95. Singh, H., Hnizdo, V., and Demchuk, E. (2002). Probabilistic model for two dependent circular variables. Biometrika , 719-723. Sippl, M. (1993). Recognition of errors in three-dimensional structures of proteins. Proteins: Structure, Function, and Bioinformatics , 355-362. Sippl, M.J. (1990). Calculation of Conformational Ensembles from Potentials of Mean Force - an Approach to the Knowledge-Based Prediction of Local Structures in Globular-Proteins. J Mol Biol , 859-883. Sippl, M.J., and Weitckus, S. (1992). Detection of native-like models for amino acid sequences of unknown three-dimensional structure in a data base of known protein conformations. Proteins-Structure Function and Bioinformatics , 258-271. Skolnick, J. (2006). In quest of an empirical potential for protein structure prediction. Curr Opin Struc Biol , 166-171. Skolnick, J., Kihara, D., and Zhang, Y. (2004). Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm. Proteins , 502-518. Skolnick, J., Kolinski, A., and Ortiz, A. (2000). Derivation of protein-specific pair potentials based on weak sequence fragment similarity. Proteins , 3-16. Soding, J. (2005). Protein homology detection by HMM-HMM comparison. Bioinformatics , 951-960. Specht, D.F. (1990). Probabilistic Neural Networks. Neural Networks , 109-118. Swendsen, R.H., and Wang, J.S. (1986). Replica Monte-Carlo Simulation of Spin-Glasses. Physical Review Letters , 2607-2609. Tanaka, S., and Scheraga, H.A. (1976). Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules , 945-950. Tegge, A., Wang, Z., Eickholt, J., and Cheng, J. (2009). NNcon: improved protein contact map prediction using 2D-recursive neural networks. Nucleic Acids Res , W515-W518. Tooze, J., and Branden, C. (1999). Introduction to protein structure (USA New York: Garland Publishing, Inc). Tuffery, P., and Derreumaux, P. (2005). Dependency between consecutive local conformations helps assemble protein structures from secondary structures using Go potential and greedy algorithm. Proteins-Structure Function and Bioinformatics , 732-740. Unger, R., Harel, D., Wherland, S., and Sussman, J.L. (1989). A 3D building blocks approach to analyzing and predicting structure of proteins. Proteins: Structure, Function and Genetics , 355-373. Vapnik, V. (1998). Statistical learning theory (Wiley New York). Vendruscolo, M., Najmanovich, R., and Domany, E. (2000). Can a pairwise contact potential stabilize native protein folds against decoys obtained by threading? Proteins , 134-148. Wang, G., and Dunbrack, R. (2003). PISCES: a protein sequence culling server. Bioinformatics , 1589-1591. Wang, Z., and Xu, J. (2011). A conditional random fields method for RNA sequence–structure relationship modeling and conformation sampling. Bioinformatics , i102-i110. Wang, Z., Zhao, F., Peng, J., and Xu, J. (2011). Protein 8-class secondary structure prediction using conditional neural fields. Proteomics , 3786-3792. Wendoloski, J.J., and Salemme, F.R. (1992). Probit - a Statistical Approach to Modeling Proteins from Partial Coordinate Data Using Substructure Libraries. Journal of Molecular Graphics , 124-126. Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics bulletin , 80-83. Wu, S., Skolnick, J., and Zhang, Y. (2007a). Ab initio modeling of small proteins by iterative TASSER simulations. Bmc Biol . Wu, S., and Zhang, Y. (2008a). A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics , 924-931. Wu, S., and Zhang, Y. (2008b). MUSTER: Improving protein sequence profile-profile alignments by using multiple sources of structure information. Proteins-Structure Function and Bioinformatics , 547-556. Wu, Y., Lu, M., Chen, M., Li, J., and Ma, J. (2007b). OPUS-Ca: A knowledge-based potential function requiring only C alpha positions. Protein Science , 1449-1463. Xia, Y., Huang, E.S., Levitt, M., and Samudrala, R. (2000). Ab initio construction of protein tertiary structures using a hierarchical approach. J Mol Biol , 171-185. Xu, D., Zhang, J., Roy, A., and Zhang, Y. (2011). Automated protein structure modeling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based structure refinement. Proteins-Structure Function and Bioinformatics

79 Suppl 10 , 147-160. Xu, J. (2005). Fold recognition by predicted alignment accuracy. Ieee Acm T Comput Bi , 157-165. Xu, J., Peng, J., and Zhao, F. (2009). Template-based and free modeling by RAPTOR++ in CASP8. Proteins: Structure, Function, and Bioinformatics , 133-137. Xue, B., Faraggi, E., and Zhou, Y. (2009). Predicting residue-residue contact maps by a two-layer, integrated neural-network method. Proteins-Structure Function and Bioinformatics , 176-183. Yang, Y., Faraggi, E., Zhao, H., and Zhou, Y. (2011). Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates. Bioinformatics , 2076-2082. Zemla, A. (2003). LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res , 3370-3374. Zemla, A., Venclovas, C., Moult, J., and Fidelis, K. (1999). Processing and analysis of CASP3 protein structure predictions. Proteins, 22-29. Zemla, A., Venclovas, C., Moult, J., and Fidelis, K. (2001). Processing and evaluation of predictions in CASP4. Proteins-Structure Function and Bioinformatics, 13-21. Zhang, C., Vasmatzis, G., Cornette, J.L., and DeLisi, C. (1997). Determination of atomic desolvation energies from the structures of crystallized proteins. J Mol Biol , 707-726. Zhang, J., and Zhang, Y. (2010). A Novel Side-Chain Orientation Dependent Potential Derived from Random-Walk Reference State for Protein Fold Selection and Structure Prediction. Plos One . Zhang, Y., Kolinski, A., and Skolnick, J. (2003). TOUCHSTONE II: a new approach to ab initio protein structure prediction. Biophys J , 1145-1164. Zhang, Y., and Skolnick, J. (2005a). The protein structure prediction problem could be solved using the current PDB library. P Natl Acad Sci USA , 1029-1034. Zhang, Y., and Skolnick, J. (2005b). TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res , 2302-2309. Zhang, Y., and Skolnick, J. (2007). Scoring function for automated assessment of protein structure template quality (vol 57, pg 702, 2004). Proteins-Structure Function and Bioinformatics , 1020-1020. Zhao, F., Li, S., Sterner, B.W., and Xu, J. (2008). Discriminative learning for protein conformation sampling. Proteins: Structure, Function, and Bioinformatics , 228-240. Zhao, F., Peng, J., DeBartolo, J., Freed, K., Sosnick, T., and Xu, J. (2009). A Probabilistic Graphical Model for Ab Initio Folding. In Research in Computational Molecular Biology (Tucson, Arizona, Springer Berlin Heidelberg), pp. 59-73. Zhao, F., Peng, J., and Xu, J. (2010). Fragment-free approach to protein folding using conditional neural fields. Bioinformatics , i310-i317. Zhou, H., Pandit, S.B., and Skolnick, J. (2009). Performance of the Pro-sp3-TASSER server in CASP8. Proteins-Structure Function and Bioinformatics

77 Suppl 9 , 123-127. Zhou, H., and Skolnick, J. (2009). Protein structure prediction by pro-Sp3-TASSER. Biophys J , 2119-2127. Zhou, H., and Zhou, Y. (2002). Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Science , 2714-2726., 2714-2726.