[PDF] A glance into the evolution of template-free protein structure prediction methodologies

Abstract

Full PDF

AA glance into the evolution of template-free protein structure prediction methodologies

Surbhi Dhingra , Ramanathan Sowdhamini , Frédéric Cadet , and BernardOﬀmann ∗ Université de Nantes, CNRS, UFIP, UMR6286, F-44000 Nantes, France Computational Approaches to Protein Science (CAPS), National Centre forBiological Sciences (NCBS), Tata Institute for Fundamental Research (TIFR),Bangalore 560-065, India University of Paris, BIGR—Biologie Intégrée du Globule Rouge, Inserm,UMR_S1134, Paris F-75015, France Laboratory of Excellence GR-Ex, Boulevard du Montparnasse, Paris F-75015,France DSIMB, UMR_S1134, BIGR, Inserm, Faculty of Sciences and Technology,University of La Reunion, Saint-Denis F-97715, France

Abstract

Prediction of protein structures using computational approaches has been explored for overtwo decades, paving a way for more focused research and development of algorithms in com-parative modelling, ab intio modelling and structure reﬁnement protocols. A tremendous suc-cess has been witnessed in template-based modelling protocols, whereas strategies that involvetemplate-free modelling still lag behind, speciﬁcally for larger proteins (> 150 a.a.). Variousimprovements have been observed in ab initio protein structure prediction methodologies over-time, with recent ones attributed to the usage of deep learning approaches to construct proteinbackbone structure from its amino acid sequence. This review highlights the major strategiesundertaken for template-free modelling of protein structures while discussing few tools devel-oped under each strategy. It will also brieﬂy comment on the progress observed in the ﬁeld of ab initio modelling of proteins over the course of time as seen through the evolution of CASPplatform.This paper is dedicated to the memory of Anna Tramontano (1957-2017) who was an Italiancomputational biologist and chair professor of biochemistry at the Sapienza University of Rome.Declarations of interest: noneKeywords: protein structure prediction; ab-initio; template-free modelling; artiﬁcial intelligence ∗ corresponding author : bernard.oﬀ[email protected] a r X i v : . [ q - b i o . Q M ] A p r bbreviations: Critical Assessment of protein Structure Prediction (CASP), Template-basedmodelling (TBM), Template-free modelling (TBM), Fragment-based approaches (FBA), Arti-ﬁcical Intelligence (AI). Ab initio

Protein Structure Prediction

A signiﬁcant amount of sequence data does not share homology with well-studied protein fam-ilies. This called for development of approaches which could help predicting protein structureswith minimal or no known information. Such approaches fall into the second major class ofcomputational protein structure prediction called “Template-Free modelling/Free-modelling”(TFM/FM). The word “free” used in the name indicates the initial take on such algorithmsto rely on physical laws to determine protein structures. Though, most of the algorithms de-veloped around it are guided by structural information. In this review we will touch into theevolution of Free-modelling and the approaches that have been used to predict 3D models.Throughout this review Free-modelling, ab initio modelling and de novo modelling will be usedinterchangeably to discuss template-free modelling approaches.Template-free modelling comprises of algorithms/pipeline/methods for generating proteinmodels with no known structural homologs available. Mainly these approaches focused on us-ing physics based principles and energy terms to model proteins. The nomenclature remainsdebatable as in several cases, information from known structures is used in one way or another.This review is considering the following deﬁnition as best suited to describe our understandingof TFM: “

Ab initio protein structure prediction or Free modelling (FM) can most appropriatelybe deﬁned as an eﬀort to construct 3D structure without using homologous proteins as tem-plate” [14, 22, 32–34]. FM approaches majorly depend on designing algorithms with ability torapidly locate global energy minimum and a scoring function capable of selecting best availableconformation from the several generated models [35–37].The aim of free-modelling protocols is to predict the most stable protein spatial arrangementwith lowest free energy. The major challenge faced while developing ab initio approachesis searching conformational space which is usually huge considering the dynamic nature of3igure 1: Computation Protein Structure Prediction Approaches.

This ﬁgure provides a broad classiﬁcation of few computational protein structure predictionstrategies developed and used to determine protein structure. The two major classiﬁcations of thesemethodologies lie under the domain of Template-Based Modelling and Template-free modelling. Eachof these categories can further be split into a set of strategies based on the basic principle followed bythe parent approaches for structure prediction. proteins. Since, these approaches involve building the protein structure from scratch, focus islaid on building eﬀective energy functions to minimise conformational search space and facilitateaccurate folding [22, 34].

Ab initio algorithms can also be inﬂuenced from experimental dataavailable in the form of abstract NMR restraints, predicted residue-residue contact maps, Cryo-EM density maps etc. [38–40].

Ab initio

Prediction of Protein Structure

Free modelling has witnessed a major bloom in the past era owing to several strategies de-veloped for structure prediction, few of them have been stated in the Table 1. Initially thescientiﬁc community resorted to use pure physics based laws, MD simulations etc. to explorethe atomic dynamics of protein molecules. The prediction horizon expanded with time intoutilizing restraints like C α -C α distance, dihedral angles, solvent interactions, side-chain atoms,contact map information and more from available structures. The newer fundamentals involvedbuilding saturated library of structural information in the form of small fragments, secondarystructural elements, motifs, foldons etc. Below we have broadly classiﬁed the ab initio protein4 a b l e : S tr a t e g i e s a v a il a b l e f o r p r o t e i n s tr u c t u r e p r e d i c t i o n . F i r s tt w o c o l u m n s p r o v i d e t h e n a m e a nd a b r i e f d e s c r i p t i o n o ff e w a l go r i t h m s d e v e l o p e d o v e rt i m e f o r a b i n i t o p r o t e i n s tr u c t u r e p r e d i c t i o n . T h i r d c o l u m n s t a t e s t h e l e n g t h w h i c h f o r m o s tt oo l s i s i nd i c a t i v e s t a t e - o f - t h e - a rt r e s u l t s o b s e r v e d t h r o u g h o u tt h e ﬁ e l d ,i. e , a b l e t oa c h i e v e a G D T _ T S s c o r e o f % a nd a b o v e . F o r m o s t p a rt h i g h a cc u r a c y m o d e l s g e n e r a t i o n i ss t illli m i t e d t o s m a ll p r o t e i n ( < r e s i du e l o n g ) . A l go r i t h m s / S e r v e r s S t r a t e g y / A pp r o a c h M a x i m u m p r e d i c t e d l e n g t h R e f e r e n c e R o s e tt a F r ag m e n t - a ss e m b l y u s i n g M C s i m u l a t i o n s , a ll - a t o m e n e r g y f un c t i o n t o d e t e r m i n e s tr u c t u r e a nd c l u s t e r i n go f m o d e l s ∼ [ ] Q u a r k F r ag m e n t - b a s e d a ss e m b l y u s i n g R E M C s i m u l a t i o n g u i d e db y k n o w l e d g e b a s e dp o t e n t i a l s up t o150 [ , ] E d a R o s e U t ili s i n g E d a F o l d a l go r i t h m f o r f r ag m e n t b a s e d a ss e m b l y w i t h c l u s - t e r b a s e d a nd e n e r g y b a s e d v a r i a t i o n s up t o150 [ ] C hun k - T a s e r H y b r i d a pp r oa c hu s i n g r e s tr a i n t s d e r i v e d f r o m s up e r s ec o nd a r y s tr u c t u r ec hun k s a s w e ll a s b y t h r e a d i n g t h e t e m p l a t e s < [ ] UN R E S P h y s i c s - b a s e d c o n f o r m a t i o n a l s p a ce s e a r c hu s i n g UN R E S e n e r g y f un c t i o n < [ ] B C L :: F O L D A ss e m b li n g s ec o nd a r y s tr u c t u r a l e l e m e n t s u s i n g M C s a m p li n ga nd k n o w l e d g e - b a s e d e n e r g y f un c t i o n s ∼ [ , , ] SS - T h r e a d P r e d i c t i o n o f c o n t a c t i n g p a i r s o f α - h e li x a nd β - s tr a nd < [ ] U n i C o n D U s i n g f o l d o n s a ndp r o b a b ili s t i c m o d e l s t o c a p t u r e l o c a l b a c k b o n e s tr u c t u r a l p r e f e r e n ce a nd s i d ec h a i n c o n f o r m a t i o n s e a r c h s p a ce ∼ [ ] B h ag ee r a t h - HH y b r i d a pp r oa c h i n v o l v i n g c o m b i n a t i o n o f s e v e r a l t oo l s d e v e l o p e d i n t h e l a b w i t h t h e goa l t o r e du cec o n f o r m a t i o n a l s e a r c h s p a ce ∼ [ ] S m o t i f T FF r ag m e n t - B a s e d a pp r oa c hd e v e l o p e du s i n g s a t u r a t e d li b r a r y o f s u - p e r s ec o nd a r y s tr u c t u r e f r ag m e n t s ∼ [ ] T o u c h s t o n e S ec o nd a r y a nd t e rt i a r y r e s tr a i n t s p r e d i c t i o n t h r o u g h t h r e a d i n g - b a s e d a pp r oa c h e s ∼ [ , ] P c o n s F o l d E v o l u t i o n a r y b a s e d s tr u c t u r e p r e d i c t i o np i p e li n e u s i n g P c o n s C c o n - t a c t p r e d i c t i o n s o n R o s e tt a f o l d i n g p r o t o c o l s ∼ [ ] E d a F o l d F r ag m e n t - b a s e d a ll a t o m e n e r g y f un c t i o n t o p r o du ce a t o m i c m o d e l s up t o150 [ , ] A s tr o - F o l d f o l d i n ga m i n oa c i d s e q u e n ce s u s i n g ﬁ r s t - p r i n c i p l e b a s e d a pp r oa c h e s ∼ [ , ] D M F o l d D ee p l e a r n i n g b a s e dp r o t e i n m o d e lli n g b y i n c o r p o r a t i n g p r e d i c t e d s tr u c t u r a l c o n s tr a i n t s li k e i n t e r - a t o m i c d i s t a n ce b o nd s , h y d r og e n b o nd s a nd t o r s i o n a n g l e s up t o200 [ ] These formed the basis of initial algorithms built under the emerging ﬁeld. The main ideabehind developing these physics-based approaches is to rely on MD simulations to trace thefolding path of the proteins. The philosophy backing their design is to obtain lowest energyconformation model by folding the protein sequence using quantum mechanism and coulombpotential [63, 64]. But due to high computational requirements, the ﬁeld majorly relies on interatomic interactions and force ﬁelds to solve the protein folding problem.Free energy calculations have been explored from the very beginning of computational pro-tein structure prediction evolution. It is believed that these approaches can go beyond doc-umented structures and capture novel folds and patterns by exploring the inherent dynamicmotion of proteins [65,66]. Despite the availability of better computing, physics based approachcontinues to lag behind due to the amount of time required to reach the native state along-withthe meddling of erroneous force-ﬁeld that restrict the model to attain it [12, 67–69].MELD (Modelling Employing Limited Data) [65] is a recently developed physics-basedprotein structure prediction approach which uses Bayesian law to tap into atomic moleculardynamics of proteins for structural modelling. It has proven to be eﬀective in determininghigh resolution structures of proteins up to 260 residues long [65]. Similar eﬀort was made byDavid Shaw’s group where they utilised diﬀerent sets of restraints to reduce the MD simulationruns and prevent the model from getting trapped in non-native energy state [70]. H. Nguyenet al demonstrated that the combination of an implicit solvent and a force ﬁeld can resultin near-native models in-case of small proteins (less than 100 amino acids) [71]. Anothergroup showed that simulation time can be reduced and energy landscapes can be managedusing residue-speciﬁc force ﬁeld (RSFF1) in explicit solvent and Replica exchange moleculardynamics (REMD) [72].

It is by far the most successful strategy used for template-free prediction of protein structure.This approach revolves around the construction of fragment libraries of varied lengths, whereeach fragment represents a pseudo-structure. The idea is to map information from proteinfragments instead of using entire templates for constructing protein model. Segments of querysequences are replaced by the fragment coordinates recorded in the fragment library or by itspredicted fold. Since, it is computationally exhaustive to go through all possible protein foldconformations for a structure built from scratch, fragmenting the sequence limits the numberof folding patterns thus reducing the computational expense. Bowie and Eisenberg introducedFragment-Based assembly approach to predict protein structures [41]. They used fragmentsof length 9 to 25 from a database of known proteins and an energy function (composed of 6terms) that can guide building of energetically stable models [41]. This attempt set path forthe evolution of computational 3D-modelling of protein structures using fragments.Through the years several fragment-based approaches have been developed; few of whichhave done exceedingly well and remain the best options for ab initio protein structure predictionto date. The basic idea behind these algorithms remains the same and typically varies withfragment type, length and scoring functions used to generate energetically minimised stablestructure. Rosetta [43, 44], one of the most renowned fragment based approach, uses fragmentlibraries of length 3 and 9. It follows a Monte Carlo simulation based strategy to predict globallyminimised protein models. The scoring function used in Rosetta is based on Bayesian separationof total energy into individual components. Its highest achievement has been noted during6ASP11, where it correctly predicted the structure of a 256 amino acid long sequence [73].SmotifsTF [54] produces library of supersecondary structure fragments known as Smotifsto built probable models. The fragment library construction and utilisation is based on frag-ment assembly protocols. The fragment collection is governed by weak sequence similaritiesgenerating fragments on average of 25 residues in length. QUARK [49] has more dynamicfragment length range of up to 20 residues which are assembled using replica-exchange MonteCarlo simulations guided by knowledge-based force-ﬁeld. It has also been ranked as the bestpredictor in FM category for both CASP9 [18] and CASP10 [19] competition and was amongthe two dominant tools in CASP11 [73].The energy functions or scoring functions used in FBA are directed by micro-state interac-tions existing within known protein structures. These energy terms or functions are also termedas “Knowledge Based Potentials” [74]. FBA algorithms sought out to optimize these energyfunctions. Though and on one hand, FBA based algorithms have witnessed the most successin the biannual CASP competition by designing algorithms around the principle that certainlocal structures are favoured by local amino acid sequence. On the other hand, this limits theirability to search for alternate conformation of the proteins within a single run which reducestheir probability of discovering a novel fold.

Algorithms employing the use of SSEs for building protein models usually focus on assemblingthe core backbone of the protein with an exception of loop regions leading to model reﬁnementprotocols. BCL::FOLD [15] is one of such algorithms built with the objective to overcome thesize and complexity limits faced by most approaches. In the later edition, restraints recoveredfrom sparse NMR data were also incorporated in the pipeline aiding in rapid identiﬁcation ofprotein topology [40]. This was benchmarked on protein data set upto a length of 565 containingboth soluble and membrane proteins. The algorithm was tested on 20 CASP11 targets, out ofwhich it was able to produce a GDT_TS score of 30% on average for twelve [52]. The averageGDT_TS score was accounted for 36%. The study was conducted by using targets belonging todiﬀerent categories oﬀered by CASP, for example T0 (regular targets), TP (predicted residue-residue contacts) and TS (NMR-NOE restraints) etc. This study also pointed out that betterstructures were predicted for proteins dominated by α -helix than β -strands. The predictionaccuracy also decreased with the size of the protein.Another algorithm based on the similar principle is SSThread [53]. It predicts contactingpairs of α -helices and β -strands from experimental structures, secondary structure predictionand contact map predictions. The overlapping pairs are then assembled into a core structureleading to the prediction of loop regions. The contact pairing strategy employed by SSThreadhas been shown to be better in predicting β -strand pairs then all α pairs. Quite recently neural network based deep learning approaches have seen a boom in the ﬁeld ofprotein structure prediction.So far, deep learning approaches for PSP have vastly been used as one of the componentin the entire pipeline rather than implicitly being implemented as the driving force. Majorityof its use revolve around prediction of residue-residue contacts, which are primarily derivedthrough co-evolutionary approaches and/or by building sequence alignment proﬁles [75–77].Recent work done by Al Quraishi [78] focused on building a pure deep learning basedprediction approach. It was designed as a one step algorithm for prediction of protein structurerelying on end-to-end deferential deep learning strategy. The emphasis was laid on not using7ny co-evolutionary data or information from existing templates for protein model construction.Instead, the algorithm relied on data derived solely from protein sequence in question andevolutionary proﬁle of individual residues within the sequence. This method achieved state-of-the-art results as observed in case of ab initio modelling protocols.Another tool that has shown prominence in CASP13 is DeepMind’s AlphaFold [79]. Ituses a two step process for protein structure determination, which also involves the use of co-evolutionary proﬁles to guide model building. Through this methodology high-accuracy modelswere constructed for 24 out of 43 test proteins achieving a TM-score of 0.7 and above in thetemplate-free modelling domain.The community is still beginning to explore the beneﬁts of deep learning approaches intoPSP. The major step back that can be encountered by these techniques would be related to lackof availability of structural data. Since, these approaches are based on training the algorithmbased on the certain patterns followed by available data. Structure prediction ﬁeld has alwaysbeen slower than the sequence equivalent which translates into lower availability of data thatcan be used to train the algorithm. Hence, though deep learning based approaches can be betterimplemented in sequence domain of protein biology, it will take other advances in structuralbiology to push forward deep learning based approaches. Other problem that such approachescan be prone to is over training of the algorithms.

With the advancements made in computational approaches to protein structure prediction,the line between individual methodology is diminishing. Now the structure prediction com-munity is moving forward towards the use of “Hybrid Approaches”, which do not strictly relyon pure template based or template-free prediction criteria but on the amalgamation of both.Bhageerath [80] is one such homology/ ab initio hybrid protocol. It is available in the formof a web-server called Bhageerath-H [37]. The main focus of the pipeline is on reduction ofconformational search space. Out of thousands of predicted models, top 5 are selected basedon physio-chemical metric (pcSM) scoring function (speciﬁc to this algorithm). Eﬃciency ofthis software was put to test by using CASP10 targets with promising prediction results. Afterthe assessment of its shortcomings, an updated version was presented in CASP12 meeting asBhageerathH+ [81].In another study, Quark [49] and fragment-guided molecular dynamic (FG-MD) were addedto I-Tasser pipeline [11, 82] to improve on the existing protocol [33, 83]. The basic idea wasto introduce ab initio generated structures from QUARK into LOMETS [84] to ﬁnd any hitwith existing homologous template with a good TM-score. Top hits are then passed into I-Tasser pipeline for atomic reﬁnement to obtain a structure with low rmsd. This combinationproduced better results for FM targets in CASP10 and CASP11 experiments than QUARKalone [33,85]. MULTICOM_NOVEL approach is one more example of a hybrid algorithm whichwas constructed by combining various complementary structure prediction pipelines includingMULTICOM server, I-Tasser, RaptorX [14], Rosetta etc.Chunk-Tasser can also be put into this category as it utilizes both chunks of folded secondarystructural fragments along with fold-recognition to assemble protein structures [50].On similar grounds, an initiative was undertaken in 2014 to combine methods of the bestknown protein structure prediction techniques and to come up with a pipeline which could gen-erate better structures. This initiative came to be known as WeFold, where 13 labs collaboratedto merge their algorithms forming 5 major branches [86]. The outcome was promising and theauthors of this study discussed on further improvements to be made in prediction protocols asa result of this ’coopetition’ [86]. 8

Evolution of CASP and its contribution

CASP has been a contributing factor for the work done in the ﬁeld of computational proteinstructure prediction. It is a biennial competition being conducted for around two decadesserving as a platform to judge the accuracy of prediction pipelines. It has grown overtime into aprotein structure prediction platform to qualify prediction strategies coming under domains liketemplate-based modelling, template-free modelling, reﬁnement protocols, contact predictionetc. [12, 17, 87, 88].To keep a track of advancement in PSP techniques, CASP prepares a list of unpredictedprotein sequences in each category every two years. This provides an uniformity in assessing theadvancement perceived in each area of structure prediction. The protein sequence list providedfor blind testing of ab initio modelling approaches often constitutes of protein sequences with“soon to be released” structures. Best models are determined on the basis of several criteria, oneof them being a local-global alignment score called GDT_TS score (Global Distance Test) [89].It calculates the C α distance between residues from model and template protein at deﬁnedrmsd cut-oﬀ values. Henceforth determining both local and global similarities between twoprotein molecules.The initial achievement in protein tertiary structure prediction was observed in CASP4, butmainly for small proteins ( ≤

120 residues). In later years, the ab initio prediction ﬁeld remainedstagnant for about a decade until the introduction of better contact prediction approaches inCASP11 competing pipelines with promising improvements in prediction accuracy [90]. Sim-ilar trend was observed in CASP12 with the inclusion of alignment-based contact predictionmethods [91].Recently conducted CASP13 demonstrated further improvement on average GDT_TS scoredue to the employment of deep learning approaches in structure prediction [78]. This servedas an encouragement to dig further into deep learning based approaches to solve the proteinfolding problem.

Template-based prediction in general are quicker than experimental methods, at least in provid-ing initial spatial arrangement of the protein. One of the major drawback of these approachesis the redundancy of information, i.e., no new fold or family can be discovered as it relieson building models from existing structures. In addition, these methods fail to establish thestructural integrity of a protein sequence with decreasing sequence or structure identity.This review peeks into few methods and possibilities of free-modelling techniques developedand available for the prediction of protein structure.

Ab initio protein structure predictionstill bare inﬂuence from PDB structures for optimizing the parameters of protein folding. Thisinformation helps them reduce the conformational space sampling requirements by maximizingthe eﬃciency of energy functions. Most of the algorithms are still directed by a combination ofknowledge-based potentials and physics-based approaches [92].To date free-modelling has been been well adapted for protein sequences length of 150residues or below [16] [92, 93]. Few instances have seen algorithms overdoing themselves andgoing beyond the length restrictions to predict structure for longer proteins. CASP11 witnessedmajor success in ab initio protein structure prediction for a structure of length 256 a.a. [90].The major challenges faced by this ﬁeld starts with ﬁnding an eﬃcient way to explore theconformational search space as a protein sequence can fold into indeﬁnite forms. Thereby,reducing the plausible folding possibilities to the best probable fold is a hard task to achieve.The length limitation could indicate towards the design strategy of the algorithms, manyof which rely on deﬁning domain boundaries prior to structure prediction. But, this would not9e a very strong case of argument as most reliable predictions still lie under the length of 150residue, though single domain boundary could expand upto a length 200 to 250 residues. Theonly exception to this case seen over the past years of CASP competition where a protein oflength 256 a.a. was accurately predicted [90].One could also argue that as a community, we are still at the domain-level of structureprediction given the length limitations. Algorithms have been built that speciﬁcally targetsolving the structure of single-domain proteins [57, 94] or choices of restraints that are limitedto single-domain proteins only [56]. Many of the designed algorithms tend to be validated ona data set of single-domain proteins as well [75].Even so, in the case of small protein structure predictions, not all of them deliver modelclose to its nativity. It is just that chances of having a good structure is more if the length ofthe protein is around 150 or less.There is also the case of stagnancy in the ﬁeld for the majority of years and been pointedout a lot. One of the example is the discussion provided by the end of CASP10 [95]. The studypointed out that the results from CASP10 lay closer to what were observed during CASP5.It also reﬂected on the fact that it might be due to the gradual increase in the complexity ofCASP targets along with the inclusion of multi-domain targets provided for modelling.Another drawback faced by few of these algorithms can be the run time which can vary alot depending on the size of the protein and the internal functioning of the algorithm. With theinclusion of artiﬁcial intelligence, the time scale for modelling has been reduced to millisecondsbut generally, a single prediction can take from somewhere from few minutes to hours to daysfor an algorithm to complete.Most of the algorithms still rely on manual intervention to complete the runs and so thehuman error should also be considered.The point of PSP is not just high accuracy structure determination but also to ascertainthe basis behind this biological process (protein folding). Thereby, ﬁnally answering questionslike “why good protein become faulty and cause disease”.

De novo protein structure prediction still requires a lot of improvement, but at the sametime it promises a better prospect of structure prediction in future. It brings with it a hope ofpredicting newer folds at a faster pace when compared to experimental approaches which canremain stuck for years altogether due to numerous reasons. In general computational structureprediction techniques though have a room for improvement are still quick when compared to tra-ditional approaches [16]. If considering Template-Based modelling approaches, few limitationsstill persist whereas ab initio approaches can move a step ahead and might help understandingthe basic principles of protein folding [93].

Acknowledgements

The authors are most thankful to Yves-Henri Sanejouand for critical reading of the manuscript.SD is thankful to Conseil Régional de La Réunion and Fonds Social Européen for providing aPhD scholarship under tier 234275, convention DIRED/20161451. BO is thankful to ConseilRégional Pays de la Loire for support in the framework of GRIOTE grant.

Conﬂict of interest

Authors declare no competing interests. 10 uthors contributions

Conception of the work was done by SD, RS, FC and BO. SD, RS and BO performed literaturedata collection and analysis. All four authors SD, RS, FC and BO wrote the manuscript.All authors approved the ﬁnal version of the article.

References [1] James C Whisstock and Arthur M Lesk. Prediction of protein function from proteinsequence and structure.

Quarterly reviews of biophysics , 36(3):307–340, 2003.[2] Gina Kolata. Trying to crack the second half of the genetic code.

Science , 233:1037–1040,1986.[3] Emmanuel Boutet, Damien Lieberherr, Michael Tognolli, Michel Schneider, and AmosBairoch. UniProtKB/Swiss-Prot.

Methods in molecular biology (Clifton, N.J.) , 406:89–112, 2007.[4] Peter W. Rose, Andreas Prlić, Ali Altunkaya, Chunxiao Bi, Anthony R. Bradley,Cole H. Christie, Luigi Di Costanzo, Jose M. Duarte, Shuchismita Dutta, Zukang Feng,Rachel Kramer Green, David S. Goodsell, Brian Hudson, Tara Kalro, Robert Lowe, EzraPeisach, Christopher Randle, Alexander S. Rose, Chenghua Shao, Yi Ping Tao, YanaValasatava, Maria Voigt, John D. Westbrook, Jesse Woo, Huangwang Yang, Jasmine Y.Young, Christine Zardecki, Helen M. Berman, and Stephen K. Burley. The RCSB proteindata bank: Integrative view of protein, gene and 3D structural information.

Nucleic AcidsResearch , 45(D1):D271–D281, 2017.[5] Max F. Perutz, Michael G. Rossmann, Ann F. Cullis, Hilary Muirhead, Georg Will, andA. C. T. North. Structure of Haemoglobin: A Three-Dimensional Fourier Synthesis at5.5-Å. Resolution, Obtained by X-Ray Analysis.

Nature , 185(4711):416–422, 1960.[6] Xavier Morelli, Alain Dolla, Myrjam Czjzek, P. Nuno Palma, Francis Blasco, Ludwig Krip-pahl, Jose J.G. Moura, and Françoise Guerlesquin. Heteronuclear NMR and soft docking:An experimental approach for a structural model of the cytochrome c553-ferredoxin com-plex.

Biochemistry , 39(10):2530–2537, 2000.[7] Dong Xu and Yang Zhang. Ab Initio structure prediction for Escherichia coli: towardsgenome-wide protein structure modeling and fold assignment.

Scientiﬁc reports , 3:1895,2013.[8] Ewen Callaway. The Revolution Will Not Be Crystallized.

Nature , 525:172–174, 2015.[9] Christian B. Anﬁnsen. The Formation and Stabilization of Protein Structure.

BiochemicalJournal , 128(4):737–749, 1972.[10] Christian B Anﬁnsen. Principles that Govern the Folding of Protein Chains.

Science ,181(4096):223–230, 1973.[11] Ambrish Roy, Alper Kucukural, and Yang Zhang. I-TASSER: a uniﬁed platform for au-tomated protein structure and function prediction.

Nature Protocols , 5(4):725–738, 2010.[12] Michael Feig. Computational protein structure reﬁnement: almost there, yet still so far togo.

Wiley Interdisciplinary Reviews: Computational Molecular Science , 7(3), 2017.1113] Andras Fiser. Template-based protein structure modeling.

Methods in molecular biology(Clifton, N.J.) , 673:73–94, 2010.[14] Morten Källberg, Haipeng Wang, Sheng Wang, Jian Peng, Zhiyong Wang, Hui Lu, andJinbo Xu. Template-based protein structure modeling using the RaptorX web server.

Nature Protocols , 7(8):1511–1522, 2012.[15] Mert Karaka, Nils Woetzel, Rene Staritzbichler, Nathan Alexander, Brian E. Weiner, andJens Meiler. BCL::Fold - De Novo Prediction of Complex and Large Protein Topologiesby Assembly of Secondary Structure Elements.

PLoS ONE , 7(11), 2012.[16] Ling-Hong Hung, Shing-Chung Ngan, and Ram Samudrala.

De Novo Protein StructurePrediction , pages 43–63. Springer New York, New York, NY, 2007.[17] Moshe Ben-David, Orly Noivirt-Brik, Aviv Paz, Jaime Prilusky, Joel Sussman, and YaakovLevy. Assessment of casp8 structure predictions for template free targets.

Proteins: Struc-ture, Function, and Bioinformatics , 77(Suppl. 9):50–65, 2009.[18] Lisa Kinch, Shuo Yong Shi, Qian Cong, Hua Cheng, Yuxing Liao, and Nick V. Grishin.CASP9 assessment of free modeling target predictions.

Proteins: Structure, Function, andBioinformatics , 79(Suppl. 10):59–73, 2011.[19] Chin Hsien Tai, Hongjun Bai, Todd J. Taylor, and Byungkook Lee. Assessment oftemplate-free modeling in CASP10 and ROLL.

Proteins: Structure, Function, and Bioin-formatics , 82(Suppl. 2):57–83, 2014.[20] David Simoncini, Thomas Schiex, and Kam Y.J. Zhang. Balancing exploration and ex-ploitation in population-based sampling improves fragment-based de novo protein structureprediction.

Proteins: Structure, Function, and Bioinformatics , 85(5):852–858, 2017.[21] Lisa Kinch, Wenlin Li, Richard Schaeﬀer, Roland Dunbrack, Bohdan Monastyrskyy, An-driy Kryshtafovych, and Nick Grishin. Casp 11 target classiﬁcation.

Proteins: Structure,Function and Bioinformatics , 84:20–33, 2016.[22] Bee Yin Khor, Gee Jun Tye, Theam Soon Lim, and Yee Siew Choong. General overview onstructure prediction of twilight-zone proteins.

Theoretical Biology and Medical Modelling ,12(1):15, 2015.[23] Konstantin Arnold, Lorenza Bordoli, Jürgen Kopp, and Torsten Schwede. The SWISS-MODEL workspace: A web-based environment for protein structure homology modelling.

Bioinformatics , 22(2):195–201, 2006.[24] Lorenza Bordoli, Florian Kiefer, Konstantin Arnold, Pascal Benkert, James Battey, andTorsten Schwede. Protein structure homology modeling using SWISS-MODEL workspace.

Nature Protocols , 4(1):1–13, 2008.[25] Alexander Miguel Monzon, Diego Javier Zea, Cristina Marino-Buslje, and Gustavo Parisi.Homology modeling in a dynamical world.

Protein Science , 26:2195–2206, 2017.[26] Andras Fiser, Roberto Sánchez, Francisco Melo, and Andrej Fiser. Comparative ProteinStructure Modeling.

Computational Biochemistry and Biophysics , 2001.[27] Benjamin Webb and Andrej Sali. Comparative protein structure modeling using modeller.

Current protocols in bioinformatics , 54(1):5.6.1–5.6.37, 2016.1228] Burkhard Rost, Reinhard Schneider, and Chris Sander. Protein fold recognition byprediction-based threading.

Journal of Molecular Biology , 270(3):471–480, 1997.[29] Jeﬀrey Skolnick and Daisuke Kihara. Defrosting the frozen approximation:PROSPECTOR–a new approach to threading.

Proteins: Structure, Function, and Bioin-formatics , 42(3):319–331, 2001.[30] William R Taylor and Inge Jonassen. A structural pattern-based method for protein foldrecognition.

Proteins: Structure Function, and Bioinformatics , 56(2):222–34, 2004.[31] Jinbo Xu, Feng Jiao, and Libo Yu. Protein structure prediction using threading.

Methodsin molecular biology (Clifton, N.J.) , 413:91–121, 2008.[32] David E. Kim, Ben Blum, Philip Bradley, and David Baker. Sampling Bottlenecks in Denovo Protein Structure Prediction.

Journal of Molecular Biology , 393(1):249–260, 2009.[33] Wenxuan Zhang, Jianyi Yang, Baoji He, Sara Walker, Hongjiu Zhang, Brandon Govin-darajoo, Jouko Virtanen, Zhidong Xue, Hong-Bin Shen, and Yang Zhang. Integration ofquark and i-tasser for ab initio protein structure prediction in casp11.

Proteins: Structure,Function, and Bioinformatics , 84(Suppl. 1):76–86, 2015.[34] Debswapna Bhattacharya, Renzhi Cao, and Jianlin Cheng. UniCon3D: De novo proteinstructure prediction using united-residue conformational search via stepwise, probabilisticsampling.

Bioinformatics , 32(18):2791–2799, 2016.[35] Paul IW de Bakker, Nicholas Furnham, Tom L. Blundell, and Mark A. DePristo. Con-former generation under restraints.

Current Opinion in Structural Biology , 16(2):160–165,2006.[36] Adam Liwo, Cezary Czaplewski, Stanisław Ołdziej and Harold A Scheraga. Computationaltechniques for eﬃcient conformational sampling of proteins Adam.

Structure , 18(2):134–139, 2008.[37] B Jayaram, Priyanka Dhingra, Avinash Mishra, Rahul Kaushik, Goutam Mukherjee,Ankita Singh, and Shashank Shekhar. Bhageerath-H: A homology/ab initio hybrid serverfor predicting tertiary structures of monomeric soluble proteins.

BMC Bioinformatics ,15(Suppl. 16):S7, 2014.[38] Peter M Bowers, Charlie E Strauss, and David Baker. De novo protein structure determi-nation using sparse NMR data.

Journal of Biomolecular NMR , 18(4):311–318, 2000.[39] Maya Topf, Matthew L. Baker, Marc A. Marti-Renom, Wah Chiu, and Andrej Sali. Reﬁne-ment of protein structures by iterative comparative modeling and cryoEM density ﬁtting.

Journal of Molecular Biology , 357(5):1655–1668, 2006.[40] Brian Weiner, Nathan Alexander, Louesa Akin, Nils Woetzel, Mert Karakas, and JensMeiler. Bcl::fold-protein topology determination from limited nmr restraints.

Proteins:Structure Function, and Bioinformatics , 82:587–595, 2014.[41] James. U. Bowie and David Eisenberg. An evolutionary approach to folding small alpha-helical proteins that uses sequence information and an empirical guiding ﬁtness function.

Proceedings of the National Academy of Sciences , 91(10):4436–4440, 1994.[42] Kim T. Simons, Charles Kooperberg, Enoch Huang, and David Baker. Assembly of proteintertiary structures from fragments with similar local sequences using simulated annealingand bayesian scoring functions.

Journal of Molecular Biology , 268(1):209–225, 1997.1343] Kim T. Simons, Rich Bonneau, Ingo Ruczinski, and David Baker. Ab initio protein struc-ture prediction of CASP III targets using ROSETTA.

Proteins: Structure, Function, andBioinformatics , 37(Suppl. 3):171–176, 1999.[44] Richard Bonneau, Jerry Tsai, Ingo Ruczinski, Dylan Chivian, Carol Rohl, Charlie E MStrauss, and David Baker. Rosetta in CASP4: Progress in ab initio protein structureprediction.

Proteins: Structure, Function, and Bioinformatics , 45(Suppl. 5):119–126, 2001.[45] Srivatsan Raman, Robert Vernon, James Thompson, Michael Tyka, Ruslan Sadreyev,Jimin Pei, David Kim, Elizabeth Kellogg, Frank Dimaio, Oliver Lange, Lisa Kinch, WillSheﬄer, Bong Hyun Kim, Rhiju Das, Nick V. Grishin, and David Baker. Structure pre-diction for CASP8 with all-atom reﬁnement using Rosetta.

Proteins: Structure, Function,and Bioinformatics , 77(Suppl. 9):89–99, 2009.[46] Hahnbeom Park, Frank DiMaio, and David Baker. Casp11 reﬁnement experiments withrosetta.

Proteins: Structure, Function, and Bioinformatics , 84(Suppl. 1):314–322, 2015.[47] Sergey Ovchinnikov, David Kim, Ray Wang, Yuan Liu, Frank DiMaio, and David Baker.Improved de novo structure prediction in casp11 by incorporating co-evolution informationinto rosetta.

Proteins: Structure, Function, and Bioinformatics , 84:67–75, 2015.[48] Rebecca Faye Alford, Andrew Leaver-Fay, Jeliazko R. Jeliazkov, Matthew J O’Meara,Frank P. DiMaio, Hahnbeom Park, Maxim V Shapovalov, P. Douglas Renfrew,Vikram Khipple Mulligan, Kalli Kappel, Jason W. Labonte, Michael Steven Pacella,Richard Bonneau, Philip Bradley, Roland L. Dunbrack, Rhiju Das, David Baker, BrianKuhlman, Tanja Kortemme, and Jeﬀrey J. Gray. The Rosetta all-atom energy functionfor macromolecular modeling and design.

Journal of Chemical Theory and Computation ,13(6):3031–3048, 2017.[49] Dong Xu and Yang Zhang. Ab initio protein structure assembly using continuous structurefragments and optimized knowledge-based force ﬁeld.

Proteins: Structure, Function, andBioinformatics , 80(7):1715–1735, 2012.[50] Hongyi Zhou and Jeﬀrey Skolnick. Ab initio protein structure prediction using chunk-TASSER.

Biophysical Journal , 93(5):1510–1518, 2007.[51] S Ołdziej, C Czaplewski, A Liwo, M Chinchio, M Nanias, J A Vila, M Khalili, Y AArnautova, A Jagielska, M Makowski, H D Schafroth, R Kaźmierkiewicz, D R Ripoll,J Pillardy, J A Saunders, Y K Kang, K D Gibson, and H A Scheraga. Physics-basedprotein-structure prediction using a hierarchical protocol based on the UNRES force ﬁeld:assessment in two blind tests.

Proceedings of the National Academy of Sciences of theUnited States of America , 102(21):7547–7552, 2005.[52] Axel W. Fischer, Sten Heinze, Daniel K. Putnam, Bian Li, James C. Pino, Yan Xia,Carlos F. Lopez, and Jens Meiler. CASP11 - An evaluation of a modular BCL: Fold-basedprotein structure prediction pipeline.

PLoS ONE , 11(4), 2016.[53] Kevin J. Maurice. SSThread: Template-free protein structure prediction by threading pairsof contacting secondary structures followed by assembly of overlapping pairs.

Journal ofComputational Chemistry , 35(8):644–656, 2014.[54] Brinda Vallat, Carlos Madrid-Aliste, and Andras Fiser. Modularity of Protein Folds as aTool for Template-Free Modeling of Structures.

PLoS Computational Biology , 11(8):1–16,2015. 1455] Daisuke Kihara, Hui Lu, Andrzej Kolinski, and Jeﬀrey Skolnick. TOUCHSTONE:an ab initio protein structure prediction method that uses threading-based tertiary re-straints.

Proceedings of the National Academy of Sciences of the United States of America ,98(18):10125–10130, 2001.[56] Yang Zhang, Andrzej Kolinski, and Jeﬀrey Skolnick. TOUCHSTONE II: a new approachto ab initio protein structure prediction.

Biophysical Journal , 85(2):1145–64, 2003.[57] Mirco Michel, Sikander Hayat, Marcin J. Skwark, Chris Sander, Debora S. Marks, andArne Elofsson. PconsFold: Improved contact predictions improve protein models.

Bioin-formatics , 30(17):482–488, 2014.[58] David Simoncini, Francois Berenger, Rojan Shrestha, and Kam Y J Zhang. A probabilisticfragment-based protein structure prediction algorithm.

PLoS ONE , 7(7):1–11, 2012.[59] David Simoncini and Kam Y J Zhang. Eﬃcient Sampling in Fragment-Based ProteinStructure Prediction Using an Estimation of Distribution Algorithm.

PLoS ONE , 8(7):1–10, 2013.[60] John L. Klepeis and Christodoulos A. Floudas. ASTRO-FOLD: a combinatorial and globaloptimization framework for Ab initio prediction of three-dimensional structures of proteinsfrom the amino acid sequence.

Biophysical journal , 85(4):2119–2146, 2003.[61] Ashwin. Subramani, Yinan Wei, and Christodoulos A. Floudas. ASTRO-FOLD 2.0: Anenhanced framework for protein structure prediction.

AIChE Journal , 58(5):1619–1637,2012.[62] Joe G. Greener, Shaun M. Kandathil, and David T. Jones. Deep learning extends de novoprotein modelling coverage of genomes using iteratively predicted structural constraints.

Nature Communications , 10(1):1–13, 2019.[63] Adam Liwo, Jooyoung Lee, Daniel R Ripoll, Jaroslaw Pillardy, and Harold A Scheraga.Protein structure prediction by global optimization of a potential energy function.

Proceed-ings of the National Academy of Sciences of the United States of America , 96(10):5482–5485, 1999.[64] Corey Hardin, Taras V. Pogorelov, and Zaida Luthey-Schulten. Ab initio protein structureprediction.

Current Opinion in Structural Biology , 12(2):176–181, 2002.[65] Alberto Perez, Joseph A Morrone, Emiliano Brini, Justin L Maccallum, and Ken A Dill.Blind protein structure prediction using accelerated free-energy simulations.

Science Ad-vances , 2:e1601274, 2016.[66] Alpan Raval, Stefano Piana, Michael P. Eastwood, and David E. Shaw. Assessment of theutility of contact-based restraints in accelerating the prediction of protein structure usingmolecular dynamics simulations.

Protein Science , 25(1):19–29, 2016.[67] Richard Bonneau and David Baker. Ab initio protein structure prediction: progress andprospects.

Annual Review of Biophysics and Biomolecular Structure , 30(1):173–189, 2001.[68] Carlos Simmerling, Bentley Strockbine, and Adrian E. Roitberg. All-atom structure pre-diction and folding simulations of a stable protein.

Journal of the American ChemicalSociety , 124(38):11258–11259, 2002. 1569] Alpan Raval, Stefano Piana, Michael P. Eastwood, Ron O. Dror, and David E. Shaw.Reﬁnement of protein structure homology models via long, all-atom molecular dynamicssimulations.

Proteins: Structure, Function, and Bioinformatics , 80(8):2071–2079, 2012.[70] Alpan Raval, Stefano Piana, Michael P. Eastwood, and David E. Shaw. Assessment of theutility of contact-based restraints in accelerating the prediction of protein structure usingmolecular dynamics simulations.

Protein Science , 25(1):19–29, 2016.[71] Hai Nguyen, James Maier, He Huang, Victoria Perrone, and Carlos Simmerling. Foldingsimulations for proteins with diverse topologies are accessible in days with a physics-basedforce ﬁeld and implicit solvent.

Journal of the American Chemical Society , 136(40):13959–13962, 2014.[72] Fan Jiang and Yun Dong Wu. Folding of fourteen small proteins with a residue-speciﬁcforce ﬁeld and replica-exchange molecular dynamics.

Journal of the American ChemicalSociety , 136(27):9536–9539, 2014.[73] Lisa Kinch, Wenlin Li, Bohdan Monastyrskyy, Andriy Kryshtafovych, and Nick Grishin.Evaluation of free modeling targets in casp11 and roll.

Proteins: Structure, Function, andBioinformatics , 84(Suppl. 1):51–66, 2015.[74] Evandro Ferrada and Francisco Melo. Eﬀective knowledge-based potentials.

Protein Sci-ence , 18(7):1469–1485, 2009.[75] Jörg Schaarschmidt, Bohdan Monastyrskyy, Andriy Kryshtafovych, and Alexandre Bon-vin. Assessment of contact predictions in casp12: Co-evolution and deep learning comingof age.

Proteins: Structure, Function, and Bioinformatics , 86(Suppl. 1):51–66, 2017.[76] Sheng Wang, Siqi Sun, Zhen Li, Renyu Zhang, and Jinbo Xu. Accurate de novo predic-tion of protein contact map by ultra-deep learning model.

PLOS Computational Biology ,13(1):1–34, 2017.[77] Shaun M. Kandathil, Joe G. Greener, and David T. Jones. Prediction of interresiduecontacts with DeepMetaPSICOV in CASP13.

Proteins: Structure, Function, and Bioin-formatics , 87(12):1092–1099, 2019.[78] Mohammed AlQuraishi. End-to-end diﬀerentiable learning of protein structure.

Cell Sys-tems , 8(4):292 – 301.e3, 2019.[79] Andrew Senior, Richard Evans, John Jumper, James Kirkpatrick, Laurent Sifre, TimGreen, Chongli Qin, Augustin Žídek, Alexander Nelson, Alex Bridgland, Hugo Penedones,Stig Petersen, Karen Simonyan, Steve Crossan, Pushmeet Kohli, David Jones, David Silver,Koray Kavukcuoglu, and Demis Hassabis. Improved protein structure prediction usingpotentials from deep learning.

Nature , 577:1–5, 2020.[80] B. Jayaram, Priyanka Dhingra, Bharat Lakhani, and Shashank Shekhar. Bhageerath -Targeting the near impossible: Pushing the frontiers of atomic models for protein tertiarystructure prediction.

Journal of Chemical Sciences , 124(1):83–91, 2012.[81] Rahul Kaushik, Ankita Singh, Debarati Dasgupta, Amita Pathak, Shashank Shekhar,and B. Jayaram. BhageerathH+: A hybrid methodology based software suite for proteintertiary structure prediction. In

CASP12 Proceedings , pages 25–26, 2016.[82] Jianyi Yang and Yang Zhang. I-TASSER server: New development for protein structureand function predictions.

Nucleic Acids Research , 43(W1):W174–W181, 2015.1683] Dong Xu, Jian Zhang, Ambrish Roy, and Yang Zhang. Automated protein structure mod-eling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding andFG-MD-based structure reﬁnement.

Proteins: Structure, Function, and Bioinformatics ,79(Suppl. 10):147–160, 2011.[84] Sitao Wu and Yang Zhang. LOMETS: A local meta-threading-server for protein structureprediction.

Nucleic Acids Research , 35(10):3375–3382, 2007.[85] Yang Zhang. Interplay of I-TASSER and QUARK for template-based and ab initio pro-tein structure prediction in CASP10.

Proteins: Structure, Function, and Bioinformatics ,82(Suppl. 2):175–187, 2014.[86] George A. Khoury, Adam Liwo, Firas Khatib, Hongyi Zhou, Gaurav Chopra, Jaume Bac-ardit, Leandro O. Bortot, Rodrigo A. Faccioli, Xin Deng, Yi He, Pawel Krupa, Jilong Li,Magdalena A. Mozolewska, Adam K. Sieradzan, James Smadbeck, Tomasz Wirecki, SethCooper, Jeﬀ Flatten, Kefan Xu, David Baker, Jianlin Cheng, Alexandre C.B. Delbem,Christodoulos A. Floudas, Chen Keasar, Michael Levitt, Zoran Popović, Harold A. Scher-aga, Jeﬀrey Skolnick, and Silvia N. Crivelli. WeFold: A coopetition for protein structureprediction.

Proteins: Structure, Function, and Bioinformatics , 82(9):1850–1868, 2014.[87] Ralf Jauch, Hock Chuan Yeo, Prasanna R Kolatkar, and Neil D Clarke. Assessment ofCASP7 structure predictions for template free targets.

Proteins: Structure, Function, andBioinformatics , 69(S8):57–67, 2007.[88] John Moult, Krzysztof Fidelis, Andriy Kryshtafovych, Torsten Schwede, and Anna Tra-montano. Critical assessment of methods of protein structure prediction (CASP)—RoundXII.

Proteins: Structure, Function, and Bioinformatics , 86:7–15, 2018.[89] Adam Zemla. LGA: A method for ﬁnding 3D similarities in protein structures.

NucleicAcids Research , 31(13):3370–3374, 2003.[90] John Moult, Krzysztof Fidelis, Andriy Kryshtafovych, Torsten Schwede, and Anna Tra-montano. Critical assessment of methods of protein structure prediction (CASP) - progressand new directions in Round XI.

Proteins: Structure, Function, and Bioinformatics , 84:4–14, 2016.[91] Luciano A. Abriata, Giorgio E. Tamò, Bohdan Monastyrskyy, Andriy Kryshtafovych, andMatteo Dal Peraro. Assessment of hard target modeling in CASP12 reveals an emergingrole of alignment-based contact prediction methods.

Proteins: Structure, Function, andBioinformatics , 86:97–112, 2018.[92] Yang Zhang. Progress and challenges in protein structure prediction.

Current Opinion inStructural Biology , 18(3):342–348, 2008.[93] Jooyoung Lee, Peter L Freddolino, and Yang Zhang. Ab initio protein structure prediction.In

From protein structure to function with bioinformatics , pages 3–35. Springer, 2017.[94] David E. Kim, Dylan Chivian, and David Baker. Protein structure prediction and analysisusing the Robetta server.

Nucleic Acids Research , 32(WEB SERVER ISS.):526–531, 2004.[95] Andriy Kryshtafovych, Krzysztof Fidelis, and John Moult. CASP10 results compared tothose of previous CASP experiments.