Assessment of protein assembly prediction in CASP13
Dmytro Guzenko, Aleix Lafita, Bohdan Monastyrskyy, Andriy Kryshtafovych, Jose M. Duarte
OO R I G I N A L A R T I C L E
Assessment of protein assembly prediction inCASP13
Dmytro Guzenko | Aleix Lafita | BohdanMonastyrskyy | Andriy Kryshtafovych | Jose M.Duarte Research Collaboratory for StructuralBioinformatics Protein Data Bank, SanDiego Supercomputer Center, University ofCalifornia, San Diego, La Jolla, CA 92093,USA European Molecular Biology Laboratory,European Bioinformatics Institute,Wellcome Genome Campus, Hinxton,Cambridge, CB10 1SD, UK Protein Structure Prediction Center,Genome and Biomedical Sciences Facilities,University of California, Davis, CA 95616,USA
Correspondence
Dmytro Guzenko, Research Collaboratoryfor Structural Bioinformatics Protein DataBank, San Diego Supercomputer Center,University of California, San Diego, La Jolla,CA 92093, USAEmail: [email protected]
Funding information
DG and JD were supported by the RCSBPDB, jointly funded by the National ScienceFoundation, the National Institute ofGeneral Medical Sciences, the NationalCancer Institute, and the Department ofEnergy (nsf-dbi 1338415; PrincipalInvestigator: Stephen K. Burley). AK andBM were supported by the US NationalInstitute of General Medical Sciences(NIGMS/NIH) grant GM100482. AL wassupported by the EMBL International PhDProgramme.
We present the assembly category assessment in the 13 th edition of the CASP community-wide experiment. For thesecond time, protein assemblies constitute an independentassessment category. Compared to the last edition we seea clear uptake in participation, more oligomeric targets re-leased, and consistent, albeit modest, improvement of thepredictions quality. Looking at the tertiary structure pre-dictions we observe that ignoring the oligomeric state ofthe targets hinders modelling success. We also note thatsome contact prediction groups successfully predicted ho-momeric interfacial contacts, though it appears that thesepredictions were not used for assembly modelling. Homol-ogy modelling with sizeable human intervention appears toform the basis of the assembly prediction techniques in thisround of CASP. Future developments should see more inte-grated approaches to modelling where multiple subunits area natural part of the modelling process, which would benefitthe structure prediction field as a whole. K E Y W O R D S
CASP, protein assembly, protein interfaces, structure prediction a r X i v : . [ q - b i o . B M ] A ug G UZENKO ET AL . | INTRODUCTION
In their physiological environment, protein chains com-monly associate with other chains or copies of themselvesto form protein assemblies. This is the so-called quater-nary structure, an intrinsic property of the native state ofa protein, known before the first atomic structures weresolved [1]. Protein function is linked and often is deter-mined or regulated by the oligomeric structure [2][3][4].As of March 2019, the average structure in the ProteinData Bank (PDB) [5] is a dimer and approximately half ofthe PDB is annotated as oligomeric. Estimates of the av-erage protein oligomeric state in the cell point to an evenhigher tetrameric assembly [6].Protein oligomerization is a broad term that encom-passes states with different degrees of affinity. The as-sociation between polypeptide chains in stable obligateoligomers can be regarded as an extension of protein fold-ing and often occurs simultaneously [7]. At the other ex-treme are transient protein-protein complexes where theassociation is opportunistic and promiscuous, represent-ing the functions of the proteins involved [8]. It is im-portant to note that there is a continuum between thesestates, and in the context of CASP no effort has yet beenmade to distinguish them.Due to intrinsic limitations of the different experi-mental methods used for structure determination, proteinassemblies are likely underrepresented in the PDB. Thethree methods most commonly used are X-ray crystallog-raphy, nuclear magnetic resonance (NMR) spectroscopyand 3-dimensional electron microscopy (3DEM).X-ray crystallography has been and remains the mainsource of atomic-resolution protein structures in the PDB.The majority of these are homomeric (85% of depositionsin 2018), from which about half are oligomeric. Crystalliza-tion of hetero-oligomers is more technically challenging,especially as the interaction becomes more transient [9].Consequently, hetero-oligomeric complexes are severelyunderrepresented in the X-Ray crystallographic output.Historically the second-most popular method for pro-tein structure determination, NMR spectroscopy, doesnot contribute significantly to their oligomerization knowl-edge. It accounted for 3% of overall depositions to PDB in 2018 with 90% of entries being monomers. The reasonsare mostly technical: protein complexes are often largeand symmetric and both of these factors complicate NMRdata analysis.The rapidly expanding 3DEM technique is naturallysuited for determination of protein complexes (95% of theEM entries) and has the most potential to boost our qua-ternary structure knowledge. In 2018 3DEM accountedfor 10% of PDB depositions and, notably, for about a thirdof all deposited hetero-oligomeric complexes. Tradition-ally, the interpretation of the experimental maps was morechallenging due to low resolution (median 4.3 Å) and lesswell-developed data-model fit quality metrics. Howeverthere is plenty of room for optimism as the technique con-tinues to actively develop and achieves ever higher resolu-tions (the median resolution was 3.8 Å in 2018) [10][11].The Critical Assessment of protein Structure Predic-tion (CASP) experiment was established as a means to con-sistently evaluate the state of the protein structure com-putational modeling field. The experiment focuses on prob-lems at the frontier of the research and evolves togetherwith it. New prediction categories deemed attainable areregularly introduced, and those where the progress is be-lieved to have been exhausted are discontinued [12].Quaternary structure has a rather peculiar historywithin the experiment. While oligomeric protein tar-gets were incidentally featured in CASP2 (1996), CASP7(2006) and CASP9 (2010), the experiment was mainlyfocused on tertiary structure prediction. On the otherhand, the Critical Assessment of PRedicted Interactions(CAPRI), an independent experiment inspired by CASP,was established in 2001 to address the protein-proteindocking problem. With such an arrangement, the assess-ment of the quaternary structure modeling was explicitlybranched into “subunits” (CASP) and “interfaces” (CAPRI).Recognizing the growing importance of integrated qua-ternary structure prediction, CASP and CAPRI conductedthe parallel assessment of selected oligomeric targets in2014 (CASP11/CAPRI30). In 2016 (CASP12), a separate“Assembly” category was introduced to evaluate predic-tions of the complete 3-dimensional functional units on alloligomeric CASP targets. The assembly category serves tohighlight the importance of considering proteins in their
UZENKO ET AL . 3 native solution state, with the ultimate goal of producingcomplete models, that can shed light into the biology andfunction of the molecular systems under scrutiny.By introducing new assessment categories, the CASPexperiment shapes and drives the development of meth-ods necessary to excel in them [12]. Recent breakthroughsin both domain structure [13] and contact predictions [14]suggest that higher-order complexity targets, protein as-semblies, are feasible. Here we present our analysis ofthe CASP13 assembly predictions, compare the results tothose of CASP12 and discuss the status and outlook of thefield. | METHODS2.1 | Assembly targets
In CASP13, the organizers proactively gathered proteinassemblies, specifically targeting heteromeric complexes.This has resulted in 64% of the targets (42 out of 66) be-ing oligomeric – a marked increase from 42% (30 targetsout of 71) in CASP12[15]. 20 targets were selected forthe combined CAPRI/CASP experiment [ref CAPRI assess-ment].In terms of experimental methods the vast majority oftargets came from X-ray crystallography (36 out of 42),whilst the rest were solved with the 3DEM technique.Compared to CASP12 (26 X-ray, 2 NMR and 2 3DEM) weobserve a significant increase in structures solved with3DEM, consistent with the recent developments in experi-mental structural biology.Assigning the oligomeric state of targets was not al-ways a straightforward task, specifically in the case of crys-tal structures, where the contacts in the crystal lattice canlead to different interpretations [16]. This step was donein collaboration with the CAPRI assessment team, withcontributions from the CASP organizers. In broad terms,to assign the oligomeric state we considered the following(in order of priority): experimentalists indication, preferred if backed by ex-perimental evidence; if structure was known, EPPIC [17] and PISA [18] anal- ysis; stoichiometry consensus of homologous structures inthe PDB found with HHpred [19].All CASP13 targets were examined in this way, even whenassumed to be monomers by the experimentalists. Afterthis procedure, 5 cases remained ambiguous and were as-signed with low confidence (see Table S1). This shows howone of the challenges in assembly prediction is the defini-tion of the ground truth [16].The selection process resulted in a wide range of stoi-chiometries and symmetries (see Table S1). They includeda helical symmetry ( T0995 ) and a very large complex withA6B6C6 stoichiometry (
H1021 ) solved by 3DEM. Out of42 targets, 12 were heteromeric and 30 homomeric, dou-ble the proportion of heteromers as would be expectedif drawn randomly from the PDB[20]. Two of the het-eromeric targets presented uneven stoichiometry (
H0953 ,with stoichiometry A3B1 and
H1022 with A6B3), a ratherunusual event in the PDB with only 10% occurrence amongall known heteromers[20]. | Target difficulty
We have classified the targets into three difficulty lev-els based on the information available to the predictorsprior to the experiment, similarly to the CASP12 assem-bly assessment[21]. Outcome of predictions ( i.e. , posterior difficulty) was not considered.We define three difficulty classes with the followingcriteria: • Easy : the target has templates for both the subunitsand the overall assembly, findable by sequence homol-ogy detection methods. • Medium : the target has partial templates identifiableby sequence homology detection methods. Partial canmean that the full subunit templates are known butno information to model the interface can be found,or that information of only part of the interfaces isknown (e.g. a dimer template available for half of atetrameric target). • Difficult : the target does not have templates findable G UZENKO ET AL . by sequence homology detection methods, for eitherthe subunits or the assembly.One of the targets ( T0965 ) was classified as Medium(see Table S1), despite availability of a complete template,because the arrangement of helices at the interface dif-fered substantially in the target structure. | Evaluation scores
We assess the accuracy of the predicted protein-protein in-terfaces with the two measures introduced in the CASP12assembly assessment: Interface Contact Similarity (ICS)and Interface Patch Similarity (IPS) [21]. In the officialevaluation tables in the predictioncenter.org website,these scores are called F1 and Jaccard respectively. Evalua-tion of the interfaces is sufficient if the subunits are knownor are relatively easy to model independently of each other.However, CASP assembly targets are not selected withthis assumption in mind and in practice often require non-trivial subunit modelling. To capture performance of thetertiary structure prediction methods in the context ofquaternary structure, we have chosen to add two otherscores to the pool: local Distance Difference Test (lDDT)[22] for local model quality and Global Distance Test (GDT)[23] for similarity of the global fold. These scores are notdirectly applicable to the multi-chain models, as the or-der of chains in the file is not necessarily preserved withrespect to their 3-dimensional arrangement. Therefore,’chain mapping’ has to be established between the targetand the prediction prior to regular scoring. We used theQS-score algorithm [24] (all targets except
H1021 ) and QS-align [25] (
H1021 ) for this purpose. The obtained scoreswere rescaled to the [ , ] range and are referred here asGDT/lDDT Oligomeric (or GDTo/lDDTo for brevity). Inaddition, we calculated these scores for the CASP12 tar-gets and predictions to enable direct comparison of theresults. Figure 1 shows score correlations for all modelsin CASP13, with clear blocks differentiating how interface(local) scores capture different information than assembly(global) scores. Z -scores were calculated for every score per evalu-ation target. The first submitted model (supposedly the best out of five allowed) was used for each group. To avoidpenalizing unsuccessful prediction attempts and softwareglitches, we followed the CASP convention of removingoutliers ( Z < − ), recalculating the Z -scores and flatten-ing negative values to zero. The total group score is a sim-ple sum of all Z -scores for all targets it submitted predic-tions for. It has been noted [26] that difficult targets withfew good predictions may result in inflated Z -scores. Tomitigate this effect we performed ’leave-one-out ranking’,whereby each target is consecutively removed from con-sideration, and groups’ mean total score is used for theranking. The maximum and minimum total score valuescan be used to assess the significance of the differencesbetween the closely ranked groups (shown in Figure 4 aserror bars). | RESULTS
A total of 45 groups participated in the CASP13 assemblycategory. From those, 22 groups participated only in thesubset of targets selected for the joint CASP/CAPRI exper-iment, while 23 submitted predictions for all targets. 17groups submitted models for more than 10 targets. Thatcompares to only 10 groups submitting models for morethan 10 targets in CASP12 assembly category [21]. Interms of number of models submitted there was a dra-matic increase from 1600 in CASP12 to more than 5000in CASP13.Clear improvements in the prediction format andmethodology were introduced in this edition compared tothe first assembly category experiment in CASP12. First,the stoichiometry information is now provided to the pre-diction servers in an automated way. Second, model filescan now be multi-chain, eliminating the need for asses-sors to guess whether predictors are actually attemptingassembly prediction or not. | Performance
We present detailed score distributions for all targets inFigure 2, each panel corresponding to one of the 4 scoresused. We used the
Seok-naive_assembly method [27] as
UZENKO ET AL . 5 an indication of baseline for each target. In order to qual-itatively analyze the predictions outcome, we consider atarget to be solved if there exist models for which all fourscores (ICS, IPS, lDDTo, GDTo) have values greater than0.5. It follows that 9 assembly targets out of 42 are solvedin CASP13:
T0961o , T0973o , H0974 , T0983o , T1003o , T1004o , T1006o , T1016o , T1020o (Figure 2). However, 4of these are also solved by the baseline method.
T1004o is a notable improvement on the baseline, as it had twopartial assembly templates (PDB IDs 5EFV and 5M9F),which most groups successfully combined. In contrast tothe results of tertiary structure prediction in this roundof CASP, absence of detectable assembly templates withnear-complete coverage guarantees absence of good mod-els. Using the same criteria as above, we find that 6 (easy)targets out of 30 were solved in CASP12 – the same pro-portion as in CASP13. To evaluate the progress quantita-tively, we assume that the difficulty of the assembly tar-gets in CASP12 and CASP13 has roughly the same dis-tribution (evidence in [ref this year’s domain predictionassessment]), and compare the relative performance ofthe predictors by matching score percentiles. For example,GDTo value of 0.5 in CASP12 is at the 76 th percentile ofall best predictions. In CASP13, the 76 th percentile cor-responds to the GDTo value of 0.55, which indicates 5%improvement. Figure 3 reveals the complete picture ofsuch analysis and shows 5-15% improvement for all scoresacross the board.Finally, the CASP13 group ranking is shown in Figure 4.The Venclovas group consistently outperformed the restin all difficulty classes, followed by
Seok and
BAKER . Suc-cess of the top-performing groups appears to be in largepart due to the human intervention, as all participatingservers are ranked similarly to the naïve strategy. | Prediction highlights
An interesting and quite successful prediction target was
T0976 . The homodimer target is composed of 4 copiesof a well known domain with many templates available inthe PDB (CATH superfamily 3.40.250.10, Oxidized Rho-danese domain 1 [28]). However, there were no templates with this particular dimer. Rather, a monomeric template(PDB ID: 1YT8) had a similar overall arrangement of the4 domains with interdomain interfaces resembling thedimeric interface in the target (see Figure 7A). Groups like
D-Haven , ZouTeam and
ClusPro achieved relatively goodscores for the dimeric interface and for the assembly.Target
T1001 , classified as difficult, was another suc-cess story from predictors. A good dimeric template existsin the PDB (PDB ID: 5LLW), however, the matching do-main in 5LLW is only a small part of the full length protein(Figure 7B) and importantly contains a very long insertionwhen compared to
T1001 . Indeed, HHpred is not able tofind either this or a tertiary-only template (PDB ID: 3OOV)when submitting different subsets of the target sequence.Relatively good predictions were submitted by
Seok and
BAKER groups.An example of an unsuccessful multimeric predictionwas
H0968 , classified as difficult due to lack of assemblytemplates and with both monomers being FM targets. Thesubunits were well modelled by a few groups, presumablyaided by contact prediction. However there was essen-tially no group that came close to either of the two inter-faces present in the target (Figure 7C). Nevertheless, somegroups could predict interface contacts for this target’s ho-momeric interface, as detailed in the
Contact Prediction section below. | Importance of quaternary modelling
While analyzing the results, we noticed a tendency in howthe quaternary structure is handled by the predictors, inparticular those who did not participate in the assemblycategory. Most groups seemingly split the problem intotwo consecutive steps: 1) modelling the subunits, 2) mod-elling the complex. However, results from this CASP showthat such strategy is flawed. This can be appreciated veryclearly in multiple targets (Figure 5) which we discuss be-low.
T0973 , T0991 and
T0998 : all 3 targets have similarfolds and dimeric quaternary structures. The dimeric in-terface is formed by the swapping of a helix folding ontothe beta sheet of the other monomer, with an enormousburied surface area resulting in an intimate and very stable G UZENKO ET AL . dimer . However, the evaluation unit for the regular pre-diction was the full monomer (including the swapped helix)in all 3 cases. Unsurprisingly, these targets received pooroverall predictions. A good quaternary template was avail-able for the target T0973 , which resulted in some mod-ellers achieving good scores. Notably, the best performinggroup in the regular category,
AlphaFold , did not use tem-plates explicitly and showed poor performance for
T0973 (GDT_TS=32.62).Target
H0953 is an A3B1 multimer, composed ofa trimeric part with a beta helix fold attached to amonomeric receptor recognition protein. The trimer con-sists of single-chain beta sheets in the N-terminal and ofinterdigitated beta strands coming from each of the chainsin the C-terminal. The interface buried area is not ex-ceptionally large but the intertwining geometry makesit an obligate multimer. Again in this case, the evalua-tion unit (
T0953s1-D1 ) was assigned to a single full-lengthmonomer out of the trimer. This resulted in overall badpredictions in the C-terminal region for regular categorymodels.
BAKER is the only group that comes close to a rea-sonable prediction for the C-terminal.Other examples are
T0981 , T0989 and
H0957 . With-out going into detail, all of these had relatively low-qualitypredictions due to treating the chains as completely inde-pendent folding units. | Contact predictions for homomericinterfaces
Next, we looked whether contact predictions are in someway useful for quaternary structure modelling. Althoughinterface contacts are not considered in the contact pre-diction category in CASP13 [ref Fiser 2019], homomeric in-terfaces are formed by contacts within a single target andshould therefore be accounted for. In total, 37 CASP13targets form homomeric interactions, which in average ac-count for 13% of all contacts in the target, ranging from2% to over 50% (Figure S1). To our surprise, we find thathomomeric contacts are usually among the top rankedpredictions from the best groups in each respective tar- get. In the examples shown in Figure 6, good predictionsexist for both the tertiary and interface contacts. Theyare regarded as false positives in current evaluation. Infact, we find that considering homomeric contacts wouldhave changed the group ranking for contact prediction ofsome targets, e.g.
T0968s2 . In view of these results, fu-ture CASP editions should consider evaluating homomericcontacts.Homomeric interface contacts also present a chal-lenge for protein structure modelling from contact ma-trix predictions, since currently most regular predictorstry to fold a single subunit. The additional interface con-tacts in the matrix would impose unrealistic constraintsbetween residues in the folding protocol, similarly to falsepositives, known to negatively affect 3D reconstruction[29, 30]. Modellers would need to disentangle intra-chainfrom inter-chain contacts in the matrix and adapt theirpipelines to fold multiple chains according to the given sto-ichiometry, similar to what has been done for heteromericinterface predictions [31, 32].Among all types of homomeric interactions, isologousinterfaces (as found in cyclic dimers and dihedral symme-tries) present yet another challenge for protein assem-bly modelling from contact predictions. Due to the 2-foldsymmetry, many of the contacts at the interface, speciallythose close to the axis of symmetry, will be between thesame residues (residue interacting with itself in anothersubunit) or residues very close in sequence, which are ex-cluded by design from contact predictions. For example,this is the case for the homodimeric interface in target
T0968s2 . | Data-assisted predictions and as-semblies
A total of 7 assembly targets were also released as ‘data as-sisted’ targets (Fig. S2), a category that attempts to evalu-ate advances in integrative modelling methods [33]. SAXSdata was collected for all 7 of the targets, whilst cross-link data was collected for 5 of them and NMR data for 1(
H0980 ). The experimental details and data-assisted spe- Indeed, quoting Kaspars Tars (Latvian Biomedical Research and Study Centre) who provided the experimental structure: "Monomers do not exist in a free state,so modelling a monomer structure makes no sense. (...) The hydrophobic core of the protein is in part composed of inter-monomer contacts in dimer."
UZENKO ET AL . 7 cific assessment is discussed in the respective papers [refsHura 2019, Fiser 2019, Montelione 2019]. Here, as partof our assembly analysis, we looked into how the data-assisted assembly predictions compare with the regularones, using the regular evaluation strategy. All 7 targetswere selected from the difficult group, for which there is lit-tle homology information available to perform traditionalmodelling. SAXS data has the potential to provide valuableinformation about the global shape of the assemblies andthus should be particularly helpful for this category. At thesame time, cross-linking and NMR data can provide infor-mation on the inter-chain interfaces, potentially helpingthe assembly modelling process.Figure S3 presents the evaluation of all the targets onthe 4 scores used here (see Methods). The score ranges forall of them are not significantly different from the regularpredictions. Barring target
X0957 (Fig. S4), no systematicimprovement is detectable in this experiment. The rea-sons appear to be twofold. First, difficulty of the targetsmay have limited the search space of the prediction meth-ods too early in the pipeline (Fig. S5). Second, the groupswith the best non-assisted predictions generally did notparticipate in the data-assisted category, which limits com-parability of the outcomes between the categories. | CONCLUSIONS AND OUTLOOK
We have presented the CASP13 assembly category as-sessment, the second edition of CASP with a dedicatedassembly category. We have seen significant increase inparticipation, indicating more interest in quaternary struc-ture modelling, a trend that can only be beneficial to thefurther development of methods. In addition, quality ofthe predictions consistently increased as well. We are hop-ing that the trend will continue in the next CASPs and thatquaternary structure modelling becomes mainstream. Un-fortunately, predictions in the regular categories are stillnot taking into account quaternary structure as an essen-tial part of their modelling pipelines. We also showed thatcontact prediction for homomeric interfaces is already sur-prisingly successful, an aspect likely ignored by both pre-dictors and assessors at the moment. We still see room for improvement in several places.Automation is rather limited in this category. For instance,only 2 servers (Swiss-Model[34] and Robetta [35]) par-ticipate in the multimeric section of the fully automatedCAMEO experiment [36]. The sophistication of the meth-ods in assembly modelling is falling behind traditional ter-tiary modelling. Specifically, we have not seen much uti-lization of the machine learning methods, popular in thetertiary structure and contact prediction categories. It ap-pears that traditional homology modelling still dominatesthe field.In conclusion, we would like to emphasize that quater-nary modelling is intrinsic to the protein modelling prob-lem and must be considered from the outset in the designof modelling pipelines. Correspondingly, a CASP evalua-tion unit should match the functional form of a proteinstructure, be it a monomer or an assembly, with consistentmetrics throughout. A C K N O W L E D G E M E N T S
We would like to thank the CASP management committeefor organization and support. We are grateful to ChaokSeok for contributing the naïve prediction method. Wethank Susan Tsutakawa, Gregory Hura, Gaetano Monte-lione and Andras Fiser for their help in interpreting data-assisted predictions. We thank Spencer Bliven for discus-sions of the CASP12 assembly prediction results.
R E F E R E N C E S [1] Svedberg T, Nichols J. The application of the oil turbinetype of ultracentrifuge to the study of the stability re-gion of carbon monoxide-hemoglobin. Journal of theAmerican Chemical Society 1927;49(11):2920–2934.[2] Goodsell DS, Olson AJ. Structural symmetry andprotein function. Annual review of biophysics andbiomolecular structure 2000;29(1):105–153.[3] Selwood T, Jaffe EK. Dynamic dissociating homo-oligomers and the control of protein function. Archivesof Biochemistry and Biophysics 2012;519(2):131–143.[4] Hashimoto K, Panchenko AR. Mechanisms of pro-tein oligomerization, the critical role of insertions anddeletions in maintaining different oligomeric states. G UZENKO ET AL . Proceedings of the National Academy of Sciences2010;107(47):20352–20357.[5] Burley SK, Berman HM, Bhikadiya C, Bi C, Chen L,Di Costanzo L, et al. RCSB Protein Data Bank: bi-ological macromolecular structures enabling researchand education in fundamental biology, biomedicine,biotechnology and energy. Nucleic acids research2018;47(D1):D464–D474.[6] Goodsell DS. Inside a living cell. Trends in biochemicalsciences 1991;16:203–206.[7] Tsai CJ, Xu D, Nussinov R. Protein folding via bindingand vice versa. Folding and Design 1998;3(4):R71–R80.[8] Nooren IM, Thornton JM. Diversity of protein–proteininteractions. The EMBO journal 2003;22(14):3486–3492.[9] Radaev S, Li S, Sun PD. A survey of protein–protein com-plex crystallizations. Acta Crystallographica Section D:Biological Crystallography 2006;62(6):605–612.[10] Kühlbrandt W. The resolution revolution. Science2014;343(6178):1443–1444.[11] Ognjenović J, Grisshammer R, Subramaniam S. Fron-tiers in Cryo Electron Microscopy of Complex Macro-molecular Assemblies. Annual review of biomedical en-gineering 2019;21.[12] Kryshtafovych A, Fidelis K, Moult J. CASP: A drivingforce in protein structure modeling. Introduction toProtein Structure Prediction: Methods and Algorithms2010;p. 15–32.[13] Abriata LA, Tamò GE, Monastyrskyy B, KryshtafovychA, Dal Peraro M. Assessment of hard target modelingin CASP12 reveals an emerging role of alignment-basedcontact prediction methods. Proteins: Structure, Func-tion, and Bioinformatics 2018;86:97–112.[14] Schaarschmidt J, Monastyrskyy B, Kryshtafovych A,Bonvin AM. Assessment of contact predictions inCASP12: Co-evolution and deep learning coming ofage. Proteins: Structure, Function, and Bioinformatics2018;86:51–66.[15] Moult J, Fidelis K, Kryshtafovych A, Schwede T, Tramon-tano A. Critical assessment of methods of protein struc-ture prediction (CASP)—Round XII. Proteins: Structure,Function, and Bioinformatics 2018;86:7–15. [16] Capitani G, Duarte JM, Baskaran K, Bliven S, Somody JC.Understanding the fabric of protein crystals: computa-tional classification of biological interfaces and crystalcontacts. Bioinformatics 2015;32(4):481–489.[17] Bliven S, Lafita A, Parker A, Capitani G, DuarteJM. Automated evaluation of quaternary structuresfrom protein crystals. PLoS computational biology2018;14(4):e1006104.[18] Krissinel E. Stock-based detection of proteinoligomeric states in jsPISA. Nucleic acids research2015;43(W1):W314–W319.[19] Zimmermann L, Stephens A, Nam SZ, Rau D, KüblerJ, Lozajic M, et al. A completely reimplemented MPIbioinformatics toolkit with a new HHpred server at itscore. Journal of molecular biology 2018;430(15):2237–2243.[20] Xu Q, Dunbrack Jr RL. Principles and characteristics ofbiological assemblies in experimentally determined pro-tein structures. Current opinion in structural biology2019;55:34–49.[21] Lafita A, Bliven S, Kryshtafovych A, Bertoni M,Monastyrskyy B, Duarte JM, et al. Assessment ofprotein assembly prediction in CASP12. Proteins:Structure, Function, and Bioinformatics 2018;86:247–256.[22] Mariani V, Biasini M, Barbato A, Schwede T. lDDT: a localsuperposition-free score for comparing protein struc-tures and models using distance difference tests. Bioin-formatics 2013;29(21):2722–2728.[23] Zemla A, Venclovas ˇC, Moult J, Fidelis K. Process-ing and analysis of CASP3 protein structure predic-tions. Proteins: Structure, Function, and Bioinformatics1999;37(S3):22–29.[24] Bertoni M, Kiefer F, Biasini M, Bordoli L, Schwede T.Modeling protein quaternary structure of homo-andhetero-oligomers beyond binary interactions by homol-ogy. Scientific reports 2017;7(1):10480.[25] Lafita A, Bliven S, Prlić A, Guzenko D, Rose PW, BradleyA, et al. BioJava 5: A community driven open-sourcebioinformatics library. PLoS computational biology2019;15(2):e1006791.[26] Cozzetto D, Kryshtafovych A, Fidelis K, Moult J, Rost B,Tramontano A. Evaluation of template-based models inCASP8 with standard measures. Proteins: Structure,Function, and Bioinformatics 2009;77(S9):18–28.
UZENKO ET AL . 9 [27] Lensink MF, Velankar S, Baek M, Heo L, Seok C, WodakSJ. The challenge of modeling protein assemblies: theCASP12-CAPRI experiment. Proteins: Structure, Func-tion, and Bioinformatics 2018;86:257–273.[28] Dawson NL, Lewis TE, Das S, Lees JG, Lee D, Ashford P,et al. CATH: an expanded resource to predict proteinfunction through structure and sequence. Nucleic acidsresearch 2016;45(D1):D289–D295.[29] Duarte JM, Sathyapriya R, Stehr H, Filippis I, Lappe M.Optimal contact definition for reconstruction of con-tact maps. BMC bioinformatics 2010;11(1):283.[30] Sathyapriya R, Duarte JM, Stehr H, Filippis I, LappeM. Defining an essence of structure determiningresidue contacts in proteins. PLoS computational biol-ogy 2009;5(12):e1000584.[31] Hopf TA, Schärfe CP, Rodrigues JP, Green AG,Kohlbacher O, Sander C, et al. Sequence co-evolutiongives 3D contacts and structures of protein complexes.Elife 2014;3:e03430.[32] Ovchinnikov S, Kamisetty H, Baker D. Robust and accu-rate prediction of residue–residue interactions across protein interfaces using evolutionary information. Elife2014;3:e02030.[33] Ogorzalek TL, Hura GL, Belsom A, Burnett KH,Kryshtafovych A, Tainer JA, et al. Small angle X-rayscattering and cross-linking for data assisted proteinstructure prediction in CASP 12 with prospects forimproved accuracy. Proteins: Structure, Function, andBioinformatics 2018;86:202–214.[34] Waterhouse A, Bertoni M, Bienert S, Studer G, TaurielloG, Gumienny R, et al. SWISS-MODEL: homology mod-elling of protein structures and complexes. Nucleicacids research 2018;46(W1):W296–W303.[35] Kim DE, Chivian D, Baker D. Protein structure predic-tion and analysis using the Robetta server. Nucleic acidsresearch 2004;32(suppl_2):W526–W531.[36] Haas J, Barbato A, Behringer D, Studer G, Roth S,Bertoni M, et al. Continuous Automated Model Evalu-atiOn (CAMEO) complementing the critical assessmentof structure prediction in CASP12. Proteins: Structure,Function, and Bioinformatics 2018;86:387–398.
UZENKO ET AL . F I G U R E 1
Score correlations. A heat map with correlations among all relevant scores used in thepredictioncenter.org web site. The “local” block of scores captures interface features, the “global” block capturesfeatures of the whole assembly.
F I G U R E 2
Per-target score distributions and comparison to the baseline (naïve) values, if present. The targets forwhich the median prediction is worse than the baseline in each score are labeled in red.
UZENKO ET AL . 11
F I G U R E 3
Performance comparison between CASP13 and CASP12. 5 top predictions per target (maximum 1 pergroup) were selected for each score from CASP12 and CASP13 submissions. The scores were matched by percentilesand plotted as CASP12 ( x axis) vs. CASP13 ( y axis). Values above the diagonal correspond to improvement in CASP13. F I G U R E 4
Group rankings in the assembly category. The groups are sorted by the sum of Z -scores for all difficultyclasses. The error bars are obtained by iteratively excluding every target from each difficulty class and recalculating thecumulative Z -scores. The server groups are labeled in violet. UZENKO ET AL . F I G U R E 5
Importance of quaternary modelling. A) Targets
T0973 , T0991 and
T0998 with very large dimericinterfaces and the main hydrophobic core split at the interface. The best regular prediction GDT_TS scores for theircorresponding monomeric evaluation units were: 82.62 for T0973-D1, 37.16 for T0991-D1 and 35.54 for T0998-D1. B)Trimeric part of target
H0953 showing the intertwined beta-strand geometry in the C-terminal half of the fold.
F I G U R E 6
Homomeric interface contacts (upper-right of the contact matrix) and best interface contact predictions(lower-left) for three CASP13 FM targets: A) interdigitated trimer
T0953s1 and prediction by group RR106; B) dimericinterface (isologous) of
T0968s2 and prediction by group RR036; and C) hexameric subunit
T1022s1 and prediction bygroup RR164.
UZENKO ET AL . 13
F I G U R E 7
Prediction highlights. A) The homodimeric target
T0976 and the monomeric template that matches theglobal arrangement of the 4 domains, B) Homodimeric target
T1001 and the template PDB entry 5LLW, a much largerprotein, the highlighted central domain has a very close tertiary structure and a similar interface region. C) The A2B2heterotetramer
T0968 with a main homomeric interface (cyan and yellow chains) via beta pairing, composing a largebeta sandwich. The other subunit attaches on either side of the beta sheets.
F I G U R E S 1
Percentage of homomeric interface contacts in CASP13 targets.
UZENKO ET AL . F I G U R E S 2
Data-assisted targets.
F I G U R E S 3
Score distributions for all predictions of data-assisted and the corresponding non-assisted targets. Twotypes of crosslinks and two types of scattering datasets are merged for the purpose of this figure.
UZENKO ET AL . 15
F I G U R E S 4
Target X0957 shows improvement across all scores considered due to several fortunate intermolecularcrosslinks. Crosslinked residues in the target and the assisted prediction are highlighted in red and connected with adashed yellow line. Crosslinks between missing residues are not shown. Best regular prediction (bottom) has asignificantly lower GDT.
F I G U R E S 5