[PDF] Customised fragment libraries for ab initio protein structure prediction using a structural alphabet

Abstract

Motivation: Computational protein structure prediction has taken over the structural community in past few decades, mostly focusing on the development of Template-Free modelling (TFM) or ab initio modelling protocols. Fragment-based assembly (FBA), falls under this category and is by far the most popular approach to solve the spatial arrangements of proteins. FBA approaches usually rely on sequence based profile comparison to generate fragments from a representative structural database. Here we report the use of Protein Blocks (PBs), a structural alphabet (SA) to perform such sequence comparison and to build customised fragment libraries for TFM. Results: We demonstrate that predicted PB sequences for a query protein can be used to search for high quality fragments that overall cover above 90% of the query. The fragments generated are of minimum length of 11 residues, and fragments that cover more than 30% of the query length were often obtained. Our work shows that PBs can serve as a good way to extract structurally similar fragments from a database of representatives of non-homologous structures and of the proteins that contain less ordered regions.

Full PDF

CCustomised fragment libraries for ab initio proteinstructure prediction using a structural alphabet

Surbhi Dhingra , Ramanathan Sowdhamini , Yves-Henri Sanejouand ,Frédéric Cadet , and Bernard Oﬀmann ∗ Université de Nantes, CNRS, UFIP, UMR6286, F-44000 Nantes, France Computational Approaches to Protein Science (CAPS), National Centrefor Biological Sciences (NCBS), Tata Institute for Fundamental Research(TIFR), Bangalore 560-065, India University of Paris, BIGR—Biologie Intégrée du Globule Rouge, Inserm,UMR_S1134, Paris F-75015, France Laboratory of Excellence GR-Ex, Boulevard du Montparnasse, ParisF-75015, France DSIMB, UMR_S1134, BIGR, Inserm, Faculty of Sciences and Technology,University of La Reunion, Saint-Denis F-97715, France PEACCEL, Protein Engineering ACCELerator, 6 Square Albin Cachot,box 42, 75013 Paris, France

Abstract

Motivation:

Computational protein structure prediction has taken over the structuralcommunity in past few decades, mostly focusing on the development of Template-Freemodelling (TFM) or ab initio modelling protocols. Fragment-based assembly (FBA),falls under this category and is by far the most popular approach to solve the spatialarrangements of proteins. FBA approaches usually rely on sequence based proﬁle com-parison to generate fragments from a representative structural database. Here we reportthe use of Protein Blocks (PBs), a structural alphabet (SA) to perform such sequencecomparison and to build customised fragment libraries for TFM.

Results:

We demonstrate that predicted PB sequences for a query protein can be usedto search for high quality fragments that overall cover above 90% of the query. The frag-ments generated are of minimum length of 11 residues, and fragments that cover morethan 30% of the query length were often obtained. Our work shows that PBs can serve asa good way to extract structurally similar fragments from a database of representativesof non-homologous structures and of the proteins that contain less ordered regions.

Availability:

Data and scripts are available for download at https://frama.link/z2wh7rKC ∗ corresponding author : bernard.oﬀ[email protected] a r X i v : . [ q - b i o . Q M ] M a y ontact: bernard.oﬀ[email protected] Supplementary information:

Supplementary data are available as an annexure to thispreprint online.

A switch has been observed in the ﬁeld of protein structure prediction (PSP) in thepast few decades wherein research is being encouraged in building computational ap-proaches, mostly focusing on Free modelling (FM) or Template-free Modelling (TFM) pro-tocols [1,2]. These approaches are validated through the biannual competition of CASP [3]intending into resolving sequence-structure-function paradigm. The core methodologiesaround which the free modelling protocols have been developed are fragment-based as-sembly methods [4–6], threading approaches [7], physics-based methods [8, 9] and quiterecently machine learning approaches [10]. We have recently reviewed the progress madein the ﬁeld lately [11].From these, fragment-based approaches (FBA) have been explored the most for theconstruction of ab initio protein models. Such methodologies rely on assembling structuralfragments covering short or long stretches of amino acid sequences from a representativestructural database sharing sequence similarity with a query protein. The fundamentalbehind FBA is that local protein sequence patterns follow a general trend of structuralfeatures [12]. This lead to the stipulation that the local conformations for a given proteinsequence can be recovered from fragments sharing local sequence similarity with local re-gions in existing protein structures [13]. FBA proceeds by collecting such local conforma-tions in a fragment library and assembling them to construct potential structural models,mostly by using knowledge-based scoring functions. In short, a number of fragments aregenerated for each position within the target protein sequence which are then reduced tothe best representative, based on diﬀerent scoring criteria. The fragment lengths vary withthe algorithm in question, usually lying within the range of 20 residues [14]. Nonetheless,accurate models have been constructed using fragments as short as 3 residues long [15,16].In general, fragment based approaches are beneﬁcial in restricting the dimension ofconformational search space by limiting the number of fragments used per position. Thisalso serves as a major check point for such algorithms, as they inherently fall behind inexploring alternative conformations for the same sequence [17]. Recently, eﬀorts havebeen made in overcoming this drawback by redesigning fragment search heuristics [13].Algorithms have also been developed for fragment mining using non-traditional meth-ods. One such approach is SA-Frag [18], which uses a type of structural alphabet toconstruct fragment libraries. It builts local proﬁle comparison between target and tem-plate structures based on predicted SA sequences. Though, this study placed SA in theconverstaion of PSP, it is still not on par with the available sequence counterparts [19].Yet, this algorithm has left an open space to dig more into the ways diﬀerent SAs can beexploited for structure prediction.In the current work, we have evaluated the potential of Protein Blocks (PBs) [20], astructural alphabet, in constructing eﬀective fragment libraries for ab initio protein struc-ture modelling. There are several types of structural alphabets (SAs) [21–23] availablethat focus on clustering protein backbone conformations into a limited set of represen-tative local stretches. Protein Blocks (PBs) emerged with a goal to obtain a good localstructure approximation of protein 3D structures when converted into 1D PB sequences2nd good prediction of local structures directly from amino acid sequence [20, 24]. Thisstructural alphabet was obtained after an unsupervised clustering using a Self OrganizingMap (SOM) [25, 26]. Protein blocks constitute a set of 16 structural prototypes lettered a to p . Each of the PB is 5 residues in length depicting 2( M -1) dihedral angles, where M (here M =5) is the number of residues constituting the prototype. Using PBs, anyprotein structure of n residues long can be converted into a PB sequence of length n -4.PBs have been shown to approximate on average at 0.42Å every local conformation ofprotein structures [27].Our work makes use of the available applications of PBs. One being translation of 3Dprotein structure coordinates into a string of readable 1D PB sequence by a procedurecommonly named as PB assignment (PBA). Another being the prediction of the proba-ble backbone conformation of a protein sequence in absence of secondary structure andsequence alignment proﬁles using a knowledge-based scoring function. This algorithmis available in the form of web-based tool called PB-kPRED [28] for the methodologyloosely termed as PB prediction (PBP). We have utilised this ability of PBs to approxi-mate protein backbone into mining fragments for ab initio modelling of protein structures.The fragments are retained in the form of PB sequence hits from a non-redundant, non-homologous protein database. The quality of fragments are assessed for PB sequencescoming from both PB assignment (PBA) and PB prediction (PBP).The signiﬁcance of this work has been centered around the idea of extensive confor-mational space search using a form of structural prototypes as an initial step. PBs areshown to be beneﬁcial in such cases by primarily excavating the template database forlocal backbone conformations relying primarily on its local conformation. ≤ ≤ ≥

40 residues. These accounted for a total of 23,989 uniqueprotein chains. The hits were further clustered at 30% sequence identity using the KClustalgorithm [30], resulting in a total of 7632 protein chains. Additionally, any protein withchain breaks were removed from the clustered sequence set, ﬁnally resulting in 5391 uniquechains.PB sequences were assigned for each of these 5391 templates in the ﬁnal database usingan in-house script which alongside generates equivalent DSSP output ﬁles and torsionangle ﬁles. Secondary structure assignments for each protein chain were congregatedusing Pdb-tools [31]. All these ﬁles along with protein sequence corresponding to eachtemplate constituted our curated database.

The query dataset used in this work is derived from a previously published publicationthat focused on building fragment libraries [14]. This dataset is comprised of 43 queryprotein structures ranging from 59 to 508 residues in length and is provided in Table 1.Each of them is a monomer and has been further categorized into four main SCOP familyclasses, speciﬁcally, all alpha, all beta, alpha plus beta ( α + β ) and alpha and beta ( α / β ).3able 1: The query dataset and its characteristics.

Table summarises the lengths in terms of number of residues and the estimated accuracies ofPB predictions as observed for each protein from the query dataset. A PB-kPRED predictionscore (last column) of 1 or above has been determined as an indicator of reliable PB sequenceprediction [28]. All the proteins in the dataset, except for one, laid above this score cut-oﬀvalue.PDB id SCOP Class Length (AA) Length (PDB) Accuracy (%) kPred score1AIL all- α

73 70 63.6 1.51RRO all- α

108 108 73.1 1.991U61 all- α

138 127 65.2 1.431SL8 all- α

191 181 77.5 2.231QUU all- α

250 248 63.8 1.511T5J all- α

313 301 65.8 1.611PO5 all- α

476 465 Too low Too low1MHN all- β

59 59 69.2 1.791TEN all- β

90 90 72.1 1.942G1L all- β

104 103 70.5 1.861IFR all- β

121 113 67.1 1.681BFG all- β

146 126 83.3 2.532FR2 all- β

172 161 70.4 1.851EE6 all- β

197 197 79.5 2.331UAI all- β

224 223 71.1 1.92C9A all- β

259 259 68.9 1.771O4Y all- β

288 270 66.1 1.631HG8 all- β

349 349 81.9 2.461NKG all- β

508 508 82.7 2.51VJW α + β

60 59 59.4 1.281MWP α + β

96 96 89.7 2.861GNU α + β

117 117 80.6 2.391R9H α + β

135 118 85.8 2.66206L α + β

164 162 81.9 2.452FS3 α + β

282 280 82.1 2.471DZF α + β

215 211 85 2.621DXJ α + β

242 242 74.2 2.11MAT α + β

264 263 83.9 2.561JKS α + β

294 280 80.8 2.361MC4 α + β

370 369 77.3 2.222FKF α + β

462 455 80.3 2.371H75 α / β

81 76 74.3 2.061IU9 α / β

111 111 68.8 1.771E6K α / β

130 130 77.7 2.241P90 α / β

145 123 64.8 1.561FTG α / β

168 168 79.8 2.341QCY α / β

193 193 75.9 2.142A14 α / β

263 257 63.4 1.481IZZ α / β

283 276 77.6 2.231QUE α / β

303 303 82.8 2.511KRM α / β

356 349 81.2 2.423BSG α / β

414 404 77.2 2.211PGN α / β

482 473 72.7 1.98 • PB-Assignment (PBA) procedure. In this case the query PB sequence is retrieveddirectly from the PDB coordinate ﬁles using an in-house PB assignment script,following the basic principle of PB generation, i.e conversion of 3D information into1D sequence as described before [20]. In short, the dihedral angles of all constitutiveoverlapping pentapeptides from a query are used to classify each of them in one ofthe 16 PB classes a to p . Secondary structure representations for the target proteinwere here obtained using DSSP assignment protocol [32]. • PB-Prediction (PBP) procedure. Here the potential query PB sequences were man-ually predicted using the PB-kPRED web tool [28]. This step was done to accountfor the loss of information endured due to probabilistic estimation of PBs per po-sition for a given sequence. PB-kPRED features a prediction quality criterion inthe form of a prediction score and a standardized accuracy percentile. Generally, ascore of < 1 concludes unreliable prediction. The prediction accuracies and kPREDscores for the query dataset are shown in the Table 1. Secondary structure rep-resentations for the target protein were here estimated by the secondary structureprediction tool Psipred [33].

A procedure was setup to mine non-homologous fragments from the curated templatedatabase. To avoid any bias in the benchmarking of our method, the primary step ofthe procedure involved looking for sequence homologs in the template database. Anytemplate protein sharing more than 30% sequence identity with any the query proteinsequence was considered for removal: pairwise sequence identity was calculated for eachquery against the entire curated template database using Needleman and Wunch globalalignment algorithm [34]. Any hit sharing more than the deﬁned sequence identity cut-oﬀwith the target sequence was removed from the template database for that run. A local PBalignment tool, PB-Align [35] was used to generate alignment hits for minimum lengthof 7 PBs corresponding to 11 amino acid long sequence stretches. PB alignment hitswhereby the query PB sequence was identical to the template fragment were collected asviable fragments, since it should correspond to identical backbone conformations. Furtherfragment quality assessment was performed on all identical hits using the criteria discussedbelow. The overall process of fragment generation is summarised in Figure 1. rmsd of 2Å divided5igure 1:

Fragment Generation Procedure.

Shown is a schematic representation ofthe general ﬂow of fragment generation carried out using protein blocks as primary source.Local PB alignment between query and template PB sequence eﬀectuates fragments ofminimum length of 11 amino acid residues. The alignment hits are then analysed basedon several criteria to procure the quality of the generated fragments.6y the total number of fragments generated for the query protein. The second main cri-terion that was examined was coverage which was deﬁned as the number of positions inthe query sequence for which at least one fragment was obtained by the fragment miningprocedure. Other assessment criteria that were tested include calculation of amino acidsequence identity and similarity and secondary structure identity between each fragmenthit against the target sequence.

The ﬁnal template database consisted of 5391 protein chains clustered at < 30% sequenceidentity and having a resolution of < 3Å. The examination of the secondary structuralelements distribution in the curated database showed that it is populated with 44.76%helices, 27.82% of strands and 27.42% coils and loops. In terms of PBs, the databasecorresponded to 29.9% PB m the central part of alpha-helices, 19.13% PB d the centralpart of a strand and 51% of other PBs (or coils), which approximated to the generalprotein secondary structure distribution (regular and irregular) [24].Our database contained templates from 10 out of 12 SCOP representative classes,with the majority being associated to the 4 main SCOP classes. Out of 5391 proteins,1962 found no corresponding SCOP hit. This might be due to delay in synchronisationof structural annotation data across platforms. The fragment generation procedure is illustrated in Figure 1. Any protein from ourcurated template database that could have been a probable sequence homologue with aprotein from our que query dataset was excluded on the ﬂy from the template database. Inthis scenario, any template sharing >30% identity with the query sequence was consideredas a homologue . This ensures the premises of Free Modelling and the legitimacy of ourpipeline in picking up good fragments in absence of sequence homologues.An average of ∼

60k hits were generated by local PB-align for each query sequencewith least being 34,655 for 1AIL and maximum being 86,866 for 1HG8. The fragmentswere further ﬁltered on the basis of PB sequence identity. All the PB hits that were 100%identical to the query PB sequence, whether this was assigned (PBA) or predicted (PBP),were chosen as best representative fragments. This functioned as a preliminary ﬁlteringstep by limiting the fragment search area to the best PB ﬁt. In doing so, the averagenumber of fragments reduced to ∼

13k hits per query without signiﬁcantly aﬀecting theoverall coverage. The lowest number of identical hits was documented for query protein1EE6 with 3761 hits and maximum for 1MWP with 21732 hits. Detailed counts of thetotal hits before and after ﬁltering are provided in Supplementary Tables 1 and 2 for PBAand PBP respectively.As seen in the Figure 2, the maximum number of fragments laid below 15 residuesin length. This remained the case for all the query proteins, whatever their total length.The longest fragments obtained were 65 and 61 residues long for PBA and PBP schemesrespectively.The overall eﬀectiveness of the fragment generating pipeline was judged on the basisof rmsd calculations and coverage attained by the fragments for all the queries. Figure 3shows an overview of rmsd distribution as observed in all the cases for both PBA and7BP schemes. It is clear from the graph that most of the fragment hits lie below the rmsd cut-oﬀ of 2.5Å which approximates to 85% of the generated fragments.A barplot depicting the overall percent of sequence space covered (coverage) for eachquery protein is shown in Figure 4. Higher coverage was observed in case of PBA with anaverage of 96.5% in comparison to PBP with an average of 93.9%. For 6 cases, PBP gavebetter percentage coverage than PBA cases. Protein queries belonging to all- β SCOPFigure 2:

Fragment length distribution : Protein Block Assignment (left) andProtein Block Prediction (right).

In both the cases maximum number of fragmentsare lying at the assigned minimum fragment length of 7 residues. The histogram is plottedfor fragment length against number of fragments generated. There is an exponential dropin number of fragments generated with the increase in fragment length.Figure 3:

Rmsd distribution : Protein Block Assignment (left) and ProteinBlock Prediction (right)).

The graph is plotted for total number of fragments gener-ated from all the queries in test dataset against the rmsd observed when these fragmentsare ﬁtted over original query structure. For both the cases of PB production, maximumnumber of fragments lie below the rmsd cut-oﬀ value of 2.5 Å.8igure 4:

Coverage analysis of fragments generated.

Top : barplot showing, for eachquery protein from our query dataset, the percentage of positions covered by a fragment(y axis is from 80-100% and x-axis are the labels for the PDB codes of the queries).Results obtained with PBA and PBP procedures are shown in blue and green respectively.Query proteins are grouped according to their SCOP classes. The queries with lowestcoverage belongs to SCOP β class. Bottom : detailed coverage results illustrated for4 query examples for both PBA and PBP schemes. The x-axis represents the residuepositions and the y-axis is the raw count of the number of fragments that covered each ofthe positions. Positions not covered by any fragment are shown in red on the plots.class showed lowest percentage coverage when compared to queries from the other threemain SCOP classes.The number of hits observed per position also varied between the two cases for eachquery. This is illustrated in Figure 4 for four examples drawn one from each SCOP classand in Supplementary Figures 1 and 2 for all the queries. Both coverages for PBA andPBP schemes are shown. In many queries, the coverage was heterogenous with somepositions that were covered by a low number of fragments and some positions with a highnumber of hits. This is well illustrated by the spikes in the number of hits along thequery protein length. These highly covered stretches mostly correspond to regions withcanonical secondary structures stretches. Noteworthily, on average, lower number of hitswere observed per position in case of PBP without aﬀecting the overall coverage of thequery sequence. This can be explained by the loss of accuracy procured in case of PBpredictions (Figure 4 and Supplementary Figures 1 and 2).9 .3 Assessment of fragment quality

The quality assessment was primarily done by ﬁtting fragment hits onto the originalstructures using a protein structure least square ﬁtting program. It was measured interms of precision, which is deﬁned as the percentage of number of hits lying under agiven rmsd cut-oﬀ. Table 2 summaries the precision percentage values obtained for eachSCOP class analysed in the current work at three rmsd cut-oﬀ values. An overall precisionof 75.4% and 64.3% was calculated for PBA and PBP respectively at the rmsd cut-oﬀof 2Å. Least precision was quantiﬁed for SCOP class all-beta reaching up to 71.8% and47.7% for PBA and PBP schemes respectively. Rest of the classes had higher precisionlevels lying above 65% collectively. With more stringent rmsd cut-oﬀ values, the precisiondropped.Table 2: Precision of the pipeline in generating fragments at three rmsd cut-oﬀ values.

The table describes the precision gained by the fragment generation pipeline as seen at threediﬀerent rmsd cutoﬀ values of 1.5Å, 2Å and 2.5Å. It also compares the precision that could beachieved at the best case scenario of exact PB sequence (PBA) when compared to predictedPB sequences (PBP). At the rmsd cut-oﬀ of 2Å, the pipeline is able to retain an averageprecision of 64%. Lowest precision has been noted for hits belonging to SCOP class all- β proteins.SCOP Class PBA scheme ( rmsd Å) PBP scheme ( rmsd Å) ≤ ≤ ≤ ≤ ≤ ≤ α all β α + β α / β Average

Amino acid sequence identity and similarity and secondary structure identity werecalculated for all fragments against corresponding position in the query sequence. Figure 5depicts their overall distributions. Note that amino acid sequence identity distribution isdominated by fragments sharing no identity (0%) with the query sequence. On the otherhand, an increase in sequence similarity is observed in the plots, but the graphs are stillskewed towards low values, indicating low similarity correspondence between query andtemplate amino acid sequences. This was not the case for secondary structure identity:the distribution shifted towards right with maximum hits reaching up to 100% identity.This attributes to the nature of PB sequences which depict 1D sequences associated tothe local backbone conformation, thus indicating towards it being an objective criteriafor assessing and qualifying fragments.Interestingly, 40% of the fragments lying under rmsd of 2Å didn’t share any sequenceidentity with the query sequence and 90% shared less than 37.50% sequence identity.This result could be appreciated as it aligns with the speciﬁcs of template database be-ing non-homologous, thereby lowering the chances of ﬁnding the exact same amino acidsequence hit altogether for the variable lengths of fragments extracted. The aggregationwas consistent in both PBA and PBP schemes: 50% of the data shared <25% sequencesimilarity with query sequence with 75% of the data reaching ∼

36% sequence similaritiesin case of PBP. Noteworthily, an average secondary structure identity of 85% was notedfor fragments lying below the 2Å mark with most being 100% identical for both PBA andPBP. 10igure 5:

Qualitative analysis of fragment hits.

The graphs above exhibit the distribution of sequence identity, sequence similarity andsecondary structure identity as observed in case of fragments generated for two workingcases of PB generation, PBA (left) and PBP (right). [A] The graphs shows the distribu-tion of amino acid sequence identity. [B] Amino acid sequence similarity distribution asobserved between query sequences from test dataset and the fragments generated by thepipeline. [C] Secondary structure identity shared between query sequences and fragmentsgenerated by the pipeline. Both the cases of PB generation, i.e PBA and PBP, follow asimilar trend of identity and similarity distribution.)11 ROC curve analysis (Figure 6) performed for the aforementioned elements against rmsd (2Å) conﬁrmed that secondary structure identity criteria was indeed the best amongstthe three tested criteria for prioritising fragments. The sensitivity and speciﬁcity curvesfor individual SCOP classes are further detailed in Supplementary Figures 2.A visual inspection of the fragments in Pymol [37] shows that the global fold of the tar-get structures can be retained by the fragments. Sample of this visualization is illustratedin Figure 7.

In the current paper, protein blocks have shown to hold a potential for eﬀective fragmentmining towards the goal of ab initio protein structure prediction. The work concludesthat a huge amount of good quality fragments can be extracted from structural databaserepresented in the form of PBs in absence of sequence homology. It is a very simple pipelinethat mimics amino acid sequence pairwise alignment algorithms to detect, thanks to PBalignment, local structural hits within the working database of structural templates. Thisshifts the fragment-based approaches from relying on amino acid sequence to directlyaccessing structural patterns from a curated database. Huge pool of structural stretchesof varying lengths were recovered for each query in the dataset equating to congregatedfragments data. This importantly included loop regions that connect regular secondarystructures.The primary requirement of the study was to eﬀectively depict amino acid sequenceFigure 6:

ROC analysis using three objective criteria.

A) and B) represent results for PBA and PBP schemes respectively. For all the fragmentslying below the rmsd cut-oﬀ of 2Å, ROC curve was plotted to visualize eﬀective criteriafor choosing viable fragments. The three criteria are: (i) amino acid sequence identity(green), (ii) amino acid sequence similarity (red) and (iii) secondary structure identity(blue). Sequence identity and similarity lie on the mark of 50% depicting no directcorrelation with the quality of the fragments generated. SS identity, on the other hand,has an AUC of approximately 70%, ensuring its role in recuperating good fragments.12igure 7:

Superimposition of fragments generated onto the original query struc-tures.

The ﬁgure shows the original structure of two queries from the test dataset, namely, 1AILand 1VJW, along with the clusters of best fragments retained for each. Here, (A) isoriginal structure, (B) is the fragments clusters obtained for PBA scheme and (C) is thefragments clusters obtained for PBP scheme.13nto reliable PB sequences while retaining the probable local conformation. It was attainedby using a previous work done in our group ( [28]) for optimizing possible PB per positionfor each overlapping pentapeptide within a protein sequence (PB-kPRED webserver). Thecalculated PB prediction scores for all the target proteins from the dataset were abovethe standardized threshold, with an exception of one (1PO5). The average accuracy ofPB-prediction was noted to be 74.98 % for 42 out of 43 targets in the dataset, which wasample to guide valid fragment picking through the pipeline.Prior to running fragment extraction protocol, it was insisted upon abiding by theconstraints of free modelling approaches to avoid biasing the aftermath of fragment qualityassessment. In line with it, any potential homologous template sequence was removed fromthe curated database as a preliminary step. This consequently produced personaliseddatabase for each query sequence. This step also ensured objectivity of the pipeline wheninspecting the quality of fragments generated through PB-predicted sequences.From our analysis, it was noted that fragments generated via this pipeline show nocorrespondence between amino acid sequence identity or similarity and rmsd . On theother hand, secondary structure juxtaposition between the template fragments with thepredicted secondary structure for the query showed excellent relevance. The data unan-imously favoured secondary structure identity comparison, with a large proportion offragments that hit 100% secondary structure identity. A similar trend was observed whenexamining the correlation between rmsd and secondary structure identity. Higher sec-ondary structure identity corresponded to lower rmsd between template fragments andquery (data not shown). This is not surprising since PB itself is a representation of thebackbone.The quality of fragments was measured by ﬁtting fragment coordinates onto originalstructure and was termed here as “precision”. This was evaluated for the 4 main SCOPclasses, all agreeing with the cumulative results of the pipeline. Signiﬁcant discrepancieswere noted for class all-beta. Here the accuracy of fragment prediction did not reachthe mark of other classes. In general, it has been noticed that alpha folds are easier tomap back to than beta strands. This was even seen in the current study where the ’all-beta’ queries showed overall higher rmsd and more scattered secondary structure identitydistribution.The pipeline was tested against the best possible case of PB representation, i.e PBassignment (PBA) and was compared to real life scenario of approximating PB sequencefor the given query through PB prediction (PBP). This accounted for the eﬀects of misseddata (information) on the process of fragment picking. Our results shown that the overallquality of fragments retained in case of PBP was not compromised.The quality of fragments was judged on the basis of two common criteria used inearlier work done on fragment library generation pipelines, i.e precision and coverage.Though, the overall eﬃciency of our pipeline is on a par with other algorithms availablelike HHFrag and NNMake that report precision of 62.16% and 38.17% respectively [14].Higher precision of HHFRAG can be attributed to the presence of sequence homologues inthe working database [14] which is not the case here. SAFrag, another SA based fragmentmining pipeline, has shown a precision of 86.7% [18]. Their SAFrag strategy was builton the grounds similar to HHFrag and uses HMM-based proﬁle-proﬁle comparison togenerate fragment hits. It makes use of structural alphabets that describe 27 states. Thepipeline carried out proﬁle comparison based on the pre-segregated or partitioned querysequence segments into sub-sets of varying lengths of 6 to 27 residues. These sub-setswere then used to search for similar proﬁles within the established structural data banks.It has to be noted that SAFrag used two types of structural data banks: PDB25 and14DB50 for fragment generation. The higher precision and coverage of this algorithm canbe attributed to the grounds that during the fragment search they accept the inclusionof the target structure as well as data coming from structural homologs as well in theirtemplate database.The coverage attained in our work lies on the higher range when compared to theseother algorithms (HHFrag - 71% ±

13, NNMake - ∼ Our work shows that structural alphabets are useful to ﬁnd fragments for prediction ofprotein structures in template-free modelling approaches. SAs are indeed more liberatingwhen compared to amino acids sequence based fragment libraries due to their potentialin accessing a more expanded conformational space. This aids in broadening the searchspace without losing the potential to capture the native fold of the target protein. Theycan successfully be implemented in foraging potential structural homologues for proteinssharing otherwise low or no aminoc acid sequence similarity. We show here that fragmentscovering the entire length of small proteins can be readily generated easing the task forstructure prediction protocols.Since, PBs are local conformations represented in the form of 1D sequence, the tenacityof pulling fragments sharing similar fold is higher when compared to the possibility of thathappening with amino acid sequence comparison. Longer fragments representing entiredomains can be extracted using the protocol. In general, PBs hold a promising steptowards protein structure prediction protocols using the local conformations as a startingpoint.

Acknowledgements

The authors thank Dr Alexandre G. de Brevern and Prof Narayanaswamy Srinivasan forfruitful discussions on this work. 15 unding

This work has been supported by the Conseil Régional de La Réunion and Fonds SocialEuropéenin the form of a PhD scholarship to SD under tier number 234275, conventionnumber DIRED/20161451. BO is thankful to Conseil Régional Pays de la Loire forsupport in the framework of GRIOTE grant.

Competing interests

F.C. is linked to Peaccel. SD, RS, YHS and BO declare no competing interests.

References [1] Kevin J Maurice. SSThread: Template-free protein structure prediction by threadingpairs of contacting secondary structures followed by assembly of overlapping pairs.

Journal of Computational Chemistry , 35(8):644–656, 2014.[2] Bee Yin Khor, Gee Jun Tye, Theam Soon Lim, and Yee Siew Choong. Generaloverview on structure prediction of twilight-zone proteins.

Theoretical Biology andMedical Modelling , 12(1):15, 2015.[3] John Moult, Jan T Pedersen, Richard Judson, and Krzysztof Fidelis. A large-scale ex-periment to assess protein structure prediction methods.

Proteins: Structure, Func-tion, and Bioinformatics , 23(3):ii–iv, 1995.[4] Carol A Rohl, Charlie EM Strauss, Kira MS Misura, and David Baker. Proteinstructure prediction using Rosetta. In

Methods in Enzymology , volume 383, pages66–93. Elsevier, 2004.[5] Yang Zhang, Adrian K Arakaki, and Jeﬀrey Skolnick. TASSER: an automatedmethod for the prediction of protein tertiary structures in CASP6.

Proteins: Struc-ture, Function, and Bioinformatics , 61(S7):91–98, 2005.[6] Dong Xu and Yang Zhang. Ab initio protein structure assembly using continuousstructure fragments and optimized knowledge-based force ﬁeld.

Proteins: Structure,Function, and Bioinformatics , 80(7):1715–1735, 2012.[7] Ambrish Roy, Alper Kucukural, and Yang Zhang. I-TASSER: a uniﬁed platformfor automated protein structure and function prediction.

Nature Protocols , 5(4):725,2010.[8] JL Klepeis and CA Floudas. ASTRO-FOLD: a combinatorial and global optimizationframework for ab initio prediction of three-dimensional structures of proteins fromthe amino acid sequence.

Biophysical Journal , 85(4):2119–2146, 2003.[9] S Ołdziej, C Czaplewski, A Liwo, M Chinchio, M Nanias, J A Vila, M Khalili, Y AArnautova, A Jagielska, M Makowski, H D Schafroth, R Kaźmierkiewicz, D R Ripoll,J Pillardy, J A Saunders, Y K Kang, K D Gibson, and H A Scheraga. Physics-basedprotein-structure prediction using a hierarchical protocol based on the UNRES forceﬁeld: Assessment in two blind tests.

Proceedings of the National Academy of Sciences ,102(21):7547–7552, 2005. 1610] Mohammed AlQuraishi. AlphaFold at CASP13.

Bioinformatics , 35(22):4862–4865,2019.[11] Surbhi Dhingra, Ramanathan Sowdhamini, Frédéric Cadet, and Bernard Oﬀmann. Aglance into the evolution of template-free protein structure prediction methodologies. arXiv preprint arXiv:2002.06616 , 2020.[12] Kim T Simons, Charles Kooperberg, Enoch Huang, and David Baker. Assembly ofprotein tertiary structures from fragments with similar local sequences using sim-ulated annealing and bayesian scoring functions.

Journal of Molecular Biology ,268(1):209–225, 1997.[13] Shaun M Kandathil, Mario Garza-Fabre, Julia Handl, and Simon C Lovell. Improvedfragment-based protein structure prediction by redesign of search heuristics.

ScientiﬁcReports , 8(1):1–14, 2018.[14] Saulo HP de Oliveira, Jiye Shi, and Charlotte M Deane. Building a better fragmentlibrary for de novo protein structure prediction.

PloS One , 10(4), 2015.[15] Richard Bonneau, Jerry Tsai, Ingo Ruczinski, Dylan Chivian, Carol Rohl, Charlie EMStrauss, and David Baker. Rosetta in CASP4: progress in ab initio protein structureprediction.

Proteins: Structure, Function, and Bioinformatics , 45(S5):119–126, 2001.[16] Dominik Gront, Daniel W Kulp, Robert M Vernon, Charlie EM Strauss, and DavidBaker. Generalized fragment picking in Rosetta: design, protocols and applications.

PloS One , 6(8), 2011.[17] Shaun M Kandathil, Julia Handl, and Simon C Lovell. Toward a detailed under-standing of search trajectories in fragment assembly approaches to protein structureprediction.

Proteins: Structure, Function, and Bioinformatics , 84(4):411–426, 2016.[18] Yimin Shen, Géraldine Picord, Frédéric Guyon, and Pierre Tuﬀery. Detecting proteincandidate fragments using a structural alphabet proﬁle comparison approach.

PloSone , 8(11), 2013.[19] Jad Abbass and Jean-Christophe Nebel. Customised fragments libraries for proteinstructure prediction based on structural class annotations.

BMC Bioinformatics ,16(1):136, 2015.[20] Alexandre G. de Brevern, Catherine Etchebest, and Serge Hazout. Bayesian proba-bilistic approach for predicting backbone structures in terms of protein blocks.

Pro-teins: Structure, Function, and Bioinformatics , 41(3):271–287, 2000.[21] Ron Unger, David Harel, Scot Wherland, and Joel L Sussman. A 3D building blocksapproach to analyzing and predicting structure of proteins.

Proteins: Structure,Function, and Bioinformatics , 5(4):355–373, 1989.[22] Anne-Cloude Camproux, Romain Gautier, and Pierre Tuﬀery. A hidden Markovmodel derived structural alphabet for proteins.

Journal of Molecular Biology ,339(3):591–605, 2004.[23] Shuai Cheng Li, Dongbo Bu, Xin Gao, Jinbo Xu, and Ming Li. Designing succinctstructural alphabets.

Bioinformatics , 24(13):i182–i189, 2008.1724] Agnel Praveen Joseph, Garima Agarwal, Swapnil Mahajan, Jean-Christophe Gelly,Lakshmipuram S Swapna, Bernard Oﬀmann, Frédéric Cadet, Aurélie Bornot, ManojTyagi, Hélène Valadié, Bohdan Schneider, Catherine Etchebest, NarayanaswamySrinivasan, and Alexandre G De Brevern. A short survey on protein blocks.

Bio-physical Reviews , 2(3):137–145, 2010.[25] Teuvo Kohonen. Self-organized formation of topologically correct feature maps.

Bi-ological Cybernetics , 43(1):59–69, 1982.[26] T. Kohonen, M. R. Schroeder, and T. S. Huang, editors.

Self-Organizing Maps .Springer-Verlag, Berlin, Heidelberg, 3rd edition, 2001.[27] Alexandre G de Brevern. New assessment of a structural alphabet.

In Silico Biology ,5(3):283–289, 2005.[28] Iyanar Vetrivel, Swapnil Mahajan, Manoj Tyagi, Lionel Hoﬀmann, Yves-Henri Sane-jouand, Narayanaswamy Srinivasan, Alexandre G De Brevern, Frederic Cadet, andBernard Oﬀmann. Knowledge-based prediction of protein backbone conformationusing a structural alphabet.

PloS One , 12(11), 2017.[29] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat,Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The Protein Data Bank.

Nucleic Acids Research , 28(1):235–242, 2000.[30] Maria Hauser, Christian E Mayer, and Johannes Söding. kClust: fast and sensitiveclustering of large protein sequence databases.

BMC Bioinformatics , 14(1):248, 2013.[31] João PGLM Rodrigues, João MC Teixeira, Mikaël Trellet, and Alexandre MJJ Bon-vin. Pdb-tools: a swiss army knife for molecular structures.

F1000Research , 7, 2018.[32] Wolfgang Kabsch and Christian Sander. Dictionary of protein secondary struc-ture: pattern recognition of hydrogen-bonded and geometrical features.

Biopolymers ,22(12):2577–2637, 1983.[33] Liam J McGuﬃn, Kevin Bryson, and David T Jones. The PSIPRED protein structureprediction server.

Bioinformatics , 16(4):404–405, 2000.[34] Saul B Needleman and Christian D Wunsch. A general method applicable to thesearch for similarities in the amino acid sequence of two proteins.

Journal of MolecularBiology , 48(3):443–453, 1970.[35] Manoj Tyagi, Venkataraman S Gowri, Narayanaswamy Srinivasan, Alexandre Gde Brevern, and Bernard Oﬀmann. A substitution matrix for structural alphabetbased on structural alignment of homologous proteins and its applications.

Proteins:Structure, Function, and Bioinformatics , 65(1):32–39, 2006.[36] Andrew D McLachlan. Rapid comparison of protein structures.

Acta Crystallograph-ica Section A , 38(6):871–873, 1982.[37] Warren L DeLano et al. Pymol: An open-source molecular graphics tool.

CCP4Newsletter on Protein Crystallography , 40(1):82–92, 2002.18 upplementary Figures and Tables upplementary Figures 1. Coverage density plots. The plots under this heading show the distribution of the number of fragments per position.The analysis has been performed for both treatment of the query proteins: (1) Protein BlockAssignment (PBA) and (2) Protein Block Prediction (PBP). The graphs are further classifiedinto section : (a) SCOP Class – all α, (b) SCOP Class – all β, (c) SCOP Class - α+ β and (d)SCOP Class - α/β . The percentage of coverage for each test protein is marked above the plot.Along with it the positions for each protein with no fragment hits are marked in red and thenumber of residues with no hits is also shown above each graph.

Supplementary Figure 1.1a

Protein Block Assignment (PBA) - SCOP Class – all α upplementary Figure 1.1b Protein Block Assignment (PBA) – SCOP Class – all β upplementary Figure 1.1c Protein Block Assignment (PBA) – SCOP Class – α+β upplementary Figure 1.1d Protein Block Assignment (PBA) – SCOP Class – α/β upplementary Figure 1.2a Protein Block Prediction (PBP) - SCOP Class – all α upplementary Figure 1.2b Protein Block Prediction (PBP) - SCOP Class – all β upplementary Figure 1.2c Protein Block Prediction (PBP) - SCOP Class – α+β upplementary Figure 1.2d Protein Block Prediction (PBP) - SCOP Class – α/β upplementary Figures 2. Sensitivity and specificity Plots. The plots under this heading show the co-relation between rmsd (2 Å) and three chosencriteria for prioritizing fragment selection, i.e., protein sequence identity, protein sequencesimilarity and secondary structure identity. The analysis has been performed for bothtreatment of the query proteins: (1) Protein Block Assignment (PBA) and (2) Protein BlockPrediction (PBP). The graphs are further classified into section : (a) SCOP Class –all α, (b)SCOP Class – all β, (c) SCOP Class - α+β and (d) SCOP Class - α/β.

Supplementary Figure 2.1a

Protein Block Assignment (PBA) - SCOP Class – all α upplementary Figure 2.1b Protein Block Assignment (PBA) – SCOP Class – all β upplementary Figure 2.1c Protein Block Assignment (PBA) – SCOP Class – α+β upplementary Figure 2.1d Protein Block Assignment (PBA) – SCOP Class – α/β upplementary Figure 2.2a Protein Block Prediction (PBP) - SCOP Class – all α upplementary Figure 2.2b Protein Block Prediction (PBP) - SCOP Class – all β upplementary Figure 2.2c Protein Block Prediction (PBP) - SCOP Class – α+β upplementary Figure 2.2d Protein Block Prediction (PBP) - SCOP Class – α/β B Assignment PDB ID Length (AA) Length (PDB) All Hits Identical PB Hits Longest Hits Coverage (%)All Alpha 1AIL 73 70 34655 14402 150 1001RRO 108 108 49734 9728 415 1001U61 138 127 38640 6490 358 1001SL8 191 181 44102 8195 402 98.91QUU 250 248 44824 11293 225 1001T5J 313 301 51463 6480 658 1001PO5 476 465 63837 9992 1569 97.8All Beta 1MHN 59 59 45693 7763 172 1001TEN 90 89 41975 5675 232 1002G1L 104 103 44320 4135 343 98.11IFR 121 113 43452 5637 270 97.31BFG 146 126 57287 7341 343 98.42FR2 172 161 78928 18226 392 99.41EE6 197 197 42734 3761 471 94.41UAI 224 223 45133 5718 437 96.42C9A 259 259 44350 5378 717 98.51O4Y 288 270 63734 10210 791 95.91HG8 349 349 86866 22834 505 86.81NKG 508 508 55584 8251 1841 98.8Alpha and Beta 1VJW 60 59 72137 12525 235 1001MWP 96 96 73965 21732 461 97.91GNU 117 117 73953 19932 583 1001R9H 135 118 76827 17704 558 100206L 164 162 55886 9290 861 98.12FS3 282 280 64493 13195 674 97.51DZF 215 211 70425 15982 850 99.11DXJ 242 242 59053 8455 791 931MAT 264 263 62926 13142 1024 98.11JKS 294 280 66566 13929 885 98.61MC4 370 369 61250 9888 739 99.22FKF 462 455 61147 13037 1300 98.9Alpha/Beta 1H75 81 76 69484 13346 299 1001IU9 111 111 66589 13346 342 1001E6K 130 130 71746 16321 621 1001P90 145 123 73913 20568 363 1001FTG 168 168 72925 18987 711 99.41QCY 193 193 69619 15430 532 1002A14 263 257 64677 12697 835 99.21IZZ 283 276 71107 16494 770 95.71QUE 303 303 62765 12752 742 99.71KRM 356 349 67668 12602 779 94.83BSG 414 404 67000 14272 832 98.31PGN 482 473 66032 14479 1019 99.6

Supplementary Table 1.

This table provides the fragment hit counts obtained after Protein Block Assignment (PBA) for each protein from the query dataset. B Prediction PDB ID Length (AA) Length (PDB) All Hits Identical PB Hits Longest Hits Coverage (%)All Alpha 1AIL 73 70 38447 11501 414 98.61RRO 108 108 40842 7661 451 99.11U61 138 127 56844 10419 248 96.91SL8 191 181 50381 10150 984 97.21QUU 250 248 53552 12696 441 96.41T5J 313 301 54456 4973 668 91.71PO5 476 465 62554 9430 1591 97.6All Beta 1MHN 59 59 70551 6379 380 94.91TEN 90 90 69989 3775 859 95.52G1L 104 103 45732 3721 317 90.31IFR 121 113 66507 15170 946 93.81BFG 146 126 53359 7121 482 99.22FR2 172 161 60031 4934 2542 97.51EE6 197 197 49351 3073 792 92.41UAI 224 223 47480 2926 740 87.92C9A 259 259 59190 5551 705 88.81O4Y 288 270 67912 12023 1869 95.91HG8 349 349 50990 6720 659 87.41NKG 508 508 56424 8486 1530 94.5Alpha+Beta 1VJW 60 59 61326 12716 157 96.61MWP 96 96 73836 21621 308 991GNU 117 117 82325 21173 780 97.41R9H 135 118 51549 5452 636 95.8206L 164 162 53082 8458 736 1002FS3 282 280 64098 12754 1236 96.81DZF 215 211 70028 17204 824 1001DXJ 242 242 63698 9378 500 931MAT 264 263 63583 13768 909 99.21JKS 294 280 69244 16160 1134 98.61MC4 370 369 66492 11564 1209 97.82FKF 462 455 64106 14040 1236 95.4Alpha/Beta 1H75 81 76 72386 14483 729 90.81IU9 111 111 66329 18860 1258 95.51E6K 130 130 72078 15837 1205 1001P90 145 123 74454 15671 2383 95.11FTG 168 168 70585 14923 587 99.41QCY 193 193 69674 15537 687 1002A14 263 257 66681 14408 2781 97.31IZZ 283 276 71995 14779 1563 94.91QUE 303 303 64904 14981 889 97.71KRM 356 349 67838 13370 605 99.43BSG 414 404 67614 11980 1006 95.31PGN 482 473 65827 11766 1414 94.7

Supplementary Table 2.