[PDF] Deep transfer learning in the assessment of the quality of protein models

Abstract

MOTIVATION: Proteins fold into complex structures that are crucial for their biological functions. Experimental determination of protein structures is costly and therefore limited to a small fraction of all known proteins. Hence, different computational structure prediction methods are necessary for the modelling of the vast majority of all proteins. In most structure prediction pipelines, the last step is to select the best available model and to estimate its accuracy. This model quality estimation problem has been growing in importance during the last decade, and progress is believed to be important for large scale modelling of proteins. The current generation of model quality estimation programs performs well at separating incorrect and good models, but fails to consistently identify the best possible model. State-of-the-art model quality assessment methods use a combination of features that describe a model and the agreement of the model with features predicted from the protein sequence. RESULTS: We first introduce a deep neural network architecture to predict model quality using significantly fewer input features than state-of-the-art methods. Thereafter, we propose a methodology to train the deep network that leverages the comparative structure of the problem. We also show the possibility of applying transfer learning on databases of known protein structures. We demonstrate its viability by reaching state-of-the-art performance using only a reduced set of input features and a coarse description of the models. AVAILABILITY: The code will be freely available for download at this http URL.

Full PDF

DDeep transfer learning in the assessment of the quality of protein models

David Menéndez Hurtado, ∗ Karolis Uziela, and Arne Elofsson † Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University

Motivation:

Proteins fold into complex structures that are crucial for their biological functions.Experimental determination of protein structures is costly and therefore limited to a small fractionof all known proteins. Hence, diﬀerent computational structure prediction methods are necessaryfor the modelling of the vast majority of all proteins. In most structure prediction pipelines, the laststep is to select the best available model and to estimate its accuracy. This model quality estimationproblem has been growing in importance during the last decade, and progress is believed to beimportant for large scale modelling of proteins. The current generation of model quality estimationprograms performs well at separating incorrect and good models, but fails to consistently identifythe best possible model. State-of-the-art model quality assessment methods use a combination offeatures that describe a model and the agreement of the model with features predicted from theprotein sequence.

Results:

We ﬁrst introduce a deep neural network architecture to predict model quality usingsigniﬁcantly fewer input features than state-of-the-art methods. Thereafter, we propose a method-ology to train the deep network that leverages the comparative structure of the problem. We alsoshow the possibility of applying transfer learning on databases of known protein structures. Wedemonstrate its viability by reaching state-of-the-art performance using only a reduced set of inputfeatures and a coarse description of the models.

Availability:

The code will be freely available for download at github.com/ElofssonLab/ProQ4 . I. INTRODUCTION

Proteins perform the vast majority of all biologicalfunctions. They are constructed as long polymers builtof twenty amino acids that fold into a three-dimensionalshape after synthesis. During the last half of the cen-tury, the structures of more than 100 000 proteins havebeen experimentally obtained using X-ray crystallogra-phy, NMR, or electron microscopy. Although the experi-mental techniques have improved signiﬁcantly, the aver-age cost for a new protein structure is still close to $100k,limiting the number of experimentally determined pro-tein structures [1].A single organism contains thousands of genes, eachcoding for at least one protein. Due to the exponen-tial decrease of sequencing costs the sequence of morethan 100 million proteins has been deposited in publicdatabases like UniRef [2]. Further, almost one orderof magnitude more sequences are available from meta-genomic projects. This means that there are at leastthree orders of magnitude separating the number ofknown protein sequences and structures. This gap is in-creasing, and only computational methods will be ableto close it.Luckily, methods to generate accurate three-dimensional models of proteins exist. If the structureof a homologous protein has been solved, it can beused as a template for modelling. This method canbe applied to about 50% of all protein families [3, 4].For the rest, one has to rely on other methods. Here, ∗ [email protected] † [email protected] rapid progress in contact prediction has recently enabledthe modelling of the structure of many proteins. Theaccuracy of these models varies and depends on severalfactors and no single method always produces the bestmodel. Therefore, in many modelling approaches acombination of methods and/or parameter settings areused to produce multiple models. This will later will beanalysed to identify the best one. A. Deep learning

Deep learning is a family of machine learning algo-rithms that has brought a revolution in ﬁelds such ascomputer vision, speech recognition, and artiﬁcial intel-ligence [5]. The improvements can be attributed to twodiscoveries: (i) new algorithms enable to deﬁne and eﬃ-ciently train much more complex models, that can takeadvantage of large training sets; and (ii), deep learningmethods can make use of the structure of the data totake a data-driven approach to feature engineering. In-stead of hand crafting high-level features, deep learningcan use lower lever representations of the data that pre-serve its structure, such as the ordering of amino acids ina protein chain, or the proximity of pixels in an image. Adeep learning model is composed of multiple layers thatsuccessively transform the inputs, learning to extract therelevant information during training. In eﬀect, replacingthe manual feature engineering with an automatic featureextraction. In a deep model, each layer learns to combinesome of its input creating a hierarchy of features auto-matically extracted from the data. For example, whentrying to identify an object in an image, the ﬁrst layersmight learn to detect edges, the next layers would com- a r X i v : . [ q - b i o . B M ] A p r bine them into textures, object parts, and eventually thewhole training label. B. Transfer learning

Deep learning can be used to train high capacity mod-els, but it requires large amounts of data, that is notalways available or cheap. But since most of the networkacts as a feature extractor, it is possible to train a net-work on a larger and diﬀerent but related problem, andreuse the learned feature extraction.Yosinski et al. [6] studied how well these features aregeneralisable across categories in image recognition andhas since become a standard procedure in deep learning.

C. Predicting protein structural features

The structure of a protein consists of secondary struc-ture elements; α -helices and β -sheets that are packed insuch a way that hydrophobic amino acids are mostly hid-den from the surface. We also know that homologoussequences fold into similar structures, so we can searchsequence databases for proteins that are likely to be sim-ilar to the one we are interested in, and compile theminto an aligned list called multiple sequence alignment.Both the secondary structure and the surface acces-sibility of a residue in a protein can be predicted witha rather high accuracy by applying machine learning onthe statistics extracted from the columns of this multiplesequence alignment. The accuracy of these methods hasreached close to 80% in two [7] or three [8] state classiﬁ-cations. D. CASP: Critical Assessment of protein StructurePrediction

CASP is a biennial experiment aimed at assessing thestate of the structure prediction ﬁeld. The organizers re-lease the sequence of proteins of hitherto unknown struc-ture, and allow a few days for groups around the worldto submit predictions. In a second stage, these modelsare published and evaluated by model quality assessmentmethods. The latest editions count around 100 individ-ual sequences (also called targets), and received around200 models per target, coming from circa 30 independentmethods.

E. Model quality assessment background

Estimating the free energies of protein models hasa long history within the protein structure predictionﬁeld. [9, 10] Here, the ultimate goal is to understand thefundamental physics governing protein folding to such an FIG. 1: Detail of the 3D structure of the protein 3TDU.Highlighted in yellow are the residues that smoothlytransition between helix and coil. Predictions arecommonly wrong about the exact position of theboundary.extent that it is possible to simulate the folding of a pro-tein. The purpose of the model quality assessment pro-gram is to accurately estimate the distance of the modelfrom the native structure.There are two general strategies to evaluate the qual-ity of a single protein model: comparison with auxiliarypredictions, and evaluation of physico-chemical proper-ties of the model. A limitation of the second approach isbe that even if a method could perfectly describe the freeenergy of a protein model - this measure might not be atall correlated with the diﬀerence of a model from the na-tive structure. Take the trivial example of a model whereall atoms except one, are perfectly placed. This last atomis then positioned on top of another atom. Such a modelwould have a near inﬁnite free energy, but by any dis-tance measure, be almost perfect.On the other hand models solely relying on predictionsmight provide important low-resolution methods and re-duce the complexity of 3D prediction to simpler ones,such as secondary structure. A good model is expectedto agree with the predictions. In addition, all disagree-ments might not carry the same information. Often theexact boundaries of secondary structure elements are no-toriously diﬃcult to predict, and depend on the exactthresholds used to deﬁne it, as seen in Figure 1. In con-trast, it is rare that errors in predictions between diﬀerentsecondary structure classes occur.

F. Related work

For more than a decade several groups have developedprotein model quality assessment methods using variousinput features [11–16]. The improvements achieved dur-ing the last few years can be attributed to small, but sig-niﬁcant, improvements to the methods by (i) includingadditional descriptions of a protein model [17], (ii) ap-plying deep learning techniques [18] and combining manyfeatures [16].Here follows a brief description of the inputs and algo-rithms used by other comparable methods. These mostlyfocus on physico-chemical descriptors, such as statisticalpotentials, and train simple machine learning algorithms. • QMEAN [19]: Linear combination of the agree-ment of secondary structure and surface area, andthree statistical potentials describing the torsionangles, amino acid contacts, and exposure to thewater. • DeepQA [20]: Deep Belief Network combining sev-eral scoring potentials and energies, other qualityassessment methods, and seven diﬀerent agreementmetrics between observed and predicted secondarystructure and surface area. • ProQ3 [17]: A Linear SVM trained on atom andamino acid contacts, observed secondary structure,multiple sequence alignment statistics, agreementswith secondary structure and surface area, as wellas terms of the Talaris energy function [21]. • ProQ3D [18]: Same input as ProQ3, but replacingthe linear SVM with a multi layer perceptron. • VoroMQA [22]: Statistical potential based on thefrequencies of observed atom contacts. • SVMQA [15]: SVM combining diﬀerent statisticalpotentials and agreement with secondary structureand surface area.In contrast to all other methods, our method uses onlycoarse structural features and a multiple sequence align-ment, but no statistical potentials nor any other chemicaldescription.

II. METHODSA. Datasets

In order to directly compare results with the most re-cent method, ProQ3D [18] retrained on LDDT [23], wehave used the same datasets for training and testing: allthe submitted models to the CASP editions 9 and 10 fortraining, and CASP 11 for evaluation, excluding all thetargets shorter than 50 residues. In total, we have 57 263models from 212 diﬀerent targets in the training set, and14 580 from 79 targets in the test set, excluding targetscancelled by the organisers.We also present the results on the same subset of theCameo dataset used by ProQ3D, 19 899 models from 676diﬀerent targets.For the pre-trained networks, we used 5687 structuresfrom the PISCES dataset [24], while an additional 300 were left out for validation. The multiple sequence align-ments were produced using Jackhmmer [25] to searchUniref50 [26], with an E-value threshold of 10 − for 3iterations. B. Description of inputs

Prediction of structural features of a protein is im-proved by using multiple sequence alignments [27]. Fromthe multiple sequence alignment we extract two statis-tics: the self-information (Equation 1a) and the partialentropy (Equation 1b) of the position: I i = − log (cid:18) p i ¯ p i (cid:19) , (1a) S i = − p i log (cid:18) p i ¯ p i (cid:19) , (1b)where p i is the frequency of the amino acid i at the po-sition and ¯ p i is the average frequency on the data set.We also include the protein sequence itself using sparseencoding.The protein models are deﬁned in a coarse representa-tion by the sine and cosine of the dihedral angles [28] ϕ and ψ , the secondary structure, relative surface area, andenergies of hydrogen bonds in the backbone, as deﬁnedby DSSP [29]. In DSSP, 8 diﬀerent secondary structurestates are assigned to a protein. Due to lack of data, wemerged the following two pairs of states: G (3 helix)and I ( π − helix), and T (hydrogen bonded turn) and S(bend). C. Output: the target function

There are several diﬀerent metrics that can be usedto evaluate the similarity between a protein model andthe native structure. In this work we choose to use a lo-cal scoring function: the Local Distance Diﬀerence Test(LDDT) [30], a number between 0 and 1 that representsthe fraction of conserved contacts between all pairs ofatoms in the native structure for several distance thresh-olds. An LDDT of 1 means a perfect agreement, anda good model shall have scores higher than 0 .

5. Otherquality estimation methods could also be used providingsimilar performance. To estimate the quality of a modelwe deﬁne the global score to be the average of the localscores, giving a score of 0 to any residue missing from themodel, and ignoring those absent from the native struc-ture.

D. Figures of merit

The performance of the models will be evaluated onseveral metrics. • Local correlation:

Pearson correlation between thepredicted and true values assigned to every residue.This measures the reliability of the predicted scoresassigned to each residue. • Local RMSE:

Root Mean Squared Error betweenthe aforementioned predicted local scores and thetrue values. • Per model correlation: average Pearson correlationbetween predicted and true values of every residuefor every model. This measures how well we candiﬀerentiate the correct and incorrect parts of themodel. • Global correlation:

Pearson correlation betweenpredicted and true values for the overall model.This is deﬁned as the average of the local scores,ignoring any residue not present in the native struc-ture, and setting the score of missing residues to 0.This measures how well we can rank targets in aglobal scale. • Global RMSE:

Root Mean Squared Error betweenthe global predicted and true scores. • Per target correlation: average Pearson correlationfor the global scores per target. A measurement ofthe ability of the program to rank models from thesame target. • First rank loss: average diﬀerence between the bestand the top ranked model. It measures how well wecan select the best model for each target.

III. IMPLEMENTATIONA. Network architecture

In order to exploit the spatial distribution of the localfeatures, and to allow the network to compare observa-tions and predictions locally, we have implemented a 1Dfully convolutional network trained on the local scores.We used the ELU activation function [31] as a non-linearity, and also applied a small L penalty of 10 − to every convolutional layer, except for the output lay-ers, The training was guided by the Adam optimiser [32]. B. The ResNet module

The convolutional block from ResNet [33] inspired ourarchitecture due to its capacity to converge eﬃcientlywhile still keeping a deep architecture that provides alarge eﬀective ﬁeld of view. As shown on Figure 2, itis composed of two blocks of successive convolution ofwidth 3, followed by an activation function, batch nor-malisation [34], and dropout, whose output is summedto the inputs. When the number of input and output channels is diﬀerent, we modify the skip connection toinclude only one convolution, activation, batch normali-sation, and dropout.FIG. 2: The 1D ResNet module, the main buildingblock of our convolutional nets

C. Simple Convolutional Neural Network

The simplest implementation of a deep convolutionalnetwork consists of several branches that combines all theinputs can be seen in Figure 3. In one branch, the threeinput vectors derived from the sequence (the sequence it-self, self-information, and partial entropy) are followed bya convolution of size 1, to project them into a 16 dimen-sional vector space per residue. They are then followedby two ResNet modules, merged into a single branch byconcatenation, and followed by two more ResNet mod-ules. The structural inputs are likewise projected intoa 64 dimensional space and passed through four ResNetmodules. Finally, both branches are concatenated, andsent through four more ResNet modules. A single con-volutional layer of width 7 and no L penalty is appliedused in the ﬁnal layer.FIG. 3: The convolutional architecture D. Sequence pre-trained network

Our dataset contains roughly 200 times more modelsthan unique sequences, which is an obvious source of bias.For example, some of these targets are hard, and not asingle model is of good quality. The network could there-fore easily learn that this particular sequence is alwaysbad, which we want to avoid. After all, a new method,or more data could result in a good model in the future.To tackle this issue, we pre-train the branch corre-sponding to the sequence inputs on 5687 of known pro-tein structures from PISCES [24], showing our network amuch more diverse set of proteins. As shown in Figure 4a,we inserted one hidden convolutional layer connected tofour more predicting the 3 and 6-state secondary struc-ture, surface accessibility, and sine and cosine of the di-hedral angles. We denominate alignment features to theoutput of the network in the hidden layer before this one,and it is used to replace the sequence branch in the pre-vious network. The resulting model has exactly the samearchitecture and parameters, but with some weights setthrough training in a diﬀerent task.The performance of each auxiliary predictor is com-parable to state-of-the-art methods, but due to possibleoverlaps between training and test sets of the diﬀerentmethods, a completely fair comparison would need morecareful studies that are beyond the scope of this paper.

E. Tricephalous network

The ultimate application of model quality assessmentis to rank models of the same protein. In this sectionwe will describe an architecture designed to exploit thisstructure of the problem by reducing it to pairwise com-parisons.In our dataset, for each target we have around 200individual models, but 20 000 pairs, which we can use asdata augmentation. So, instead of feeding one structure at the time, we will present the network with two modelsof the same target and ask it to predict the scores, as wellas which model is better, for every residue, as shown inFigure 5.In order to ensure the symmetry of the problem is re-spected, each model is passed through a pair of identicalcopies of the pre-trained network described previously,keeping the parameters of each copy tied. This is called aSiamese conﬁguration, because the network is composedof two conjoined twins. The comparison prediction isdone by a symmetrised perceptron with one hidden layer,the SortNet [35], represented in grey boxes in the Figure.SortNet is a small variation of the classical perceptronto represent proper preference. So, if we predict that a is better than b with a probability of 0 .

8, we will alsopredict b to be better than a with a probability of 0 . . F. Regression as a classiﬁcation

Deep learning usually performs better on classiﬁcationthan on regression tasks. Therefore, we will divide the[0 ,

1] range into N equally sized bins and replace themean squared error loss function by a cross entropy. Inorder to recover the predicted score, we then average thescores with equally spaced weights. p = N X i =1 s i ( σ low + i · σ step ) + σ offset , (2)where s i is the predicted probability of being in the i − th bin, σ low , σ step , and σ offset are three free parametersthat were obtained minimising the mean squared erroron the training set.This is the ﬁnal architecture, and we will refer to it asProQ4. G. Dependencies

The networks were implemented with Keras [37], usingTensorﬂow [38] as a backend. The data is stored in HDF5ﬁles accessed through PyTables [39]. All the networkswere trained on a single Nvidia 1070Ti GPU, equippedwith 8GB of VRAM.In order to make predictions, the only dependencies arePython 3 with Numpy, Biopython, Keras, Tensorﬂow, (a) Sequence pre-training, learning the alignmentfeatures. (b) Sequence pre-trained network

FIG. 4: The two stages of the Pre-trained network. The sequence pre-training is used to extract the alignmentfeatures that are used subsequently throughout the paper.FIG. 5: The Tricephalous architecture: the two stagesof the Comparative network combined at once.and H5Py, DSSP, and a multiple sequence alignment.All dependencies are open source and can be freely dis-tributed. Running on GPUs requires a CUDA-enabledNVIDIA GPU card, but can also work on CPUs.

IV. RESULTS AND DISCUSSION

In the Tables I and II we present the results on theCASP 11 datasets for the three networks, compared withthe state-of-the-art method, ProQ3D [18]. No other pub-licly available method was trained on LDDT, so we can-not establish a fair comparison. We also present the re-sults on Cameo [40] on Tables III and IV. The last row oneach table, Tricephalous network trained on classiﬁcationis our ﬁnal model, ProQ4, also plotted in Figure 6.Of all the reported ﬁgures, we believe the per targetcorrelation to be the most important metric, since oneis usually interested in comparing models correspondingto the same protein. First rank loss is also of interest,but since depends on a single data point per target, it isnoisier.For a baseline, we trained a simple feed forward net-work on our features with a window of 21 residues, withtwo hidden layers of 512 units, ELU activation, BatchNormalisation, and Dropout ( p drop = 0 . Method R-local RMSE local R-per modelProQ3D(retrained on LDDT)

Base MLP 0.62 0.180 0.44Pre-trained MLP 0.52 0.198 0.47Simple CNN 0.51 0.205 0.42Pre-trained CNN 0.68 0.172 0.55Tricephalous 0.72 0.160 0.55ProQ4 0.77 0.147 0.56

R-local is the correlation between all local predicted and true scores in the dataset; RMSE stands for Root MeanSquared Error. R-per model is the average correlation between predicted and true scores for each model in thedataset. Find a more detailed explanation in section II DTABLE II: Summary of results on the global scores on CASP11.

Method R-global Global RMSE R-per target First rank lossProQ3D(retrained on LDDT) 0.90 0.080 0.82 0.040Base MLP 0.77 0.118 0.82 0.046Pre-trained MLP 0.67 0.135 0.82 0.046Simple CNN 0.67 0.157 0.68 0.091Pre-trained CNN 0.85 0.121 0.85 0.036Tricephalous 0.87 0.106 0.88 0.029ProQ4

R-global is the correlation between all global predicted and true scores in the dataset; R-per target is the averagecorrelation of global scores of models for each protein; and ﬁrst rank loss is the average diﬀerence in true scoresbetween the best model and the top ranked for each target. .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . P r e d i c t e d l o c a l CASP 11 .

00 0 .

25 0 .

50 0 .

75 1 . . . . . . . Cameo .

00 0 .

25 0 .

50 0 .

75 1 . True . . . . . . P r e d i c t e d g l o b a l .

00 0 .

25 0 .

50 0 .

75 1 . True . . . . . . FIG. 6: 2D histogram of local (upper) and global(lower) scores for CASP 11 and Cameo. . . . . P e rt a r g e t c o rr e l a t i o n s CASP 11 . . . . Cameo .

00 0 .

05 0 .

10 0 .

15 0 .

20 0 . F i r s t R a n k L o ss .

00 0 .

05 0 .

10 0 .

15 0 .

20 0 . FIG. 7: Comparison of the performance of ProQ3D (inblue) and ProQ4 (green) in per target correlations(upper) and ﬁrst rank loss (lower). For Cameo, thehistograms have been truncated to the same range asCASP for clarity.which are indeed diﬃcult to rank, as shown in Figure 9.

A. Number of classes in the scores

When treating the regression as a classiﬁcation, wetested all splits between 2 and 9 classes, using equallyspaced bins. The diﬀerences between 3 or more classesare smaller than 2% in the local, global, and per target . . . . . . Secondary Structure - 3 state accuracy . . . . . . . . P e rt a r g e t c o rr e l a t i o n FIG. 8: Comparison between per target correlationsand the quality of the sequence-based predictor. Thecorrelation is weak (0.33), suggesting that the quality ofthe sequence-based features is not a signiﬁcant limitingfactor. . . . . . . Average model quality . . . . . . . P e rt a r g e t c o rr e l a t i o n ProQ4ProQ3

FIG. 9: The per target correlations of ProQ3D andProQ4 as a function of the average quality of the modelon CASP11. The dependency is stronger for ProQ3D( R P Q D = 0 .

52 vs R P Q = 0 . B. The eﬀect of larger networks

We tested deeper networks increasing from 8 up to50 ResNet blocks deep, and while it improved the lo-cal correlation, the global metrics remained the same, orslightly lower. For example, with a depth of 50 layers,we increased a local correlation of from 0 .

77 to 0 .

81, onCASP11, while the global remained at 0 .

91, but the pertarget drops from 0 .

90 to 0 .

87. Even though the local cor-TABLE III: Summary of results on the local scores on Cameo.

Method R-local RMSE local R-per modelProQ3D(retrained on LDDT)

Base MLP 0.48 0.232 0.42Pre-trained MLP 0.50 0.248 0.47Simple CNN 0.43 0.219 0.41Pre-trained CNN 0.61 0.202 0.55Tricephalous 0.65 0.199 0.56ProQ4 0.65 0.201 0.56

R-local is the correlation between all local predicted and true scores in the dataset; RMSE stands for Root MeanSquared Error. R-per model is the average correlation between predicted and true scores for each model in thedataset. Find a more detailed explanation in section II DTABLE IV: Summary of results on the global scores on Cameo.

Method R-global Global RMSE R-per target First rank lossProQ3D(retrained on LDDT)

Base MLP 0.71 0.180 0.75 0.034Pre-trained MLP 0.76 0.202 0.76 0.032Simple CNN 0.70 0.163 0.71 0.041Pre-trained CNN 0.80 0.158 0.80 0.028Tricephalous 0.82 0.158 0.83 0.025ProQ4 0.82 0.156

56 to 0 .

55. Larger networksseem to predict signiﬁcantly better local scores, but be-ing hampered by biases.

C. Global vs local scores

We are training on local scores, so directly trying to op-timise the local RMSE. It is surprising, then, that whilethe performance on local scores is signiﬁcantly worse thanProQ3D, ProQ4 shows an improvement on global met-rics, specially on per target correlations. So, ProQ4 isbetter at telling the global diﬀerences between models,but fails to locate them accurately. This may be due tothe fact that during the training procedure we are pre-senting the network with the full model, or due to thelack of a description of the chemical properties of themodel.As shown in the section IV B, an increase in local per-formance does not necessarily translate into global im-provements.

D. Diﬀerence between CASP and Cameo

The main diﬀerence in Cameo is that there are muchfewer models per target, and coming from fewer servers.Many of the targets have easy templates available, somost of the models tend to be similar to each other.The average ﬁrst rank loss on the Cameo dataset isdominated by a single target, 4UYQ_B, with a loss of0.44 (see Figure 7). If that target were excluded, the aver-age ﬁrst rank loss would drop to 0.021, same as ProQ3D.

E. Are we learning something new?

Figure 10 shows a correlation matrix between all theglobal scores for all the architectures presented. We canclearly identify two close clusters, one for the MLP ar-chitectures, and another one for all the convolutionalnetworks using pre-trained features. Simple CNN is anoutlier, showing only a moderate correlation with its ar-chitectural twin, the pretrained network. This indicatesthat the Tricephalous network and ProQ4 are learningthe same thing, ProQ4 is just slightly better.On the other hand, the agreement between ProQ3Dand each of our models is similar to the agreement be-tween the model and the true values, suggesting thatboth are learning something fundamentally diﬀerent, de-spite having been trained on the same dataset.

F. Method biases

There are multiple strategies to generate protein mod-els. Diﬀerent methods fall into distinct kinds of errors, and produce models of diverse chemical characteristics.This is a challenge when evaluating the model quality; amodel presenting unnatural chemical properties can bedue to low quality or just the existence of non-optimisedlocal geometries, for instance caused by sub-optimal side-chain packing.A common solution to the diversity problem is to uni-formise the models by repacking the side-chains with asingle program such as SCWRL [41]. However, there arelimitations to this approach and it might be computa-tionally expensive. Also, from a practical point of view,it causes a method to be dependent on additional pro-grams.In our proposed method, we use only a coarse descrip-tion of the protein that is not sensitive to the packing ofthe side chains. This should make it quicker and easierto apply it in a large scale.Another source of bias befalls when both the methodgenerating the model and the assessment make use of thesame auxiliary predictor, such as PSIPRED [8]. In thiscase, we ﬁnd a bias for models generated with a particularmethod. In order to tackle this issue, we replace thecomparison with explicit predictors with our own methodand train on the hidden representation instead of the ﬁnaloutput.

V. CONCLUSIONSA. Limitation of the method

Our proposed method uses a very coarse descriptionof the structural structures, focused on features that canbe predicted from the sequence. This restricted descrip-tion limits the performance on the local level, but it iscompensated by an increased performance in the globaland, specially, per target ranking, due to the in builtcomparative structure.Finding the right representation for the chemical prop-erties of proteins from a deep learning point of view re-mains as an open question, and we hope this work willincentivise this line of research.

B. The importance of structured data

One of the reasons behind the great success of deeplearning is the ability to take full advantage of the struc-ture of the data. For machine learning in bioinformaticsthis has not been utilized fully, except for contact pre-diction in the works of Golkov et al. [42] and Wang et al. [43]. Many machine learning methods in bioinformaticsstill rely on sliding window approaches. Even when deeplearning has been applied this has often been limited toincreasing the complexity of a multi-layer perceptron ar-chitecture.Since the advent of the Non Free Lunch theorem [44],we know that the success of a machine learning algorithm1

ProQ3D MLP basic MLP pretrained CNN CNN pretrained Tricephalous ProQ4 TrueProQ3DMLP basicMLP pretrainedCNNCNN pretrainedTricephalousProQ4True 0 . . . . . . . FIG. 10: Correlation matrix between the predicted global scores on CASP 11 for each method. The order of therows is the same as in the tables, with the last corresponding to the true scores. A similar picture is obtained if wecompare local scores instead.is tied to how much domain knowledge can be includedin its training. In traditional machine learning, this isdone through careful feature engineering, trying to ﬁndthe closest representation to our objective. In this workwe propose a training framework that replaces the tra-ditional end-to-end ﬁtting with a multi-stage process de-signed to inject domain knowledge in every step of theway:1. Spatial relationships and translation invariance arecoded as convolutions.2. The sequence information is extracted in the pre-training.3. The tricephalous architecture encodes the ranking nature of the problem.Both the pre-training and the comparative trainingbring an improvement to the results across all the metricswe have evaluated on both datasets.The importance of structure is also seen on the eﬀectof pre-training. While it improves CNN-based architec-tures, Multi Layer Perceptron (MLP) models don’t ben-eﬁt, or are even hindered by pre-training.

FUNDING

This work was supported by grants from the SwedishResearch Council (VR-NT 2016-03798 to AE) andSwedish e-Science Research Center (BW). The SwedishNational Infrastructure provided computational re-sources for Computing (SNIC) at NSC. [1] T. Terwilliger, D. Stuart, and S. Yokoyama, Annu RevBiophys , 371 (2009).[2] The UniProt Consortium, Nucleic Acids Res , D158(2017).[3] M. Michel, D. Menéndez Hurtado, K. Uziela, andA. Elofsson, Bioinformatics , i23?i29 (2017).[4] S. Ovchinnikov, H. Park, N. Varghese, P.-S. Huang, G. A.Pavlopoulos, D. E. Kim, H. Kamisetty, N. C. Kyrpides,and D. Baker, Science , 294?298 (2017). [5] Y. LeCun, Y. Bengio, and G. Hinton, Nature ,436?444 (2015).[6] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, in Pro-ceedings of the 27th International Conference on Neu-ral Information Processing Systems - Volume 2 , NIPS’14(MIT Press, Cambridge, MA, USA, 2014) pp. 3320–3328.[7] B. Petersen, T. Petersen, P. Andersen, M. Nielsen, andC. Lundegaard, BMC Structural Biology , 51 (2009). [8] D. T. Jones, Journal of Molecular Biology , 195(1999).[9] B. Park and M. Levitt, Journal of molecular biology ,367 (1996).[10] T. Lazaridis and M. Karplus, Journal of molecular biol-ogy , 477 (1999).[11] B. Wallner and A. Elofsson, Protein Science , 1073(2003).[12] B. Wallner, Protein Science , 900 (2006).[13] A. Ray, E. Lindahl, and B. Wallner, BMC Bioinformatics , 224 (2012).[14] S. Mirzaei, T. Sidi, C. Keasar, and S. Crivelli,IEEE/ACM Transactions on Computational Biology andBioinformatics , 1 (2016).[15] B. Manavalan and J. Lee, Bioinformatics (2017),10.1093/bioinformatics/btx222.[16] A. Maghrabi and L. J. McGuﬃn, Nucleic Acids Research(2017), 10.1093/nar/gkx332.[17] K. Uziela, N. Shu, B. Wallner, and A. Elofsson, ScientiﬁcReports , 33509 (2016).[18] K. Uziela, D. Menéndez Hurtado, N. Shu, B. Wallner,and A. Elofsson, Bioinformatics , btw819 (2017).[19] P. Benkert, S. C. E. Tosatto, and D. Schomburg, Pro-teins: Structure, Function, and Bioinformatics , 261(2008).[20] R. Cao, D. Bhattacharya, J. Hou, and J. Cheng, BMCBioinformatics (2016), 10.1186/s12859-016-1405-y.[21] A. Leaver-Fay, M. J. O’Meara, M. Tyka, R. Jacak,Y. Song, E. H. Kellogg, J. Thompson, I. W. Davis,R. A. Pache, S. Lyskov, J. J. Gray, T. Kortemme, J. S.Richardson, J. J. Havranek, J. Snoeyink, D. Baker, andB. Kuhlman, in Methods in Protein Design , Methods inEnzymology, Vol. 523, edited by A. E. Keating (AcademicPress, 2013) pp. 109 – 143.[22] K. Olechnovic and C. Venclovas, Proteins: Struc-ture, Function, and Bioinformatics (2017),10.1002/prot.25278.[23] K. Uziela, D. Menéndez Hurtado, N. Shu, B. Wallner,and A. Elofsson, Proteins: Structure, Function, andBioinformatics (2018), 10.1002/prot.25492.[24] G. Wang and R. L. Dunbrack, Jr., Bioinformatics ,1589 (2003).[25] L. S. Johnson, S. R. Eddy, and E. Portugaly, BMC Bioin-formatics , 431 (2010).[26] B. E. Suzek, Y. Wang, H. Huang, P. B. McGarvey, andC. H. Wu, Bioinformatics , 926 (2014).[27] B. Rost, C. Sander, and R. Schneider, Journal of Molec-ular Biology , 13 (1994).[28] B. Xue, O. Dor, E. Faraggi, and Y. Zhou, Proteins:Structure, Function, and Bioinformatics , 427 (2008). [29] W. Kabsch and C. Sander, Biopolymers , 2577 (1983).[30] V. Mariani, M. Biasini, A. Barbato, and T. Schwede,Bioinformatics , 2722 (2013).[31] D. Clevert, T. Unterthiner, and S. Hochreiter, Proceed-ings of the 32nd International Conference on MachineLearning (2015).[32] D. Kingma and J. Ba, International Conference forLearning Representations (2015).[33] K. He, X. Zhang, S. Ren, and J. Sun, 2016 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) (2016), 10.1109/cvpr.2016.90.[34] S. Ioﬀe and C. Szegedy, Proceedings of the 32nd Interna-tional Conference on Machine Learning (2015).[35] L. Rigutini, T. Papini, M. Maggini, and F. Scarselli,IEEE Transactions on Neural Networks , 1368 (2011).[36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever,and R. Salakhutdinov, Journal of Machine Learning Re-search , 1929 (2014).[37] F. Chollet et al. , “Keras,” https://github.com/fchollet/keras (2015).[38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is-ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-enberg, D. Mané, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,F. Viégas, O. Vinyals, P. Warden, M. Wattenberg,M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,”(2015), software available from tensorﬂow.org.[39] F. Alted, I. Vilata, et al. , “PyTables: Hierarchicaldatasets in Python,” (2002–).[40] J. Haas, A. Barbato, D. Behringer, G. Studer,S. Roth, M. Bertoni, K. Mostaguir, R. Gumienny, andT. Schwede, Proteins: Structure, Function, and Bioinfor-matics , 387?398 (2017).[41] Q. Wang, A. A. Canutescu, and R. L. Dunbrack, NatureProtocols , 1832 (2008).[42] V. Golkov, M. J. Skwark, A. Golkov, A. Dosovitskiy,T. Brox, J. Meiler, and D. Cremers, in Advances in Neu-ral Information Processing Systems 29 , edited by D. D.Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar-nett (Curran Associates, Inc., 2016) pp. 4222–4230.[43] S. Wang, S. Sun, Z. Li, R. Zhang, and J. Xu, PLoSComput Biol , e1005324 (2017).[44] D. H. Wolpert, Neural Computation8