Deep transfer learning in the assessment of the quality of protein models
DDeep transfer learning in the assessment of the quality of protein models
David Menéndez Hurtado, ∗ Karolis Uziela, and Arne Elofsson † Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University
Motivation:
Proteins fold into complex structures that are crucial for their biological functions.Experimental determination of protein structures is costly and therefore limited to a small fractionof all known proteins. Hence, different computational structure prediction methods are necessaryfor the modelling of the vast majority of all proteins. In most structure prediction pipelines, the laststep is to select the best available model and to estimate its accuracy. This model quality estimationproblem has been growing in importance during the last decade, and progress is believed to beimportant for large scale modelling of proteins. The current generation of model quality estimationprograms performs well at separating incorrect and good models, but fails to consistently identifythe best possible model. State-of-the-art model quality assessment methods use a combination offeatures that describe a model and the agreement of the model with features predicted from theprotein sequence.
Results:
We first introduce a deep neural network architecture to predict model quality usingsignificantly fewer input features than state-of-the-art methods. Thereafter, we propose a method-ology to train the deep network that leverages the comparative structure of the problem. We alsoshow the possibility of applying transfer learning on databases of known protein structures. Wedemonstrate its viability by reaching state-of-the-art performance using only a reduced set of inputfeatures and a coarse description of the models.
Availability:
The code will be freely available for download at github.com/ElofssonLab/ProQ4 . I. INTRODUCTION
Proteins perform the vast majority of all biologicalfunctions. They are constructed as long polymers builtof twenty amino acids that fold into a three-dimensionalshape after synthesis. During the last half of the cen-tury, the structures of more than 100 000 proteins havebeen experimentally obtained using X-ray crystallogra-phy, NMR, or electron microscopy. Although the experi-mental techniques have improved significantly, the aver-age cost for a new protein structure is still close to $100k,limiting the number of experimentally determined pro-tein structures [1].A single organism contains thousands of genes, eachcoding for at least one protein. Due to the exponen-tial decrease of sequencing costs the sequence of morethan 100 million proteins has been deposited in publicdatabases like UniRef [2]. Further, almost one orderof magnitude more sequences are available from meta-genomic projects. This means that there are at leastthree orders of magnitude separating the number ofknown protein sequences and structures. This gap is in-creasing, and only computational methods will be ableto close it.Luckily, methods to generate accurate three-dimensional models of proteins exist. If the structureof a homologous protein has been solved, it can beused as a template for modelling. This method canbe applied to about 50% of all protein families [3, 4].For the rest, one has to rely on other methods. Here, ∗ [email protected] † [email protected] rapid progress in contact prediction has recently enabledthe modelling of the structure of many proteins. Theaccuracy of these models varies and depends on severalfactors and no single method always produces the bestmodel. Therefore, in many modelling approaches acombination of methods and/or parameter settings areused to produce multiple models. This will later will beanalysed to identify the best one. A. Deep learning
Deep learning is a family of machine learning algo-rithms that has brought a revolution in fields such ascomputer vision, speech recognition, and artificial intel-ligence [5]. The improvements can be attributed to twodiscoveries: (i) new algorithms enable to define and effi-ciently train much more complex models, that can takeadvantage of large training sets; and (ii), deep learningmethods can make use of the structure of the data totake a data-driven approach to feature engineering. In-stead of hand crafting high-level features, deep learningcan use lower lever representations of the data that pre-serve its structure, such as the ordering of amino acids ina protein chain, or the proximity of pixels in an image. Adeep learning model is composed of multiple layers thatsuccessively transform the inputs, learning to extract therelevant information during training. In effect, replacingthe manual feature engineering with an automatic featureextraction. In a deep model, each layer learns to combinesome of its input creating a hierarchy of features auto-matically extracted from the data. For example, whentrying to identify an object in an image, the first layersmight learn to detect edges, the next layers would com- a r X i v : . [ q - b i o . B M ] A p r bine them into textures, object parts, and eventually thewhole training label. B. Transfer learning
Deep learning can be used to train high capacity mod-els, but it requires large amounts of data, that is notalways available or cheap. But since most of the networkacts as a feature extractor, it is possible to train a net-work on a larger and different but related problem, andreuse the learned feature extraction.Yosinski et al. [6] studied how well these features aregeneralisable across categories in image recognition andhas since become a standard procedure in deep learning.
C. Predicting protein structural features
The structure of a protein consists of secondary struc-ture elements; α -helices and β -sheets that are packed insuch a way that hydrophobic amino acids are mostly hid-den from the surface. We also know that homologoussequences fold into similar structures, so we can searchsequence databases for proteins that are likely to be sim-ilar to the one we are interested in, and compile theminto an aligned list called multiple sequence alignment.Both the secondary structure and the surface acces-sibility of a residue in a protein can be predicted witha rather high accuracy by applying machine learning onthe statistics extracted from the columns of this multiplesequence alignment. The accuracy of these methods hasreached close to 80% in two [7] or three [8] state classifi-cations. D. CASP: Critical Assessment of protein StructurePrediction
CASP is a biennial experiment aimed at assessing thestate of the structure prediction field. The organizers re-lease the sequence of proteins of hitherto unknown struc-ture, and allow a few days for groups around the worldto submit predictions. In a second stage, these modelsare published and evaluated by model quality assessmentmethods. The latest editions count around 100 individ-ual sequences (also called targets), and received around200 models per target, coming from circa 30 independentmethods.
E. Model quality assessment background
Estimating the free energies of protein models hasa long history within the protein structure predictionfield. [9, 10] Here, the ultimate goal is to understand thefundamental physics governing protein folding to such an FIG. 1: Detail of the 3D structure of the protein 3TDU.Highlighted in yellow are the residues that smoothlytransition between helix and coil. Predictions arecommonly wrong about the exact position of theboundary.extent that it is possible to simulate the folding of a pro-tein. The purpose of the model quality assessment pro-gram is to accurately estimate the distance of the modelfrom the native structure.There are two general strategies to evaluate the qual-ity of a single protein model: comparison with auxiliarypredictions, and evaluation of physico-chemical proper-ties of the model. A limitation of the second approach isbe that even if a method could perfectly describe the freeenergy of a protein model - this measure might not be atall correlated with the difference of a model from the na-tive structure. Take the trivial example of a model whereall atoms except one, are perfectly placed. This last atomis then positioned on top of another atom. Such a modelwould have a near infinite free energy, but by any dis-tance measure, be almost perfect.On the other hand models solely relying on predictionsmight provide important low-resolution methods and re-duce the complexity of 3D prediction to simpler ones,such as secondary structure. A good model is expectedto agree with the predictions. In addition, all disagree-ments might not carry the same information. Often theexact boundaries of secondary structure elements are no-toriously difficult to predict, and depend on the exactthresholds used to define it, as seen in Figure 1. In con-trast, it is rare that errors in predictions between differentsecondary structure classes occur.
F. Related work
For more than a decade several groups have developedprotein model quality assessment methods using variousinput features [11–16]. The improvements achieved dur-ing the last few years can be attributed to small, but sig-nificant, improvements to the methods by (i) includingadditional descriptions of a protein model [17], (ii) ap-plying deep learning techniques [18] and combining manyfeatures [16].Here follows a brief description of the inputs and algo-rithms used by other comparable methods. These mostlyfocus on physico-chemical descriptors, such as statisticalpotentials, and train simple machine learning algorithms. • QMEAN [19]: Linear combination of the agree-ment of secondary structure and surface area, andthree statistical potentials describing the torsionangles, amino acid contacts, and exposure to thewater. • DeepQA [20]: Deep Belief Network combining sev-eral scoring potentials and energies, other qualityassessment methods, and seven different agreementmetrics between observed and predicted secondarystructure and surface area. • ProQ3 [17]: A Linear SVM trained on atom andamino acid contacts, observed secondary structure,multiple sequence alignment statistics, agreementswith secondary structure and surface area, as wellas terms of the Talaris energy function [21]. • ProQ3D [18]: Same input as ProQ3, but replacingthe linear SVM with a multi layer perceptron. • VoroMQA [22]: Statistical potential based on thefrequencies of observed atom contacts. • SVMQA [15]: SVM combining different statisticalpotentials and agreement with secondary structureand surface area.In contrast to all other methods, our method uses onlycoarse structural features and a multiple sequence align-ment, but no statistical potentials nor any other chemicaldescription.
II. METHODSA. Datasets
In order to directly compare results with the most re-cent method, ProQ3D [18] retrained on LDDT [23], wehave used the same datasets for training and testing: allthe submitted models to the CASP editions 9 and 10 fortraining, and CASP 11 for evaluation, excluding all thetargets shorter than 50 residues. In total, we have 57 263models from 212 different targets in the training set, and14 580 from 79 targets in the test set, excluding targetscancelled by the organisers.We also present the results on the same subset of theCameo dataset used by ProQ3D, 19 899 models from 676different targets.For the pre-trained networks, we used 5687 structuresfrom the PISCES dataset [24], while an additional 300 were left out for validation. The multiple sequence align-ments were produced using Jackhmmer [25] to searchUniref50 [26], with an E-value threshold of 10 − for 3iterations. B. Description of inputs
Prediction of structural features of a protein is im-proved by using multiple sequence alignments [27]. Fromthe multiple sequence alignment we extract two statis-tics: the self-information (Equation 1a) and the partialentropy (Equation 1b) of the position: I i = − log (cid:18) p i ¯ p i (cid:19) , (1a) S i = − p i log (cid:18) p i ¯ p i (cid:19) , (1b)where p i is the frequency of the amino acid i at the po-sition and ¯ p i is the average frequency on the data set.We also include the protein sequence itself using sparseencoding.The protein models are defined in a coarse representa-tion by the sine and cosine of the dihedral angles [28] ϕ and ψ , the secondary structure, relative surface area, andenergies of hydrogen bonds in the backbone, as definedby DSSP [29]. In DSSP, 8 different secondary structurestates are assigned to a protein. Due to lack of data, wemerged the following two pairs of states: G (3 helix)and I ( π − helix), and T (hydrogen bonded turn) and S(bend). C. Output: the target function
There are several different metrics that can be usedto evaluate the similarity between a protein model andthe native structure. In this work we choose to use a lo-cal scoring function: the Local Distance Difference Test(LDDT) [30], a number between 0 and 1 that representsthe fraction of conserved contacts between all pairs ofatoms in the native structure for several distance thresh-olds. An LDDT of 1 means a perfect agreement, anda good model shall have scores higher than 0 .
5. Otherquality estimation methods could also be used providingsimilar performance. To estimate the quality of a modelwe define the global score to be the average of the localscores, giving a score of 0 to any residue missing from themodel, and ignoring those absent from the native struc-ture.
D. Figures of merit
The performance of the models will be evaluated onseveral metrics. • Local correlation:
Pearson correlation between thepredicted and true values assigned to every residue.This measures the reliability of the predicted scoresassigned to each residue. • Local RMSE:
Root Mean Squared Error betweenthe aforementioned predicted local scores and thetrue values. • Per model correlation: average Pearson correlationbetween predicted and true values of every residuefor every model. This measures how well we candifferentiate the correct and incorrect parts of themodel. • Global correlation:
Pearson correlation betweenpredicted and true values for the overall model.This is defined as the average of the local scores,ignoring any residue not present in the native struc-ture, and setting the score of missing residues to 0.This measures how well we can rank targets in aglobal scale. • Global RMSE:
Root Mean Squared Error betweenthe global predicted and true scores. • Per target correlation: average Pearson correlationfor the global scores per target. A measurement ofthe ability of the program to rank models from thesame target. • First rank loss: average difference between the bestand the top ranked model. It measures how well wecan select the best model for each target.
III. IMPLEMENTATIONA. Network architecture
In order to exploit the spatial distribution of the localfeatures, and to allow the network to compare observa-tions and predictions locally, we have implemented a 1Dfully convolutional network trained on the local scores.We used the ELU activation function [31] as a non-linearity, and also applied a small L penalty of 10 − to every convolutional layer, except for the output lay-ers, The training was guided by the Adam optimiser [32]. B. The ResNet module
The convolutional block from ResNet [33] inspired ourarchitecture due to its capacity to converge efficientlywhile still keeping a deep architecture that provides alarge effective field of view. As shown on Figure 2, itis composed of two blocks of successive convolution ofwidth 3, followed by an activation function, batch nor-malisation [34], and dropout, whose output is summedto the inputs. When the number of input and output channels is different, we modify the skip connection toinclude only one convolution, activation, batch normali-sation, and dropout.FIG. 2: The 1D ResNet module, the main buildingblock of our convolutional nets
C. Simple Convolutional Neural Network
The simplest implementation of a deep convolutionalnetwork consists of several branches that combines all theinputs can be seen in Figure 3. In one branch, the threeinput vectors derived from the sequence (the sequence it-self, self-information, and partial entropy) are followed bya convolution of size 1, to project them into a 16 dimen-sional vector space per residue. They are then followedby two ResNet modules, merged into a single branch byconcatenation, and followed by two more ResNet mod-ules. The structural inputs are likewise projected intoa 64 dimensional space and passed through four ResNetmodules. Finally, both branches are concatenated, andsent through four more ResNet modules. A single con-volutional layer of width 7 and no L penalty is appliedused in the final layer.FIG. 3: The convolutional architecture D. Sequence pre-trained network
Our dataset contains roughly 200 times more modelsthan unique sequences, which is an obvious source of bias.For example, some of these targets are hard, and not asingle model is of good quality. The network could there-fore easily learn that this particular sequence is alwaysbad, which we want to avoid. After all, a new method,or more data could result in a good model in the future.To tackle this issue, we pre-train the branch corre-sponding to the sequence inputs on 5687 of known pro-tein structures from PISCES [24], showing our network amuch more diverse set of proteins. As shown in Figure 4a,we inserted one hidden convolutional layer connected tofour more predicting the 3 and 6-state secondary struc-ture, surface accessibility, and sine and cosine of the di-hedral angles. We denominate alignment features to theoutput of the network in the hidden layer before this one,and it is used to replace the sequence branch in the pre-vious network. The resulting model has exactly the samearchitecture and parameters, but with some weights setthrough training in a different task.The performance of each auxiliary predictor is com-parable to state-of-the-art methods, but due to possibleoverlaps between training and test sets of the differentmethods, a completely fair comparison would need morecareful studies that are beyond the scope of this paper.
E. Tricephalous network
The ultimate application of model quality assessmentis to rank models of the same protein. In this sectionwe will describe an architecture designed to exploit thisstructure of the problem by reducing it to pairwise com-parisons.In our dataset, for each target we have around 200individual models, but 20 000 pairs, which we can use asdata augmentation. So, instead of feeding one structure at the time, we will present the network with two modelsof the same target and ask it to predict the scores, as wellas which model is better, for every residue, as shown inFigure 5.In order to ensure the symmetry of the problem is re-spected, each model is passed through a pair of identicalcopies of the pre-trained network described previously,keeping the parameters of each copy tied. This is called aSiamese configuration, because the network is composedof two conjoined twins. The comparison prediction isdone by a symmetrised perceptron with one hidden layer,the SortNet [35], represented in grey boxes in the Figure.SortNet is a small variation of the classical perceptronto represent proper preference. So, if we predict that a is better than b with a probability of 0 .
8, we will alsopredict b to be better than a with a probability of 0 . . F. Regression as a classification
Deep learning usually performs better on classificationthan on regression tasks. Therefore, we will divide the[0 ,
1] range into N equally sized bins and replace themean squared error loss function by a cross entropy. Inorder to recover the predicted score, we then average thescores with equally spaced weights. p = N X i =1 s i ( σ low + i · σ step ) + σ offset , (2)where s i is the predicted probability of being in the i − th bin, σ low , σ step , and σ offset are three free parametersthat were obtained minimising the mean squared erroron the training set.This is the final architecture, and we will refer to it asProQ4. G. Dependencies
The networks were implemented with Keras [37], usingTensorflow [38] as a backend. The data is stored in HDF5files accessed through PyTables [39]. All the networkswere trained on a single Nvidia 1070Ti GPU, equippedwith 8GB of VRAM.In order to make predictions, the only dependencies arePython 3 with Numpy, Biopython, Keras, Tensorflow, (a) Sequence pre-training, learning the alignmentfeatures. (b) Sequence pre-trained network
FIG. 4: The two stages of the Pre-trained network. The sequence pre-training is used to extract the alignmentfeatures that are used subsequently throughout the paper.FIG. 5: The Tricephalous architecture: the two stagesof the Comparative network combined at once.and H5Py, DSSP, and a multiple sequence alignment.All dependencies are open source and can be freely dis-tributed. Running on GPUs requires a CUDA-enabledNVIDIA GPU card, but can also work on CPUs.
IV. RESULTS AND DISCUSSION
In the Tables I and II we present the results on theCASP 11 datasets for the three networks, compared withthe state-of-the-art method, ProQ3D [18]. No other pub-licly available method was trained on LDDT, so we can-not establish a fair comparison. We also present the re-sults on Cameo [40] on Tables III and IV. The last row oneach table, Tricephalous network trained on classificationis our final model, ProQ4, also plotted in Figure 6.Of all the reported figures, we believe the per targetcorrelation to be the most important metric, since oneis usually interested in comparing models correspondingto the same protein. First rank loss is also of interest,but since depends on a single data point per target, it isnoisier.For a baseline, we trained a simple feed forward net-work on our features with a window of 21 residues, withtwo hidden layers of 512 units, ELU activation, BatchNormalisation, and Dropout ( p drop = 0 . Method R-local RMSE local R-per modelProQ3D(retrained on LDDT)
Base MLP 0.62 0.180 0.44Pre-trained MLP 0.52 0.198 0.47Simple CNN 0.51 0.205 0.42Pre-trained CNN 0.68 0.172 0.55Tricephalous 0.72 0.160 0.55ProQ4 0.77 0.147 0.56
R-local is the correlation between all local predicted and true scores in the dataset; RMSE stands for Root MeanSquared Error. R-per model is the average correlation between predicted and true scores for each model in thedataset. Find a more detailed explanation in section II DTABLE II: Summary of results on the global scores on CASP11.
Method R-global Global RMSE R-per target First rank lossProQ3D(retrained on LDDT) 0.90 0.080 0.82 0.040Base MLP 0.77 0.118 0.82 0.046Pre-trained MLP 0.67 0.135 0.82 0.046Simple CNN 0.67 0.157 0.68 0.091Pre-trained CNN 0.85 0.121 0.85 0.036Tricephalous 0.87 0.106 0.88 0.029ProQ4
R-global is the correlation between all global predicted and true scores in the dataset; R-per target is the averagecorrelation of global scores of models for each protein; and first rank loss is the average difference in true scoresbetween the best model and the top ranked for each target. .
00 0 .
25 0 .
50 0 .
75 1 . . . . . . . P r e d i c t e d l o c a l CASP 11 .
00 0 .
25 0 .
50 0 .
75 1 . . . . . . . Cameo .
00 0 .
25 0 .
50 0 .
75 1 . True . . . . . . P r e d i c t e d g l o b a l .
00 0 .
25 0 .
50 0 .
75 1 . True . . . . . . FIG. 6: 2D histogram of local (upper) and global(lower) scores for CASP 11 and Cameo. . . . . P e rt a r g e t c o rr e l a t i o n s CASP 11 . . . . Cameo .
00 0 .
05 0 .
10 0 .
15 0 .
20 0 . F i r s t R a n k L o ss .
00 0 .
05 0 .
10 0 .
15 0 .
20 0 . FIG. 7: Comparison of the performance of ProQ3D (inblue) and ProQ4 (green) in per target correlations(upper) and first rank loss (lower). For Cameo, thehistograms have been truncated to the same range asCASP for clarity.which are indeed difficult to rank, as shown in Figure 9.
A. Number of classes in the scores
When treating the regression as a classification, wetested all splits between 2 and 9 classes, using equallyspaced bins. The differences between 3 or more classesare smaller than 2% in the local, global, and per target . . . . . . Secondary Structure - 3 state accuracy . . . . . . . . P e rt a r g e t c o rr e l a t i o n FIG. 8: Comparison between per target correlationsand the quality of the sequence-based predictor. Thecorrelation is weak (0.33), suggesting that the quality ofthe sequence-based features is not a significant limitingfactor. . . . . . . Average model quality . . . . . . . P e rt a r g e t c o rr e l a t i o n ProQ4ProQ3
FIG. 9: The per target correlations of ProQ3D andProQ4 as a function of the average quality of the modelon CASP11. The dependency is stronger for ProQ3D( R P Q D = 0 .
52 vs R P Q = 0 . B. The effect of larger networks
We tested deeper networks increasing from 8 up to50 ResNet blocks deep, and while it improved the lo-cal correlation, the global metrics remained the same, orslightly lower. For example, with a depth of 50 layers,we increased a local correlation of from 0 .
77 to 0 .
81, onCASP11, while the global remained at 0 .
91, but the pertarget drops from 0 .
90 to 0 .
87. Even though the local cor-TABLE III: Summary of results on the local scores on Cameo.
Method R-local RMSE local R-per modelProQ3D(retrained on LDDT)
Base MLP 0.48 0.232 0.42Pre-trained MLP 0.50 0.248 0.47Simple CNN 0.43 0.219 0.41Pre-trained CNN 0.61 0.202 0.55Tricephalous 0.65 0.199 0.56ProQ4 0.65 0.201 0.56
R-local is the correlation between all local predicted and true scores in the dataset; RMSE stands for Root MeanSquared Error. R-per model is the average correlation between predicted and true scores for each model in thedataset. Find a more detailed explanation in section II DTABLE IV: Summary of results on the global scores on Cameo.
Method R-global Global RMSE R-per target First rank lossProQ3D(retrained on LDDT)
Base MLP 0.71 0.180 0.75 0.034Pre-trained MLP 0.76 0.202 0.76 0.032Simple CNN 0.70 0.163 0.71 0.041Pre-trained CNN 0.80 0.158 0.80 0.028Tricephalous 0.82 0.158 0.83 0.025ProQ4 0.82 0.156
R-global is the correlation between all global predicted and true scores in the dataset; R-per target is the averagecorrelation of global scores of models for each protein; and first rank loss is the average difference in true scoresbetween the best model and the top ranked for each target.0relations improve significantly, the per model correlationis slightly reduced from 0 .
56 to 0 .
55. Larger networksseem to predict significantly better local scores, but be-ing hampered by biases.
C. Global vs local scores
We are training on local scores, so directly trying to op-timise the local RMSE. It is surprising, then, that whilethe performance on local scores is significantly worse thanProQ3D, ProQ4 shows an improvement on global met-rics, specially on per target correlations. So, ProQ4 isbetter at telling the global differences between models,but fails to locate them accurately. This may be due tothe fact that during the training procedure we are pre-senting the network with the full model, or due to thelack of a description of the chemical properties of themodel.As shown in the section IV B, an increase in local per-formance does not necessarily translate into global im-provements.
D. Difference between CASP and Cameo
The main difference in Cameo is that there are muchfewer models per target, and coming from fewer servers.Many of the targets have easy templates available, somost of the models tend to be similar to each other.The average first rank loss on the Cameo dataset isdominated by a single target, 4UYQ_B, with a loss of0.44 (see Figure 7). If that target were excluded, the aver-age first rank loss would drop to 0.021, same as ProQ3D.
E. Are we learning something new?
Figure 10 shows a correlation matrix between all theglobal scores for all the architectures presented. We canclearly identify two close clusters, one for the MLP ar-chitectures, and another one for all the convolutionalnetworks using pre-trained features. Simple CNN is anoutlier, showing only a moderate correlation with its ar-chitectural twin, the pretrained network. This indicatesthat the Tricephalous network and ProQ4 are learningthe same thing, ProQ4 is just slightly better.On the other hand, the agreement between ProQ3Dand each of our models is similar to the agreement be-tween the model and the true values, suggesting thatboth are learning something fundamentally different, de-spite having been trained on the same dataset.
F. Method biases
There are multiple strategies to generate protein mod-els. Different methods fall into distinct kinds of errors, and produce models of diverse chemical characteristics.This is a challenge when evaluating the model quality; amodel presenting unnatural chemical properties can bedue to low quality or just the existence of non-optimisedlocal geometries, for instance caused by sub-optimal side-chain packing.A common solution to the diversity problem is to uni-formise the models by repacking the side-chains with asingle program such as SCWRL [41]. However, there arelimitations to this approach and it might be computa-tionally expensive. Also, from a practical point of view,it causes a method to be dependent on additional pro-grams.In our proposed method, we use only a coarse descrip-tion of the protein that is not sensitive to the packing ofthe side chains. This should make it quicker and easierto apply it in a large scale.Another source of bias befalls when both the methodgenerating the model and the assessment make use of thesame auxiliary predictor, such as PSIPRED [8]. In thiscase, we find a bias for models generated with a particularmethod. In order to tackle this issue, we replace thecomparison with explicit predictors with our own methodand train on the hidden representation instead of the finaloutput.
V. CONCLUSIONSA. Limitation of the method
Our proposed method uses a very coarse descriptionof the structural structures, focused on features that canbe predicted from the sequence. This restricted descrip-tion limits the performance on the local level, but it iscompensated by an increased performance in the globaland, specially, per target ranking, due to the in builtcomparative structure.Finding the right representation for the chemical prop-erties of proteins from a deep learning point of view re-mains as an open question, and we hope this work willincentivise this line of research.
B. The importance of structured data
One of the reasons behind the great success of deeplearning is the ability to take full advantage of the struc-ture of the data. For machine learning in bioinformaticsthis has not been utilized fully, except for contact pre-diction in the works of Golkov et al. [42] and Wang et al. [43]. Many machine learning methods in bioinformaticsstill rely on sliding window approaches. Even when deeplearning has been applied this has often been limited toincreasing the complexity of a multi-layer perceptron ar-chitecture.Since the advent of the Non Free Lunch theorem [44],we know that the success of a machine learning algorithm1
ProQ3D MLP basic MLP pretrained CNN CNN pretrained Tricephalous ProQ4 TrueProQ3DMLP basicMLP pretrainedCNNCNN pretrainedTricephalousProQ4True 0 . . . . . . . FIG. 10: Correlation matrix between the predicted global scores on CASP 11 for each method. The order of therows is the same as in the tables, with the last corresponding to the true scores. A similar picture is obtained if wecompare local scores instead.is tied to how much domain knowledge can be includedin its training. In traditional machine learning, this isdone through careful feature engineering, trying to findthe closest representation to our objective. In this workwe propose a training framework that replaces the tra-ditional end-to-end fitting with a multi-stage process de-signed to inject domain knowledge in every step of theway:1. Spatial relationships and translation invariance arecoded as convolutions.2. The sequence information is extracted in the pre-training.3. The tricephalous architecture encodes the ranking nature of the problem.Both the pre-training and the comparative trainingbring an improvement to the results across all the metricswe have evaluated on both datasets.The importance of structure is also seen on the effectof pre-training. While it improves CNN-based architec-tures, Multi Layer Perceptron (MLP) models don’t ben-efit, or are even hindered by pre-training.
FUNDING
This work was supported by grants from the SwedishResearch Council (VR-NT 2016-03798 to AE) andSwedish e-Science Research Center (BW). The SwedishNational Infrastructure provided computational re-sources for Computing (SNIC) at NSC. [1] T. Terwilliger, D. Stuart, and S. Yokoyama, Annu RevBiophys , 371 (2009).[2] The UniProt Consortium, Nucleic Acids Res , D158(2017).[3] M. Michel, D. Menéndez Hurtado, K. Uziela, andA. Elofsson, Bioinformatics , i23?i29 (2017).[4] S. Ovchinnikov, H. Park, N. Varghese, P.-S. Huang, G. A.Pavlopoulos, D. E. Kim, H. Kamisetty, N. C. Kyrpides,and D. Baker, Science , 294?298 (2017). [5] Y. LeCun, Y. Bengio, and G. Hinton, Nature ,436?444 (2015).[6] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, in Pro-ceedings of the 27th International Conference on Neu-ral Information Processing Systems - Volume 2 , NIPS’14(MIT Press, Cambridge, MA, USA, 2014) pp. 3320–3328.[7] B. Petersen, T. Petersen, P. Andersen, M. Nielsen, andC. Lundegaard, BMC Structural Biology , 51 (2009). [8] D. T. Jones, Journal of Molecular Biology , 195(1999).[9] B. Park and M. Levitt, Journal of molecular biology ,367 (1996).[10] T. Lazaridis and M. Karplus, Journal of molecular biol-ogy , 477 (1999).[11] B. Wallner and A. Elofsson, Protein Science , 1073(2003).[12] B. Wallner, Protein Science , 900 (2006).[13] A. Ray, E. Lindahl, and B. Wallner, BMC Bioinformatics , 224 (2012).[14] S. Mirzaei, T. Sidi, C. Keasar, and S. Crivelli,IEEE/ACM Transactions on Computational Biology andBioinformatics , 1 (2016).[15] B. Manavalan and J. Lee, Bioinformatics (2017),10.1093/bioinformatics/btx222.[16] A. Maghrabi and L. J. McGuffin, Nucleic Acids Research(2017), 10.1093/nar/gkx332.[17] K. Uziela, N. Shu, B. Wallner, and A. Elofsson, ScientificReports , 33509 (2016).[18] K. Uziela, D. Menéndez Hurtado, N. Shu, B. Wallner,and A. Elofsson, Bioinformatics , btw819 (2017).[19] P. Benkert, S. C. E. Tosatto, and D. Schomburg, Pro-teins: Structure, Function, and Bioinformatics , 261(2008).[20] R. Cao, D. Bhattacharya, J. Hou, and J. Cheng, BMCBioinformatics (2016), 10.1186/s12859-016-1405-y.[21] A. Leaver-Fay, M. J. O’Meara, M. Tyka, R. Jacak,Y. Song, E. H. Kellogg, J. Thompson, I. W. Davis,R. A. Pache, S. Lyskov, J. J. Gray, T. Kortemme, J. S.Richardson, J. J. Havranek, J. Snoeyink, D. Baker, andB. Kuhlman, in Methods in Protein Design , Methods inEnzymology, Vol. 523, edited by A. E. Keating (AcademicPress, 2013) pp. 109 – 143.[22] K. Olechnovic and C. Venclovas, Proteins: Struc-ture, Function, and Bioinformatics (2017),10.1002/prot.25278.[23] K. Uziela, D. Menéndez Hurtado, N. Shu, B. Wallner,and A. Elofsson, Proteins: Structure, Function, andBioinformatics (2018), 10.1002/prot.25492.[24] G. Wang and R. L. Dunbrack, Jr., Bioinformatics ,1589 (2003).[25] L. S. Johnson, S. R. Eddy, and E. Portugaly, BMC Bioin-formatics , 431 (2010).[26] B. E. Suzek, Y. Wang, H. Huang, P. B. McGarvey, andC. H. Wu, Bioinformatics , 926 (2014).[27] B. Rost, C. Sander, and R. Schneider, Journal of Molec-ular Biology , 13 (1994).[28] B. Xue, O. Dor, E. Faraggi, and Y. Zhou, Proteins:Structure, Function, and Bioinformatics , 427 (2008). [29] W. Kabsch and C. Sander, Biopolymers , 2577 (1983).[30] V. Mariani, M. Biasini, A. Barbato, and T. Schwede,Bioinformatics , 2722 (2013).[31] D. Clevert, T. Unterthiner, and S. Hochreiter, Proceed-ings of the 32nd International Conference on MachineLearning (2015).[32] D. Kingma and J. Ba, International Conference forLearning Representations (2015).[33] K. He, X. Zhang, S. Ren, and J. Sun, 2016 IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) (2016), 10.1109/cvpr.2016.90.[34] S. Ioffe and C. Szegedy, Proceedings of the 32nd Interna-tional Conference on Machine Learning (2015).[35] L. Rigutini, T. Papini, M. Maggini, and F. Scarselli,IEEE Transactions on Neural Networks , 1368 (2011).[36] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever,and R. Salakhutdinov, Journal of Machine Learning Re-search , 1929 (2014).[37] F. Chollet et al. , “Keras,” https://github.com/fchollet/keras (2015).[38] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is-ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-enberg, D. Mané, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,F. Viégas, O. Vinyals, P. Warden, M. Wattenberg,M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,”(2015), software available from tensorflow.org.[39] F. Alted, I. Vilata, et al. , “PyTables: Hierarchicaldatasets in Python,” (2002–).[40] J. Haas, A. Barbato, D. Behringer, G. Studer,S. Roth, M. Bertoni, K. Mostaguir, R. Gumienny, andT. Schwede, Proteins: Structure, Function, and Bioinfor-matics , 387?398 (2017).[41] Q. Wang, A. A. Canutescu, and R. L. Dunbrack, NatureProtocols , 1832 (2008).[42] V. Golkov, M. J. Skwark, A. Golkov, A. Dosovitskiy,T. Brox, J. Meiler, and D. Cremers, in Advances in Neu-ral Information Processing Systems 29 , edited by D. D.Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Gar-nett (Curran Associates, Inc., 2016) pp. 4222–4230.[43] S. Wang, S. Sun, Z. Li, R. Zhang, and J. Xu, PLoSComput Biol , e1005324 (2017).[44] D. H. Wolpert, Neural Computation8