[PDF] Align-gram : Rethinking the Skip-gram Model for Protein Sequence Analysis

Abstract

Background: The inception of next generations sequencing technologies have exponentially increased the volume of biological sequence data. Protein sequences, being quoted as the `language of life', has been analyzed for a multitude of applications and inferences. Motivation: Owing to the rapid development of deep learning, in recent years there have been a number of breakthroughs in the domain of Natural Language Processing. Since these methods are capable of performing different tasks when trained with a sufficient amount of data, off-the-shelf models are used to perform various biological applications. In this study, we investigated the applicability of the popular Skip-gram model for protein sequence analysis and made an attempt to incorporate some biological insights into it. Results: We propose a novel k -mer embedding scheme, Align-gram, which is capable of mapping the similar k -mers close to each other in a vector space. Furthermore, we experiment with other sequence-based protein representations and observe that the embeddings derived from Align-gram aids modeling and training deep learning models better. Our experiments with a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows the potential of Align-gram in performing different types of deep learning applications for protein sequence analysis.

Full PDF

AAlign-gram : Rethinking the Skip-gram Model for Protein SequenceAnalysis

Nabil Ibtehaz , S. M. Shakhawat Hossain Sourav , Md. Shamsuzzoha Bayzid , andM. Sohel Rahman Samsung R&D Institute, Bangladesh Department of CSE, BUET,ECE Building, West Palasi, Dhaka-1205, Bangladesh * Corresponding author [email protected]

December 8, 2020

Abstract

Background: The inception of next generations sequencing technologies have exponentiallyincreased the volume of biological sequence data. Protein sequences, being quoted as the ‘lan-guage of life’, has been analyzed for a multitude of applications and inferences.Motivation: Owing to the rapid development of deep learning, in recent years there havebeen a number of breakthroughs in the domain of Natural Language Processing. Since thesemethods are capable of performing diﬀerent tasks when trained with a suﬃcient amount ofdata, oﬀ-the-shelf models are used to perform various biological applications. In this study, weinvestigated the applicability of the popular Skip-gram model for protein sequence analysis andmade an attempt to incorporate some biological insights into it.Results: We propose a novel k -mer embedding scheme, Align-gram, which is capable ofmapping the similar k -mers close to each other in a vector space. Furthermore, we experimentwith other sequence-based protein representations and observe that the embeddings derivedfrom Align-gram aids modeling and training deep learning models better. Our experimentswith a simple baseline LSTM model and a much complex CNN model of DeepGoPlus shows thepotential of Align-gram in performing diﬀerent types of deep learning applications for proteinsequence analysis. Index terms—

Deep Learning, k -mer, Protein Sequence Analysis, Skip-gram Model, WordEmbedding 1 a r X i v : . [ q - b i o . Q M ] D ec Introduction

With the rapid development of sequencing technology, there has been an exponential increase inthe amount of biological sequence data [1]. This abundant amount of sequence data enables us toapply big data and deep learning-based technologies in biological sequence analysis, which is revo-lutionizing almost all the diﬀerent aspects of life [2]. Most notably, the application of deep learningin protein sequence analysis is showing great promises [1, 3, 4]. Previously, machine learning-basedmethods trained with biological features have been used in tasks like secondary structure prediction[5], surface residues [6], protein subcellular location [7] etc. Deep learning-based approaches are,arguably, much superior and convenient over the traditional machine learning based bioinformaticsanalysis pipeline as they make the feature extraction step unnecessary and are capable of workingwith the raw data instead. Since the identiﬁcation and extraction of such biological features aretime-consuming, expensive, and require a lot of eﬀort, deep learning-based methods are potentiallymore eﬀective to keep up with the ever-increasing number of protein sequences. As a result, moreand more deep learning models are being developed to perform tasks like contact prediction [8],protein secondary structure prediction [9], protein function prediction [10], protein sub-cellularlocalization [11], peptide-MHC binding prediction [12] etc.The selection of the input representation is the ﬁrst step in leveraging deep learning approachesin protein sequence analysis. The idea is to break the protein sequences into tokens (amino acidsor k -mers) and assign a numeric value or vector to each of the tokens. This assignment canbe done using several approaches. The simplest and most popular method is to apply one-hotencoding [4], that assumes that diﬀerent amino acids or k -mers are completely uncorrelated witheach other and accordingly, treats them independently. The second-most widely used approachis to use the BLOcks SUbstitution Matrix (BLOSUM)[13] that encodes the amino acids usingtheir corresponding row in the BLOSUM matrix [14, 15]. This incorporates some evolutionaryinformation of the amino acids as BLOSUM values signify which pairs of amino acids are morelikely to transpose into one another during the course of evolution. This sense of relation betweenthe amino acids may prove beneﬁcial in certain analysis tasks. However, sometimes the simpler one-hot encoding performs better [16], and sometimes the combination of both yields superior results[12]. In addition to sequence-based approaches, evolutionary information like Position-SpeciﬁcScoring Matrices (PSSM) [17, 18], physicochemical properties [19, 20], and handcrafted features[21, 22] have also shown great promises. Although these latter representations may often yieldsurpassing results, this usually comes at the cost of a signiﬁcant amount of analysis, studies, andexperiments which are essential to compute them. When we consider the level of eﬀort and contrastthem with the almost on par results obtained from using simpler sequence-based representationsin novel network architectures, their utility and enticement rather fade away.Distributed representation of words has been one of the most inﬂuential and ground-breakingworks in Natural Language Processing (NLP) [23, 24]. Earlier works, using one-hot-encoding,treated words as independent tokens. But actually, the various words are interrelated with eachother, some are similar like ‘Perfect’ and ‘Excellent’, some are opposite like ‘Swift’ and ‘Sluggish’,whereas some are completely unrelated like ‘Computer’ and ‘Penguin’. Thus, it is understandablethat leveraging these relations between and among the words instead of treating them as completelyseparate entities will greatly beneﬁt in modeling. Although the idea of generating representationsfor words has been investigated quite early [25], the simplistic nature of one-hot-encoding retainedits acceptance, mostly because, such simple approaches proved to be robust when trained on a largeamount of data [23]. Nevertheless, with their revolutionary works, Mikolov et al. popularized the2oncept of word embedding [23, 24]. Now, the word2vec [26] and doc2vec [27] models are beingapplied in multitude of applications. Like all other domains working with text data, bioinformaticshas also embraced the potentials and possibilities of such word embedding models [28, 29, 30, 31].In the context of bioinformatics, the concept of words is translated to k -mers, and using a largedatabase of protein sequences instead of text corpus, embeddings are learned in an unsupervisedmanner. This type of approach has also obtained success in certain applications [32, 33, 34].Despite the success of such approaches, in this work, we have made an attempt to investigatewhether oﬀ-the-shelf NLP concepts like the popular and widely adopted skip-gram model are com-patible with bioinformatics tasks like protein sequence analysis. Deep neural networks being theuniversal function approximator, it is always possible to model any problem with them, providedthat we have a suﬃcient amount of data. Nevertheless, we sought out means to coalesce somebiological insights with the concepts of NLP. While doing so, we identiﬁed some potential pitfalls ofapplying oﬀ-the-shelf Skip-gram models for protein sequence analysis and proposed a novel model,Align-gram, which is likely to be more suitable for bioinformatics applications. We ﬁrst present a brief description of the popular Skip-gram model.

The skip-gram model [23] was developed to generate embeddings for words such that the syntacticand semantic relations between the words are preserved. The model constructs a vector spaceand maps the diﬀerent words with the objective of mapping similar words close to each other. Toachieve this a shallow neural network with a single hidden layer is used, that tries to predict thenearby words of a given input word. To train the model a large text corpus is used, the sentences aretokenized into words and represented by a unique id. For each word w in the sentences we consider awindow, i.e., we take some words W before and after w , and train the model to output those words( W ) when w is given as input. Since the words are organized in natural human sentences obeyingspeciﬁc grammatical and semantic rules, this procedure, in essence, allows us to approximate themeaning of the word w . After training the model the weights from the hidden layer are used togenerate embeddings for the words. In an earlier section, we have mentioned that oﬀ-the-shelf Skip-gram models have been appliedin protein sequence analysis. However, upon diligent inspection, we point out several aspects ofthe Skip-gram model that might not be compatible with protein sequence analysis. First of all,in bioinformatics, we consider k -mers as tokens instead of words, which is the logical thing to do.However, unlike natural language where words as tokens are well deﬁned, for k -mers in proteinsequence it is not. Often when working with 3-mers, the common practice is to shift the sequence2 times and break them into k -mers. For example, for a sequence ADT IV AV ET , we have the 3shifted sequences

ADT IV AV ET , DT IV AV ET and

T IV AV ET . We can break the 3 sequencesinto sets of k -mers as { ADT, IV A, V ET } , { DT I, V AV } , { T IV, AV E } . This approach has beenfollowed by [28, 31], but we can observe that the same sequence is resulting in diﬀerent contextualrepresentations. This kind of situation is rare for natural texts as the sentences follow speciﬁc3ets of rules. Furthermore, even if we consider contexts this way, we need to verify that themodel manages to map the similar k -mers in the vector space properly, based on this contextualinformation about the k -mers is given. Although a study [35] investigated the semantic meaninglatent in the position or context for protein domains, to the best of our knowledge there hasbeen no study that experiments with the semantic representation obtained from k -mer contexts inprotein sequences. We inspect the vector space generated for k -mers using the Skip-gram modeland discover that there is hardly a correlation between k -mer similarity scores and the vectorsimilarity scores of the embeddings (please refer to Section 3.1). Moreover, homo-repeats [36] like AAAAAAAA exists in protein sequences, but the Skip-gram model does not consider repetitionsas it is seldomly found in natural sentences.

The question that has driven us is whether the contextual information obtained from the location of k -mers in a protein sequence is suﬃcient to infer the relations between the k -mers. Traditionally thesimilarity between protein sequences or k -mers is determined using alignment scores [37]. Duringthe course of evolution for point mutation, one amino acid can be transformed into another, thelikelihood of which can be deduced from a substitution matrix, like BLOSUM [13]. Furthermore,gaps in the alignment correspond to indels, i.e., insertion or deletion mutations, which is regulatedwith suitable gap penalties. As a result, for decades, alignment scores are being used by biologiststo measure the degree of similarity between protein sequences.Therefore, to map the k -mers based on their similarity information we incorporate this biologicalintuition into the Skip-gram model. Instead of working with the contextual information of thesurrounding k -mers, we compute the alignment scores of all the k -mer pairs. We then modify theSkip-gram model to become a regression model, by using linear activation for the output layer.Since cascaded linear layers collapse into a single linear layer, we use sigmoid activation functionfor the hidden layers. We represent the input k -mers to this network using one-hot encoding similarto Skip-gram, but unlike Skip-gram, we try to predict the alignment scores of all the k -mers withthe input k -mer. We obtain a weight matrix from training the model and derive the embeddingsfor the k -mers from it, similar to Skip-gram model.In this way, modifying the Skip-gram model, we develop Align-gram to map the k -mers basedon their alignment scores instead of the contextual information. We computed the alignment scores between the 3-mers using the BioPython [38] library. Thevalues for gap open penalty, gap extend penalty were taken 11 and 1 respectively and we used theBLOSUM62 substitution matrix, which are the default options for BLASTP [39].The shallow neural network with a single hidden layer has been implemented in Keras [40] withTensorﬂow [41] backend. The hidden layer consists of 100 neurons which corresponds to the lengthof the embeddings. We used sigmoid and linear as the activation functions for the hidden layer andoutput layer respectively. The input one-hot vector was mapped to its corresponding alignmentscore vector as a regression problem. We minimized the mean squared error using the Adam [42]optimizer. The model was trained for 5000 epochs, which took about 30 minutes in a Tesla P-100GPU. 4 a) (b) (c)

Figure 1:

Overview of the Process.

1a presents breaking the sequence into 3-mer. In Skip-gram model (1b) the3-mers are then used to train the model using the proximity of the 3-mers. For the proposed Align-gram model (1c)we compute the alignment scores between all the 3-mers and train the model to predict the alignment scores and inthe process obtain the embedding matrix for 3-mers.

The primary objective of word embeddings is to generate such a vector representation for a wordso that in the vector space it is close to the vectors representing words with similar meaning. Forexample, we want a word like ‘Pizza’ to be close to words like ‘Burger’ or ‘Sandwich’, but far fromwords like ‘Car’ or ‘Computer’ in the vector space.Therefore, for k -mer embeddings we expect the k -mers to be placed near similar k -mers. Thedegree of similarity between vectors and k -mers is computed using cosine similarity and alignmentscore respectively. Thus for both Skip-gram and Align-gram embeddings we compute the align-ment score between every pair of k -mers (3-mers) and measure the cosine similarity scores of thecorresponding vectors. We collected the 100-dimensional Skip-gram based embeddings for 3-mersfrom [28]. We study the correlation between cosine similarity and alignment score for the twoembeddings and present them using box-plots in Fig. 2. (a) (b) Figure 2:

Equivalence between vector similarity and alignment score.

We present the correlation betweenthe two using box-plots for both Skip-gram(2a) and Align-gram(2b) based embeddings. For Align-gram embeddingswe observe a strong correlation with a Pearson correlation coeﬃcient of 90.2%. For better visualization the outlierof the box-plot has been omitted. k -mers close to each other. In addition to maintaining the similarity between k -mers, we are interested to examine the capabilityof Align-gram to map the amino acids, preserving the physical-chemical properties. In order to doso, we train an Align-gram model with the 20 amino acids, instead of the k -mers. For the purposeof visualization, we compute 2-dimensional embeddings for them. Venkatarajan [43] proposedquantitative descriptors for the amino acids by summarizing 237 physical-chemical properties intoa 5-dimensional property space. We collect the Eigenvectors for the amino acids from [43] andcompare it with our produced vector space. Fig. 3 presents the relation between our vector spaceand the summarized 5-dimensional property space. It can be observed that for most cases fornearby points the values of the eigenvectors are quite similar. For a few exceptions like Glycine ( G )and Serine ( S ) we can observe that in terms of E2 and E5 values they are a bit diﬀerent, but forE1, E3 and E4 they are almost identical. These minor mismatches are due to the fact that weare trying to map a 2-dimensional space into a 5-dimensional one, which itself is an approximatemapping of a 237-dimensional space. Thus, overall it is apparent that amino acid properties areconserved in our vector space. Figure 3:

Relating to the Amino Acid Properties.

We map the amino acids using Align-gram in 2 dimensionalvector space and compare them with the 5 dimensional property space presented in [43]. It can be observed that theeigen vector values for the nearby amino acids are close to each other.

In this section, we demonstrate that Align-gram based embeddings are suitable to represent proteinsequences in deep learning based applications. We compare with the usual baseline sequence-basedrepresentations and perform experiments with four diﬀerent datasets each considering a diﬀerenttype of problem. 6 .3.1 Baseline Representations

We compare our proposed Align-gram based embeddings with the other widely used sequencebased representations, namely, one-hot encoding, BLOSUM, and Skip-gram based embeddings.We represent the individual amino acids with sparse one-hot encoded vectors or correspondingrows from BLOSUM62. On the other hand, in order to use Skip-gram and Align-gram basedembeddings, we tokenize the protein sequence into 3-mers and use their corresponding embedding.We have collected the 100-dimensional Skip-gram based embeddings for 3-mers from [28]. Hence,to keep our embeddings comparable to that of Skip-gram, we have computed 100-dimensionalembeddings. Protein sequences are seldomly of uniform length, therefore to make the sequencesevenly long, we pre-pad them with zeros.

In order to investigate the versatility of Align-gram based embeddings, we consider four diﬀerenttypes of machine learning problems, namely, binary classiﬁcation, multiclass classiﬁcation, regres-sion, and sequence-to-sequence prediction.For the purpose of evaluating the capability of Align-gram embeddings in performing such tasks,we use four diﬀerent datasets. For binary classiﬁcation, we predict DNA-Binding proteins using thePDB1075 dataset [44]. For regression, multiclass classiﬁcation and sequence-to-sequence predictionwe use the Protein Stability Prediction [45], Remote Homology Detection [46] and Secondary Struc-ture Prediction [47, 48, 9] datasets from the TAPE benchmark [49]. We take the training split ofthese datasets to perform 5-fold cross-validation, and consider the PDB186 [50] and validation splitof TAPE for testing the model. All these datasets are publicly available .The PDB1075 dataset consists of 1075 proteins sequences with a maximum length of 1323. TheProtein Stability Prediction dataset consists of 53614 proteins whose lengths are conﬁned to 49-50.The Remote Homology Detection dataset consists of a total of 12312 proteins with 1195 diﬀerent foldlabels. However, their lengths are distributed quite sporadically, the lengths cover a range of 17-1419with a median value of 134 only. Therefore, both to mitigate the computational expense and avoidzero-padding over several magnitudes of length, we limit our analysis to sequences belonging to 75thpercentile length, which equals 217. Furthermore, the dataset is highly imbalanced, hence, we onlyconsider fold classes with at least 50 samples. This brings down the training set to 4814 sequencesand similar selections were performed for the test data. The Secondary Structure Prediction datasetalso suﬀers from a similar scattered distribution of sequence length, ranging from 20-1632 with amean of 256, thus we similarly limit our experiments to 75th percentile length, i.e. 333.The tasks and datasets are summarized in Table 1.

Table 1:

Tasks and Datasets.

An overview of the tasks, their types and dataset information is presented here.

Task Task Type Dataset http://server.malab.cn/Local-DPP/Datasets.html https://github.com/songlab-cal/tape .3.3 Baseline Model Since the objective of this work has been developing a representation for protein sequences, it isbeyond our scope to design suitable deep network architectures for solving the diﬀerent problemsand properly tune their hyperparameters. Therefore, we develop a simple baseline model andexperiment with the diﬀerent baseline representations, using the same model.The baseline model consists of 3 LSTM layers each with 256 cells. 2 fully connected layerswith 128 and 64 neurons respectively follow them, each with relu activation function and 20%dropout. After these layers, we deﬁne the output layer, for regression and binary classiﬁcation taskthe output layer consists of a single neuron with linear and sigmoid activation function respectively.For the multiclass classiﬁcation problem, the output layer consists of neurons equalling the numberof classes and they are activated using softmax activation function. On the other hand, for sequence-to-sequence prediction, we use time distributed fully connected layers.

We train the baseline model with the diﬀerent input representations for the diﬀerent tasks. Weevaluate the models on the test dataset, using task-speciﬁc evaluation metrics. For binary clas-siﬁcation, we use Precision, Recall, F1, and Accuracy, for multiclass classiﬁcation we use Top-1,Top-5, and Top-10 Accuracies, for regression we use Mean Squared Error (MSE), Mean AbsoluteError (MAE), and Spearman’s rank correlation coeﬃcient ( ρ ), for the speciﬁc sequence-to-sequencetask we compute the suitable 3 Class and 8 Class Accuracies over the entire predicted sequence.The experimental results for the diﬀerent tasks are presented in Table 2. Table 2:

Experimental Results.

Here, we present the results on the test data for the diﬀerent tasks, using diﬀerentrepresentations. For diﬀerent type of predictions suitable metrics have been used. Although the results are quitemediocre due to the simplicity of the baseline model, it can be observed that Align-gram based embeddings led tothe best results (presented in bold).DNA-Binding-Protein Prediction Protein Secondary Structure PredictionEmbedding Precision Recall F1 Accuracy 3 Class Accuracy 8 Class AccuracyOne Hot 48.11 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Remote Homology Detection Protein Stability PredictionEmbedding Top-1 Acc. Top-5 Acc. Top-10 Acc. MSE MAE ρ One Hot 17.57 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± From the table, it is evident that Align-gram based embeddings have obtained superior resultsin all the diﬀerent tasks for all the evaluation metrics. The individual performance metric scoresmay appear mediocre, but this is due to the simplicity of the baseline model. For these experiments,our purpose was to demonstrate under similar settings and constraints, Align-gram embeddings arecapable of producing better representations, which has been satisﬁed. Keeping the model simplehas allowed us to perform the experiments in reduced time and we resorted to that as our objectivehas been to develop a novel representation, not novel architectures. Nevertheless, we show the8otential of Align-gram in improving much complex deep learning models like DeepGoPlus [10] inSection 3.5.

Evolutionary features, although are the most capable ones for modeling biological sequences, arenot always applicable for a number of reasons. For example, PSSM [17, 18] is always changing withthe discovery of new proteins [16], HMM state-transition probabilities [9] requires sequence align-ment which is time consuming and computationally expensive. As a result, language model basedapproaches are gaining more popularity for simplicity and convenience, despite the evolutionaryfeatures being more capable of modeling state of the art methods [49].Thus, it is expected that evolutionary features like PSSM or HMM will outperform simplesequence based representations including Align-gram embeddings. Nevertheless, we conduct exper-iments with the baseline model using evolutionary features. For Remote Homology Detection weuse PSSM [46] and for Secondary Structure Prediction we use HMM [9]. Using them as input weobserve a signiﬁcant improvement over the sequential representations, which is expected. However,most often these evolutionary features are coupled with one-hot-encoding [9, 51, 52]. Therefore,we examined coupling Align-gram embeddings with these features to see if Align-gram managesto contribute to the score. From the experiments, it was observed that including Align-gram em-beddings with evolutionary features improves the result over using only evolutionary features orcombining them with one-hot encoding, which is the popular practice. It may be noted here thatsince PSSM results in a 20-dimensional vector, we found that using 20-dimensional Align-gramembeddings were better to combine with PSSM. Align-gram oﬀers us to use embeddings of variousdimensions and more investigation is necessary to determine the optimal dimension for Align-gramembeddings.In this way, though Align-gram could not surpass evolutionary features, still it is able to sup-plement them by improving the overall performance. The results are presented in Table 3.

Table 3:

Results Obtained from using Evolutionary Features.

For Remote Homology Detection we use PSSMand for Secondary Structure Prediction we use HMM features. It can be observed that using evolutionary featuresyields much better results compared to that obtained from the sequence based representations. Although Align-gramembeddings fail to outperform evolutionary features, using them together in unison yields more superior results.

Task : Remote Homology Detection Task : Secondary Structure PredictionEmbedding Top-1 Acc Top-5 Acc 3 Class Acc 8 Class AccEvolutionary Features 28.361 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± As mentioned in Section 3.3.3, since we primarily worked with protein sequence encoding, it wasbeyond our scope to design suitable deep network architectures for solving diﬀerent problems. Thislimited us to conduct our experiments with a generic, boilerplate baseline model. Therefore, in9rder to demonstrate the eﬃcacy and applicability of Align-gram in deep networks, we considerDeepGOPlus [10], one of the state-of-the-art methods for protein function prediction.DeepGOPlus employs a novel deep convolutional neural network (CNN) architecture to extractmotifs, that are likely indicators of possible protein functions. Furthermore, it combines the neuralnetwork predictions with Diamond [53] based sequence similarity predictions using a weighted summodel. DeepGOPlus shows competitive performance with contemporary best-performing methods.Moreover, they demonstrate astounding results in CAFA3 evaluation [54], putting them as one ofthe three best scoring predictors for CCO and the second-best for BPO and MFO evaluations.The authors of DeepGOPlus have made both their codes and data publicly available. There-fore, we have collected their codes and reproduced their results using their provided data-splits,following the CAFA3 evaluations. DeepGOPlus uses one-hot-encoding to represent the proteinsequences, thus, we replace the input scheme with our proposed Align-gram based embedding andrerun the experiments following precisely the same protocol. It should be noted that DeepGOPlus,being a model with over 54 million parameters, for the massive computational requirements, weran the experiments with 20-dimensional embeddings generated using Align-gram for 3-mers. Also,the results reproduced from the oﬃcial implementation of DeepGOPlus were slightly diﬀerent fromthe results presented in the paper. This may be due to diﬀerences in software or library versions,ﬂoating precision in hardware, random seed, etc. Therefore, to make the comparisons in an even-ground, we compare the results obtained from using Align-gram embeddings with the results wereproduced. The authors also shared their computed diamond scores and thus we perform theweighted sum ensemble as well. The results are presented in Table 4. Table 4:

Results Obtained from the DeepGOPlus Model.

We reproduce the original results presented in theDeepGOPlus paper. Furthermore, we rerun the experiments using Align-gram based protein sequence representationand also ensemble with Diamond scores. Using Align-gram improves the MFO and BPO metrics and for CCO aslight dip is noticed. The improvements are more noticeable for the results using the CNN model only as Diamondscore seems to prevail the ensemble result a bit. F max S min AUPRMethod MFO BPO CCO MFO BPO CCO MFO BPO CCODeep CNN ResultsIn paper 0.420 0.378 0.607 9.711 24.234 8.153 0.355 0.323 0.616Reproduced 0.405 0.380

Align-gram

Align-gram https://github.com/bio-ontology-research-group/deepgoplus http://deepgoplus.bio2vec.net/data/data-cafa.tar.gz Align-gram is available as free, open-source software at: https://github.com/nibtehaz/align-gram

We have made our trained Align-gram embedding public in the same link, Furthermore, we havepublished Align-gram as a customizable interface, with suﬃcient documentation to aid researchersin the advancement of protein sequence analysis.The datasets used to develop and evaluate Align-gram are publicly available and the sourceshave been properly mentioned throughout the paper.

In this work, we started by analyzing the Skip-gram model thoroughly, with a dedicated focusrevolving around the application on protein sequence analysis. We investigated and conjectured oncertain scenarios and conditions which may hinder the relevance of the Skip-gram model for pro-teomic analysis. The signiﬁcance and contribution of the Skip-gram model in the domain of NaturalLanguage Processing (NLP) are indisputable. Nevertheless, we raise the question of whether oﬀ-the-shelf ideas and tools for broader NLP are suﬃcient for bioinformatics applications or not.In one end we have the conventional biological pipelines, that computes handcrafted features,physicochemical properties, or works with evolutionary information involving PSSM or HMM pro-ﬁles. All these require signiﬁcant endeavors, but as they are built upon decades of knowledgecompiled by the biologists and bioinformaticians, most often they yield interesting and superiorresults. On the other end, we have sequence based deep learning methods. These approachesare mostly based on oﬀ-the-shelf ideas from NLP domain, lacking suﬃcient tailoring for biologicalapplications. Despite this, these methods have proven eﬀective, owing to the volume of sequencedata we have now, which even in unannotated form is adequate for unsupervised or self-supervisedprocesses. Nevertheless, we made an attempt to coalesce these two contrasting opposite ideologies,to fuse the best of both worlds.Thus, in this work, we have developed a novel word or k -mer embedding scheme for proteinsequence analysis. We call it Align-gram, maintaining the analogies with the successful and popularSkip-gram model from NLP. Align-gram manages to correlate the embedding vector similarity ofthe k -mers with the alignment score, signifying that evolutionary and biologically similar k -mersare projected together. This ﬁnding is further strengthened by the conservation of amino acidproperties in the learned vector space. In addition to bridging the gap between NLP model andBiological insights, Align-gram provides us with a better embedding to train deep learning models.Although Align-gram based embeddings are not on-par with computationally expensive PSSM andHMM based evolutionary features, the embeddings can supplement them well, improving the overallperformance further. 11he future directions of this research can be manifold. Firstly, we are interested in applyingAlign-gram embedding in solving a multitude of diverse problems. It would be interesting to analyzehow Align-gram based embeddings aﬀect the performance of the existing state of the art sequencebased deep learning models for diﬀerent tasks in protein sequence analysis. Furthermore, we areintrigued to perform similar investigations on more sophisticated NLP models like Elmo [55] andBERT [56]. Very recently, some attempts of using Elmo and BERT models for protein sequenceshave been made [57, 58], albeit oﬀ-the-shelf NLP models without much speciﬁc modiﬁcations havebeen employed. Thus, it is worth exploring the possibilities of augmenting such models withbiological insights and intuitions to develop more reﬁned models, better suited for bioinformaticsapplications. Also, further experiments merging Align-gram embeddings with the myriad of existinghand-crafted features or evolutionary information will be beneﬁcial for suitable input representationselection. Therefore, we believe that Align-gram based embeddings can be the proper surrogate forSkip-gram based embeddings for protein sequence analysis and will prove itself as an eﬀective toolfor bioinformaticians thereby. References [1] Pedro Larranaga, Borja Calvo, Roberto Santana, Concha Bielza, Josu Galdiano, Inaki Inza,Jos´e A Lozano, Rub´en Arma˜nanzas, Guzm´an Santaf´e, Aritz P´erez, et al. Machine learning inbioinformatics.

Brieﬁngs in bioinformatics , 7(1):86–112, 2006.[2] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Rox-burgh, Angela Hung Byers, et al.

Big data: The next frontier for innovation, competition, andproductivity . McKinsey Global Institute, 2011.[3] Seonwoo Min, Byunghan Lee, and Sungroh Yoon. Deep learning in bioinformatics.

Brieﬁngsin bioinformatics , 18(5):851–869, 2017.[4] Bo Wen, Wenfeng Zeng, Yuxing Liao, Zhiao Shi, Sara R Savage, Wen Jiang, and Bing Zhang.Deep learning in proteomics.

Proteomics , page 1900335, 2020.[5] Joachim Selbig, Theo Mevissen, and Thomas Lengauer. Decision tree-based formation ofconsensus protein secondary structure prediction.

Bioinformatics , 15(12):1039–1046, 1999.[6] Changhui Yan, Drena Dobbs, and Vasant Honavar. A two-stage classiﬁer for identiﬁcation ofprotein–protein interface residues.

Bioinformatics , 20(suppl 1):i371–i378, 2004.[7] Ying Huang and Yanda Li. Prediction of protein subcellular locations using fuzzy k-nn method.

Bioinformatics , 20(1):21–28, 2004.[8] Pietro Di Lena, Ken Nagata, and Pierre Baldi. Deep architectures for protein contact mapprediction.

Bioinformatics , 28(19):2449–2457, 2012.[9] Michael Schantz Klausen, Martin Closter Jespersen, Henrik Nielsen, Kamilla KjaergaardJensen, Vanessa Isabell Jurtz, Casper Kaae Soenderby, Morten Otto Alexander Sommer, OleWinther, Morten Nielsen, Bent Petersen, et al. Netsurfp-2.0: Improved prediction of proteinstructural features by integrated deep learning.

Proteins: Structure, Function, and Bioinfor-matics , 2019. 1210] Maxat Kulmanov and Robert Hoehndorf. Deepgoplus: improved protein function predictionfrom sequence.

Bioinformatics , 36(2):422–429, 2020.[11] Jos´e Juan Almagro Armenteros, Casper Kaae Sønderby, Søren Kaae Sønderby, Henrik Nielsen,and Ole Winther. Deeploc: prediction of protein subcellular localization using deep learning.

Bioinformatics , 33(21):3387–3395, 2017.[12] Haoyang Zeng and David K Giﬀord. Quantiﬁcation of uncertainty in peptide-mhc binding pre-diction improves high-aﬃnity peptide selection for therapeutic design.

Cell systems , 9(2):159–166, 2019.[13] Steven Henikoﬀ and Jorja G Henikoﬀ. Amino acid substitution matrices from protein blocks.

Proceedings of the National Academy of Sciences , 89(22):10915–10919, 1992.[14] Timothy J O’Donnell, Alex Rubinsteyn, Maria Bonsack, Angelika B Riemer, Uri Laserson,and Jeﬀ Hammerbacher. Mhcﬂurry: open-source class i mhc binding aﬃnity prediction.

Cellsystems , 7(1):129–132, 2018.[15] Jing Jin, Zhonghao Liu, Alireza Nasiri, Yuxin Cui, Stephen Louis, Ansi Zhang, Yong Zhao,and Jianjun Hu. Attention mechanism-based deep learning pan-speciﬁc model for interpretablemhc-i peptide binding prediction. bioRxiv , page 830737, 2019.[16] Aaron Hein, Casey Cole, and Homayoun Valafar. An investigation in optimal encoding ofprotein primary sequence for structure prediction by artiﬁcial neural networks. arXiv preprintarXiv:2008.00539 , 2020.[17] David T Jones. Protein secondary structure prediction based on position-speciﬁc scoringmatrices.

Journal of molecular biology , 292(2):195–202, 1999.[18] Jack Hanson, Kuldip Paliwal, Thomas Litﬁn, Yuedong Yang, and Yaoqi Zhou. Improvingprediction of protein secondary structure, backbone angles, solvent accessibility and contactnumbers by using predicted contact maps and an ensemble of recurrent and residual convolu-tional neural networks.

Bioinformatics , 35(14):2403–2410, 2019.[19] Duolin Wang, Yanchun Liang, and Dong Xu. Capsule network for protein post-translationalmodiﬁcation site prediction.

Bioinformatics , 35(14):2386–2394, 2019.[20] Hongli Fu, Yingxi Yang, Xiaobo Wang, Hui Wang, and Yan Xu. Deepubi: a deep learningframework for prediction of ubiquitination sites in proteins.

BMC bioinformatics , 20(1):1–10,2019.[21] Jennifer G Abelin, Dewi Harjanto, Matthew Malloy, Prerna Suri, Tyler Colson, Scott P Gould-ing, Amanda L Creech, Lia R Serrano, Gibran Nasir, Yusuf Nasrullah, et al. Deﬁning hla-iiligand processing and binding rules with mass spectrometry enhances cancer epitope predic-tion.

Immunity , 51(4):766–779, 2019.[22] Bin Yu, Zhaomin Yu, Cheng Chen, Anjun Ma, Bingqiang Liu, Baoguang Tian, and Qin Ma.Dnnace: Prediction of prokaryote lysine acetylation sites through deep neural networks withmulti-information fusion.

Chemometrics and Intelligent Laboratory Systems , page 103999,2020. 1323] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean. Eﬃcient estimation of wordrepresentations in vector space. arXiv preprint arXiv:1301.3781 , 2013.[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeﬀ Dean. Distributed repre-sentations of words and phrases and their compositionality. In

Advances in neural informationprocessing systems , pages 3111–3119, 2013.[25] David E Rumelhart, Geoﬀrey E Hinton, and Ronald J Williams. Learning representations byback-propagating errors. nature , 323(6088):533–536, 1986.[26] Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 , 2014.[27] Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insightsinto document embedding generation. arXiv preprint arXiv:1607.05368 , 2016.[28] Ehsaneddin Asgari and Mohammad RK Mofrad. Continuous distributed representation ofbiological sequences for deep proteomics and genomics.

PloS one , 10(11):e0141287, 2015.[29] Dhananjay Kimothi, Akshay Soni, Pravesh Biyani, and James M Hogan. Distributed repre-sentations for biological sequence analysis. arXiv preprint arXiv:1608.05949 , 2016.[30] Patrick Ng. dna2vec: Consistent vector representations of variable-length k-mers. arXivpreprint arXiv:1701.06279 , 2017.[31] Kevin K Yang, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. Learned proteinembeddings for machine learning.

Bioinformatics , 34(15):2642–2648, 2018.[32] Carlo Mazzaferro. Predicting protein binding aﬃnity with word embeddings and recurrentneural networks. bioRxiv , page 128223, 2017.[33] Poomarin Phloyphisut, Natapol Pornputtapong, Sira Sriswasdi, and Ekapol Chuangsuwanich.Mhcseqnet: a deep neural network model for universal mhc binding prediction.

BMC bioin-formatics , 20(1):270, 2019.[34] Johanna Vielhaben, Markus Wenzel, Wojciech Samek, and Nils Strodthoﬀ. Usmpep: univer-sal sequence models for major histocompatibility complex binding aﬃnity prediction.

BMCbioinformatics , 21(1):1–16, 2020.[35] Daniel Buchan and David Jones. Inferring protein domain semantic roles using word2vec. bioRxiv , page 617647, 2019.[36] Michail Yu Lobanov, Petr Klus, Igor V Sokolovsky, Gian Gaetano Tartaglia, and Oxana VGalzitskaya. Non-random distribution of homo-repeats: Links with biological functions andhuman diseases.

Scientiﬁc reports , 6:26941, 2016.[37] Swathik Clarancia Peter, Jaspreet Kaur Dhanjal, Vidhi Malik, Navaneethan Radhakrishnan,Mannu Jayakanthan, Durai Sundar, Durai Sundar, and Mannu Jayakanthan. Encyclopediaof bioinformatics and computational biology.

Ranganathan, S., Grib-skov, M., Nakai, K.,Sch¨onbach, C., Eds , pages 661–676, 2018. 1438] Peter JA Cock, Tiago Antao, Jeﬀrey T Chang, Brad A Chapman, Cymon J Cox, AndrewDalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauﬀ, Bartek Wilczynski, et al. Biopy-thon: freely available python tools for computational molecular biology and bioinformatics.

Bioinformatics , 25(11):1422–1423, 2009.[39] Blast options and defaults, 2020.[40] Fran¸cois Chollet et al. Keras, 2015.[41] Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeﬀrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoﬀrey Irving, Michael Isard, et al. Tensorﬂow: asystem for large-scale machine learning. In

OSDI , volume 16, pages 265–283, 2016.[42] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014.[43] Mathura S Venkatarajan and Werner Braun. New quantitative descriptors of amino acidsbased on multidimensional scaling of a large number of physical–chemical properties.

Molecularmodeling annual , 7(12):445–453, 2001.[44] Bin Liu, Jinghao Xu, Xun Lan, Ruifeng Xu, Jiyun Zhou, Xiaolong Wang, and Kuo-ChenChou. idna-prot— dis: identifying dna-binding proteins by incorporating amino acid distance-pairs and reduced alphabet proﬁle into the general pseudo amino acid composition.

PloS one ,9(9):e106691, 2014.[45] Gabriel J Rocklin, Tamuka M Chidyausiku, Inna Goreshnik, Alex Ford, Scott Houliston,Alexander Lemak, Lauren Carter, Rashmi Ravichandran, Vikram K Mulligan, Aaron Cheva-lier, et al. Global analysis of protein folding using massively parallel design, synthesis, andtesting.

Science , 357(6347):168–175, 2017.[46] Naomi K Fox, Steven E Brenner, and John-Marc Chandonia. Scope: Structural classiﬁcationof proteins—extended, integrating scop and astral data and classiﬁcation of new structures.

Nucleic acids research , 42(D1):D304–D309, 2013.[47] Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, HelgeWeissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank.

Nucleic acidsresearch , 28(1):235–242, 2000.[48] John Moult, Krzysztof Fidelis, Andriy Kryshtafovych, Torsten Schwede, and Anna Tramon-tano. Critical assessment of methods of protein structure prediction (CASP)-Round XII.

Proteins: Structure, Function, and Bioinformatics , 86:7–15, 2018.[49] Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, PieterAbbeel, and Yun Song. Evaluating protein transfer learning with tape. In

Advances in NeuralInformation Processing Systems , pages 9689–9701, 2019.[50] Wangchao Lou, Xiaoqing Wang, Fan Chen, Yixiao Chen, Bo Jiang, and Hua Zhang. Sequencebased prediction of dna-binding proteins based on hybrid feature selection using random forestand gaussian naive bayes.

PloS one , 9(1):e86703, 2014.1551] Fei He, Rui Wang, Jiagen Li, Lingling Bao, Dong Xu, and Xiaowei Zhao. Large-scale predictionof protein ubiquitination sites using a multimodal deep architecture.

BMC systems biology ,12(6):109, 2018.[52] Kai-Yao Huang, Justin Bo-Kai Hsu, and Tzong-Yi Lee. Characterization and identiﬁcation oflysine succinylation sites based on deep learning method.

Scientiﬁc reports , 9(1):1–15, 2019.[53] Benjamin Buchﬁnk, Chao Xie, and Daniel H Huson. Fast and sensitive protein alignmentusing diamond.

Nature methods , 12(1):59–60, 2015.[54] Naihui Zhou, Yuxiang Jiang, Timothy R Bergquist, Alexandra J Lee, Balint Z Kacsoh, Alex WCrocker, Kimberley A Lewis, George Georghiou, Huy N Nguyen, Md Naﬁz Hamid, et al. Thecafa challenge reports improved protein function prediction and new functional annotationsfor hundreds of genes through experimental screens.

Genome biology , 20(1):1–23, 2019.[55] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Ken-ton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv preprintarXiv:1802.05365 , 2018.[56] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 ,2018.[57] Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian Dallago, Dmitrii Nechaev, FlorianMatthes, and Burkhard Rost. Modeling aspects of the language of life through transfer-learningprotein sequences.

BMC bioinformatics , 20(1):723, 2019.[58] Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, LlionJones, Tom Gibbs, Tamas Feher, Christoph Angerer, Debsindhu Bhowmik, et al. Prottrans:Towards cracking the language of life’s code through self-supervised deep learning and highperformance computing. arXiv preprint arXiv:2007.06225arXiv preprint arXiv:2007.06225