EPGAT: Gene Essentiality Prediction With Graph Attention Networks
11 EPGAT: Gene Essentiality Prediction WithGraph Attention Networks
Joo Schapke, Anderson Tavares, and Mariana Recamonde-Mendoza
Abstract
The identification of essential genes/proteins is a critical step towards a better understanding ofhuman biology and pathology. Computational approaches helped to mitigate experimental constraintsby exploring machine learning (ML) methods and the correlation of essentiality with biological infor-mation, especially protein-protein interaction (PPI) networks, to predict essential genes. Nonetheless,their performance is still limited, as network-based centralities are not exclusive proxies of essentiality,and traditional ML methods are unable to learn from non-Euclidean domains such as graphs. Giventhese limitations, we proposed EPGAT, an approach for essentiality prediction based on Graph AttentionNetworks (GATs), which are attention-based Graph Neural Networks (GNNs) that operate on graph-structured data. Our model directly learns patterns of gene essentiality from PPI networks, integratingadditional evidence from multiomics data encoded as node attributes. We benchmarked EPGAT for fourorganisms, including humans, accurately predicting gene essentiality with AUC score ranging from 0.78 to0.97. Our model significantly outperformed network-based and shallow ML-based methods and achieveda very competitive performance against the state-of-the-art node2vec embedding method. Notably, EPGATwas the most robust approach in scenarios with limited and imbalanced training data. Thus, the proposedapproach offers a powerful and effective way to identify essential genes and proteins.
Index Terms
Bioinformatics, deep learning, essential genes, essential proteins, graph neural networks, multiomicsdata, prediction
J. Schapke, A. Tavares, and M. Recamonde-Mendoza are with the Institute of Informatics, Universidade Federal do Rio Grandedo Sul, Porto Alegre, Brazil. E-mail: [email protected], { artavares, mrmendoza } @inf.ufrgs.br.Preprint. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice,after which this version may no longer be accessible. a r X i v : . [ q - b i o . M N ] J u l I. I
NTRODUCTION
Discovering the essentialome, i.e., the set of essential genes for the survival of a living organism, isan important research question in bioinformatics and genomics. A plethora of fundamental functions areplayed in the cell by proteins coded by essential genes [1], resulting in an irreplaceable role in controllingcellular processes. Therefore, essential genes are regarded as the basis of life, such that characterizingan organism’s essentialome can give us insights into the inner working and interdependence within itsgenome. Knowledge of essential proteins and genes not only improves our understanding about biologicalprocesses and molecular functions, but also has contributed to a range of fields, such as in syntheticbiology, for the definition of a minimal genome [2]; in drug design, for the choice of drug targets ofantimicrobial and anticancer compounds [3], [4]; in a better understanding of diseases, by helping discoverpathogenic genes [5]; and in metabolic engineering [6].Several experimental procedures, such as targeted single-gene knockout and transposon mutagenesis,have been developed to identify essential genes [7]. Although they are generally robust and accurate,wet-lab approaches imply costs with laboratory resources and are time consuming, which may imposesignificant limitations to their use, especially for large scale analyses and for complex organisms like miceand humans [8]. Considering this scenario, computational approaches have become appealing solutionsas they provide fast and inexpensive results, which can reduce the workload in experimental proceduresby suggesting a list of candidate genes and proteins to be tested. With the rapid development of high-throughput techniques, a large volume of biological data is available for in silico studies, covering a widevariety of organisms. This factor was crucial for the further development of computational methods topredict essential genes and proteins.As recently reviewed [8], computational methods usually rely on network topology features extractedfrom Protein-Protein Interaction (PPI) networks as the main source of data to predict gene essentiality.These methods leverage the fact that highly connected genes in the PPI network are more likelyto be essential – the centrality-lethality rule [9]. Therefore, correlation between node centralities andgene essentiality has been explored in several computational approaches previously proposed, includingneighborhood-based methods such as degree centrality (DC) [10], local average connectivity (LAC) [11],and edge-clustering coefficient centrality (NC) [12].Although information regarding interaction among genes is strongly related to gene essentiality, usingonly PPI network topological features leads to limited prediction accuracy [13]. First, PPI networks We note that although PPI networks encode interactions among proteins, we refer to the nodes of the network as genes orproteins interchangeably throughout the text, considering the genes by which these proteins are encoded. are known for being noisy and containing many false positive connections [14]. Second, there aremany protein interaction databases, with significantly different networks among them, which not onlyemphasizes the potential incompleteness of single PPI networks, but may also generate varying andunstable results depending on the network used. Third, due to the complexity of living organisms, it isnatural to assume that gene essentiality is determined by multiple biological factors that can not be fullycaptured by topological characteristics of the network [7].Recent works aim at alleviating these issues by using multiomics datasets to either manually pruneuntrustworthy connections or as additional input for the statistical and computational methods beingapplied. In this direction, gene expression profiles are perhaps the most common choice of additionalevidence among works in the field [8], being applied, for instance, in [15] and [16]. Other typesof biological information already used in conjunction with PPI networks are subcellular localizationinformation [17], orthology information [18], protein complexes [19] and domains [20], sequence data[21], and functional annotations such as Gene Ontology [22]. In addition, there is an increasing numberof studies exploring three or more sources of biological information in the construction of statistical orcomputational models ( e.g., [23]–[25]).Given the nature of this prediction problem, computational approaches often apply machine learning(ML) algorithms, especially supervised learning methods, as part of the solution [13]. In this sense, theidentification of essential genes and proteins is regarded as a binary classification task, and algorithmsbuild a prediction model based on features related to gene and protein essentiality. Most works inthis scope have applied shallow ML algorithms, among which Support Vector Machines (SVM) [26],na¨ıve Bayes [27], and decision trees [28] are recurrent ones. This methodology undoubtedly helpedadvance the frontier of knowledge regarding essential genes and proteins. Nonetheless, whereas centrality-based methods needs prior knowledge to create good score functions, shallow ML methods demand thespecialist’s input to select representative biological properties as features for the learning problem, whichis not a trivial task, especially when knowledge regarding these features is still evolving and may not befully elucidated [7].Deep learning methods, on the contrary, avoid the critical step of extracting handcrafted features in thedesign of ML methods, automatically learning features from the training data during its operation. Thus,deep learning does not require prior knowledge or assumptions regarding relevant biological featuresof gene essentiality. Moreover, deep learning methods have demonstrated an excellent performance in awide range of prediction and network-based problems in Bioinformatics [29], [30].Zeng et al. [31] proposed a deep learning framework for identifying essential proteins (DeepEP)without any prior knowledge, adopting PPI network, gene expression data, and subcellular localization information. The node2vec algorithm [32] was applied for network representation learning, a long shortterm memory network was used to process the gene expression data, and the output vectors originatedfrom pre-processing the three data sources are concatenated and classified by a fully connected layerwith the sigmoid activation function. Computational experiments with the PPI network of
Saccharomycescerevisiae indicate an area under the Receiver Operating Characteristic (ROC) curve (AUC) of 0.832.Also in this direction, PPI networks and gene expression data were used as input for the deep learningframework proposed in [33]. The node2vec technique was used for encoding the PPI network into alow-dimensional space, which is concatenated with patterns extracted by a multi-scale ConvolutionalNeural Network (CNN) from an image-based representation of gene expression profiles. The outputvector is analyzed by a sequence of two fully connected layers using the rectified linear unit (ReLU) andsoftmax activation functions to predict the final label of a protein. An AUC score of 0.82 is achievedfor
S. cerevisiae data. The works by [31] and [33] show that the dense vectors generated by node2vectechnique contribute to an improved performance over commonly used centrality measures.More recently, Zhang et al. [34] proposed DeepHE, a deep learning-based method to predict humanessential genes. DeepHE integrates sequence features extracted from DNA and protein sequences withfeatures learned from PPI network using node2vec. A deep neural network was trained with a cost-sensitive technique to address the imbalanced learning problem inherent to this domain, predicting humangene essentiality with an average AUC score higher than 0.94. The method is shown to outperformtraditional shallow ML algorithms such as SVM, RF, and na¨ıve Bayes.Despite the success of these deep learning methods for gene or protein essentiality prediction, theyhave in common the need to use a node embedding technique ( e.g., node2vec) to transform the PPInetwork to an ordered and fixed-size input lying in the Euclidean space, as expected by statistical andML models. In other words, these approaches do not learn from the full graph data, but rather fromlow-dimensional representations obtained from them. Therefore, it is reasonable to assume that duringthis transformation, valuable information may be lost and more complex patterns may not be discovered,impacting in the performance and generalization power of the model.Graph Neural Networks (GNNs) [35], [36] were introduced as a class of deep learning based methodscapable of dealing with the non-Euclidean nature of graphs, automatically learning network topology-preserving node-level vector representations from networks. Graph Convolutional Networks (GCNs) [37]are the current state-of-the-art for problems posed for GNNs, performing node classification based onnode features and network topology by aggregating information from neighboring nodes in a hierarchicalfashion. GCNs have significantly improved performance in other prediction tasks in Bioinformatics, suchas estimating drugtarget binding affinity [38], identifying cancer driver genes [39], and classifying breast cancer subtype [40].In this work, we hypothesize that GNN-based models could offer high performance in the identificationof essential genes and proteins compared to previous deep classifiers, by learning more complex relationsdirectly from the PPI networks. We propose EPGAT, a novel computational method for gene essentialityprediction based on Graph Attention Networks (GATs) [41], an extension to GCNs that adds an attentionmechanism to the original model. As far as we are aware, no previous work has adopted GCNs, or moreespecifically GATs, for prediction of essential genes and proteins.Our approach aims at tackling the main limitations of current methods, i.e., (i) the low reliability ofPPI networks, (ii) the need to integrate multiomics data to more broadly capture the biological notionof essentiality, and (iii) the limited use of graph-embedded knowledge by previously proposed networktopology measures, shallow ML algorithms, and deep learning methods. Our solution operates directlyover the graph structure by using GATs, and incorporates multiomics datasets, namely gene expressionprofiles, orthology information, and subcelular localization information, as node features, which arecollectively used with PPI networks in the model learning process. To deal with issue (i), we evaluate ourexperiments on three distinct PPI network databases. Additionally, we benchmark our method on fourorganisms:
Escherichia coli , Saccharomyces cerevisiae , Drosophila melanogaster , and
Homo sapiens .Our contributions in this article are: 1) We show that EPGAT provides state-of-the-art results comparedto other ML and network-based approaches and 2) we analyze and integrate different PPI and multiomicsdatasets in order to infer which of them are the most relevant for gene essentiality prediction.The remainder of this paper is organized as follows. We introduce the relevant background and thesetting of our experiments throughout Section II. In Section III, we present our experiments, results, anddiscussion of our findings. Lastly, in Section IV, we conclude our study and offer perspectives for futureresearch. II. M
ATERIALS AND M ETHODS
In this section, we present the basis of our proposed method, including an overview of the used dataand learning algorithm, evaluation strategies, and baselines adopted for comparison purpose.
A. Data Collection and Preprocessing
The proposed method was evaluated in the prediction of essential genes on four organisms:
Escherichiacoli , Saccharomyces cerevisiae , Drosophila melanogaster , and
Homo sapiens . S. cerevisiae and
E. coli were choosen as they have their complete essentialome published [42] [43] and are widely used astestbed for statistical models. We also tested our method on the human genome since the results and performance of computational models in complex organisms that do not have a known essentialome isan open question and, therefore, a valuable research direction.
D. melanogaster is also an interestingbenchmark organism because, as we will later discuss, it is the most negatively biased organism of ourdataset regarding annotation on essential genes.Besides essential genes dataset, we used four kinds of biological datasets in our experiments: PPInetworks, gene expression profiles, subcellular localization, and orthology information. To standardizenomenclature among these sources and allow data integration, the identification of genes and proteinswas mapped to UniProtKB nomenclature. Elements without such correspondent were removed from ourdataset. In what follows we explain our data sources.
1) Essential Genes Dataset:
Annotations for essential and non-essential genes for all organisms weredownloaded from the database of Online GEne Essentiality (OGEE) (downloaded at 25/03/2020) [44].A total of 5,636 genes (18.63% essential) were obtained for
S. cerevisiae , 4,322 genes (8.2% essential)for
E. coli , 13,781 genes (2.96% essential) for
D. melanogaster , and 21,556 (8.9% essential) genes for
H. sapiens .After the pre-processing step to standardize the nomenclature of genes using the UniProKB standard,the number of genes labeled for each species was: 5,636 genes (18.63% essential) for
S. cerevisiae , 3,686genes (6.56% essential) for
E. coli , 12,329 genes (1.95% essential) for
D. melanogaster , and 18,476 genes(9.88% essential) for
H. sapiens . As we may observe,
D. melanogaster is the most negatively biasedorganism since only 1.95% of labeled data refer to essential genes ( i.e., positive instances, in the contextof classification).
2) PPI Networks:
As previously reported, PPI networks have many false positive interactions [14], andnetworks for the same organism from different databases are extremely heterogeneous. This may lead toresults instability due to the choice of PPI dataset. To analyze and alleviate the effect of network choice,we evaluated the results for each organism in three different PPI networks: DIP (data as of 2017-02-05),BioGRID (Version 3.5.182), and STRING (Version 11.0).
DIP , the Database of Interacting Proteins [45], lists protein pairs that were experimentally shown tobind to each other. Although DIP has been used by previous works, it has the characteristic of beinga sparse network. In our work, DIP dataset contains the lowest number of nodes and edges among allnetworks used, for all organisms evaluated.
BioGRID , the Biological General Repository for Interaction Datasets [46], provides the curation andstorage of protein interactions reported in the biomedical literature for all major model organisms andfor humans. Interactions in this database are divided between physical and genetic, both of which areused in our dataset.
TABLE IC
HARACTERISTICS OF THE
PPI
NETWORKS USED IN THIS STUDY . S. cerevisiae E. coli
BioGRID STRING DIP BioGRID STRING DIP
N. nodes 6908 6049 5126 3971 4068 2924N. edges 526392 393022 22941 178271 114432 12246N. labeled genes 5548 5455 4570 3489 3521 1902N. essential genes 1048 1050 981 201 234 205N. test labels 1110 1091 914 698 705 381
H. sapiens D. melanogaster
BioGRID STRING DIP BioGRID STRING DIP
N. nodes 16562 18822 4615 2484 11499 972N. edges 479496 1340788 7417 22653 622980 1401N. labeled genes 14486 17894 3370 1778 8592 625N. essential genes 1590 1806 726 79 161 41N. test labels 2898 3579 674 356 1719 125
STRING [47] dataset assembles PPI collected and scored from different ’evidence channels’, dependingon the origin and type of the biological evidence. A combined and final score is computed for eachinteraction, which is typically used as an estimate of the likelihood that a given interaction is biologicallymeaningful, given the supporting evidence. Heuristically, we filtered out connections with a confidencescore bellow 0.5 aiming at reducing false positive interactions. As part of our experimental approach, wefurther analyzed such choice, as discussed in Section III-D.DIP dataset lists only proteins that bind to each other, while STRING and BioGRID may also includeindirect interactions between proteins, which makes them much denser. We also note that we addedself-connections for every node in the networks, a standard procedure in order to train GATs and GCNsin general (see Section II-B for further details).Details of collected PPI networks are summarized in Table I. The number of nodes and edges refer tothe structure of networks collected from the databases aforementioned. The number of labeled genes ofeach PPI network is the number of genes contained within the essential genes dataset that intersect withelements from the PPI network, including both positive and negative labels regarding essentiality. The positive labels, which are of special interest in our work, are shown in the ”N. essential genes” field.Finally, the number of test labels is the size of the partitions reserved for model evaluation from thelabeled genes dataset.
3) Gene Expression Profiles:
In our multiomics-based approach, we used gene expression profilesfrom the Gene Expression Omnibus (GEO) database [48]. For
E. coli , we collected gene expressiondata from
GSE7326 and
GSE40693 , which we found empirically to be highly correlated with the
E. coli essentialome.
GSE7326 evaluates
E. coli gene expression during cell death, whereas
GSE40693 measuresthe transcriptomic changes of
E. coli in response to an antimicrobial compound. For
S. cerevisiae , weused the expression profiles from the
GSE3431 dataset, which was previously applied for essential geneprediction [49].
GSE86354 and
GSE67547 were used for
H. sapiens and
D. melanogaster , respectively.
GSE86354 provides expression profiles for 1,558 samples across 8 tissue sites generated by the Genotype-Tissue Expression (GTEx) project, and
GSE67547 expression is obtained over the lifespan of 120,000fruit-flies. Further information about gene expression data may be obtained in the GEO database throughdatasets’ accession numbers.
4) Subcelullar Localization and Orthology Information:
As additional biological evidence, we usedgene orthology information gathered from the
InParanoid (v8) [50] database and protein subcellularlocalization data from the COMPARTMENTS database (version 8) [51]. We note that the COMPART-MENTS databases does not provide information on subcellular localization for
E. coli . For this reason,we evaluate this organism using only the gene expression and orthology information as additional data.
B. Graph Neural Networks
A PPI dataset may be denoted by a graph G composed by a set of nodes N = { , . . . , N } , representingproteins, and a set of edges E reflecting the interactions among the proteins. Additionally, we mayrepresent information of each protein in the network by adding attributes to the node it is representedby. With this reasoning we frame essential gene prediction as a node classification problem over a PPInetwork attributed with additional multiomics data.Traditional ML methods usually rely on preprocessing procedures in order to parse graph data andgenerate a tabular dataset. As neighborhood information is important in a graph, the entry of each nodein the resulting tabular dataset would need to contain features of neighboring nodes. As the number ofneighbors for each node varies, the resulting dataset would be unwieldy. Hence, methods that deal withraw graph data are desirable.Graph Neural Networks (GNNs) comprise a family of ML methods that handle graph data without pre-processing. In this work we use a specific GNN method, namely Graph Attention Networks (GATs) [41]. GATs combine ideas of generalized convolutions [37], which allows graph nodes to aggregate informationfrom their irregular neighborhoods, with self-attention mechanisms [52], which allows nodes to learn therelative importance of each neighbor during the aggregation process. Next, we describe GATs for nodeclassification tasks, where the goal is to predict the class c of each graph node out of C possible classes.In a GAT, each layer l ∈ { , . . . , L } contains a vector-valued representation, also called embedding, (cid:126)h ( l ) i , for each node i ∈ N . For an arbitrary node i , the first layer, (cid:126)h (1) i may contain features that reflectproperties of the node (e.g. multiomics protein data), domain knowledge or just be randomly initialized.Embeddings of intermediate (hidden) layers represent increasingly “higher-level” latent features of thenodes. Embedding dimensions can be arbitrary in the hidden layers. The final layer, (cid:126)h ( L ) i , is a vector ofdimension C , containing the probability of node i to belong to each one of the C classes.For a given node i , the embeddings of all its neighbors j in a layer, (cid:126)h ( l ) j , are used to compute i ’sembedding on the next layer’s (cid:126)h ( l +1) i . To make use of its own embedding, the set of i ’s neighbors, N i , includes i itself. Equation 1 shows the embedding calculation procedure, where σ is a non-linearactivation function, applied element-wise to the resulting vector, W ( l ) is the matrix of learnable weightsof layer l and α ij ∈ [0 , is the learnable attention between i and j , which indicates how strongly i “listens” to information on node j . (cid:126)h ( l +1) i = σ ( (cid:88) j ∈N i α ij W ( l ) (cid:126)h ( l ) j ) (1)Thus, the goal in GATs is to adjust the weights W ( l ) of each layer and the attention coefficients α ij between every pair i, j of nodes so as to capture latent node features and, ultimately, generate good classpredictions. Note that the weights of each layer, W ( l ) , are shared, i.e., all nodes use the same weightmatrix to update their next layer’s embeddings. Hence the importance of the self-attention mechanism:it weighs the importance of each neighbor when updating a node’s embedding. Moreover, this value isalso learned from data.As node embeddings lie in a continuous space, GATs can be seen as a deep learning technique. Thismeans that, if differentiable activation functions are used, GATs can be trained end-to-end with gradientdescent methods. C. Model Training and Performance Evaluation
For model training and evaluation, we randomly split the labeled data into a training (80%) and a testset (20%) in a stratified manner, i.e., preserving original classes proportions in both sets. To better estimatethe models generalization power, ten random splits were used and performance metrics were averaged across repetitions. Additionally, we reserved a subset of the trainining dataset and used it for validationwhile training statistical methods (including GAT). The validation data is used for early stoppage of thetraining procedure.Essential genes compose only a small fraction of the coding genome of most organisms, including E. coli , S. cerevisiae , D. melanogaster , and
H. sapiens . As a result, datasets of essential genes areheavily inclined towards negative cases. To account for this class imbalance, we trained EPGAT usingthe weighted binary cross-entropy (CE) function, defined as: CE = − c ylog ( p ) − c (1 − y ) log (1 − p ) (2)where y is the true label of the instance and assumed to be 1 ( i.e., positive or essential) or 0 ( i.e., negativeor non-essential), p is the probability predicted by the model for the positive class (and thus − p is theprobability predicted for the negative class), and c and c are inversely proportional to the number ofinstances in the training dataset for the positive and negative classes, respectively.Training GAT models involves tuning several hyperparameters, which can be very time consuming,especially when a number of models are being developed, as in our work. To reduce computational costs,we first determined the best combination of hyperparameters for the S. cerevisiae networks using optuna[53], an algorithm for hyperparameter optimization in ML. Next, based on the results from this analysis,we performed an empirical evaluation for the other organisms by manually varying the hyperparametersusing a smaller number of combinations that had given good results for the
S. cerevisiae networks. Thehyperparameters used for EPGAT for each organism are listed in Table II.In order to improve the generalization and robustness of our models, we applied the regularizationtechniques dropout [54] and L2 regularization. After every GAT layer, we added a dropout layer withprobability p , specified for each organism as summarized in Table II. We also added a regularizationbased on the L2-norm of the model’s hyperparameters when computing its gradient. This causes theweights to decay to smaller values, thus biasing the learning procedure towards simpler solutions.The performance was assessed using the area under the ROC curve (AUC). The ROC curve showsthe performance of a classification model in distinguishing between two classes at distinct thresholdssettings, by plotting the corresponding True Positive Rate (TPR, in the y-axis) and the False PositiveRate (FPR, in the x-axis). The AUC score may be interpreted as the probability that the model ranks arandom positive instance more highly than a random negative instance. Therefore, higher AUC scoresindicate better classification models. The choice of this metric aims to allow a qualitative comparison TABLE IIH
YPERPARAMETERS USED FOR TRAINING THE
GAT
MODELS FOR EACH ORGANISM . Parameter
S. cerevisiae D. melanogaster E. coli & H. sapiens
Learning rate 0.005 0.005 0.005Weight decay 2e-4 5e-4 5e-4Hidden units 12 16 8Attention heads 8 8 8Dropout 0.3 0.6 0.4 with previous works and to better deal with the inherent class imbalance in our datasets in contrast tomeasures such as accuracy.
D. Baselines
As baselines in our study, we adopted approaches from the three classes of computational strategiesmore common in the related literature, namely, network topology measures, shallow ML algorithms, andnode embedding coupled with deep learning.As for topology measures, we used degree centrality (DC), which calculates the number of neighborsof a node i in the network; neighborhood centrality (NC), which is based on edge clustering coefficient;and local average connectivity (LAC), which evaluates the local connectivity of node i ’s neighbors. Forfurther details about these measures, we refer reader to the survey by Li et al. [8].The traditional, shallow ML algorithms multilayer perceptrons (MLP) and support vector machines(SVM) were also applied as baselines. These algorithms do not explicitly handle graph data as input.Therefore, standard feature vectors for each gene were constructed by concatenating their informationfrom our multiomics dataset ( i.e., gene expression, orthology, and subcellular organization) and extractedfrom the STRING PPI network topology ( i.e., node degree) into a single vector. Next, feature vectorsfor all genes are concatenated in a single table and used as input for training these models. The MLPnetwork was configured with a single hidden layer of 32 units and trained with a dropout rate of . toreduce overfitting. The SVM was trained using the radial basis function (RBF) kernel.Finally, we compared our results against models built upon network features extracted by the node2vec(N2V) embedding method, as applied in previous works ( e.g., [31], [33], [34]). N2V uses biased randomwalks on graphs to learn a graph embedding for each node, which captures the network structure in a regular format. These embeddings are then used as additional features for the MLP training set, so thatwe carry out a fair performance comparison with our GAT model. The parameters used for training theN2V-based MLP model were the same as in the comparison with shallow ML algorithms.III. R ESULTS AND D ISCUSSION
Our experiments were implemented in Python using the sklearn library [55] for general trainingand evaluation methodology, and for the SVM model; Pytorch [56] for the MLP models; and PytorchGeometric extension [57] for the graph-based models, namely EPGAT and N2V.We setup our experiments aiming at i) comparing our approach to other network-based and MLapproaches (both shallow and deep learning methods) used for gene or protein essentiality predictionand ii) assessing the impact over performance of the different choices of PPI databases and multiomicadatasets. Performance for the trained models are shown as the mean and the standard deviation of tenruns using random stratified splits of the dataset. We carried out statistical comparisons by means oftwo-tailed Student t-test with a 95% confidence level to evaluate the significance of different metrics.
A. Comparison with network topology-based methods
Our first batch of experiments compared the proposed EPGAT model to the network centralities DC,LAC, and NC. For the analysis of topology-based methods, genes in the testing set were ordered by theircentrality measure in a descending order and the resulting rank was used for AUC score analysis.The results for this comparison considering the three PPI networks, namely STRING, BioGRID, andDIP, are shown in Figure 1. In most organisms and networks, our approach had a significant improvementover the baselines. Except for the
D. melanogaster
DIP and BioGrid networks, for which none of thecomparisons achieved statistical significance, and for the
E. coli
DIP network, in which EPGAT was onlysignificantly superior than LAC, in all other comparisons EPGAT presented the highest performance withstatistical significance ( p < . ).Surprisingly, the topology-based methods were less consistent than EPGAT. This is mainly perceivedin the results for E. coli and
D. melanogaster . The improvement brought by the proposed model in termsof performance values and stability is caused by two main factors. First, these baseline methods areanalytical approaches that provide only a crude simplification of nodes patterns and roles within the PPInetwork, thus probably only partially capturing the information correlated to gene essentiality that it isexpected to be contained within this type of biological evidence ( i.e., protein interaction networks). Incontrast, GAT is an expressive neural network that is able to model and detect complex relationships bydirectly analyzing the complete graph structure, as opposed to centrality metrics derived from it. Second, R O C A U C STRING BioGRID DIP
S. cerevisiae
E. coli
STRING BioGRID DIP
D. melanogaster
STRING BioGRID DIP
H. sapiens
STRING BioGRID DIP a) b)c) R O C A U C d) DCLACNCEPGATDCLACNCEPGAT DCLACNCEPGATDCLACNCEPGAT
Fig. 1. Performance of EPGAT and network-based methods on each network database and organism. Metrics are averaged for10 repetitions and the shaded areas display standard deviation. our EPGAT model parses additional datasets, which provides richer and more complete information forthe prediction task. As previously discussed, gene or protein essentiality is a multifactorial phenomenon,which is corroborated by our results. A distillation of these two factors is explored in a data ablationstudy presented in Section III-CRegarding the poor performance of EPGAT for
D. melanogaster , we note that the networks provided byBioGRID and DIP are severely restricted for this organism. First, the average degree for these networksis relatively low as compared to STRING and to other organisms. Second, while our dataset containsover twelve thousand labeled genes for the
D. melanogaster , the DIP and BioGRID networks only mapinteractions among 625 and 1,778 of them, respectively, such that the training set is also limited in contrastto other scenarios. Finally, it should be noted that the
D. melanogaster essentialome was by far the mostchallenging one due to its severe class imbalance (1.95% of labeled genes are classified as essential).Considering the total number of nodes in the networks gathered from BioGRID and DIP databses, only4.21% and 3.18% of them are essential genes. These constrictions, both in the number of samples and skewness of instances, are hard to be overcome by statistical methods. In the BioGRID network, EPGATachieved comparable results to the baselines, whereas in the DIP network every method was equivalentto random guessing. We note, however, that the D. melanogaster
STRING network contained over 8,000labeled genes, and in this case, EPGAT had significant improvement over the baselines, which showsthat given enough data our approach can learn important features even in highly biased datasets.Noticeably, the DIP network had the worst performance on every organism. For the
D. melanogaster and the
H. sapiens networks, the performance for most methods were equivalent to random guessing.EPGAT was able to introduce improvements for the Human DIP network, but its performance was stillinferior as those for BioGRID and STRING. As it was briefed in Table I, DIP is the database with thelowest number of mapped interactions. For the
D. melanogaster and
H. sapiens organisms, DIP providesextremely sparse networks, with average degree of 1.44 and 1.6, respectively, in contrast to 4.47 for
S.cerevisiae and 4.18 for
E. coli . This characteristic impairs classification performance given the underlyingfunctioning of GATs, which depends on a node’s connectivity to update its embedding. Nonetheless, evenfor
S. cerevisiae , for which DIP presents the highest average degree and ratio of labeled genes ( i.e.,
B. Comparison with machine learning methods
We compared the proposed approach with three ML classifiers: MLP, SVM, and a node2vec-basedmodel to extract features that are further analyzed using a MLP (N2V). All classifiers were trainedusing the collected multiomics dataset (see Section II-A). Nonetheless, while the standard, shallow MLmethods (SVM and MLP) relied on a structured table with features extracted from PPI networks andother omics data, the graph-based methods ( i.e.,
EPGAT and N2V) are tailored to the task of learning thegene essentiality-related patterns directly from the input network. We note, however, that their underlyingfunctioning for creating a low-dimensional feature vector for each node in the network is different, aswell as the strategy to combine multiomics data with network-based features.Figure 2 displays the performance of the EPGAT model in contrast to the baseline ML methods. Allmodels use as basis the STRING PPI network. Our approach notably improved the prediction performanceover MLP and SVM for
S. cerevisiae , E. coli , and
H. sapiens , achieving statistically significant differences D. melanogasterH. sapiens E. coliS. cerevisiae a) b)c) d)
Fig. 2. Comparison among EPGAT and other machine learning approaches based on 10 random splits of labeled data. Themodels were evaluated using interaction information from the STRING network. Black bars display the standard deviation. ( p < . ) in these comparisons. MLP was significantly superior than SVM for E. coli , but in all otherorganisms the average performance for both shallow ML methods was very similar.Regarding the comparison between EPGAT and N2V, overall, we observed a very competitive perfor-mance of our model in relation to the node2vec embedding, which is the state-of-the-art method used tolearn from graph structured data in related works. EPGAT achieved a superior performance with statisticalsignificance for
E. coli ( p = 0 . ) and S. cerevisiae ( p = 0 . ). For D. melanogaster , the difference wasnot statistically significant ( p = 0 . ), nonetheless, EPGAT had a very positive impact on the stabilityof the model, notably reducing the standard deviation obtained from the ten random splits of data inrelation to the other methods, including N2V. Finally, EPGAT and N2V achieved very close performancefor identification of human essential genes (90.95 ± ± with p = 0 . ).Thus, in general, our approach improved the identification of essential genes over previous ML methodsfor all organisms, either by increasing the average AUC score or by generating a more robust model ( i.e., presenting lower variance on predictive performance). Our results show the importance of a more elaborateapproach to deal with graphs in this prediction task. The advantage in approaches that maximally preserveproperties and information from graph whilst identifying gene essentiality patterns is especially clear inthe comparison between the N2V and MLP, which are two models based on the MLP learning algorithm.By adding the node embeddings generated with node2vec to its feature set, N2V was significantly better( p < . ) than the shallow MLP for all organisms except D. melanogaster . C. Data ablation study
To elucidate the importance of each type of biological evidence in our multiomics-based approach,we carried out an ablation study for the proposed EPGAT model. Performance of EPGAT was evaluatedwith the PPI network ( i.e., , STRING, BioGRID, or DIP) without any additional attributes, followed by
TABLE IIID
ATA ABLATION STUDY SHOWS
AUC
PERFORMANCE FOR THE
GAT
MODEL WITH INCREMENTALLY MORE OMICS DATA .DC
CENTRALITY IS USED AS THE BASELINE . Method
S. cerevisiae E. coli
DIP BioGRID ST RING DIP BioGRID ST RING
DC 67.22 72.01 75.38 72.90 77.21 82.44GAT 71.57 88.58 * 87.66 90.13 * H. sapiens D. melanogaster
DIP BioGRID ST RING DIP BioGRID ST RING
DC 57.64 81.22 79.75 52.65 80.25 68.83GAT 63.70 87.81 90.34 * * * Statistically significant ( p < . ) performance improvement in comparison to previous model in the incremental analysis.7 incremental inclusion of other omics data to investigate their impact over the model AUC score. Resultsare shown in Table III, in which the best performance per organism network are highlighted in bold.We perceive mixed results and a variance in performance depending not only on the choice of data, buton the organism analyzed as well. Nonetheless, in general, additional types of biological evidence seemsto positively impact models predictive power as compared to the exclusive use of PPI networks. Notably,gene expression data presented a great potential to boost performance over every choice of PPI networkand organism. The only exception was for the D. melanogaster
DIP model, which had a very unstableperformance due to the sparseness of this network (see Figure 2-d) and the large class imbalance in thelabeled dataset. Therefore, results for this specific scenario may not be statistically relevant as the previousAUC performance was already inferior to a random classifier. Interestingly, the use of additional omicsdata with the DIP network for
S. cerevisiae , E. coli , and
H. sapiens was able to decrease the performancegap in relation to the other networks, despite the general limitation of sparseness observed in the DIPdatabase.Subcellular localization information had only a slight quantitative impact on the results, but in fourout of the 12 scenarios analyzed ( i.e., combinations of PPI networks and organisms), the EPGAT modeltrained with PPI plus gene expression and sublocalization data led to the best performance. This is aninteresting result, since although the complete multiomics models were the best in 5 of these scenarios,for
E. coli we could not assess models trained with sublocalization due to data unavailability. Therefore,it is not possible to assert that models with ortholog were in fact the best framework, or if this result isassociated with the lack of sublocalization information for
E. coli .Finally, we note that even without other sources of omics data, our EPGAT model surpasses the DCbaseline in most cases. A statistically better performance ( p < . ) of EPGAT in relation to DC wasobserved in all models, except for the D. melanogaster ’s DIP and BioGRID networks and for the
E.coli ’s DIP network. In these two models, DC achieved a higher AUC score, although without statisticalsignificance considering a 95% confidence level.
D. Analysis of PPI thresholds on STRING network
The impact on model performance according to the choice of PPI network database is quite noticeablefrom our experimental results summarized in Table III. This observation led us to examine the impactof filtering out the networks used as input for EPGAT based on distinct confidence measures, as in theprevious experiments we kept only interactions with a STRING confidence score higher than 0.5.We rerun the experiments with the EPGAT model in all organisms using different filtering thresholds.Results are given in Figure III-D, in which the confidence thresholds are indicated in the x-axis. Experi- ments for H. sapiens were truncated at a confidence score of 0.3 due to memory constraints with densernetworks.The best AUC score was obtained at a threshold of 0.2 for
E. coli (98.13%), of 0.3 for
H. sapiens (91.00%), of 0.5 for
S. cerevisiae (90.42%), and of 0.1 for
D. melanogaster (82.53%). Hence, there isa clear trend towards smaller thresholds achieving better performance. This finding corroborates withour previous assertions on dense networks carrying more information about gene essentiality, even if atthe cost of more misinformation, and yielding better results for our approach. Nonetheless, a surprisingobservation is that for most organisms, EPGAT performance was relatively consistent even with largelydiverging threshold values and, thus, different number of interactions within the PPI network. The variationbetween the best and the worst performance was relatively small for
E. coli (0.069),
S. cerevisiae (0.029),and
H. sapiens (0.047). This result reinforces the robustness and potential of EPGAT for this predictiontask.For the
D. melanogaster dataset, we observed an erratic behavior. Not only the average performancewas unstable across distinct PPI filter thresholds, but also the standard deviation was consistently higherwhen compared to other organisms. We believe this is caused by the limited volume and class imbalancecharacteristics for the essential genes data.
D. melanogaster dataset comprised only 161 positive labeledgenes in the STRING PPI network used in our work, which causes any perturbation, even if a smallone, to affect classification and exert a significant impact on the AUC score. Moreover, as we may seein Table I, the
D. melanogaster
STRING PPI network has a relatively large dimension when compared
STRING PPI Filter Threshold R O C A U C S. cerevisiaeE. coliH. sapiensD. melanogaster
Fig. 3. Performance with the STRING network with different filtering thresholds Results on human were truncated due tomemory limitations. to S. cerevisiae and
E. coli , which aggravates the situation as changes in the filtering thresholds leadto a more prominent perturbation on the structure of the network. Finally, we note that EPGAT modelscould be enhanced by optimizing the interactions filtering threshold in a case-by-case basis, which wasnot explored in the current work.
E. GAT model interpretation
Our model is based on an attention mechanism, which weights every interaction among two genes i and j . Intuitively, the weight is a value α i,j ∈ [0 , that assigns the importance of node j for theprediction returned for node i . In other words, the weights can be seen as the influence that a gene j causes in the classification of another gene i . Therefore, learned attentional weights may provide benefitsin model interpretability.We visualize the attention weights for the H. sapiens
STRING PPI network in Figure 4. For bettervisualization purpose, we obtained a subgraph from highly connected nodes to reduce network dimension.The weight is displayed by the transparency of each edge, such that strongly shaded edges means that theinteraction had a greater influence on the classification of the target gene. Green and red nodes correspondto essential and non-essential genes, respectively, while black nodes refer to network genes without alabel ( i.e., not comprised in the collected essential genes dataset).Noticeably, we see an interesting phenomenon. The nodes with very few connections do not affectnor are strongly affected by their neighbors, which means that the model relies on the features givenby the additional datasets for their prediction. Highly connected nodes ( i.e., hubs), on the other hand,show the opposite effect as they greatly affect the predictions of their neighbors. Complementarily tothe centrality-lethality rule that affirms that hubs have a higher likelihood of being essential, our modelindicates that genes interacting with hubs are likely to have similar essentiality status. In this networkvisualization, this is characterized by the highly interacting negative labeled genes in the lower left ofFigure 4. IV. C
ONCLUSION
In this study, we approached the gene essentiality prediction problem by using powerful GATs, a typeof GNNs, combined with the integration of multiomics datasets. While most of the previous researchon the field evaluate their algorithms on mostly one or two organisms (especially model organisms),we adopted a more comprehensive benchmark that included
H. sapiens datasets. We showed that ourapproach, named EPGAT, achieved a significant improvement in performance compared to other network-based and machine learning methods commonly used in the field. Notably, results obtained by EPGAT Fig. 4.
Subgraph of the human network with the STRING database.
The shade of each edge in the network represent theweight given by GAT’s attention mechanism, darker edges correspond to interactions which the model assigned more weightfor predictions, the opposite for lighter. Green and red nodes correspond to essential and not essential genes respectively, whileblack nodes to genes without a label. were at least comparable (when not superior) to node2vec, a state-of-the-art method for gene or proteinessentiality prediction (used, for instance, in [31], [33], [34]). While maintaining a slight edge, EPGATstill has the benefits of having a more straightforward training procedure and shorter training time.Our experiments indicated that denser PPI networks are more informative and reliable for gene essen-tiality prediction. However, even under the challenged posed by network sparseness ( i.e., D. melanogaster network in our study), EPGAT was the most robust ML method. Lastly, we observed that our approachcan also be helpful for insights over gene relations, as the GATs’ attention mechanisms is a measure ofthe influence among interacting genes. Further investigation towards interpretability of model attentionalweights may help shed light on network-based features of genes and proteins essentiality patterns. Otherinteresting research directions include (i) expanding our approach to capture the context-dependent natureof gene essentiality, for instance, by integrating more omics datasets, and (ii) incorporating node-level feature selection techniques in our model to control dimensionality and improve its generalization power.A CKNOWLEDGMENT
This work was partially supported by CNPq/Brazil and by CAPES Finance Code 001.R
EFERENCES [1] M. Juhas, L. Eberl, and J. I. Glass, “Essence of life: essential genes of minimal genomes,”
Trends in Cell Biology , vol. 21,no. 10, pp. 562–568, 2011.[2] E. V. Koonin, “How many genes can make a cell: the minimal-gene-set concept,”
Annual Review of Genomics and HumanGenetics , vol. 1, no. 1, pp. 99–116, 2000.[3] T. Hart, M. Chandrashekhar, M. Aregger, Z. Steinhart, K. R. Brown, G. MacLeod, M. Mis, M. Zimmermann, A. Fradet-Turcotte, S. Sun et al. , “High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities,”
Cell , vol. 163, no. 6, pp. 1515–1526, 2015.[4] M. S. Paul, A. Kaur, A. Geete, and M. E. Sobhia, “Essential gene identification and drug target prioritization in Leishmaniaspecies,”
Molecular BioSystems , vol. 10, no. 5, pp. 1184–1195, 2014.[5] D. Park, J. Park, S. G. Park, T. Park, and S. S. Choi, “Analysis of human disease genes in the context of gene essentiality,”
Genomics , vol. 92, no. 6, pp. 414–418, 2008.[6] J. Becker and C. Wittmann, “Systems and synthetic metabolic engineering for amino acid production–the heartbeat ofindustrial strain development,”
Current Opinion in Biotechnology , vol. 23, no. 5, pp. 718–726, 2012.[7] G. Rancati, J. Moffat, A. Typas, and N. Pavelka, “Emerging and evolving concepts in gene essentiality,”
Nature ReviewsGenetics , vol. 19, no. 1, p. 34, 2018.[8] X. Li, W. Li, M. Zeng, R. Zheng, and M. Li, “Network-based methods for predicting essential genes or proteins: a survey,”
Briefings in Bioinformatics , vol. 21, no. 2, pp. 566–583, 02 2019.[9] X. He and J. Zhang, “Why do hubs tend to be essential in protein networks?”
PLoS Genetics , vol. 2, no. 6, p. e88, 2006.[10] H. Jeong, S. P. Mason, A.-L. Barab´asi, and Z. N. Oltvai, “Lethality and centrality in protein networks,”
Nature , vol. 411,no. 6833, pp. 41–42, May 2001.[11] M. Li, J. Wang, X. Chen, H. Wang, and Y. Pan, “A local average connectivity-based method for identifying essentialproteins from the network level,”
Computational Biology and Chemistry , vol. 35, no. 3, pp. 143–150, 2011.[12] J. Wang, M. Li, H. Wang, and Y. Pan, “Identification of essential proteins based on edge clustering coefficient,”
IEEE/ACMTransactions on Computational Biology and Bioinformatics , vol. 9, no. 4, pp. 1070–1080, 2011.[13] X. Zhang, M. L. Acencio, and N. Lemke, “Predicting essential genes and proteins based on machine learning and networktopological features: a comprehensive review,”
Frontiers in Physiology , vol. 7, p. 75, 2016.[14] M. A. Mahdavi and Y.-H. Lin, “False positive reduction in protein-protein interaction predictions using gene ontologyannotations,”
BMC Bioinformatics , vol. 8, no. 1, p. 262, Jul 2007.[15] X. Tang, J. Wang, J. Zhong, and Y. Pan, “Predicting essential proteins based on weighted degree centrality,”
IEEE/ACMTransactions on Computational Biology and Bioinformatics , vol. 11, no. 2, pp. 407–418, 2014.[16] B. Zhao, J. Wang, X. Li, and F.-X. Wu, “Essential protein discovery based on a combination of modularity andconservatism,”
Methods , vol. 110, pp. 54–63, 2016.[17] X. Peng, J. Wang, J. Wang, F.-X. Wu, and Y. Pan, “Rechecking the centrality-lethality rule in the scope of protein subcellularlocalization interaction networks,”
PloS One , vol. 10, no. 6, p. e0130743, 2015. [18] G. Li, M. Li, J. Wang, Y. Li, and Y. Pan, “United neighborhood closeness centrality and orthology for predicting essentialproteins,” IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2018.[19] M. Li, Y. Lu, Z. Niu, and F.-X. Wu, “United complex centrality for identification of essential proteins from PPI networks,”
IEEE/ACM Transactions on Computational Biology and Bioinformatics , vol. 14, no. 2, pp. 370–380, 2015.[20] W. Peng, J. Wang, Y. Cheng, Y. Lu, F. Wu, and Y. Pan, “UDoNC: an algorithm for identifying essential proteins basedon protein domains and protein-protein interaction networks,”
IEEE/ACM Transactions on Computational Biology andBioinformatics , vol. 12, no. 2, pp. 276–288, 2014.[21] J. Deng, L. Deng, S. Su, M. Zhang, X. Lin, L. Wei, A. A. Minai, D. J. Hassett, and L. J. Lu, “Investigating the predictabilityof essential genes across distantly related organisms using an integrative approach,”
Nucleic Acids Research , vol. 39, no. 3,pp. 795–807, 2011.[22] W. Kim, “Prediction of essential proteins using topological properties in GO-pruned PPI network based on machine learningmethods,”
Tsinghua Science and Technology , vol. 17, no. 6, pp. 645–658, 2012.[23] M. Li, Z. Niu, X. Chen, P. Zhong, F. Wu, and Y. Pan, “A reliable neighbor-based method for identifying essential proteinsby integrating gene expressions, orthology, and subcellular localization information,”
Tsinghua Science and Technology ,vol. 21, no. 6, pp. 668–677, 2016.[24] G. Li, M. Li, J. Wang, J. Wu, F.-X. Wu, and Y. Pan, “Predicting essential proteins based on subcellular localization,orthology and PPI networks,”
BMC Bioinformatics , vol. 17, no. 8, pp. 571–581, 2016.[25] X. Lei, J. Zhao, H. Fujita, and A. Zhang, “Predicting essential proteins based on RNA-Seq, subcellular localization andGO annotation datasets,”
Knowledge-Based Systems , vol. 151, pp. 136–148, 2018.[26] F.-B. Guo, C. Dong, H.-L. Hua, S. Liu, H. Luo, H.-W. Zhang, Y.-T. Jin, and K.-Y. Zhang, “Accurate prediction of humanessential genes using only nucleotide composition and association information,”
Bioinformatics , vol. 33, no. 12, pp. 1758–1764, 2017.[27] A. M. Gustafson, E. S. Snitkin, S. C. Parker, C. DeLisi, and S. Kasif, “Towards the identification of essential genes usingtargeted genome sequencing and comparative analysis,”
BMC Genomics , vol. 7, no. 1, pp. 1–16, 2006.[28] M. L. Acencio and N. Lemke, “Towards the prediction of essential genes by integration of network topology, cellularlocalization and biological process information,”
BMC Bioinformatics , vol. 10, no. 1, p. 290, 2009.[29] S. Min, B. Lee, and S. Yoon, “Deep learning in bioinformatics,”
Briefings in Bioinformatics , vol. 18, no. 5, pp. 851–869,2017.[30] S. Jin, X. Zeng, F. Xia, W. Huang, and X. Liu, “Application of deep learning methods in biological networks,”
Briefingsin Bioinformatics , p. bbaa043, 2020.[31] M. Zeng, M. Li, Z. Fei, F. Wu, Y. Li, Y. Pan, and J. Wang, “A deep learning framework for identifying essentialproteins by integrating multiple types of biological information,”
IEEE/ACM Transactions on Computational Biology andBioinformatics , 2019.[32] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in
Proceedings of the 22nd ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining , 2016, pp. 855–864.[33] M. Zeng, M. Li, F.-X. Wu, Y. Li, and Y. Pan, “DeepEP: a deep learning framework for identifying essential proteins,”
BMC Bioinformatics , vol. 20, no. 16, p. 506, 2019.[34] X. Zhang, W. Xiao, and W. Xiao, “DeepHE: Accurately predicting human essential genes based on deep learning,” bioRxiv ,2020.[35] M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in
Proceedings. 2005 IEEEInternational Joint Conference on Neural Networks, 2005. , vol. 2. IEEE, 2005, pp. 729–734. [36] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEETransactions on Neural Networks , vol. 20, no. 1, pp. 61–80, 2008.[37] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprintarXiv:1609.02907 , 2016.[38] T. Nguyen, H. Le, and S. Venkatesh, “GraphDTA: prediction of drug–target binding affinity using graph convolutionalnetworks,”
BioRxiv , p. 684662, 2019.[39] R. Schulte-Sasse, S. Budach, D. Hnisz, and A. Marsico, “Graph convolutional networks improve the prediction of cancerdriver genes,” in
International Conference on Artificial Neural Networks . Springer, 2019, pp. 658–668.[40] S. Rhee, S. Seo, and S. Kim, “Hybrid approach of relation network and localized graph convolutional filtering for breastcancer subtype classification,” arXiv preprint arXiv:1711.05859 , 2017.[41] P. Velikovi, G. Cucurull, A. Casanova, A. Romero, P. Li, and Y. Bengio, “Graph attention networks,” 2017.[42] E. A. Winzeler, D. D. Shoemaker, A. Astromoff, H. Liang, K. Anderson, B. Andre, R. Bangham, R. Benito, J. D. Boeke,H. Bussey et al. , “Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis,”
Science ,vol. 285, no. 5429, pp. 901–906, 1999.[43] T. Baba, T. Ara, M. Hasegawa, Y. Takai, Y. Okumura, M. Baba, K. A. Datsenko, M. Tomita, B. L. Wanner, and H. Mori,“Construction of Escherichia coli k-12 in-frame, single-gene knockout mutants: the keio collection,”
Molecular SystemsBiology , vol. 2, no. 1, pp. 2006–0008, 2006.[44] W.-H. Chen, G. Lu, X. Chen, X.-M. Zhao, and P. Bork, “OGEE v2: an update of the online gene essentiality database withspecial focus on differentially essential genes in human cancer cell lines,”
Nucleic Acids Research , p. gkw1013, 2016.[45] L. Salwinski, C. S. Miller, A. J. Smith, F. K. Pettit, J. U. Bowie, and D. Eisenberg, “The database of interacting proteins:2004 update,”
Nucleic Acids Research , vol. 32, no. suppl 1, pp. D449–D451, 2004.[46] R. Oughtred, C. Stark, B.-J. Breitkreutz, J. Rust, L. Boucher, C. Chang, N. Kolas, L. ODonnell, G. Leung, R. McAdam et al. , “The BioGRID interaction database: 2019 update,”
Nucleic Acids Research , vol. 47, no. D1, pp. D529–D541, 2019.[47] D. Szklarczyk, A. L. Gable, D. Lyon, A. Junge, S. Wyder, J. Huerta-Cepas, M. Simonovic, N. T. Doncheva, J. H. Morris,P. Bork et al. , “STRING v11: protein–protein association networks with increased coverage, supporting functional discoveryin genome-wide experimental datasets,”
Nucleic Acids Research , vol. 47, no. D1, pp. D607–D613, 2019.[48] T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M.Sherman, M. Holko et al. , “NCBI GEO: archive for functional genomics data setsupdate,”
Nucleic Acids Research , vol. 41,no. D1, pp. D991–D995, 2012.[49] Y. Fan, X. Hu, X. Tang, Q. Ping, and W. Wu, “A novel algorithm for identifying essential proteins by integrating subcellularlocalization,” in , 2016, pp. 107–110.[50] E. L. Sonnhammer and G. stlund, “InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic,”
NucleicAcids Research , vol. 43, no. D1, pp. D234–D239, 2014.[51] J. X. Binder, S. Pletscher-Frankild, K. Tsafou, C. Stolte, S. I. ODonoghue, R. Schneider, and L. J. Jensen, “COMPART-MENTS: unification and visualization of protein subcellular localization evidence,”
Database , vol. 2014, 2014.[52] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is allyou need,” in
Advances in Neural Information Processing Systems , 2017, pp. 5998–6008.[53] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimizationframework,” in
Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining ,2019, pp. 2623–2631. [54] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neuralnetworks from overfitting,” The journal of machine learning research , vol. 15, no. 1, pp. 1929–1958, 2014.[55] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machinelearning in Python,”
Journal of Machine Learning Research , vol. 12, pp. 2825–2830, 2011.[56] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automaticdifferentiation in pytorch,” in , 2017.[57] M. Fey and J. E. Lenssen, “Fast graph representation learning with pytorch geometric,” arXiv preprint arXiv:1903.02428arXiv preprint arXiv:1903.02428