Analysis of Gene Interaction Graphs as Prior Knowledge for Machine Learning Models
Paul Bertin, Mohammad Hashir, Martin Weiss, Vincent Frappier, Theodore J. Perkins, Geneviève Boucher, Joseph Paul Cohen
AAnalysis of Gene Interaction Graphs as PriorKnowledge for Machine Learning Models
Paul Bertin
Mila, Université de MontréalMontréal, Canada
Mohammad Hashir
Mila, Université de MontréalMontréal, Canada
Martin Weiss
Mila, Université de MontréalMontréal, Canada
Vincent Frappier
Mila, Université de MontréalMontréal, Canada
Theodore J. Perkins
Ottawa Hospital Research InstituteUniversity of OttawaOttawa, Canada
Geneviève Boucher
Institute for Research in Immunology and CancerUniversité de MontréalMontréal, Canada
Joseph Paul Cohen
Mila, Université de MontréalMontréal, Canada
Abstract
Gene interaction graphs aim to capture various relationships between genes andcan represent decades of biology research. When trying to make predictions fromgenomic data, those graphs could be used to overcome the curse of dimensionalityby making machine learning models sparser and more consistent with biologicalcommon knowledge. In this work, we focus on assessing how well those graphscapture dependencies seen in gene expression data to evaluate the adequacy ofthe prior knowledge provided by those graphs. We propose a condition graphsshould satisfy to provide good prior knowledge and test it using ‘Single GeneInference’ tasks. We also compare with randomly generated graphs, aiming tomeasure the true benefit of using biologically relevant graphs in this context, andvalidate our findings with five clinical tasks. We find some graphs capture relevantdependencies for most genes while being very sparse. Our analysis with randomgraphs finds that dependencies can be captured almost as well at random whichsuggests that, in terms of gene expression levels, the relevant information aboutthe state of the cell is spread across many genes. Our source code is available onGitHub
A major challenge in using machine learning on gene expression data is overcoming the curse ofdimensionality (Yeung and Ruzzo, 2001; Kim and Tidor, 2003; Lee et al. , 2008; Zuber and Strimmer,2009; Heimberg et al. , 2016). The number of samples in most datasets is typically much smallerthan the number of genes. Making predictions based on gene expression can be problematic as the
Corresponding authors: [email protected] , [email protected] or [email protected] Preprint. Under review. a r X i v : . [ q - b i o . GN ] J a n odel would tend to overfit and learn the noise and spurious correlations in place of any biologicallyrelevant patterns.Penalizing model complexity can reduce this problem (Gustafsson et al. , 2005; Cawley and Talbot,2006; Ma and Huang, 2009; Simon et al. , 2013; Min et al. , 2018) and can simultaneously performfeature selection, identifying relevant genes. However, gene selection is not generally stable unlessspecial algorithms are used (He and Yu, 2010). Moreover, regularization typically emphasizes somegeneral notion of sparsity rather than any specific prior biological knowledge. Some recent workshave looked at incorporating biological knowledge into regularization, e.g. based on pathwaysrelationships (Simon et al. , 2013; Min et al. , 2018).In this work, we focus on gene interaction graphs as a form of prior biological knowledge for machinelearning models. Many groups have developed a number of gene-interaction graphs, structuringdomain knowledge from different areas of molecular biology (Ogris et al. , 2018; Warde-Farley et al. ,2010; Himmelstein and Baranzini, 2015; Himmelstein et al. , 2017; Lee et al. , 2011; Hwang et al. ,2019; Subramanian et al. , 2017; Liu et al. , 2015; Kanehisa et al. , 2017; Szklarczyk et al. , 2019). Thesegraphs can represent any number of different biological, molecular, or phenomenological relationshipssuch as protein-protein interactions, transcriptional regulation, transcriptional co-regulation, co-expression at the mRNA or protein levels, etc. Gene interaction graphs can be used with machinelearning algorithms as a proxy for biological intuition to leverage decades of biology research (Zhang et al. , 2017). They can act as a biological prior on machine learning techniques to automate featureimportance & selection and help to overcome the curse of dimensionality. For example, network-based linear regression (Li and Li, 2008; Min et al. , 2016) regularizes the weights of a linear modelbased on the connectivity of the nodes found in an interaction graph. Preliminary work by Rhee et al. (2018) and Dutil et al. (2018) found that the same can be done for non-linear models. This wouldbe useful in the majority of tasks where gene expression or single-nucleotide polymorphism data isthe input. As these graphs were not developed as an input for machine learning applications, thereis value in investigating whether they can aid machine learning algorithms. Thus there is need fora systematic approach to assess the value of this prior biological knowledge for machine learningpipelines.We propose a method to quantitatively evaluate the prior knowledge provided by gene interactiongraphs: we formulate a condition that a graph should satisfy to provide good prior knowledge formachine learning algorithms, and then test to which extent this condition holds. Seven major geneinteraction graphs created by different research groups (which we refer to as ‘curated graphs’) weretested. We compare curated graphs with randomly generated graphs, aiming to measure the truebenefit of using biologically relevant graphs in this context. We also perform experiments on clinicalprediction tasks: we compare the performance of models provided with different graph-based priorknowledge in order to validate our evaluation method.With this work, we aim to gain a deeper insight into the behavior of machine learning pipelines thatmake use of graph-based prior knowledge when applied to gene expression data. This effort canhelp in improving interpretability which is very important for genomics as it is a domain where wehave relatively limited intuition compared to images or text. Having more interpretable models couldprovide a “research gradient” to biologists allowing them to focus on specific subgroups of genes,which could lead to a fruitful feedback loop between biological experiments and machine learningpredictions. Interpreting those models could also help in generating new hypotheses that may bevalidated with experiments. Our approach is different from those of most module detection methods(Saelens et al. , 2018) which infer modules of genes that work together from gene expression data.They typically aim at better understanding biological mechanisms. On the other hand, our methodwas developed to facilitate the use of pre-established biological knowledge to exploit more efficientlygene expression data and make accurate predictions. For a set of gene expressions { g i } i ∈ [1 ...N ] where N is the number of genes, the joint probability ofthe expressions is denoted by P ( g , ..., g N ) . We hypothesize that there is a true causal (directed)graph G that generated this distribution: each gene expression was generated by a function f such2hat g i . = f ( PA i , (cid:15) ) where PA i refers to the set of expressions of the parents of node i and (cid:15) is somerandom noise.In order to provide “good” prior knowledge, we would like our gene interaction graph to be equivalentto the hypothesized true causal graph. Figure 1 illustrates the relationship between the two graphs. Definition
A “good” prior knowledge is a gene interaction (undirected) graph G which covers themoralized counterpart of the true causal graph H , i.e. if an edge i → j exists in H , the edge { i, j } should exist in G , and if two nodes have a common child in H , an edge should exist between them in G . Inclusivity Property
If an interaction graph G is a “good” prior knowledge, then for any gene node i , the Markov blanket of i in the true causal graph is contained in the set of neighbours of node i in G . Equivalently, if G is a “good” prior knowledge, the following holds for all gene nodes i : P ( g i | g i ) = P ( g i | neighbours i ) (1)where g i = { g j } j (cid:54) = i is the set that contains every gene expression except the i th one, and neighbours i = { g j |∃ edge { i, j } in graph G } If the equality (Eq. 1) holds for a gene i , it means that the conditional probability of g i given all theother genes only depends on the first degree neighbours of g i . Note that the inclusivity property doesnot ensure that the interaction graph has no spurious edges. An edge { i, j } in the interaction graphis called spurious if it does not exist in the moralized counterpart of the hypothesized true causalgraph. Also, the detection of spurious edges is not our main concern as we deal with fairly sparsegraphs. The goal of this work is to identify graphs that are sparse while still satisfying the inclusivityproperty.Figure 1: We define a “good” prior knowledge as an interaction graph G that covers the moral(undirected) graph equivalent to the hypothesized causal graph, i.e. edges in G cover all edges foundin the hypothesized true causal graph, and parents with a common child are connected in G . G couldalso contain other spurious edges (in red) while still satisfying the inclusivity property, as in thisexample. As there is no direct way to test the equality (Eq. 1), we model the conditional probability of theexpression of gene g i given all the other genes with a neural network. This task of predicting a geneexpression value given a set of other gene expressions is similar to the Single Gene Inference taskformulated in Dutil et al. (2018), which was inspired by Chen et al. (2016) and Subramanian et al. (2017).For each gene i , we train two different models that try to predict its expression level g i . The firstmodel takes all the other genes as input and the second takes only the first degree neighbours as input.If the equality holds for g i , both models should achieve similar performance and approximate theconditional probability equally well. We can even expect slightly better performance in the secondmodel as signal is lower dimensional and supposed to contain only relevant information. Conversely,if the equality does not hold for g i , then we expect the second model to achieve poorer performanceas it will be provided with incomplete information. Figure 2 shows a schematic view of our method.We restrict the prediction of g i to a binary classification task to simplify interpretation of the results.The alternative is to define a regression task, but depending on the pattern of expression of the gene, its3ange, its level of noise or any other specificity, the regression metric of a given gene ( e.g. R-squared)can be arbitrarily high even for a good fit. A reduction of performance in some genes can be missedwhen looking at global statistics of the regression task. The key point here is that we are interested inaggregated results ( e.g. mean over all genes) which requires metrics that are comparable betweengenes. We believe that the area under the ROC curve (AUC) of a binary classification task matchesthose requirements. Another reason we use classification is that we are not aiming to precisely modelgene expression patterns; rather we intend to compare the two models to know whether the equality(Eq. 1) holds or not.Figure 2: Given a graph and a target gene g i , we train two different models taking first degreeneighbours and all genes as input, respectively. They model P ( g i | neighbours i ) and P ( g i | g i ) respectively. We can compare the performance of the two models to infer the quality of the priorknowledge provided by the graph for g i We perform our analysis with gene expression data not only from healthy cells, but also from cancerous cells where biological processes might somehow be perturbed, giving us an idea ofthe usefulness of using gene interaction graphs in different contexts. We use the TCGA dataset (Weinstein et al. , 2013) where values are log ( x + 1) transformed RSEM values (Li and Dewey,2011) and the GTEx dataset (Lonsdale et al. , 2013) where values are RPKMs (Mortazavi et al. ,2008). Both those datasets are public and well-studied. The TCGA PANCAN database spans multipletissues and measures 20,530 gene expressions for 10,459 samples; most samples come from cancerbiopsies but many healthy examples are also included. The GTEx dataset consists of samples mostlyfrom healthy subjects and has a higher amount of genomic features (34,218 features) but only for2,921 samples. We normalize both datasets by the mean (gene-wise) for our analysis. We evaluate six existing graphs covering a variety of relationships in the genome, namely FunCoup,GeneMANIA, Hetionet, HumanNet, RegNet, and StringDB. In addition, we create a graph from theLandmark gene set (Subramanian et al. , 2017). We refer to these seven graphs as ‘curated’ graphs.We also create graphs with n neighbours, selected randomly, for each gene in a given dataset whichwe call random R- n graphs.We made several simplifications to be able to use all the graphs in a similar manner. The namingconvention of genes differed from a graph to another; we used the HUGO gene symbols as a commonschema and generated conversion maps from the HUGO Gene Nomenclature Committee website .We renamed the nodes in each graph to their HUGO equivalent to ensure uniformity and have aneasier comparison between graphs. Further, we considered edges to be binary (present or absent) anddidn’t take any weights that might have been associated with them. version 2016-08-16 https://cbcl.ics.uci.edu/public_data/D-GEX/GTEx_RNASeq_RPKM_n2921x55993.gctx ully Connected We consider that the baseline model taking all gene expressions except the targetas input could be represented as a ‘fully connected’ graph. Indeed, in the fully connected graph, thefirst degree neighbours of a gene are all the other genes.
FunCoup
FunCoup (Ogris et al. , 2018) used a Bayesian algorithm and orthology based informationtransfer in order to infer functionnal associations. It builds the interaction networks for 17 organisms;we use the network for humans. The edges in FunCoup have been given confidence scores forsupporting different kinds of interactions such as metabolic, signaling, complex and physical proteininteractions. It contains 5,505,787 edges covering 16,765 gene nodes.
GeneMANIA (Warde-Farley et al. , 2010) is an interaction network built from hundreds of datasetscollected from GEO , BioGRID , Pathway Commons and I2D . Ridge regression as well as a labelpropagation algorithm were used to predict gene functions. It contains 264,657 edges covering 16,297gene nodes. Hetionet (Himmelstein and Baranzini, 2015; Himmelstein et al. , 2017) is a heterogeneous biologi-cal network consisting of multiple node and edge types which was created using public biologicaldatabases. It contains around 47,000 nodes of 11 types such as genes, compounds, pathways, etc.,with more than 2 million edges. The gene nodes in Hetionet were curated from Entrez Gene andhave three types of edges between them: interaction, co-variation and regulation. In our analysis, wecombine all these edges and create an undirected graph to represent the gene sub-network in Hetionet.This graph has 471,046 edges covering 17,681 gene nodes.
HumanNet (Lee et al. , 2011; Hwang et al. , 2019) contains functional associations that wereinferred from many sources such as protein-protein interactions (PPIs), gene co-expression, proteinco-occurrence, and genomic contexts. Further, the original graphical database was improved byaugmenting its data sources such as the source texts for the co-citation network and interologs fromother species. HumanNetV2 has an inclusive hierarchy in its constituent gene networks. We usethe highest level of HumanNetV2 called HumanNet-XN in our analysis; it contains 476,399 edgescovering 16,243 gene nodes.
RegNetwork (Liu et al. , 2015), is composed of experimental and predicted up/down regulatingeffects between genes and includes information from KEGG (Kanehisa et al. , 2017). RegNetworkcontains 247,848 edges covering 7,220 gene nodes.
STRINGdb (Szklarczyk et al. , 2019) is a network of Protein-Protein Interactions. It includesseveral types of interactions among which are co citation in research papers (text mining), co-expression, gene neighbourhood, and involvement in common metabolic pathways. For each protein,we retrieve the corresponding protein-coding gene(s), in order to create a gene-gene interactionnetwork. StringDB contains different sub-graphs depending on the interaction type; we use theco-expression graph and the entire graph in our analysis. The entire graph contains 5,703,510 edgesbetween 18,851 gene nodes and the co-expression graph contains 2,945,888 edges between 18,520gene nodes.
Landmark Graph
We generate a separate graph based on the Landmark genes (Subramanian et al. , 2017). Those 978 genes were chosen to optimally recover the observed connections seen in thepilot Connectivity Map dataset. We build a graph in which each gene in a given dataset is connectedto the set of the 978 landmark genes. The latter set of genes are themselves connected together,forming a clique. We refer to this graph as the
Landmark graph . Random R- n graphs We generated random graphs with a fixed number n of randomly sampledneighbours from the set of genes in the dataset. For each target gene in the dataset, we sample n othergenes from the dataset and connect them to the target gene node to create an R- n graph . We created15 such graphs with n varying between 10 and 10,000. https://thebiogrid.org/ http://ophid.utoronto.ca/ophidv2.204/ .5 Metrics on graphs In this section, we introduce two graph metrics which would aid in the evaluation of a graph as priorknowledge for machine learning models. Different graphs have different number of nodes and weare interested in graphs that contain a good proportion of the genes in a given dataset as nodes. Asthe final purpose is to use the graph to include prior knowledge in machine learning models, it isimportant to have a graph with sufficient coverage of all genes present in a dataset as opposed to asmall subset of well-studied genes.Moreover, distribution of degrees also vary significantly acrossmultiple graphs.We define two metrics called coverage and sparsity of a graph to have a clearer picture of how graphscan differ from each other. The coverage is the percent of the genes in the dataset that are presentin the graph as nodes. The sparsity is a measure of the reduction in edges of a graph relative to thefully connected graph. It is calculated by subtracting the ratio of the number of edges in a graphto the number of edges in the fully-connected graph (which is dataset-dependent) from 1 and thenrepresented as a percent.
Coverage = n ( D ∩ G ) n ( D ) × Sparsity = (cid:18) − E g n ( D ) × n ( D ) (cid:19) × where D and G are the sets of the genes present in a particular dataset and graph respectively, E g isthe number of edges in a sub-graph of the graph that contains only the genes present in the dataset,i.e., the intersection set D ∩ G , and n ( · ) represents the cardinality of the set, i.e., the total number ofelements in the set. In order to predict the over- or under-expression of the target gene, we began by binarizing itsexpression values based on that gene’s mean expression. Then, two multi-layer perceptrons (MLPs)were trained to predict the target gene expression level using the two types of inputs mentioned inSection 2.2: all the genes (the baseline) and only the first degree neighbours. The AUC was computedon a test set after training. If the target gene had no neighbours in the graph, an AUC of 0.5 wasassigned because the absence of any input features made the prediction a random guess.We performed these experiments for all the seven curated graphs and 15 R- n graphs with both datasetsand all the genes in each dataset. We ran three trials for each combination of a gene, graph anddataset and averaged all metrics across the trials for a robust evaluation. To ensure small overlappingof example sets between trials, we used 3000 samples for TCGA and 1500 samples for GTEx withequal splits between the training, testing and validation sets. The data and also the gene neighboursfor the R- n graphs were randomly sampled for every trial but the data remained the same for everygene regardless of graph. Protocol
For all datasets, graphs and genes, we:1. Binarize the target gene expression values with the threshold based off the mean expressionfor the target gene2. Find neighbours of the gene in the graph. If the gene is not present in the graph or has noneighbours, an AUC of 0.5 is assigned and the next steps are skipped3. Fit a prediction model on a subset of the dataset corresponding to the first-degree neighbour-hood ( × × We utilized a mutlilayer perceptron (MLP) with a single hidden layer of 16 neurons and ReLUactivation functions. The binary cross-entropy loss was used with an Adam optimizer and a learningrate of 0.001 for the FunCoup, Hetionet, and fully-connected graphs and − for the rest, on all6atasets. The weight decay parameter was set to − . These hyperparameters were obtained witha search over the different graphs and datasets and 20 genes. We used early stopping based off theAUC computed on a validation set.The MLP achieved slightly better performance than logistic regression ( +5e − AUC on average over20 genes). L regularization was not used as it did not improve performance. We present results showing the performance of our models, as well as aggregated statistics computedfor the different curated graphs. We then compare with random graphs and evaluate the curatedgraphs on clinical tasks.For our analysis, we defined three distinct types of sets of gene for each dataset. The first was
AllGenes which consisted of all the genes in the dataset and was the same for all graphs. The secondwas
Covered Genes which referred to the set of genes in the dataset covered by a given graph. Eachgraph’s set of Covered Genes was unique. We report several results on both these sets. We decided todo this because our protocol of assigning an AUC of 0.5 for uncovered genes makes the performanceof graphs artificially poorer when they have low coverage. It would not be equitable to comparedifferent graphs only on their covered genes as they cover different sets of genes, which could includegenes that are easier or harder to predict. We thus choose to report results on both sets. We alsodefined a third type of set of genes called
Intersection Genes . This consisted of the set of genesin the dataset that were common to all graphs except RegNet. We chose not to include RegNet aswe wanted this set to have a high cardinality and RegNet covers a relatively small number of genes(7,220). The Intersection Genes set had 14,445 and 14,270 genes for TCGA and GTEx respectively.Table 1: Statistics for each graph and dataset.
Coverage is the percentage of genes in the dataset thatare represented as nodes in the graph.
Sparsity represents the percentage of missing edges in thegraph compared to the fully connected graph. The
Covered Genes AUC and the
All Genes AUC refer to the average AUC achieved by the graph, respectively, on only its covered genes and on theentire dataset (after adding uncovered genes with an AUC of 0.5). The
Improvement is computedwith respect to the fully-connected baseline for both the covered genes and all genes.
Per-gene AUCSTD refers to the per-gene standard deviation of AUCs across the three trials (averaged over all genesfor a given graph and dataset). All uncertainties are computed as the standard deviation across thethree trials.
FullyConnected
GeneMANIA RegNet HNetV2 Hetio FunCoup
StringDB(all)
Landmark T C G A Coverage (%) 100 79 35 87 86 82 92 100Sparsity (%) 0
Per-gene AUC SD 0.017 0.010 0.005 0.011 0.013 0.011 0.011 0.013All Genes AUC ( ± − ) 0.782 0.636 0.590 0.699 0.640 0.702 0.771 All Genes improvement ( ± − ) — -0.146 -0.193 -0.083 -0.143 -0.081 -0.011 Covered Genes AUC ( ± − ) 0.782 0.673 0.753 0.730 0.663 0.749 ± − ) — -0.126 -0.056 -0.062 -0.130 -0.045 G TE x Coverage (%) 100 48 21 52 52 49 55 100Sparsity (%) 0
Per-gene AUC SD 0.036 0.012 0.005 0.012 0.015 0.015 0.013 0.029All Genes AUC ( ± − ) 0.734 0.599 0.564 0.647 0.599 0.630 0.685 All Genes improvement ( ± − ) — -0.222 -0.170 -0.212 -0.135 -0.105 -0.050 Covered Genes AUC ( ± − ) 0.734 0.707 0.802 0.781 0.693 0.764 ± − ) — -0.116 -0.027 -0.037 -0.126 -0.054 We first visualize how well the different models fit, for the different graphs and genes. The distributionof the AUCs over the set of all genes, averaged across trials, for all the graphs is visualized in Figure3. For both datasets, there is a considerable spike at 0.5 which is mostly due to the fact that weassign a 0.5 AUC to the genes not present in the graph and some graphs have a very small proportionof genes in the datasets (especially for the GTEx dataset). However, some part of the peak at 0.5AUC does correspond to covered genes for which the trained model actually achieved an AUC of0.5, which could be due to either bad convergence or too restrictive feature selection. We also plot7 a) TCGA (b) GTEx
Figure 3: Distribution of AUCs over genes, averaged over 3 trials, using different graphs.
Left :TCGA dataset.
Right : GTEx. The distribution is computed over all genes in the datasets.distributions of AUCs of the graphs over its set of covered genes (which differs between graphs) assupplementary material in Figure A.1.We report the mean AUCs along with other statistics in Table 1 and also report the per-gene standarddeviation of AUC across the three trials to assess the robustness of the evaluation. It is noticeablethat, overall, the best fit is not achieved using the fully connected graph. The best fit is obtainedusing the Landmark graph for both datasets if all the genes in the dataset are considered. Note thatthe Landmark graph has the highest coverage and lowest sparsity among all graphs but the fullyconnected one. If we only consider the performance on the set of covered genes, best performanceare achieved using the StringDB. There is an anomaly in the case of RegNet which is consistentlythe worst graph on the set of all genes for both datasets due to its very low coverage. In general,predictive performance was better on TCGA than GTEx. (a) TCGA (b) GTEx
Figure 4: Distribution of the All Genes AUC improvement relative to the fully-connected graph.
Left :TCGA dataset.
Right : GTEx. For each gene the difference in AUC is averaged over 3 trials.For a given gene g i , we define its AUC improvement as AU C ( neighbours i ) − AU C ( all genes ) which is the difference between the AUC of the model taking first neighbours as input and the AUCof the model taking all other genes as input. This difference measures to which extent the inclusivityproperty holds, i.e. the quality of the prior knowledge provided by the graph for gene g i . Thedistribution of AUC improvements over all genes is plotted in Figure 4 where each difference hasbeen averaged over three trials. Distributions of
AUC improvements over the set of Covered Genesare provided as supplementary material in Figure A.2.For graphs such as StringDB and Landmark, a most of the mass of the distribution is close to zero orin the positive region, meaning that the equality (Eq. 1) holds for a most genes. For those same graphs,the distribution has an heavy tail in the negative area, corresponding to genes with highly negative
AUC improvement for which the equality (Eq. 1) is not satisfied. Note that StringDB and Landmarkhave high coverage and lower sparsity than most other graphs. Those failure cases should be the8onsequence of missing edges in the graphs. To confirm this hypothesis, we plot the distribution ofper-gene AUC differences with respect to the number of neighbours of genes in Figure 5. Highlynegative AUC differences seem to correspond to genes with a relatively small number of neighbours.Figure 5 also highlights that when the number of neighbours is sufficient, most genes have a positive
AUC improvement . Removing redundancy of features, while keeping relevant information allowedmodels to achieve slightly better performance. For other graphs such as HetIO and GeneMANIA, thehigh majority of the mass is in the negative region, meaning that the equality (Eq. 1) does not holdfor most genes.Figure 5: Covered Genes AUC improvement with respect to the number of neighbours, for StringDB(all) on the TCGA dataset. Color indicates density, with yellower being higher. Averaged over 3trials. n graphs We would like to assess how much of the graph’s internal structure contributes to its performancein our evaluation by comparing it to random graphs. We generate graphs with a fixed number ofrandomly sampled neighbours n from the set of genes in the dataset and evaluating them on the task.For each target gene in the dataset, we would sample n other genes from the dataset and connect themto the target gene node to create an R- n graph. We created 15 such graphs with n varying between 10and 10,000 and ran three trials with randomly sampled neighbours for every trial instead of keepingthem fixed for all three trials.We plot the improvement in AUC for some of these graphs in Figure A.3 of the supplementarymaterial. As expected, when the number of random neighbours n increased, the feature selectionrelated to the graphs achieved better performance. R- n graphs perform on par or better than thebaseline on average for n greater than , meaning that using 500 randomly sampled featuresactually performs on par or better with using the entire gene set. The standard deviation across trialsof the per-gene AUC is − . There is a bump of well predicted genes even when n is small, for theGTEx dataset only. One explanation could be that GTEx is mostly composed of healthy cells whichmakes the population of cells more homogeneous and predictions easier for some genes, but furtherinvestigation is needed to confirm this hypothesis. Figure 6 shows the mean AUC over genes as a function of the average number of neighbours in thegraph. The mean AUC is computed over all genes in the dataset. Results on the set of genes coveredby all graphs are presented as supplementary material in Figure A.4. The R- n graphs are depicted asa black line which could be thought of as the level of randomness. On the set of all genes, almost allthe curated graphs are below the level of randomness except for HumanNetV2 (GTEx), GeneMANIA(GTEx) and Landmark. Poor performance is in part due to the limited coverage of some of the graphs,in which some genes do not have any neighbours that made the prediction a random guess. On theset of genes covered by all graphs, most curated graphs are above the level of randomness, but onlyStringDB achieves better performance than the best performing random graph, and even then only bya very small margin ( ≤ − AUC). 9 a) TCGA(b) GTEx
Figure 6: Mean AUC on all genes as a function of average number of neighbours in the graph(logarithmic scale). The black line represents the average AUCs achieved with the multiple R- n graphs and one could think about it as the level of randomness. The orange lines indicate the standarddeviation of the number of neighbours for each graph. The standard deviation of AUCs is − andtoo small to be plotted. In order to support our evaluation, we evaluated on different clinical tasks a one hidden layer, sparselyconnected neural network whose connections are based on the prior knowledge provided by thedifferent curated graphs. The number of hidden units was chosen to be equal to the dimension ofthe input, and ReLU activation functions were used. The weights of the first layer are masked with ˜ A = A + Id where A is the adjacency matrix of the graph and Id the identity matrix. It means thatthe activation of the first hidden unit is a feature computed from the first gene and its neighbours only.Even if this model does not correspond to state of the art graph neural networks, it allowed us to makea simple comparison between sparsified models and an unbiased (fully connected) one. We trainedeach model with 5-folds cross-validation for 100 epochs on the set of genes covered by all graphs(but RegNet). We used the Adam optimizer with a learning rate equal to − . We only optimizedthe learning rate for the fully connected model on the (’PAM50Call_RNAseq’, ’BRCA’) task.We evaluated on five tasks taken from the TCGA Benchmark (Samiei et al. , 2019). These particulartasks were chosen because they have shown to allow improvement compared to the majority classpredictor. Each task was cast as a binary classification (majority class versus all). Test accuracies10igure 7: Test accuracy on the set of genes covered by all graphs ( Intersection Set ) along training.The task is to predict the histological type of Esophageal Carcinoma. The reported accuracy is anaverage over 5 folds of cross-validation. Shaded areas represent the standard deviation across folds.along training are reported in Figure 7 for the prediction of histological type in Esophageal Carcinoma,and as supplementary material in Figure A.5 for the other tasks.Overall, the comparative performance between graphs are consistent with our map of graphs shownin Figure 6. The model without prior knowledge (fully connected) consistently outperforms the othermodels. The relative performance between graphs, both in terms of speed of adaptation and finalaccuracy, seem to be consistent with our analysis. Second best performance is achieved using theSTRINGdb and Landmark graphs. FunCoup and HumanNet follow, before Hetionet and finallyGeneMANIA. The poor performance of GeneMANIA could be related to its extreme sparsity. Notethat for some tasks, some models do not converge well, this is a well known phenomenon observedwhen training sparsely connected neural networks (Frankle and Carbin, 2018; Cohen et al. , 2016).
We proposed a property that gene interaction graphs should satisfy in order to provide “good” priorknowledge as we defined it, in the context of predictions on gene expression data. A Single GeneInference task was used to test whether the property holds for a given graph. This test provides ameasure for each gene and graph, that reveals whether the graph captures the right dependencies forthis gene.We studied several existing curated graphs and found that large graphs such as StringDB andLandmark allow the equality (Eq. 1) to hold for most genes, while being very sparse. This suggeststhat using the entire gene set to predict the expression of a gene tends to include mostly uninformativefeatures as the expression of most genes can be explained with a fraction of the gene set. Thoseresults validate the findings of Subramanian et al. (2017), that a lot of information is shared acrossgene expressionsWe then compared existing curated graphs against randomly generated graphs to assess the usefulnessof the prior knowledge they provide. We found that randomly selecting 500 or more genes asneighbours for a target gene can perform on par with or improve over the baseline for most genes inTCGA and GTEx. This means that a random graph R- allows the equality (Eq. 1) to hold for mostgenes, while being quite sparse. Those results confirm that the relevant information about the state ofthe cell is spread across many genes. Future work could explore random graphs that preserve thestatistics ( e.g. distribution of degrees) of a given graph, in order to have a more precise comparisonbetween curated graphs and their random counterparts.We finally validate those results by comparing the performance of models inputted with priorknowledge coming from the different curated graphs. Relative performances are consistent with ouranalysis of the goodness of the prior knowledge provided by the different graphs.There are several limitations to this pipeline. We test whether the graph captures the right dependenciesin the data, which is not restricted to causal dependencies. For example, consider three genes A , B and C where A downregulates B and B downregulates C ( A → B → C ) and suppose no noise orrandomness is added in the process ie the links between these genes are deterministic. Our pipelinewill not be able to distinguish between the true causal graph and any other graph where A , B and C true graph and that our data corresponds to noisy observations of the true activation of genes. Secondand higher-degree neighbours can still have a denoising effect and give information on the true levelof expression of a gene. Let us consider the results from Dutil et al. (2018). For genes they report,taking into account second and higher degree neighbours did not yield any improvement compared toonly first degree neighbours. Nonetheless, further investigation would be needed in order to assertthat one does not need to take into account the noise of the data to test the goodness of graph-basedprior knowledge.To summarize, the additive value of using curated graphs to provide prior knowledge appears to belimited. Nonetheless, curated graphs could be valuable when one is interested in specific subgroupsof genes which have been well studied by biologists. This analysis is left for future work. Funding
This work is partially funded by a grant from the Fonds de Recherche en Sante du Quebec andthe Institut de valorisation des donnees (IVADO). This work utilized the supercomputing facilitiesmanaged by Mila, NSERC, Compute Canada, and Calcul Quebec. We also thank NVIDIA fordonating a DGX-1 computer used in this work.
Acknowledgements
We thank Francis Dutil for sharing code. We also thank Yoshua Bengio and Mandana Samiei for theiruseful feedback.
References
Cawley, G. C. and Talbot, N. L. (2006). Gene selection in cancer classification using sparse logisticregression with bayesian regularization.
Bioinformatics , (19), 2348–2355.Chen, Y., Li, Y., Narayan, R., Subramanian, A., and Xie, X. (2016). Gene expression inference withdeep learning. Bioinformatics .Cohen, J. P., Lo, H. Z., and Ding, W. (2016). RandomOut: Using a convolutional gradient norm torescue convolutional filters.Dutil, F., Cohen, J. P., Weiss, M., Derevyanko, G., and Bengio, Y. (2018). Towards Gene ExpressionConvolutions using Gene Interaction Graphs. In
International Conference on Machine Learning(ICML) Workshop on Computational Biology (WCB) .Frankle, J. and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neuralnetworks.Gustafsson, M., Hornquist, M., and Lombardi, A. (2005). Constructing and analyzing a large-scalegene-to-gene regulatory network-lasso-constrained inference and biological validation.
IEEE/ACMTransactions on Computational Biology and Bioinformatics (TCBB) , (3), 254–261.He, Z. and Yu, W. (2010). Stable feature selection for biomarker discovery. Computational biologyand chemistry , (4), 215–225.Heimberg, G., Bhatnagar, R., El-Samad, H., and Thomson, M. (2016). Low dimensionality ingene expression data enables the accurate extraction of transcriptional programs from shallowsequencing. Cell systems , (4), 239–250.Himmelstein, D. S. and Baranzini, S. E. (2015). Heterogeneous Network Edge Prediction: A DataIntegration Approach to Prioritize Disease-Associated Genes. PLOS Computational Biology , (7),e1004259. 12immelstein, D. S., Lizee, A., Hessler, C., Brueggeman, L., Chen, S. L., Hadley, D., Green, A.,Khankhanian, P., and Baranzini, S. E. (2017). Systematic integration of biomedical knowledgeprioritizes drugs for repurposing. eLife , .Hwang, S., Kim, C. Y., Yang, S., Kim, E., Hart, T., Marcotte, E. M., and Lee, I. (2019). HumanNetv2: human gene networks for disease research. Nucleic acids research , (D1), D573–D580.Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., and Morishima, K. (2017). KEGG: newperspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research .Kim, P. M. and Tidor, B. (2003). Subsystem identification through dimensionality reduction oflarge-scale gene expression data.
Genome research , (7), 1706–1718.Lee, G., Rodriguez, C., and Madabhushi, A. (2008). Investigating the efficacy of nonlinear di-mensionality reduction schemes in classifying gene and protein expression studies. IEEE/ACMTransactions on Computational Biology and Bioinformatics , (3), 368–384.Lee, I., Blom, U. M., Wang, P. I., Shim, J. E., and Marcotte, E. M. (2011). Prioritizing candidatedisease genes by network-based boosting of genome-wide association data. Genome Research , (7), 1109–1121.Li, B. and Dewey, C. N. (2011). RSEM: accurate transcript quantification from RNA-Seq data withor without a reference genome. BMC bioinformatics , , 323.Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis ofgenomic data. Bioinformatics .Liu, Z.-P., Wu, C., Miao, H., and Wu, H. (2015). RegNetwork: an integrated database of transcrip-tional and post-transcriptional regulatory networks in human and mouse.
Database: The Journalof Biological Databases and Curation .Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters, G., Garcia,F., Young, N., and Others (2013). The genotype-tissue expression (GTEx) project.
Nature genetics , (6), 580–585.Ma, S. and Huang, J. (2009). Regularized gene selection in cancer microarray meta-analysis. BMCbioinformatics , (1), 1.Min, W., Liu, J., and Zhang, S. (2016). Network-regularized Sparse Logistic Regression Models forClinical Risk Prediction and Biomarker Discovery. IEEE/ACM Transactions on ComputationalBiology and Bioinformatics .Min, W., Liu, J., and Zhang, S. (2018). Network-regularized sparse logistic regression models forclinical risk prediction and biomarker discovery.
IEEE/ACM Transactions on ComputationalBiology and Bioinformatics (TCBB) , (3), 944–953.Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping andquantifying mammalian transcriptomes by RNA-Seq. Nature Methods , (7), 621–628.Ogris, C., Guala, D., Kaduk, M., and Sonnhammer, E. L. L. (2018). FunCoup 4: new species, data,and visualization. Nucleic Acids Research , (D1), D601–D607.Rhee, S., Seo, S., and Kim, S. (2018). Hybrid Approach of Relation Network and Localized GraphConvolutional Filtering for Breast Cancer Subtype Classification. In International Joint Conferenceon Artificial Intelligence .Saelens, W., Cannoodt, R., and Saeys, Y. (2018). A comprehensive evaluation of module detectionmethods for gene expression data.
Nature Communications , (1), 1090.Samiei, M., Würfl, T., Deleu, T., Weiss, M., Dutil, F., Fevens, T., Boucher, G., Lemieux, S., andCohen, J. P. (2019). The tcga meta-dataset clinical benchmark.Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013). A sparse-group lasso. Journal ofComputational and Graphical Statistics , (2), 231–245.13ubramanian, A. et al. (2017). A Next Generation Connectivity Map: L1000 Platform and the First1,000,000 Profiles. Cell , (6), 1437–1452.e17.Szklarczyk, D., Gable, A. L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Simonovic, M.,Doncheva, N. T., Morris, J. H., Bork, P., Jensen, L. J., and Mering, C. (2019). STRING v11:protein-protein association networks with increased coverage, supporting functional discovery ingenome-wide experimental datasets. Nucleic Acids Research , (D1), D607–D613.Warde-Farley, D., Donaldson, S. L., Comes, O., Zuberi, K., Badrawi, R., Chao, P., Franz, M., Grouios,C., Kazi, F., Lopes, C. T., Maitland, A., Mostafavi, S., Montojo, J., Shao, Q., Wright, G., Bader,G. D., and Morris, Q. (2010). The GeneMANIA prediction server: biological network integrationfor gene prioritization and predicting gene function. Nucleic Acids Research .Weinstein, J. N., Collisson, E. A., Mills, G. B., Shaw, K. R. M., Ozenberger, B. A., Ellrott, K.,Shmulevich, I., Sander, C., and Stuart, J. M. (2013). The Cancer Genome Atlas Pan-Canceranalysis project.
Nature genetics .Yeung, K. Y. and Ruzzo, W. L. (2001). Principal component analysis for clustering gene expressiondata.
Bioinformatics , (9), 763–774.Zhang, W., Chien, J., Yong, J., and Kuang, R. (2017). Network-based machine learning and graphtheory algorithms for precision oncology. npj Precision Oncology , (1), 25.Zuber, V. and Strimmer, K. (2009). Gene ranking and biomarker discovery under correlation. Bioinformatics , (20), 2700–2707. 14 Appendix: Supplementary figures (a) TCGA(b) GTEx
Figure A.1: Distribution of Covered AUCs, averaged over 3 trials, using different graphs.
Top :TCGA dataset.
Bottom : GTEx. The distribution is computed over genes covered by all graphs exceptRegNet (14442 genes for TCGA and 14270 for GTEX).A.0 a) TCGA(b) GTEx
Figure A.2: Distribution of the improvement relative to the fully-connected graph on the set of genescovered by all graphs.
Top : TCGA dataset. Intersection set composed of 14442 genes
Bottom :GTEx. Intersection set composed of 14270 genes. For each gene the difference in AUC is averagedover 3 trials. A.1 a) TCGA(b) GTEx