Incorporating network based protein complex discovery into automated model construction
Paul Scherer, Maja Trȩbacz, Nikola Simidjievski, Zohreh Shams, Helena Andres Terre, Pietro Liò, Mateja Jamnik
IIncorporating network based protein complexdiscovery into automated model construction
Paul Scherer Maja Tre¸bacz Nikola SimidjievskiZohreh Shams Helena Andres Terre Pietro Liò Mateja Jamnik
Department of Computer Science and Technology,University of Cambridge, UK
Abstract
We propose a method for gene expression based analysis of cancer phenotypesincorporating network biology knowledge through unsupervised construction ofcomputational graphs. The structural construction of the computational graphs isdriven by the use of topological clustering algorithms on protein-protein networkswhich incorporate inductive biases stemming from network biology research inprotein complex discovery. This structurally constrains the hypothesis space overthe possible computational graph factorisation whose parameters can then belearned through supervised or unsupervised task settings. The sparse constructionof the computational graph enables the differential protein complex activity analysiswhilst also interpreting the individual contributions of genes/proteins involvedin each individual protein complex. In our experiments analysing a variety ofcancer phenotypes, we show that the proposed methods outperform SVM, Fully-Connected MLP, and Randomly-Connected MLPs in all tasks. Our work introducesa scalable method for incorporating large interaction networks as prior knowledge todrive the construction of powerful computational models amenable to introspectivestudy.
Gene expression data is commonly used within research intersecting cancer data research andmachine learning as it is seen as a crucial component towards understanding the molecular status oftumour tissue. In its most common form an observation of gene expression data is presented as an n -dimensional features vector of continuous values where each element of the vector correspondsto the expression level of a particular gene in the sample. Classically, this representation is directlyused to learn a prediction model for tasks such as cancer disease subtype classification or as part of alarger system integrating data from multiple modalities [1, 2].The high dimensionality and noisiness of the gene expression data poses significant problems tolearning algorithms. This causes models to overfit, learn noise, and fail to capture any biologicallyrelevant information. As a result, practitioners commonly aim to constrain model complexity byincorporating various approaches for regularisation including dimensionality reduction and useof prior biological knowledge to inductively bias models towards learning representations withfavourable characteristics [3, 1, 4, 5, 6, 7]. A part of this research on using prior knowledge focuseson the incorporation of gene interaction networks as external priors into the predictive model toguide the learning process. The overall goal of applying network-based analysis to personal genomicprofiles is to identify network modules that are both informative of cancer mechanisms and predictiveof cancer phenotypes. A survey of such approaches is covered in Zhang et al. [8].In this work we utilise topological clustering algorithms chiefly used for the identification of proteincomplexes and functional modules within PPI networks to define the structure of computational Preprint. Under review. a r X i v : . [ q - b i o . M N ] S e p raphs in an unsupervised manner. This deterministic procedure produces sparse computation graphswhich relates genes to named protein complexes, structurally parameterising individual functionsfor the "activity" of each complex based on an input gene expression profile. Further connecting thecomplex activities to cancer phenotypes defines a supervised predictive model which analyses theactivity patterns of higher level functional modules (protein complexes) to cancer phenotypes. Ourapproach effectively constrains the hypothesis space of models via structural biases obtained throughunsupervised analyses of network biology entities. Figure 2 in Appendix A features a simplifieddiagram of this process over an input genomic profile dataset and a toy interaction network used toconstruct the topology of the computational graph. The proposed method, which we will call
PComplexNet , incorporates prior biological knowledgeimbued within the structure of supplied PPI networks and protein complexes discovered via topo-logical clustering algorithms to construct a bipartite graph between genes/proteins and functionalmodules. This bipartite graph serves as the structural foundation of the computational graphs thatwill be further augmented into predictive models for cancer phenotype. Crucially, this means that thestructure of the output computational graphs is defined in a purely unsupervised and deterministicmanner over external curated knowledge.The procedure for constructing the computational graphs is best described in three stages: (i) obtaininga study specific subgraph of the PPI network, (ii) discovering protein complexes that serve as higherlevel features, and (iii) constructing the factor and computational graphs.
Let us assume an input gene expression dataset X ∈ R m × k describing m patient observations with k -dimensional vectors of gene expression values. Furthermore let us assume an external PPI network G PPI = ( V PPI , E
PPI ) , such as one from the STRING-DB 9606 Homo Sapiens PPI network [9].For our purpose, this PPI network is an unweighted graph with nodes ( V PPI ) labeled by the namesof proteins, and no additional node or edge features. We induce a subgraph of the input network G S ⊆ G PPI . The nodes of G S are the intersection of the common k genes in the input gene expressiondataset X genes and genes in the PPI network; in other words V S = X genes ∩ V PPI . The inducedsubgraph G S = ( V S , E S ) is the graph whose vertex set is V S and whose edge set consists of all of theedges in E PPI that have both endpoints in V S . This action is illustrated in the top row of actions inFigure 1. We denote G S our study PPI network since it is the "cut out" of the external PPI networkrelevant to our study. Given the induced study subnetwork, we use a topological clustering algorithm C such as DPCLUS[10] to discover protein complexes within the study PPI network G S . The aim of the clusteringalgorithms is to discover protein complexes represented as a set of induced subgraphs C ( G S ) = { c , c , . . . , c l } , where l is the number of complexes discovered by C . The number of proteincomplexes found, l , is not dependent on the user, but rather the application of the clustering algorithm C upon the input study network.It is worth noting that we specifically chose clustering algorithms that do not partition the graph.In other words, a single protein may be part of multiple complexes. This is to reflect the fact thatproteins may be involved in several biological processes and complexes. Another note to make is thatnot all proteins in G S will necessarily be assigned to clusters by C . We are not arbitrarily forcing allgenes to be part of our constructed models, and this acts as a form of feature selection upon the input X by C ( G S ) . The output of the clustering algorithm C ( G S ) = { c , c , . . . , c l } enables the construction of a bipartitefactor graph. Herein, each of the protein complexes is assigned a uniquely labelled node c i and eachprotein within the set of proteins involved in one or more complexes is also given a labeled node by2 ) b) Figure 1: This figure depicts a side-by-side comparison of a) a typical Fully-Connected MLP and b)the factor graph produced through PComplexNet. The factor graphs produced through PComplexNetare considerably sparser and incorporate biological knowledge from the PPI network and proteincomplexes discovered within. The input features used in the model are cut down through two steps.The first set of genes removed from the extraction of the study network ( G ) S . The second set offeatures are removed through the clustering process C ( G S ) .their name. Directed edges link proteins to complexes c i they are a member of. This constructiongives the factorisation of a parametric function f c i : c i → R computed from the proteins involved in c i . The function f c i ( · ) can be set by the practitioner or learned through a neural network.The parameterisations f c i : c i → R in our proposal is a stark contrast to arbitrarily chosen hidden-state activations h i : R k → R found in conventional application of fully-connected multi-layerperceptrons. Firstly, each of the c i denotes a protein complex activity, a biologically relevant structuremodelled through incorporation of external PPI and topological clustering algorithm, instead of anarbitrarily chosen hidden state node. The proteins that are members of c i , and only those proteins,affect its activity level f c i : c i → R , instead of all input features. This is a strong and explicitinductive bias if f c i is learned through a neural network.We construct computational graphs for cancer phenotype prediction by further augmenting thecurrent gene/protein to protein complex factor graph to include complete connections between theprotein complexes c i to target nodes gained when encoding the target observations Y . As such each f c i : c i → R computing the individual protein complex "activity" is learned over minimising theglobal cross-entropy loss between gene expression values and the target phenotypes. In order to evaluate the proposed method for model construction, we used publicly available geneexpression data from the METABRIC Breast Cancer Consortium (METABRIC) [11] and The CancerGenome Atlas Head-Neck Squamous Cell Carcinoma (TCGA-HNSC) [12, 13]. Using the former weevaluate on three classification tasks of predicting: Distance Relapse (binary classification), PAM50breast tumour cancer subtypes (5-class classification), Integrative Cluster (IC10) subtypes (11-classclassification). Using the latter we evaluate on two classification tasks of tumour grade (4-classclassification) and 2 year relapse free survival (binary classification). All classification tasks wereevaluated by mean percentage accuracy over a stratified 5-fold cross-validation. The specific detailsabout the datasets, experimental setup, and methods are given in Appendix B.Amongst the considered methods are: support vector machine with RBF kernel (SVM), a Fully-Connected (FC) two layer neural network with 1600 hidden layer nodes , Randomly-Connected(RC) MLP, and our proposed model constructor coupled with a variety of topological clustering This number of hidden nodes was chosen to closely match the number of protein complexes used inPComplexNet + DPCLUS, the best performing of the proposed methods.
METABRIC TCGA-HNSCDR PAM50 IC10 Tumour Grade 2 Year RFSSVM 59.39 + 9.23 75.84 + 2.07 68.28 + 2.83 56.61 + 3.67 56.92 + 5.32FC MLP 64.24 + 3.86 76.03 + 2.62 66.11 + 3.12 59.03 + 1.49 55.96 + 5.52RC MLP 65.30 + 1.04 75.87 + 1.60 67.76 + 3.14 54.67 + 2.88 57.07 + 1.33PComplexNet + MCODE 66.06 + 1.92 77.10 + 1.61 67.92 + 3.56 56.42 + 2.49 58.65 + 7.91PComplexNet + COACH 57.27 + 3.14 76.89 + 2.89 71.56 + 2.01 57.84 + 2.49 56.15 + 2.48PComplexNet + IPCA 65.05 + 3.42 77.71 + 2.27 68.88 + 5.05 55.82 + 2.34 55.58 + 2.94PComplexNet + DPCLUS algorithms. Each of our models is referred to as PComplexNet + C , where C refers to one of: MCODE[14], COACH [15], IPCA [16], or DPCLUS [10] clustering algorithms. The hyperparameters of theclustering algorithms were set to their default values.The main comparative results are summarised in Table 1 for the METABRIC and TCGA-HNCSdatasets. The results show that all variations of the computational graphs produced by PComplexNet(regardless of the clustering algorithms) outperform both the SVM and Fully-Connected MLPbaselines. More specifically, PComplexNet + DPCLUS considerably outperform the baselines onall five classification tasks, making especially substantial gain in IC10 subtype prediction in thecase of METABRIC. We attribute these performance gains of PComplexNet over Fully-ConnectedMLPs to two related advantages. Firstly, PComplexNet’s sparser model complexity allows more"weight" to be assigned to each of the input signals used. Similarly, the sparse connectivity alsohelps generalisability in a similar way to the dropout regularisation method. However, in contrastthe connectivity is set, explicit, and realised through incorporation of prior knowledge rather thanbeing random and ephemeral. This brings us to the second advantage of PComplexNets — thestructure of the computational graphs, and thus the representations, explicitly incorporate biologicalknowledge of protein complex membership as intermediate states. In other words, they are not"hidden" nodes. The learned activities of the protein complexes are explicitly factorised to the geneexpression measurements of the genes/proteins that have a membership in the complex.To show that PComplexNet benefits from both of these advantages, and not only from the firstadvantage of regularisation via sparse connections, we show that the performance of PComplexNet +DPCLUS also outperform computational graphs constructed through a random process (RC MLP).The differing performances on the choice of clustering algorithm C ( · ) reflects the different assump-tions made by researchers on what topological structures within G S contain protein complexes.MCODE and DPCLUS exhibit stricter rules on complex candidates with fewer, smaller, and moretightly knit clusters than either COACH or IPCA. This may be interpreted as these two methodsconstraining the hypothesis space more and incorporating "more" expert knowledge which is helpfulto the classification tasks. Naturally PComplexNet is agnostic to the choice of C ( · ) , therefore variouscombinations or set complexes may be explored in further work. We presented PComplexNet, a scalable unsupervised approach to incorporating biological knowledgeembedded in the structure of PPI networks for automated construction of computational graphs forgenome analysis. PComplexNet has several distinguishing properties. First, it provides a biologicallyrelevant mechanism for model regularisation, resulting in structurally constrained models that yieldbetter predictive performance. Second, PComplexNet is scalable and readily applicable to othergenomic data analysis tasks. For example, the computational graphs can be seamlessly incorporatedinto larger integrative frameworks handling multiple modalities such as the integrative variationalauto-encoders [1]. Finally, there is no arbitrary decision making on the number of hidden nodes ortheir biological relevance as in standard MLPs. Each node within our computational graphs is eithera gene, a phenotype, or a protein complex. The structure describes a knowledge-directed factorisationof the parametric function for the activity of a protein complex based on the expression levels of itsconstituent gene/proteins. This makes introspective study into the individual contributions of entitiesin the model and patterns as a whole more amenable.4 eferences [1] Nikola Simidjievski, Cristian Bodnar, Ifrah Tariq, Paul Scherer, Helena Andres Terre, ZohrehShams, Mateja Jamnik, and Pietro Liò. Variational autoencoders for cancer data integration:design principles and computational practice.
Frontiers in genetics , 10:1205, 2019.[2] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo,Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deeplearning in healthcare.
Nature Medicine , 25(1):24–29, 2019.[3] Francis Dutil, Joseph Paul Cohen, Martin Weiss, Georgy Derevyanko, and Yoshua Bengio. To-wards gene expression convolutions using gene interaction graphs. In
International Conferenceon Machine Learning (ICML) Workshop on Computational Biology (WCB) , 2018.[4] Noah Simon, Jerome Friedman, Trevor Hastie, and Rob Tibshirani. A sparse-group lasso.
Journal of Computational and Graphical Statistics , 2013.[5] Mika Gustafsson, Michael Hornquist, and Anna Lombardi. Constructing and analyzing a large-scale gene-to-gene regulatory network lasso-constrained inference and biological validation.
IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2(3):254–261, 2005.[6] Gavin C. Cawley and Nicola L. C. Talbot. Gene selection in cancer classification using sparselogistic regression with Bayesian regularization.
Bioinformatics , 22(19):2348–2355, 07 2006.[7] Wenwen Min, Juan Liu, and Shihua Zhang. Network-regularized sparse logistic regressionmodels for clinical risk prediction and biomarker discovery.
IEEE/ACM Trans. Comput. Biol.Bioinformatics , 15(3):944–953, May 2018.[8] Wei Zhang, Jeremy Chien, Jeongsik Yong, and Rui Kuang. Network-based machine learningand graph theory algorithms for precision oncology. npj Precision Oncology , 1(1):25, Aug2017.[9] Damian Szklarczyk, Annika L. Gable, David Lyon, Alexander Junge, S. Wyder, Jaime Huerta-Cepas, M. Simonovic, N. Doncheva, J. Morris, P. Bork, L. Jensen, and C. V. Mering. String v11:protein–protein association networks with increased coverage, supporting functional discoveryin genome-wide experimental datasets.
Nucleic Acids Research , 47:D607 – D613, 2019.[10] Md Altaf-Ul-Amin, Yoko Shinbo, Kenji Mihara, Ken Kurokawa, and Shigehiko Kanaya.Development and implementation of an algorithm for detection of protein complexes in largeinteraction networks.
BMC bioinformatics , 7:207–207, Apr 2006. 1471-2105-7-207[PII].[11] Christina Curtis, Sohrab P Shah, Suet-Feung Chin, Gulisa Turashvili, Oscar M Rueda, Mark JDunning, Doug Speed, Andy G Lynch, Shamith Samarajiwa, and Yinyin et al. Yuan. Thegenomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.
Nature , 486(7403):346–352, 2012.[12] Michael C. Rendleman, John M. Buatti, Terry A. Braun, Brian J. Smith, Chibuzo Nwakama,Reinhard R. Beichel, Bart Brown, and Thomas L. Casavant. Machine learning with the tcga-hnsc dataset: improving usability by addressing inconsistency, sparsity, and high-dimensionality.
BMC Bioinformatics , 20(1):339, Jun 2019.[13] Cancer Genome Atlas Network. Comprehensive genomic characterization of head and necksquamous cell carcinomas.
Nature , 517(7536):576–582, Jan 2015.[14] Gary D. Bader and Christopher W. V. Hogue. An automated method for finding molecu-lar complexes in large protein interaction networks.
BMC bioinformatics , 4:2–2, Jan 2003.PMC149346[pmcid].[15] Min Wu, Xiaoli Li, Chee-Keong Kwoh, and See-Kiong Ng. A core-attachment based methodto detect protein complexes in ppi networks.
BMC bioinformatics , 10:169–169, Jun 2009.1471-2105-10-169[PII].[16] Min Li, Jian-er Chen, Jian-xin Wang, Bin Hu, and Gang Chen. Modifying the dpclus algorithmfor identifying protein complexes based on new topological structures.
BMC Bioinformatics ,9(1):398, Sep 2008.[17] Aleix Prat, Joel S. Parker, Olga Karginova, Cheng Fan, Chad Livasy, Jason I. Herschkowitz,Xiaping He, and Charles M. Perou. Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer.
Breast Cancer Research , 12(5):R68, Sep 2010.5
Diagram of PComplexNet constructions
Induced Subnetwork over Common Genes/Proteins of External PPI Network and Gene Expression DataSet of protein complexes and functionalmodules found by algorithm on , ,
STAGE 1STAGE 2
Input Gene Expression Data ob s e r v a t i on s genes Construction of Computational Graph for Gene Expression DataBased on Functional Modules discovered by over PPI Network
Each can be learned or set individually G e n e E x p re ss i o n s STAGE 3
External PPI Network
Figure 2: An overview of our procedure for incorporating PPI network based protein complexdiscovery and constructing computational graphs for gene expression analysis. Each row correspondsto a distinct stage of the procedure detailed in Section 2.6
Data and experimental setup