Archive | 2021
Topological Analysis of Interaction Patterns in Cancer-Specific Gene Regulatory Network: Persistent Homology Approach
Abstract
In this study, we investigated cancer cellular networks in the context of gene interactions and their associated patterns in order to recognize the structural features underlying this disease. We aim to propose that the quest of understanding cancer takes us beyond pairwise interactions between genes to a higher-order construction. We characterize the most prominent network deviations in the gene interaction patterns between cancer and normal samples that contribute to the complexity of this disease. What we hope is that through understanding these interaction patterns we will notice a deeper structure in the cancer network. This study uncovers the significant deviations that topological features in cancerous cells show from the healthy one, where the last stage of filtration confirms the importance of 1-dimensional holes (topological loops) in cancerous cells and 2-dimensional holes (topological voids) in healthy cells. In the small threshold region, the drop in the number of connected components of the cancer network, along with the rise in the number of loops and voids, all occurring at some smaller weight values compared to the normal case, reveals the cancerous network tendency to certain pathways. Introduction Cancer is one of the most common human genetic diseases characterized by cellular over-proliferation1–3. Through the gene expression process, genetic code modulates biological functions and associated molecular pathways. The subsequent cellular phenotype is modulated by a dynamic network of interactions among genes. Perturbations in these interactions affect the overall manifestation of genetically driven diseases such as cancer. Genes and their dynamic interactions can be modeled by complex networks represented by nodes and links4. In network systems, each node is considered as a dynamic entity, evolving under the influence of others5–9. Systems of interacting units consist of links having positive, negative, or zero weight and they together develop a weighted signed network, called Gene Regulatory Network (GRN)10–14. GRNs can be constructed by maximum entropy models, analyzed by balance theory approaches15 and topological methods16. Moreover, responses to driving forces on the structure formation of these networks cause the development of new features and subsequently lead to the identification of unique patterns in the observational data. These patterns can arise from non-trivial connections that go beyond classical pairwise interactions, leading to a higher-order construction17. These constructions can be described by simplices of different dimensions and hence, can be studied in the framework of Balance Theory and Topological Data Analysis (TDA). From TDA, we employ the Persistent Homology (PH) analysis tool, which is based on algebraic topology and has been applied to problems in a variety of fields such as network science, physics, chemistry, biology, and medicine18–27, 27–32. PH has been previously used to study protein-protein interaction networks to inform cancer therapy by determining the correlation between Betti numbers and the survival of cancer patients33. In order to study states of balanced and imbalanced cancer networks, we previously modeled GRNs by groups of three interacting genes, forming triangles (triads) of interactions15. The resulting signed weighted network analysis in the context of Balance Theory showed significant differences between cancer and healthy cases of GRNs in the number of characteristics such as energy, number, and distribution of imbalanced triangles. This paper aims to study the higher-order representation of gene regularity interaction networks derived from cancer and normal samples. Using PH, we address theoretical concepts using empirical data and report network features of cancer samples compared to normal counterparts. Finally, we propose PH as unsupervised network analysis to study human diseases such as cancer. Network Construction from Real Data and the Result of Balance Theory Analysis of the Interaction Network Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product which leads to the production of protein as the final functional product. Cells go through a wide range of mechanisms known as Gene regulation to increase or decrease the production of specific gene products. Gene expression data is large-scale measurements of the degrees of freedom of a biological system such as a cell. In the language of statistical physics, these describe the micro-states of a cell. A gene regulatory network is a complex network34 which its nodes represent the genes, and its links between them represent the interactive couplings between genes which can be used to predict the global properties of the cells. We used mRNA (expression level) data of 20532 genes in the case of Breast Cancer (BRCA: Breast invasive carcinoma) from The Cancer Genome Atlas (TCGA)35, 36. Since RPKM (Reads Per Kilobase transcript per Million reads) puts together the ideas of normalizing by sample and by gene, we used the RPKM normalized data to find the correlation between the expression levels of the genes. Due to computational purposes, we only kept the top 483 most variable genes for all analyses by calculating for each gene the variance of its expression level over its samples. This cohort consisted of two sets of 114 healthy and 764 cancer samples. We constructed a pairwise correlation matrix37, 38 from our data-set based on pairwise gene expressions in the obtained data-set. To find the regulatory connections between genes, we needed a statistical description of the data in terms of suitable observables and infer39, 40 the underlying regulatory connections. Therefore, we restricted ourselves to an undirected pairwise maximum-entropy probability model with terms up to second order41–43, which we derive for continuous, real-valued variables. This can be considered as a problem in inverse Statistical Physics44, 45 where we want to infer parameters of a model based on observations, instead of calculating observables on the basis of model parameters. We applied the following model with pairwise couplings P({Si}) = 1 Z exp ∑ i< j Ji j Si S j (1) where Si represents the expression level of gene i as a continuous real-valued variable, and interaction matrix Ji j, describes the strength of the net interaction between two genes. Z is the so-called partition function, for normalizing the model. The corresponding Hamiltonian (energy function) for this Boltzmann distribution function is then H =−∑i< j Ji j Si S j. Model parameters can be found by satisfying these conditions through the use of Lagrange multipliers; i) Our model should give the same first and second moments as we measure from the data and ii) it must maximize the Gibbs-Shannon entropy function defined as S[P] =−∑P({Si}) ln(P({Si})). The obtained model is a multivariate Gaussian distribution of the form: P(S;〈S〉,C) = exp [ − 1 2 (S−〈S〉)TC−1(S−〈S〉) ] (2π)L/2 det(C) , (2) where L is the number of genes in the distribution and the couplings can be inferred simply by inverting the matrix of variances and covariances of expression levels Ji j =−C−1 i j . This approach is also linked to the concept of partial correlations in statistics46, 47 such that the inverse of the covariance matrix, C−1, also known as precision matrix, offers information about the partial correlations of variables. Elements of the proposed interaction matrix J, represent pairwise interaction between genes in the proposed model, where the weight of the link i− j, represented by Ji j denote the strength of the interaction between gene i and gene j. Furthermore, genetic interaction (GI) between two genes can be inferred from the sign of their interactions, indicating the way they may affect each other’s functions. Positive and negative interactions on the foundation of the constructed network imply gene expression modulation within the network. Therefore, we expect J to be a sparse matrix since each gene interacts only with a couple of other genes. Inverting a large covariance matrix computationally, however, yields to a matrix which almost none of its elements are zero. To keep this at bay, the inverse of the covariance matrix has been obtained by means of the Graphical Lasso (GLasso) algorithm48. GLasso is generally a sparse penalized maximum likelihood estimator for the concentration or inverse of covariance matrix of a multivariate elliptical distribution. When dealing with a multivariate Gaussian distribution with limited observations (lack of enough samples)49, 50, GLasso yields a sparse network (−C−1) while preserving the global features of the network51. In a network analysis, simple thresholding methods can be misleading because removing weak ties may results in the fragmentation of the network; A pair of genes may be weakly connected, while that tie plays a significant role in the structure of the network. On the other hand, removing a strong connection between insignificant or isolated pair of nodes may not destroy the global features of the network, G-Lasso is wary of such issues.