[PDF] Discovering novel cancer bio-markers in acquired lapatinib resistance using Bayesian methods

Abstract

Genes/Proteins do not work alone within our body, rather as a group they perform certain activities indicated as pathways. Signalling transduction pathways (STPs) are some of the important pathways that transmit biological signals from protein-to-protein controlling several cellular activities. However, many diseases such as cancer target some of these signalling pathways for their growth and malignance, but demystifying their underlying mechanisms are a very complicated tasks. In this study, we use a fully Bayesian approach to develop methodologies in discovering novel driver bio-markers in aberrant STPs given two-conditional high-throughput gene expression data. This project, namely PathTurbEr (Pathway Perturbation Driver), is applied on a global gene expression dataset derived from the lapatinib (an EGFR/HER dual inhibitor) sensitive and resistant samples from breast cancer cell lines (SKBR3). Differential expression analysis revealed 512 differentially expressed genes (DEGs) and their signalling pathway enrichment analysis revealed 22 singalling pathways as aberrated including PI3K-AKT, Hippo, Chemokine, and TGF-beta singalling pathway as highly dysregulated in lapatinib resistance. Next, we model the aberrant activities in TGF-beta STP as a causal Bayesian network (BN) from given observational datasets using three Markov Chain Monte Carlo (MCMC) sampling methods, i.e. Neighbourhood sampler (NS) and Hit-and-Run (HAR) sampler, which has already proven to have more robust inference with lower chances of getting stuck at local optima and faster convergence compared to other state-of-art methods. Next, we examined the structural features of the optimal BN as a statistical process that generates the global structure using, p 1 -model, a special class of Exponential Random Graph Models (ERGMs) and MCMC methods for their hyper-parameter sampling....

Full PDF

DD ISCOVERING NOVEL CANCER BIO - MARKERS IN ACQUIREDLAPATINIB RESISTANCE USING B AYESIAN METHODS

A P

REPRINT

AKM Azad iThree InstituteFaculty of ScienceUniversity of Technology Sydney, Australia [email protected]

Salem Alyami ∗ Department of Mathematics & StatisticsImam Mohammad Ibn Saud Islamic UniversityRiyadh, Saudi Arabia [email protected]

December 2, 2020 A BSTRACT

Genes/Proteins doesn’t work alone within our body, rather as a group they perform certain activitiesindicated as pathways. Signalling transduction pathways (STPs) are some of the important path-ways that transmits biological signals from protein-to-protein controlling several cellular activities.However, many diseases such as cancer target some of these signalling pathways for their growthand malignance, but demystifying their underlying mechanisms are very complicated tasks. In thisstudy, we use a fully Bayesian approaches to develop methodologies in discovering novel driverbio-markers in aberrant STPs given two-conditional high-throughput gene expression data. Thisproject, namely PathTurbEr (Pathway Perturbation Driver), is applied on a global gene expressiondataset derived from the lapatinib (an EGFR/HER dual inhibitor) sensitive and resistant samplesfrom breast cancer cell-lines (SKBR3). Differential expression analysis revealed 512 differentiallyexpressed genes (DEGs) and their signalling pathway enrichment analysis revealed 22 singallingpathways as aberrated including PI3K-AKT, Hippo, Chemokine, and TGF- β singalling pathway ashighly dys-regulated in lapatinib resistance. Next, we model the aberrant activities in TGF- β STPas a causal Bayesian network (BN) from given observational datasets using three Markov ChainMonte Carlo (MCMC) sampling methods, i.e. Neighbourhood sampler (NS) and Hit-and-Run (HAR)sampler, which has already proven to have more robust inference with lower chances of getting stuckat local optima and faster convergence compared to other state-of-arts methods. Next, we examinedthe structural features of the optimal BN as a statistical process that generate the global structureusing, p -model, a special class of Exponential Random Graph Models (ERGMs) and MCMC meth-ods for their hyper-parameter sampling. This step will enable us to detect key players that wouldsupposedly drive the aberration within the chosen BN structure of STP, which yielded 31, 33, and22 perturbation driver genes out of 79 constituent genes of three STP models of TGF- β signalling.Functional enrichment with GO terms of these driver genes suggested their signiﬁcant associationswith breast cancer progression/resistance. The full code base for PathTurbEr is available in here: https://github.com/Akmazad/PathTurbEr/ . K eywords Signal Transduction Pathways · MCMC Methods · Neighbourhood Sampler · Hit-and-Run Sampler · Metropolis-Hastings Sampler · Exponential Random Graph Models

Signal transduction pathways (STPs) involve collections of signalling proteins, transducing biological signal from oneto another that controls cell growth and division, cell death, cell fate, and cell motility [1]. Hence, altered activitiesin one or more of these STPs, e.g. PI3K/AKT signalling, Ras/MAP signalling, Notch signalling and Wnt signalling ∗ corresponding author a r X i v : . [ q - b i o . Q M ] N ov PREPRINT - D

ECEMBER

2, 2020mediates uncontrolled cell growth/death resulting tumour initiation, progression, metastasis [2], or even resistancephenotype to targeted therapies [3]. Induction of these perturbed signalling activities often refers to oncogenic mutationsof key signalling proteins that are structurally central (hub) in corresponding STP structures, resulting in their or others’hyper-activation or inactivation [1]. Therefore, elucidating mechanisms underpinning aberrant signalling activities arecrucial for successful cancer therapeutics.Hight-throughput datasets e.g. Gene/Protein expressions, copy number variation, methylation, and microRNA ex-pression from cohorts of cancer-related sample patients (i.e. treated, untreated, and resistant) enable us genome-wideanalyses of disease phenotypes (e.g. cancer malignance or cancer drug-resistance) and identiﬁcation of relevant biomark-ers in silico. Utilizing those high-throughput datasets for deriving data-driven models facilitates critical assessments ofthe systems behaviour in response to any perturbation from wild-types, i.e. due to the presence of disease markers,treatment with drugs or resistance phenotypes. Therefore, these data-driven model inference offers unique scopeto deduce novel ﬁndings that may better reﬂect the systems dynamics under study. However, it’s a very complexcomputational task and still an open problem in the ﬁeld of systems biology since inferring optimal model from thedata require smart searching algorithms within the space of all possible models. The simplest of all approaches couldprobably be deriving co-expression network from available datasets among all signalling proteins but those models willlack causalities that are often crucial for better understanding interactions among biological entities.A BN is deﬁned as a directed acyclic graph (DAG), ‘G’ = (V,E), where ‘V’ are a set of random variables (heregenes/Proteins) and ‘E’ represents relationships among those random variables. Each random variable is associatedwith conditional probability distributions given it’s parents except the root variable, which are only associated withcorresponding prior probability distributions. Causal BNs possess causal edges, for example X −→ Y indicates causalityfrom random variable ‘X’ to random variable ‘Y’. An important property of BNs is called ‘Markov condition’, whichmust be satisﬁed for the probability distributions of the constituent random variables. ‘Markov condition’ states thateach variable in the BN should be conditionally independent of other random variables that are its non-descendantsconditional on its parents.To model STP as a BN, we can model the phosphorylation activities of signalling proteins as random variables andany phosphorylation activities (i.e. signalling activities) among signalling proteins as arcs among those variables.Nonetheless, BN models like such must meet the above-mentioned ‘Markov condition’ to represent the joint probabilitydistribution of the random variables. Woolf et al. [4] suggested that the steady-state concentration of signalling proteinsand their corresponding signalling activities supports this condition, and Sachs et al. [5] has conducted a successfulproof-of-principle analysis on that.Recently, Bayesian network (BN) has been applied to model aberrant STPs by studying case/control gene expressiondata [6] using MCMC methods such as Metropolis-Hastings or Gibbs sampling algorithm. Like other non-MCMCapproaches (i.e. score-based methods such as Greedy, Hill-Climbing, Tabu search algorithm), these MCMC approachesalso suffers local optimal problem, which means models claimed as optimal from those methods may not be globallyoptimal. Recently, we have shown that compared to Metropolis-Hastings algorithm, newly developed Neighbourhoodsampler (NS) and Hit-and-Run sampler ﬁnds BN structures with better accuracy and faster convergence towards truedistribution. For example, the main rationale behind NS sampler’s efﬁciency lies in its reduction step, where the rejectedBNs are excluded from being chosen second time, offering greater chances to others to be selected, and therefore haveless chance to get stuck at local optima [7]. We hope using these two methods, we will be able to infer robust BN as aperturbed STP in given high-throughput datasets.Biological networks reveal scale-free topology where majority of nodes possess fewer connections whereas small ofamount nodes have very high connectivity, typically known as hub nodes [8]. It is shown that data-driven networkmodels also feature the similar behaviour [9], i.e. the existence of hub nodes within the network. This behaviouris expected in biological systems as for disease perturbations to take place and/or spread within the system, it musttarget central players that has maximum reachability to other nodes via excessive connectedness. Hence hub nodesserve as key bio-markers which must be analysed, found and targeted in order to counteract the perturbations. It’simportant to acknowledge that biological processes (e.g. signalling activities) underlying data and the experimentalmeasurements are stochastic. For example, derived interactions in the data-driven BN model may not be truly reliablewhereas some critical interactions may remain undetected. Therefore, after modelling data-driven STP (optimal BN),statistical approaches relying on probabilistic models should be adopted to assess the probability of forming eachinteractions (i.e. aberrant activities among signalling proteins) within that BN structure. Being originally proposed insocial network analyses, Exponential Random Graph Models (ERMGs), or p*-models [10] are central of statisticalmodelling of networks, where each possible edge in a network is considered as a random variable and modelled ascombinations structural properties of the constituent nodes, e.g. global density of nodes, attractiveness/expansiveness ofnodes (commonly known as sociality parameters). Note, non-statistical methods like descriptive approaches may alsomodel hub nodes based on their degree distributions which may not address stochastic nature of the underlying data and2

PREPRINT - D

ECEMBER

2, 2020experimental measurements [11]. In the context of this study, hub nodes should be highly social, which would allowus to analyse the possibilities of detecting those hub nodes as a key bio-marker underlying the aberrant STP structureformation.The primary research questions we would like address in this study are as two-folds. First, is there exists a causalstructure that optimally model the aberrant signalling activities within a signal transduction pathways (STP) giventwo-conditional datasets e.g. cancer-vs-normal or resistant-vs-treatment?

Second, how the data-driven STP structurehas emerged – is there any local biological process or local structure (e.g. star-like shapes) that contributed ingenerating the global structure?

We hypothesize that the capabilities of new MCMC sampling methods in accurateBN inference can be leveraged for ﬁnding optimally aberrant STP structures by studying two conditional studies –structures that would model perturbed signalling activities in cancer-vs-normal or resistant-vs-treatment conditions.In statistical modelling, it can be hypothesized that global structure of a network emerges from the agglomeration oflocal structures that relies on the local inter-dependencies of edges. We also hypothesize that statistical analyses of thegeneric and structural properties of the inferred aberrant STP using ERGM models along with hierarchical Bayesianmodelling of their hyper-parameter inter-dependencies would unravel key genes/proteins (e.g. network hubs) that drivesthose perturbations within that particular STP.

PathTerbEr is a data-driven approach in detecting perturbation driver bio-markers in signaling pathways. The schematicview of the proposed research plan is depicted in Figure 1.Figure 1:

PathTurbEr : a schematic diagram of our proposed approach.

Acquiring datasets and pre-processing

A global gene expression dataset for lapatinib sensitive-vs-resistant samples, for breast cancer cell-line, namely SKBR3,from Gene expression omnibus (GEO) with accession number GSE38376, published by Komurov et al. [12]. Thisdataset includes samples with basal conditions (i.e. lapatinib-sensitive) dosed with 0 µ M, 0.1 µ M, and 1.0 µ M of lapatinibdrug, and samples with resistant conditions with the same doses, respectively. Thus, two data matrices, each for3

PREPRINT - D

ECEMBER

2, 2020lapatinib-sensitive and lapatinib-resistant were obtained, which was further considered as control-vs-case data for thedownstream analyses. 45 signalling pathways were collected from KEGG pathway database [13].

Differential expression and functional enrichment test

For the differential expression analysis, the raw gene expression were ﬁtted with linear models using limma R package[14] with eBayes function to calculate the Bayes statistics. Furthermore, the signiﬁcance score (i.e. p-values) werefurther adjusted with FDR correction techniques. Next, a set of differentially expressed genes (DEGs) based on theadjusted p-values, were used for over-representation test with 45 KEGG signalling pathways using the hyper-geometrictest by phyper function available in stats R package. The enrichment signiﬁcance score (p-values) were further adjustedusing FDR-correction.

Inferring optimally perturbed STP structure using MCMC sampling algorithms

In this study, we have modelled the signal transduction pathways (STPs) as Bayesian Networks (BNs), which is agraphical model for inferring causal dependencies among statistical variables, which is the genes/proteins in this case.Let, G := ( V, E ) be a Bayesian network as graph with V is the set of genes, and E is the set of causal edges (i.e.directed) connecting the nodes from V, which are also acyclic and connected graph. Parameter inference of BNs using Dirichlet-Multinomial distribution

Prior to BN structure learning, we modeled the perturbation of the lapatinib-resistant samples using the followingequation: z Rij = r Rij − µ Si σ Si (1)where, r Rij , µ Si , σ Si , and z Rij are raw expression value of a gene i in the sample j , the average and standard deviation ofall the samples in the lapatinib-sensitive condition of gene i , and the transformed value of R ij , respectively. Next, thetransformed lapatinib-resistant data matrix were further discretized using the following formula: z R (cid:48) ij =  z Rij > . − z Rij < − . otherwise (2)In BNs, the causality is modelled as the conditional probabilities between parent and the child nodes, where the childrennodes probabilities are conditioned on the probabilities of parents. With a given data D and a candidate BN structure, G ,a conditional probability is deﬁned as P ( X i | P a ( X i )) , where X i is a particular gene, and P a ( X i ) is the set of parentsof X i . When the data, D is given, this conditional probability can be deﬁned as P ( X i = k | P a ( X i ) = j ) = θ ijk , where θ i jk is is the probability of each state value k within each node X i , given that its parents are in conﬁguration j . Finally,to calculate the likelihood score (i.e. probability) of a BN structure, G , a multinomial distribution (MD) was used torelate the data with model, with dirichlet distribution for its prior. Hence, the formula for the Dirichlet-Multinomialmodel is as follows: P ( D | G ) = n (cid:89) i =1 q i (cid:89) j =1 Γ( α ij )Γ( α ij + N ij ) r i (cid:89) k =1 Γ( α ijk + N ijk )Γ( α ijk ) (3)where N ijk is the number of observations in bin k of node i corresponding to a parent conﬁguration j , r i is the numberof possible state values (bins) for a particular variable X i , q i is the total number of conﬁgurations of parent state valuesof X i , N ij = (cid:80) r i k =1 N ijk , and α ijk are the hyper-parameters (hyper-conditional probabilities), α ij = (cid:80) r i k =1 α ijk .Here, we have considered α ijk = αq i r i , where α is the total imaginary counts for the Dirichlet prior. The posteriorprobability distribution of the graph G given data D can now be constructed as: P ( G | D ) = P ( D | G ) P ( G ) P ( D ) (4)where we only consider the numerator as the P ( D ) does not depends of G . Moreover, we’ve used a uniform prior for P ( G ) . 4 PREPRINT - D

ECEMBER

2, 2020

Structure inference using Neighbourhood sampler, Hit-and-Run and Metropolis-Hastings sampler

Here, we have three MCMC (Markov Chain Monte Carlo) sampling algorithms, namely Neighbourhood sampler,Hit-and-Run sampler and Metropolis-Hastings algorithms for inferring optimal STP from the whole BN search space.At each sampling iteration, each of these algorithms considers a candidate BN and looks around its neighbourhoodspace and evaluates the likelihood of it using equation 4, and selects the one that maximizes the likelihoods as optimalBN models for a signalling pathway of interest. The details of each of these algorithms are explained in the originalarticles [7, 15], which is beyond the scope of this article. Finally, the inferred BN STPs (from each of these MCMCsampling algorithms) was considered as the optimally perturbed STP structure model.

Bayesian statistical modelling of the optimal STP to infer perturbation driver genesNetwork model

In this study, we have used a fully Bayesian statistical modelling approach for analysing the statistical aspect ofthe perturbed STP structure yielded from the MCMC sampling algorithms of BN structure learning (see previoussub-section). We have used, p -model, initially proposed by Holland and Leinhardt [16], is a special class of exponentialfamilies of distributions, for this study that offer robust and ﬂexible parametric models, which is used to evaluate theprobability that a gene to be hub in the perturbation network inferred in previous sub-section.A STP BN model can be referred as a gene-gene relationship causal network with g genes, which can be considered asa random variable U over a space of g ( g − graphs. Let U=u is the realization of U and the binary outcome u ij = 1 if gene i interacts with gene j , or u ij = 0 otherwise, then u is a binary data matrix. Let P r ( u ) be the probability functiongiven by the following formula: P r ( u ) = P r ( U = u ) = 1 κ ( θ ) exp (cid:88) p θ p z p ( u ) (5)where z p ( u ) is the network statistic of type p , θ p is the parameter associated with z p ( u ) and κ ( θ ) is the normalizingconstant that ensures P r ( u ) is a proper probability distribution (sums to 1 over all u in G ) [17]. The parameter θ is avector of model parameters associated with network statistics and needs to be estimated. See [ ? ] for further details.A major challenge in computing with the p -model is the calculation of κ ( θ ) , for which the maximum likelihoodestimation is intractable due to the large graph space cardinality. A technique called maximum pseudolikelihoodestimation [18] has been developed to address this issue, which employes MCMC sampling method such as Gibbs orMetropolis-Hastings sampling algorithms [19]In this study, we have considered the undirected version of p -model, which can be simpliﬁed by using only twoBernoulli variables Y ij and Y ij as follows: Y ijk = (cid:26) if u ij = k, otherwise Now the p -model can be deﬁned using the following two equations to predict the probability of an edge being presentbetween gene i and gene j : log { P r ( Y ij = 1) } = λ ij + θ + α i + α j (6) log { P r ( Y ij = 1) } = λ ij (7)for i < j , where θ is the global density parameter, and α , which represents the sociality of a gene to be connected in anundirected network. Note, λ ij is there to ensure P r ( Y ij = 1) + P r ( Y ij = 1) = 1 . Bayesian model to infer perturbation drivers

Here, we have employed a fully Bayesian approach to infer the posteriori of parameters of the above p -model. For thatpurpose, MCMC sampling methods i.e. Gibbs sampling approach were adopted, which generate samples from the jointdistribution of P ( M , θ |D ) , where M is model under consideration, θ is the set of model parameters to be inferred, andthe D is the data. Gibbs sampling iteratively samples on parameter at a time with full conditional distribution of thecurrent and old values of other parameters. Our approach relies on a hierarchical Bayesian model where the model5 PREPRINT - D

ECEMBER

2, 2020parameters depends on several hyper-parameters i.e., θ and α , both assumed to be following a normal distribution,which in tern relies on 0 mean and standard deviation of σ θ like the following: θ ∼ N (cid:0) , σ θ (cid:1) (8) α ∼ N (cid:0) , σ α (cid:1) (9)Assuming τ = σ − , we can consider a gamma prior for both τ θ and τ α since, gamma is the conjugate prior of normaldistribution: τ θ ∼ Gamma ( a , b ) (10) τ α ∼ Gamma ( a , b ) (11)where we set a = 0.001 and b = 0.001 to make the prior for θ non-informative , making its standard deviation wideenough to express large uncertainty [11]. To implement Gibbs sampling we have used JAGS (Just another GibbsSampler). We hypothesized that the perturbation driver genes would yield higher connectivity, i.e. reveal larger socialityscore ( α values) in a STP. Differential Expression analysis and Functional enrichment test

As a pilot experiment to observe the expression dynamics between case-vs-control studies, we conducted a differentialexpression analysis with lapatinib-sensitive and lapatinib-resistant gene expression from Breast cancer patients collectedfrom Gene expression omnibus (accession number: GSE38376). Using limma R package with a threshold of 0.00001for adjusted p-value (bonferroni corrected), we have found 512 differentially expressed genes (DEGs) [AdditionalFile 1] as shown in Figure 2A. Next, to observed which signal transduction pathways (STPs) were enriched with theselected DEGs, we conducted a statistical over-representation analysis with a hyper-geometric test with phyper functionavailable in stat

R package. Using 45 KEGG signalling pathways, we have found 20 STPs that are signiﬁcantly enrihcedwith the selected DEGs yeilding adjusted p-value < . × − ), Hippo signalling (adj. p-Value= . × − ), Chemokine signalling (adj. p-Value = . × − ), TGF- β signaling (adj. p-Value = . × − ),and Thyroid hormone signaling (adj. p-Value = . × − ). As a case study of all our downstream experimentsand analyses, i.e optimal STP structure learning and its perturbation driver identiﬁcation, we have selected the TGF- β signaling pathway (the number of nodes = 79) as it demonstrated signiﬁcant enrichment with DEGs and reportedto play signiﬁcant role in breast cancer progression/resistance phenomenon via cross-talking with EGFR-mediated(lapatinib-targeted) signaling pathways [20]. Novel MCMC methods for sampling Bayesian networks ﬁnds optimal STP as dys-regulated pathway inLapatinib ResistanceData pre-processing

We hypothesized that the causal Bayesian network models may potentially reveal aberrant signalling activities givenperturbation data from two-conditional dataset. At ﬁrst, for a particular gene in a particular sample in the raw geneexpression data for lapatinib-resistant cell-line (SKBR3) was standardized into z-score, based on the mean and standarddeviation of expression of the all the samples in the corresponding gene in the lapatinib-sensitive conditions. Thisstandardized score would measure how each resistant sample (case data) data is deviated from the control expression,which is in this case lapatinib-sensitive condition, thus revealing the perturbation measurements of resistant samplescompared to the distribution of sensitive samples. Next, each expression values (i.e. z-score transformed) of lapatinib-resistant dataset were discretized in three levels based on thresholding ( > < -1.5): over-expressed , down-expressed ,or neutral to model the data-driven Bayesian networks. 6 PREPRINT - D

ECEMBER

2, 2020Figure 2: Differential expression (A) and function enrichment analyses (B) of resistant-vs-sensitive gene expressiondata in breast cancer cell-lines, SKBR3.

Optimal STP sampling for TGF- β signalling using Neighbourhood Sampler, Hit-and-Run andMetropolis-Hastings algorithm In this study, we’ve considered a Bayesian network (BN) approach to model the signalling transduction pathways(STPs). With the standardized and discretized expression data for lapatinib-resistant cell-lines (SKBR3), we haveconducted three MCMC sampling algorithms to search through a space of Bayesian network models of a STP of interest:i.e. TGF- β signalling pathway. Those three MCMC sampling algorithms are: Neighbourhood sampler, Hit-and-Runsampler and Metropolis-hastings sampler. For each of these samplers, we chose the threshold of the number parentsand the number of children is 4. Each sampling algorithm ran for 5000 sampling iteration and the burn-in iterationwas considered as 3000, which indicates the latest 2000 iterations were used for actual sampling. Figure 3A depictsthe convergence of these three sampling algorithms, where Neighbourhood sampler (NS) starts to converges beforeMetropolis-Hastings (MH) and Hit-and-Run (HAR) samplers, but after the burn-in period (3000 iterations), all of themconvergences were fairly observed.At each sampling iteration, each sampler picks a candidate BN models that achieves the best log-likelihood score amongall others in their neighbourhood BN space. When ﬁnished, each sampler returns the BN that was sampled with highestprobability of sampled (highest frequency out of the remaining 2000 sampling iterations) as the optimal STP model forthe TGF- β signalling pathway. Figure 3B-D, depicts three TFG- β BN models that were sampled from NS, HAR andMH samplers, respectively.

Studying statistical properties of optimal STP reveals important genes that drive dys-regulation in lapatinibresistance

With the optimal STP model, we conducted experiment to observed the statistical aspect of the network formation withregards to their local structure formation. To that extent, we have used a fully Bayesian approach to infer the posteriori of each node’s sociality information, which is analogous to it’s importance as hub-gene in its interactome, and thusreveal its potentiality of being a gene that drive the perturbation in the signalling pathway. We have used the MCMCsampling approach again to infer the posteriori of network parameters of the p -model that represent the edge formationprobability of an edge in the STP as a linear combination of several structure parameters, including α , the socialityparameters of involving node-pairs. This MCMC sampling algorithm was conducted for each of the three inferredTGF- β signalling pathways individually (see above) for 5000 iterations, 3000 of which were considered as the burn-in7 PREPRINT - D

ECEMBER

2, 2020Figure 3: MCMC sampling and inferred perturbation BN model for TGF- β signalling pathway using Neighbourhoodsampler, Hit-and-Run sampler, and Metropolis-hasting sampler.period. When completed, MCMC sampling with these three BN models of TGF- β signalling pathway yielded 31, 33,and 22 social genes with of α values >

0, respectively, that are listed along with their sensitive-vs-resistance log2 foldchange in Table 1.

Network analysis of aberrant STP

To observe the biological relevance of inferred driver genes in TGF- β signalling pathways, we conducted GO term(Biological processes) enrichment using clueGO [21] plugin in Cytoscape. Figure 4A-C depicts the signiﬁcantlyenriched GO terms, where positive regulation of pathway-restricted SMAD protein phosphorylation, activin receptorsignalling, regulation of androgen receptor signalling, regulation of lymphocyte differentiation, and regulation ofhormone secretion are observed. Using PathView [22] R package, we have also shown TFG- β signalling pathwaydiagram overlayed with the constituent genes/proteins with their log2 fold change values in Figure 5. Comparing with other methods in identifying aberrant STP

We have also experimented how our method performs in identifying pathway perturbation compared to other method.There are several approaches for ﬁnding perturbed pathways or functional terms. SPIA (Signaling pathway impactanalysis) [23] is one of such approaches, which not only considers the functional enrichment tests but also the abnormalperturbation (based on pathway topology) of that pathway given a set of DE markers. SPIA have found that, both of8

PREPRINT - D

ECEMBER

2, 2020Table 1: Perturbation drivers for three TGF- β signalling pathway models identiﬁed by MCMC sampling Gene From NS From HAR From MH LogFC GeneSymbol sampling sampling sampling Name

NOG – 0.036803 – 0.061517 nogginTHBS1 0.16256 0.152164 – 1.221493 thrombospondin 1DCN – 0.028003 – 0.10499 decorinLEFTY1 0.166053 – – 0.04226 left-right determination factor 1FST 0.054609 0.151098 0.081761 0.040524 follistatinBMP2 – 0.045136 0.082975 -0.15355 bone morphogenetic protein 2BMP4 0.053143 – 0.081372 0.04963 bone morphogenetic protein 4BMP5 0.057964 – – -0.0029 bone morphogenetic protein 5BMP6 0.054911 – – -0.01492 bone morphogenetic protein 6BMP7 0.051767 0.037539 0.177469 -0.93311 bone morphogenetic protein 7BMP8B – 0.0506 – -0.0773 bone morphogenetic protein 8bBMP8A – 0.142381 – 0.045929 bone morphogenetic protein 8aGDF7 0.055338 – – -0.18404 growth differentiation factor 7AMH 0.054537 – 0.165438 0.095488 anti-Mullerian hormoneTGFB2 0.164959 – – 0.706354 transforming growth factor beta 2INHBA 0.054199 0.033856 0.077547 -0.00524 inhibin beta A subunitINHBB – 0.030165 – -0.38803 inhibin beta B subunitNODAL – 0.037555 – -0.07754 nodal growth differentiation factorBMPR2 – 0.142948 – 0.7048 bone morphogenetic protein receptor type 2TGFBR2 – 0.160855 – 1.005743 transforming growth factor beta receptor 2ACVR2B 0.159379 – 0.173425 0.133561 activin A receptor type 2BBMPR1B – 0.157393 0.082578 0.11865 bone morphogenetic protein receptor type 1BACVR1 0.16285 0.038792 0.173751 0.151194 activin A receptor type 1ACVR1B 0.164295 0.035427 – 0.086201 activin A receptor type 1BBAMBI – 0.037039 0.087164 0.350395 BMP and activin membrane bound inhibitorSMAD1 0.050226 – – 0.199748 SMAD family member 1SMAD5 – 0.040064 – -0.06311 SMAD family member 5SMAD9 0.057111 – – 0.108667 SMAD family member 9SMAD2 0.161696 – – 0.150392 SMAD family member 2SMAD3 – 0.137773 – 0.273339 SMAD family member 3SMAD4 – – 0.173165 0.197391 SMAD family member 4SMAD6 – 0.156279 – -0.08563 SMAD family member 6SMAD7 0.055232 – – 0.15036 SMAD family member 7SMURF1 0.056516 0.040997 – 0.251961 SMAD speciﬁc E3 ubiquitin protein ligase 1ZFYVE9 0.053843 0.150341 – 0.118893 zinc ﬁnger FYVE-type containing 9ZFYVE16 0.16138 – 0.16913 -0.05421 zinc ﬁnger FYVE-type containing 16ID1 0.166364 – 0.175444 0.305308 inhibitor of DNA binding 1, HLH proteinID3 0.167612 0.04118 – -0.00622 inhibitor of DNA binding 3, HLH proteinID4 0.04719 – 0.173964 0.500908 inhibitor of DNA binding 4, HLH proteinRBL1 – 0.043975 – -0.25319 RB transcriptional corepressor like 1E2F5 – 0.149304 0.172732 0.477958 E2F transcription factor 5TFD p – 0.15106 – -0.17405 transcription factor Dp-1CREBBP 0.055956 0.050031 – 0.536597 CREB binding proteinEP300 – 0.153432 – 0.058004 E1A binding protein p300S p – 0.14603 – -0.00254 S p transcription factorCDKN2B 0.058374 0.149919 – 1.095818 cyclin dependent kinase inhibitor 2BPITX2 – – 0.085195 -0.60363 paired like homeodomain 2RBX1 0.16479 – 0.083075 -0.01859 ring-box 1CUL1 – 0.149773 – -0.01605 cullin 1MAPK1 – – 0.084654 0.141796 mitogen-activated protein kinase 1MAPK3 0.165836 0.043168 – 0.134163 mitogen-activated protein kinase 3IFNG 0.061818 0.145331 0.081417 0.085719 interferon gammaRHOA – – 0.164908 -0.26438 ras homolog family member APPP2CA 0.059932 – – 0.115046 protein phosphatase 2 catalytic subunit alphaPPP2CB – – 0.08517 -0.38888 protein phosphatase 2 catalytic subunit betaRPS6KB2 0.048165 – – -0.18307 ribosomal protein S6 kinase B2ACVR1C – – 0.075583 0.052312 activin A receptor type 1C9 PREPRINT - D

ECEMBER

2, 2020Figure 4: Functional enrichment of perturbation driver genesthose experiments are not necessarily dependent to each other, i.e. evidence from both of those tests signiﬁes somewhatindependent biological evidence. Hence, SPIA conducts both (perturbation test + enrichment tests) and statisticallycombines their signiﬁcance and provides a combined signiﬁcance level, namely PG . For mathematical details, pleaserefer to the original article [23]. The left panel of Figure 6 shows Log2 fold-change vs perturbation accumulation scoresfor TGF- β signalling pathway (KEGG pathway ID = hsa04350) with their 9 enriched DEGs. Moreover the right panelshows the density plot of total perturbation accumulation scores and a probability of perturbation. As we can see fromthis plot that SPIA has detected TGF- β signalling pathway as perturbed with the probability of perturbation, pPERT =0.038 ( < PREPRINT - D

ECEMBER

2, 2020Figure 5: Perturbation detection of TGF- β signalling pathway using PathView Discussion

In this study, namely ‘PathTurbEr’, we provide a novel probabilistic approach of data-driven modeling of aberrantsignal transduction pathways (STPs) via solving Bayesian Network structure learning problem, where each STP wasmodelled as a Bayesian Network. We adopted three MCMC sampling algorithm for this structure learning tasks, namelyNeighbourhood sampler, Hit-and-Run sampler and Metropolis-Hastings sampler, where each of the sampling algorithmyielded an optimal STP BN model that best describes the aberration activities that are latent in two conditional studies,e.g. lapatinib resistant-vs-sensitive expression datasets. Note, this approach only models the aberrant activities in asignalling pathway of interest (e.g. TGF- β signalling) by learning above two-conditional data, not just any data-drivenstructure of the signalling pathway. Next, we conducted MCMC sampling of the structural features of that those inferredSTP to observe how the local feature can describe the overall structure formation probability by employing a fullyBayesian statistical modelling approach with p -model. This experiment yielded a set of important genes, namely socialgenes (analogous to hub genes in the interactome), in those aberrant BN models of STPs. Functional enrichment testand comparison with state-of-the-arts methods conﬁrms the signiﬁcant functional relevance of the inferred driver genes.This robustly summarizes the important contribution of our approach to this problem.A set of bio-markers detected by differential expression analyses followed by the their functional pathway enrichmenttests manifested several signalling pathways revealing perturbed expression in resistant-vs-sensitive conditions, includingPI3K-AKT signalling, Hippo signalling, Chemokine signalling, TGF- β signalling, and Thyroid hormone signalling.TGF- β signalling has been reported to be contributing in resistance mechanism to targeted treatment in HER2-positivebreast cancer [24]. Komurov et al. also reported TGF- β signalling pathway as perturbed in resistant-vs-sensitiveconditions in acquired lapatinib resistance in breast cancer cell-line, SKBR3 [12]. Tortora et al. previously reported thatTGF- β type I receptor activation is responsible for increased expression of HER ligands, and that mediates increasedsecretion of TGF- α , amphiregulin, and heregulin, and that activates PI3K-AKT signalling pathway, which is also foundto be perturbed in our analyses [25]. Moreover, our previous study of analysing drug-resistive cross-talks also revealed11 PREPRINT - D

ECEMBER

2, 2020Figure 6: SPIA analyses to detect perturbation of TGF- β signalling pathwaythat TGF- β signalling pathway cross-talks with EGFR/HER2 signalling pathway via activation of TFG- β receptor, andthus provide a compensating route for resistant cells to survive despite continuous drug intake, i.e. lapatinib [20].Above evidence of TGF- β signalling perturbation in resistant-vs-sensitive condition led us to investigate furthermore,which factors may drive this perturbation, i.e. if any bio-marker can be associated with this aberrant phenomenon.Therefore, we adopted a fully Bayesian statistical modelling approach in order to decipher important genes in aprobabilistic manner. Unlike the frequentist/descriptive approach to identify hub genes as important genes in gene-generelationship networks, probabilistic approaches adopt uncertainty in network formation, which is a intrinsic featureobserved in biological networks, especially when derived from high-throughput experimental data [11]. Our hypothesiswas that the statistically important genes in a given perturbation network structure in the context of connectivity, wouldindicate a key role underpinning the aberration. Results from this analysis revealed 31, 33, and 22 perturbation drivergenes out of 79 constituent genes of three STP models of TGF- β signalling yielded from Neighbourhood sampler,Hit-and-Run sampler, and Metropolis-Hastings sampler, respectively 1. Out of these perturbation driver genes found byMCMC sampling, 9 genes, namely THBS1, TGFB2, TGFBR2, ID4, E3F5, CREBBP, CDKN2B, BMP7, and PITX2were also identiﬁed as differentially expressed (ﬁrst 7 genes were up-regulated whereas remaining were down-reguated)in resistant-vs-sensitive conditions in lapatinib resistance, in our independent differential expression analyses.Next, we aimed to observe the functional association of these putative genes driving the signalling perturbationby conducting an over-representation test with a set of GO terms (Biological process). We hypothesized that, theperturbation driver genes would be associated with many important biological functions triggered by TGF- β receptorsignalling that are relevant to breast cancer progression/resistance phenomenon. We have observed several biologicalfunctions including positive regulation of pathway-restricted SMAD protein phosphorylation, regulation of lymphocytedifferentiation, positive regulation of transmembrane receptor protein serine/threonine kinase signaling, regulationof activin receptor signalling, cellular response to growth factor stimulus, regulation of tgf-beta receptor signalling,and cellular response to BMP stimulus. It has been reported that TGF- β -SMAD signalling plays signiﬁcant rolein EGFR/HER-positive breast cancer oncogenesis [26, 27]. Chen et al. recently reported that tumor inﬁltrating12 PREPRINT - D

ECEMBER

2, 2020lymphocytes serves as critical pathological factor in predicting prognosis of breast cancer patients that are treatedwith anti-HER2 drugs, in our case which is lapatinib [28]. The role of serine/threonine kinase signaling involvingtrans-membrane receptor proteins is well reported, i.e. it controls both cell proliferation and cell death in response tovarious stresses [29]. Jeruss et al. has also reported that regulation of both TGF- β and activin signalling components areassociated with advanced oncogenic progression in aggressive breast carcinoma [30]. BMPs are found over-expressedin breast cancer patients, but results from Owens et al. suggested that the inhibition of BMP stimulated signalling mayreduce breast cancer metastasis by targeting both the tumor and its surrounding micro-environments [31]. Conclusion

Resistance to targeted therapies is a major obstacle in sustained and efﬁcacious treatment plan for ﬁghting againstany cancer. Lapatinib, an EGFR/HER-2 dual inhibitor, although shown great initial promise in treating breast cancerpatients, have ultimately been bypassed by the cancer-cells’ alternate survival mechanism via compensatory pathways.Therefore, a root-cause analysis i.e. detection of bio-markers driving the pathway perturbation in resistant-vs-sensitiveconditions with a robust framework is greatly needed to offer better therapeutic developments. Previous approachesmainly adopted descriptive methodologies with less emphasis on the stochastic phenomenon, which is very commonlyexist in biological systems. Hence, ‘PathTurbEr’ adopts statistical models (e.g. Bayesian networks) to detect aberrantsignalling networks, followed by fully Bayesian statistical approach to identify the markers driving those perturbation.Our framework has been developed in a generalized way, therefore we hope that it can be applied to any case-vs-controlexpression studies, and analyse similar hypothesis for any given cancer datasets.

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

AKMA and SA conceived the idea. SA contributed to the MCMC sampling method design for Bayesian networkinference from data. AKMA contributed the methodologies for statistical parameter sampling methodologies usingMCMC methods. AKMA conducted the experiments, analysed the data and wrote the manuscript. SA contributed tothe manuscript.

Additional Files

Additional ﬁle 1 — 512 DEGs and their function enrichment results with 45 KEGG signalling pathways

References [1] R. Sever and J. S. Brugge. Signal transduction in cancer.

Cold Spring Harb Perspect Med , 5(4), Apr 2015.[2] Luca Grumolato and Stuart A. Aaronson.

Aberrant Signaling Pathways in Cancer , pages 1–8. American CancerSociety, 2017.[3] A. K. Azad, A. Lawen, and J. M. Keith. Bayesian model of signal rewiring reveals mechanisms of genedysregulation in acquired drug resistance in breast cancer.

PLoS One , 12(3):e0173331, 2017.[4] P. J. Woolf, W. Prudhomme, L. Daheron, G. Q. Daley, and D. A. Lauffenburger. Bayesian analysis of signalingnetworks governing embryonic stem cell fate decisions.

Bioinformatics , 21(6):741–753, Mar 2005.[5] Karen Sachs.

Bayesian Network Models of Biological Signaling Pathways . PhD thesis, Massachusetts Institute ofTechnology, 2006.[6] R. Neapolitan and X. Jiang. Inferring Aberrant Signal Transduction Pathways in Ovarian Cancer from TCGAData.

Cancer Inform , 13(Suppl 1):29–36, 2014.[7] Salem A. Alyami, A. K. M. Azad, and Jonathan M. Keith. The neighborhood MCMC sampler for learningBayesian networks. In Xudong Jiang, Guojian Chen, Genci Capi, and Chiharu Ishll, editors,

First InternationalWorkshop on Pattern Recognition , volume 10011, pages 326 – 336. International Society for Optics and Photonics,SPIE, 2016. 13

PREPRINT - D

ECEMBER

2, 2020[8] A. L. Barabási and Z. N. Oltvai. Network biology: understanding the cell’s functional organization.

Nat RevGenet , 5(2):101–113, Feb 2004.[9] M. Ernst, Y. Du, G. Warsow, M. Hamed, N. Endlich, K. Endlich, H. Murua Escobar, L. M. Sklarz, S. Sender,C. Junghan?, S. M?ller, G. Fuellen, and S. Struckmann. FocusHeuristics - expression-data-driven networkoptimization and disease gene prediction.

Sci Rep , 7:42638, 02 2017.[10] Stanley Wasserman and Garry Robins.

An Introduction to Random Graphs, Dependence Graphs, and p* , page148–161. Structural Analysis in the Social Sciences. Cambridge University Press, 2005.[11] S. Bulashevska, A. Bulashevska, and R. Eils. Bayesian statistical modelling of human protein interaction networkincorporating protein disorder information.

BMC Bioinformatics , 11:46, 2010.[12] K. Komurov, J. T. Tseng, M. Muller, E. G. Seviour, T. J. Moss, L. Yang, D. Nagrath, and P. T. Ram. Theglucose-deprivation network counteracts lapatinib-induced toxicity in resistant ErbB2-positive breast cancer cells.

Mol Syst Biol , 8:596, 2012.[13] M. Kanehisa. The KEGG database.

Novartis Found Symp , 247:91–101, 2002.[14] M. E. Ritchie, B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi, and G. K. Smyth. limma powers differentialexpression analyses for RNA-sequencing and microarray studies.

Nucleic Acids Res , 43(7):e47, Apr 2015.[15] SALEM ALI S ALYAMI. Markov chain monte carlo methods for bayesian network inference, with applicationsin systems biology. 2017.[16] Paul W. Holland and Samuel Leinhardt. An exponential family of probability distributions for directed graphs.

Journal of the American Statistical Association , 76(373):33–50, 1981.[17] Stanley Wasserman and Philippa Pattison. Logit models and logistic regressions for social networks: I. anintroduction to markov graphs andp.

Psychometrika , 61(3):401–425, 1996.[18] D Strauss and M Ikeda. Pseudolikelihood estimation for social networks.

Journal of the American StatisticalAssociation , 85(409):204–212, March 1990.[19] Tom A. B. Snijders. Markov chain monte carlo estimation of exponential random graph models.

Journal of SocialStructure , 3, 2002.[20] A. K. Azad, A. Lawen, and J. M. Keith. Prediction of signaling cross-talks contributing to acquired drug resistancein breast cancer cells by Bayesian statistical modeling.

BMC Syst Biol , 9:2, Jan 2015.[21] B. Mlecnik, J. Galon, and G. Bindea. Automated exploration of gene ontology term and pathway networks withClueGO-REST.

Bioinformatics , 35(19):3864–3866, 10 2019.[22] W. Luo and C. Brouwer. Pathview: an R/Bioconductor package for pathway-based data integration and visualiza-tion.

Bioinformatics , 29(14):1830–1831, Jul 2013.[23] A. L. Tarca, S. Draghici, P. Khatri, S. S. Hassan, P. Mittal, J. S. Kim, C. J. Kim, J. P. Kusanovic, and R. Romero.A novel signaling pathway impact analysis.

Bioinformatics , 25(1):75–82, Jan 2009.[24] C. Oliveras-Ferraros, B. Corominas-Faja, S. Cuf?, A. Vazquez-Martin, B. Martin-Castillo, J. M. Iglesias, E. L?pez-Bonet, ?. G. Martin, and J. A. Menendez. Epithelial-to-mesenchymal transition (EMT) confers primary resistanceto trastuzumab (Herceptin).

Cell Cycle , 11(21):4020–4032, Nov 2012.[25] Giampaolo Tortora. Mechanisms of Resistance to HER2 Target Therapy.

JNCI Monographs , 2011(43):95–98, 102011.[26] A. Sundqvist, E. Vasilaki, O. Voytyuk, Y. Bai, M. Morikawa, A. Moustakas, K. Miyazono, C. H. Heldin,P. Ten Dijke, and H. van Dam. TGF-beta and EGF signaling orchestrates the AP-1- and p63 transcriptionalregulation of breast cancer invasiveness.

Oncogene , 39(22):4436–4449, 05 2020.[27] E. Tarasewicz and J. S. Jeruss. Phospho-speciﬁc Smad3 signaling: impact on breast oncogenesis.

Cell Cycle ,11(13):2443–2451, Jul 2012.[28] T. H. Chen, Y. C. Zhang, Y. T. Tan, X. An, C. Xue, Y. F. Deng, W. Yang, X. Yuan, and Y. X. Shi. Tumor-inﬁltrating lymphocytes predict prognosis of breast cancer patients treated with anti-Her-2 therapy.

Oncotarget ,8(3):5219–5232, Jan 2017.[29] R. Manoharan, H. A. Seong, and H. Ha. Dual Roles of Serine-Threonine Kinase Receptor-Associated Protein(STRAP) in Redox-Sensitive Signaling Pathways Related to Cancer Development.

Oxid Med Cell Longev ,2018:5241524, 2018.[30] Jacqueline S. Jeruss, Charles D. Sturgis, Alfred W. Rademaker, and Teresa K. Woodruff. Down-regulation ofactivin, activin receptors, and smads in high-grade breast cancer.

Cancer Research , 63(13):3783–3790, 2003.14

PREPRINT - D

ECEMBER

2, 2020[31] P. Owens, M. W. Pickup, S. V. Novitskiy, J. M. Giltnane, A. E. Gorska, C. R. Hopkins, C. C. Hong, and H. L.Moses. Inhibition of BMP signaling suppresses metastasis in mammary cancer.