[PDF] Inferring Signaling Pathways with Probabilistic Programming

Abstract

Cells regulate themselves via dizzyingly complex biochemical processes called signaling pathways. These are usually depicted as a network, where nodes represent proteins and edges indicate their influence on each other. In order to understand diseases and therapies at the cellular level, it is crucial to have an accurate understanding of the signaling pathways at work. Since signaling pathways can be modified by disease, the ability to infer signaling pathways from condition- or patient-specific data is highly valuable. A variety of techniques exist for inferring signaling pathways. We build on past works that formulate signaling pathway inference as a Dynamic Bayesian Network structure estimation problem on phosphoproteomic time course data. We take a Bayesian approach, using Markov Chain Monte Carlo to estimate a posterior distribution over possible Dynamic Bayesian Network structures. Our primary contributions are (i) a novel proposal distribution that efficiently samples sparse graphs and (ii) the relaxation of common restrictive modeling assumptions. We implement our method, named Sparse Signaling Pathway Sampling, in Julia using the Gen probabilistic programming language. Probabilistic programming is a powerful methodology for building statistical models. The resulting code is modular, extensible, and legible. The Gen language, in particular, allows us to customize our inference procedure for biological graphs and ensure efficient sampling. We evaluate our algorithm on simulated data and the HPN-DREAM pathway reconstruction challenge, comparing our performance against a variety of baseline methods. Our results demonstrate the vast potential for probabilistic programming, and Gen specifically, for biological network inference. Find the full codebase at this https URL

Full PDF

aarXiv

PreprintSystems biology

Systems biology

Inferring Signaling Pathways with ProbabilisticProgramming

David Merrell , Anthony Gitter ∗ Department of Computer Sciences, University of Wisconsin–Madison, USA Morgridge Institute for Research, Madison, Wisconsin, USA Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, USA. ∗ To whom correspondence should be addressed.

Abstract

Motivation:

Results:

We implement our method, named Sparse Signaling Pathway Sampling, in Julia using the Genprobabilistic programming language. Probabilistic programming is a powerful methodology for buildingstatistical models. The resulting code is modular, extensible, and legible. The Gen language, in particular,allows us to customize our inference procedure for biological graphs and ensure efﬁcient sampling.We evaluate our algorithm on simulated data and the HPN-DREAM pathway reconstruction challenge,comparing our performance against a variety of baseline methods. Our results demonstrate the vastpotential for probabilistic programming, and Gen speciﬁcally, for biological network inference.

Availability:

Find the full codebase at https://github.com/gitter-lab/ssps

Contact: [email protected]

Signaling pathways enable cells to process information rapidly in responseto external environmental changes or intracellular cues. One of the coresignaling mechanisms is protein phosphorylation. Kinases add phosphategroups to substrate proteins and phosphatases remove them. These changesin phosphorylation state can act as switches, controlling proteins’ activityand function. A protein’s phosphorylation status affects its localization,stability, and interaction partners (Newman et al. , 2014). Ultimately,phosphorylation changes regulate important biological processes such astranscription and cell growth, death, and differentiation (Hunter, 2009;Kholodenko et al. , 2010). Pathway databases characterize the signaling relationships amonggroups of proteins but are not tailored to individual biological contexts.Even for well-studied pathways such as epidermal growth factor receptor-mediated signaling, the proteins signiﬁcantly phosphorylated during abiological response can differ greatly from those in the curated pathway(Köksal et al. , 2018). The discrepancy can be due to context-speciﬁcsignaling (Hill et al. , 2017), cell type-speciﬁc protein abundances, orsignaling rewiring in disease (Pawson and Warner, 2007). Therefore,there is a need to learn context-speciﬁc signaling pathway representationsfrom observed phosphorylation changes. In the clinical setting, patient-speciﬁc signaling pathway representations may eventually be able to guidetherapeutic decisions (Drake et al. , 2016; Halasz et al. , 2016; Eduati et al. ,2020). © Merrell and Gitter 2020. a r X i v : . [ q - b i o . M N ] J u l Merrell and Gitter

Diverse classes of techniques have been developed to model and infersignaling pathways (Kholodenko et al. , 2012). They take approachesincluding Granger causality (Shojaie and Michailidis, 2010; Carlin et al. ,2017), information theory (Cheong et al. , 2011; Krishnaswamy et al. ,2014), logic models (Eker et al. , 2002; Guziolowski et al. , 2013; Gjerga et al. , 2020), differential equations (Schoeberl et al. , 2002; Molinelli et al. , 2013; Henriques et al. , 2017), non-parametric statistical tests(Zhang and Song, 2013), and probabilistic graphical models (Sachs et al. ,2005) among others. Some signaling pathway reconstruction algorithmstake advantage of perturbations such as receptor stimulation or kinaseinhibition. Although perturbing individual pathway members can causallylink them to downstream phosphorylation changes, characterizing acomplex pathway can require a large number of perturbation experiments.Inferring pathway structure from temporal phosphorylation data presentsan attractive alternative. A single time series phosphorylation datasetcan reveal important dynamics without perturbing individual pathwaymembers. For instance, a kinase cannot phosphorylate substrates before itis activated.An alternative approach to pathway reconstruction selects a context-speciﬁc subnetwork from a general background network. These algorithmscan use phosphorylation data to assign scores to protein nodes in a protein-protein interaction network. They then select edges that connect the high-scoring nodes, generating a subnetwork that may explain how the inducedphosphorylation changes arise from the source of stimulation. Extensionsaccommodate temporal scores on the nodes (Patil et al. , 2013; Budak et al. ,2015; Köksal et al. , 2018; Norman and Cicek, 2019).Our present work builds on past techniques that formulate signalingpathway inference as a Dynamic Bayesian Network (DBN) structureestimation problem. This family of techniques relies on two core ideas: (i) we can use a DBN to model phosphorylation time series data; and (ii) the DBN’s structure translates directly to a directed graph representingthe signaling pathway. Rather than identifying a single DBN that bestﬁts the data, these techniques take a Bayesian approach—they yield a posterior distribution over possible DBN structures. Some techniques useMarkov Chain Monte Carlo (MCMC) to sample from the posterior (Werhliand Husmeier, 2007; Gregorczyk, 2010). Others use exact, enumerativeinference to compute posterior probabilities (Hill et al. , 2012; Oates et al. ,2014; Spencer et al. , 2015).We present a new Bayesian DBN-based technique, Sparse SignalingPathway Sampling (SSPS). It improves on past MCMC methods by usinga novel proposal distribution specially tailored for the large, sparse graphsprevalent in biological applications. Furthermore, SSPS makes weakermodeling assumptions than other DBN approaches. As a result, SSPSscales to larger problem sizes and yields superior predictions in comparisonto other DBN techniques.We implement SSPS using the

Gen probabilistic programminglanguage (PPL). Probabilistic programming is a powerful methodologyfor building statistical models. It enables the programmer to build modelsin a legible, modular, reusable fashion. This ﬂexibility was important forprototyping and developing the current form of SSPS and readily supportsfuture improvements or extensions.

SSPS makes speciﬁc modeling assumptions. We start with the DBN modelof Hill et al. (2012), relax some assumptions, and modify it in other waysto be better-suited for MCMC inference.

Preliminary deﬁnitions.

We ﬁrst deﬁne some notation for clarity’s sake.Let G denote a directed graph with vertices V and edges E ( G ) . Graph G will represent a signaling pathway, with vertices V corresponding toproteins and edges E ( G ) indicating their inﬂuence relationships. We usepa G ( i ) to denote the parents of vertex i in G .Let X denote our time series data, consisting of | V | variables measuredat T timepoints. X is a T ×| V | matrix where the j th column correspondsto the j th variable and the j th graph vertex. As a convenient shorthand,let X + denote the latest T − timepoints in X , and let X − denote the earliest T − timepoints in X . Lastly, deﬁne B j ≡ X − , pa G ( j ) . Inother words, B j contains the values of variable j ’s parents at the T − earliest timepoints. In general, B j may also include columns of nonlinearinteractions between the parents. We will only include linear terms, unlessstated otherwise. Model derivation.

In our setting, we aim to infer G from X . In particular,Bayesian approaches seek a posterior distribution P ( G | X ) over possiblegraphs. From Bayes’s rule we know P ( G | X ) ∝ P ( X | G ) · P ( G ) . Thatis, a Bayesian model is fully speciﬁed by its choice of prior distribution P ( G ) and likelihood function P ( X | G ) .We derive our model from the one used by Hill et al. (2012). Theychoose a prior distribution of the form P ( G | G (cid:48) , λ ) ∝ exp (cid:0) − λ | E ( G ) \ E ( G (cid:48) ) | (cid:1) (1)parameterized by a reference graph G (cid:48) and inverse temperature λ . Thisprior gives uniform probability to all subgraphs of G (cid:48) and “penalizes”edges not contained in E ( G (cid:48) ) . λ controls the “importance” given to thereference graph.Hill et al. choose a Gaussian DBN for their likelihood function.Intuitively, they assume linear relationships between variables and theirparents: X + ,j ∼ N ( B j β j , σ j ) ∀ j ∈ { . . . | V |} . A suitable prior over the regression coefﬁcients β j and noise parameters σ j (Figure 1) allows us to integrate them out, yielding this marginal likelihoodfunction : P ( X | G ) ∝ | V | (cid:89) j =1 T − | pa G ( j ) | (cid:18) X (cid:62) + ,j X + ,j − T − T X (cid:62) + ,j ( B j ˆ β ols ) (cid:19) − T − (2)where ˆ β ols = ( B (cid:62) j B j ) − B (cid:62) j X + ,j is the ordinary least squares estimateof β j . For notational simplicity, Equation 2 assumes we have a single timecourse of length T . In general, there may be multiple time course replicateswith differing lengths. The marginal likelihood generalizes to that case ina straightforward way.In SSPS we use the same marginal likelihood function (Equation 2),but a different prior distribution P ( G ) . We obtain our prior distribution bydecomposing Equation 1 into a product of independent Bernoulli trials overgraph edges. This decomposition in turn allows us to make some usefulgeneralizations. Deﬁne edge existence variables z ij ≡ [( i, j ) ∈ E ( G )] .Let Z be the | V |×| V | matrix of all z ij . Then we can rewrite Equation 1as follows: P ( G | G (cid:48) , λ ) ≡ P ( Z | G (cid:48) , λ ) ∝ (cid:89) ( i,j ) / ∈ E ( G (cid:48) ) e − z ij λ = (cid:89) ( i,j ) ∈ E ( G (cid:48) ) (cid:18) (cid:19) z ij (cid:18) (cid:19) − z ij (cid:89) ( i,j ) / ∈ E ( G (cid:48) ) (cid:18) e − λ e − λ (cid:19) z ij (cid:18) e − λ (cid:19) − z ij where the last line is a true equality—it gives a normalized probabilitymeasure. We see that the original prior is simply a product of Bernoullivariables parameterized by a shared inverse temperature, λ . See AppendixA for a more detailed derivation. nferring Signaling Pathways with Probabilistic Programming Rewriting the prior in this form opens the door to generalizations. First,we address a shortcoming in the way reference graph G (cid:48) expresses priorknowledge. The original prior assigns equal probability to every edge of G (cid:48) . However, in practice we may have differing levels of prior conﬁdencein the edges. We address this by allowing a real-valued prior conﬁdence c ij for each edge: P ( Z | C, λ ) = (cid:89) ( i,j ) (cid:18) e − λ e − c ij λ + e − λ (cid:19) z ij (cid:32) e − c ij λ e − c ij λ + e − λ (cid:33) − z ij (3)where C is the matrix of all prior conﬁdences c ij , replacing G (cid:48) . Notice thatif every c ij ∈{ , } , then Equation 3 is equivalent to the original prior. Ineffect, Equation 3 interpolates the original prior, permitting a continuumof conﬁdences on the interval [0 , .We make one additional change to the prior by replacing the shared λ inverse temperature variable with a collection of variables, Λ = { λ j | j =1 , . . ., | V |} , one for each vertex of the graph. Recall that the original λ variable determined the importance of the reference graph. In the newformulation, each λ j controls the importance of the prior knowledge forvertex j and its parents: P ( Z | C, Λ) = (cid:89) ( i,j ) (cid:32) e − λ j e − c ij λ j + e − λ j (cid:33) z ij (cid:32) e − c ij λ j e − c ij λ j + e − λ j (cid:33) − z ij (4)We introduced Λ primarily to help MCMC converge more efﬁciently.Experiments with the shared λ revealed a multimodal posterior thattended to trap λ in high or low values. The introduction of vertex-speciﬁc λ j variables yielded faster convergence with weaker modelingassumptions—an improvement in both respects.We implicitly relax the model assumptions further via our inferenceprocedure. For sake of tractability, the original exact method of Hill et al. (2012) imposes a hard constraint on the in-degree of each vertex. Incontrast, we use a MCMC inference strategy with no in-degree constraints.In summary, our model departs from that of Hill et al. (2012) in threeimportant respects. It permits real-valued prior conﬁdences C , introducesvertex-speciﬁc inverse temperature variables Λ , and places no constraintson vertices’ in-degrees. See the full model in Figure 1 and Appendix A foradditional details. Our method uses MCMC to infer posterior edge existence probabilities.As described in Section 2.1, our model contains two classes of unobservedrandom variables: (i) the edge existence variables Z and (ii) the inversetemperature variables Λ . For each step of MCMC, we loop through thesevariables and update them in a Metropolis-Hastings fashion. Main loop.

At a high level, our MCMC procedure consists of a loop overthe graph vertices, V . For each vertex j , we update its inverse temperaturevariable λ j and then update its parent set pa G ( j ) . All of these updates areMetropolis-Hastings steps; the proposal distributions are described below.Each completion of this loop yields one iteration of the Markov chain. Proposal distributions.

For the inverse temperature variables we use asymmetric Gaussian proposal: λ (cid:48) j ∼ N ( λ j , ξ ) . In practice the methodis insensitive to ξ ; we typically set ξ =3 .The parent set proposal distribution is more complicated. There are twoprinciples at work when we design a graph proposal distribution: (i) theproposal should efﬁciently traverse the space of directed graphs, and (ii) it should favor graphs with higher posterior probability. The most widelyused graph proposal distribution selects a neighboring graph uniformlyfrom the set of possible “add-edge,” “remove-edge,” and “reverse-edge”actions (Werhli and Husmeier, 2007; Gregorczyk, 2010). We’ll refer to λ j ∼ Uniform ( λ min , λ max ) ∀ j ∈ { . . . | V |} z ij | c ij , λ j ∼ Bernoulli (cid:32) e − λ j e − c ij λ j + e − λ j (cid:33) ∀ i, j ∈ { . . . | V |} σ j ∝ σ j ∀ j ∈ { . . . | V |} β j | σ j ∼ N (cid:16) , T σ j ( B (cid:62) j B j ) − (cid:17) ∀ j ∈ { . . . | V |} X + ,j | B j , β j , σ j ∼ N (cid:0) B j β j , σ j I (cid:1) ∀ j ∈ { . . . | V |} Fig. 1.

Our generative model. (top) Plate notation. DBN parameters β j and σ j have beenmarginalized out. (bottom) Full probabilistic speciﬁcation. We usually set λ min (cid:39) and λ max =15 . If λ min > is too small, Markov chains will occasionally be initialized with verylarge numbers of edges, causing computational issues. The method is insensitive to λ max as long as it’s sufﬁciently large. Notice the improper prior /σ j . In this speciﬁcation B j denotes X − , pa Z ( j ) ; that is, the parents of vertex j depend on edge existence variables Z . this traditional proposal distribution as the uniform graph proposal . In oursetting, we expect sparse graphs to be much more probable than denseones—notice how the marginal likelihood function (Equation 2) stronglypenalizes | pa G ( j ) | . However, the uniform graph proposal exhibits apreference toward dense graphs . It proposes “add-edge” actions too often.This motivates us to design a new proposal distribution tailored for sparsegraphs—one that operates on our sparse parent set graph representation.For a given graph vertex j ∈ V , the parent set proposal distributionupdates pa G ( j ) by choosing from the following actions: • add-parent . Select one of vertex j ’s non-parents uniformly atrandom, and add it to pa G ( j ) . • remove-parent . Select one of vertex j ’s parents uniformly atrandom, and remove it from pa G ( j ) . • swap-parent . A simultaneous application of add-parent and remove-parent . Perhaps surprisingly, this action is not maderedundant by the other two. It plays an important role by yieldingupdates that maintain the size of the parent set. Because the marginallikelihood (Equation 2) changes steeply with | pa G ( j ) | , Metropolis-Hastings acceptance probabilities will be higher for actions that keep | pa G ( j ) | constant.These three actions are sufﬁcient to explore the space of directed graphs,but we need another mechanism to bias the exploration toward sparse graphs. We introduce this preference via the probability assigned to eachaction. Intuitively, we craft the action probabilities so that when | pa G ( j ) | is too small, add-parent moves are most probable. When | pa G ( j ) | istoo big, remove-parent moves are most probable. When | pa G ( j ) | isabout right, all moves are equally probable.We formulate the action probabilities for vertex j as follows. Asa shorthand, let s j = | pa G ( j ) | and deﬁne the reference size ˆ s j = (cid:80) | V | i =1 c ij . That is, ˆ s j uses the prior edge conﬁdences C to estimate anappropriate reference size for the parent set. Then, the action probabilitiesare Merrell and Gitter

Fig. 2.

Action probabilities as a function of parent set size. The reference size ˆ s isdetermined from prior knowledge. It approximates the size of a “typical” parent set. When s< ˆ s , add-parent is most probable; when s> ˆ s , remove-parent is most probable;and when s =ˆ s , all actions have equal probability. p ( add-parent | s j , ˆ s j ) ∝ − (cid:18) s j | V | (cid:19) γ (ˆ s j ) p ( remove-parent | s j , ˆ s j ) ∝ (cid:18) s j | V | (cid:19) γ (ˆ s j ) p ( swap-parent | s j , ˆ s j ) ∝ (cid:18) s j | V | (cid:19) γ (ˆ s j ) · (cid:32) − (cid:18) s j | V | (cid:19) γ (ˆ s j ) (cid:33) where γ (ˆ s j ) = 1 / log ( | V | / ˆ s j ) . We use these functional formsonly because they have certain useful properties: (i) when s j =0 , theprobability of add-parent is 1; (ii) when s j = | V | , the probability of remove-parent is 1; and (iii) when s j =ˆ s j , all actions have equalprobability (Figure 2). Beyond that, these probabilities have no particularjustiﬁcation. We provide additional information about the parent setproposal in Appendix B.Recall that Metropolis-Hastings requires us to compute the reversetransition probability for any proposal we make. This could pose achallenge given our relatively complicated parent set proposal distribution.However, Gen provides a helpful interface for computing reverseprobabilities. The user can provide an involution function that returns thereverse of a given action.

Gen then manages the reverse probabilitieswithout further intervention. This makes it relatively easy to implementMetropolis-Hastings updates with unusual proposal distributions.

Termination, convergence, and inference.

We follow the basic MCMCprotocols described by Gelman et al. (2014). This entails running multiple(i.e., 4) Markov chains and discarding the ﬁrst half of each chain as burnin.In all of our analyses, we terminate each Markov chain when it either (i) reaches a length of 100,000 iterations or (ii) the execution time exceeds 12hours. These termination conditions are arbitrary but emulate a real-worldsetting where it may be acceptable to let the method run overnight.Upon termination, we assess convergence with two diagnostics:Potential Scale Reduction Factor (PSRF) and effective number of samples( N eff ). PSRF identiﬁes cases where the Markov chains fail to mix orachieve stationarity. N eff provides a sense of “sample size” for our inferredquantities. It adjusts the number of MCMC samples by accounting forautocorrelation in each chain. For our purposes, we say a quantity has failed to converge if its PSRF ≥ . or N eff < . Note that satisfyingthese criteria does not guarantee convergence. However, failure to satisfythem is a reliable ﬂag for non-convergence.Assuming a quantity hasn’t failed to converge, we estimate it by simplytaking its sample mean from all samples remaining after burnin. In our PPL Host language Primarymodel class Primary inferencemethod

Stan custom language hierarchical, cont’s vars Black-box HMC

Edward2

Python/TensorFlow “deep”, cont’s vars Black-boxvariational

PyMC3

Python/Theano “deep”, cont’s vars Black-box HMC

Pyro

Python/PyTorch “deep”, cont’s vars Black-boxvariational

Gen

Julia discrete and cont’s vars;highly ﬂexible CustomizableMCMCTable 1. A coarse comparison of some noteworthy PPLs.

Gen providesexpressiveness but requires the user to implement an inference program for theirmodel. Cont’s vars: continuous variables; HMC: Hamiltonian Monte Carlo. setting we are primarily interested in edge existence probabilities; i.e., wecompute the fraction of samples containing each edge.

We implemented SSPS using the

Gen

PPL. We brieﬂy describe theprobabilistic programming methodology and its advantages in our setting.

Probabilistic programming.

Probabilistic programming is a methodologyfor building statistical models. It’s based on the idea that statistical modelsare generative processes —sequences of operations on random variables.In probabilistic programming, we express the generative process as aprogram written in a PPL. This program is then compiled to produce a log-probability function, which can be used in inference tasks. Probabilisticprogramming systems typically provide a set of generic inference methodsfor performing those tasks—e.g., MCMC or Variational Bayes.Compare this with a more traditional approach, where the user must (i) derive and implement the log-probability function and (ii) implementan inference method that operates on the log-probability function. Thisprocess of manual derivation and implementation is error-prone andrequires a high degree of expertise from the user. In contrast, probabilisticprogramming only requires the user to express their model in a PPL. Theprobabilistic programming system manages other details.Probabilistic programming also tends to promote good softwareengineering principles such as abstraction, modularity, and legibility. MostPPLs organize code into functions, which can be reused by multiplestatistical models.

Probabilistic programming languages.

Several PPLs have emerged inrecent years. Examples include

Stan (Carpenter et al. , 2017),

Edward2 (Dillon et al. , 2017),

Pyro (Bingham et al. , 2018),

PyMC3 (Salvatier et al. ,2016), and

Gen (Cusumano-Towner et al. , 2019). PPLs differ in how theybalance expressive power and ease of use . For example,

Stan makesit easy to build hierarchical statistical models with continuous variablesbut caters poorly to other model classes. At the other extreme,

Gen canreadily express a large class of models—discrete and continuous variableswith complex relationships—but requires the user to design a customizedinference procedure.

Implementation in

Gen . We use the

Gen

PPL precisely for its expressivepower and customizable inference. While implementing SSPS, thecustomizability of

Gen allowed us to begin with simple prototypes andthen make successive improvements. For example, our model initiallyused a dense adjacency matrix representation for G , but subsequentoptimizations led us to use a sparse parent set representation instead.Similarly, our MCMC method started with a naïve “add or removeedge” proposal distribution; we arrived at our sparse proposal distribution(Section 2.2) after multiple reﬁnements. Other PPLs do not allow this levelof control (Table 1). nferring Signaling Pathways with Probabilistic Programming Parameter Meaning Values | V | Number of variables 40, 100, 200 T Time course length 8 M Number of time courses 4 r Fraction of original edges removed 0.1, 0.5, 0.75, 1.0 a Fraction of spurious edges added 0.1, 0.5, 0.75, 1.0Table 2. These parameters deﬁne the grid of simulated datasets in our simulationstudy. There are × × distinct grid points. For each one, we generate K =5 replicates for a total of 240 simulated datasets. The graph corruptionparameters, r and a , range from very little error (0.1) to total corruption (1.0). We use a simulation study to answer important questions about SSPS:How does its computational expense grow with problem size? Is it able tocorrectly identify true edges? What is its sensitivity to errors in the priorknowledge? Simulations allow us to answer these questions in a controlledsetting where we have access to ground truth.

Data simulation process.

We generate each simulated dataset as follows:1. Sample a random adjacency matrix A ∈ { , } | V |×| V | , where eachentry is the outcome of a Bernoulli ( p ) trial. A speciﬁes the structure of a DBN. We choose p =5 / | V | so that each vertex has an average of5 parents. This approximates the sparsity we might see in signalingpathways. We denote the size of the original edge set as | E | .2. Let the weights β for this DBN be drawn from a normal distribution N (0 , / (cid:112) | V | ) . We noticed empirically that the / (cid:112) | V | scaleprevented the simulated time series from diverging to inﬁnity.3. Use the DBN deﬁned by A, β to simulate M time courses of length T . We imitate the real datasets in Section 2.5 by generating M =4 time courses, each of length T =8 .4. Corrupt the adjacency matrix A in two steps: (i) remove r · | E | ofthe edges from A ; (ii) add a · | E | spurious edges to the adjacencymatrix. This corrupted graph simulates the imperfect prior knowledge encountered in reality. The parameters r and a control the “falsenegatives” and “false positives” in the prior knowledge, respectively.We use a range of values for parameters | V | , r, and a , yielding a gridof simulations summarized in Table 2. See Appendix C and Figure 6 foradditional details. Performance metrics.

We are primarily interested in SSPS’s ability tocorrectly recover the structure of the underlying signaling pathway. Thesimulation study allows us to measure this in a setting where we have accessto ground truth. We treat this as a probabilistic binary classiﬁcation task,where the method assigns an existence conﬁdence to each possible edge.We measure classiﬁcation performance using area under the precision-recall curve (AUCPR). We use average precision to estimate AUCPR, asopposed to the trapezoidal rule (which tends to be overly-optimistic, seeDavis and Goadrich (2006); Flach and Kull (2015)).Our decision to use AUCPR is motivated by the sparseness of thegraphs. For sparse graphs the number of edges grows linearly with | V | while the number of possible edges grows quadratically. Hence, as | V | grows, the proportion of positive instances decreases and the classiﬁcationtask increasingly becomes a “needle-in-haystack” scenario.Performance measurements on simulated data come with manycaveats. It’s most instructive to think of simulated performance as a sanitycheck. Since our data simulator closely follows our modeling assumptions,poor performance would suggest serious shortcomings in our method. We measure SSPS’s performance on experimental data by followingthe evaluation outlined by the HPN-DREAM Breast Cancer NetworkInference Challenge (Hill et al. , 2016). Signaling pathways differ acrosscontexts—e.g., cell type and environmental conditions. The challenge isto infer these context-speciﬁc signaling pathways from time course data.

Dataset.

The HPN-DREAM challenge provides phosphorylation timecourse data from 32 biological contexts. These contexts arise fromexposing 4 cell lines (BT20, BT549, MCF7, UACC812) to 8 stimuli. Foreach context, there are approximately M =4 time courses, each about T =7 time points in length. Cell lines have differing numbers of phosphositemeasurements (i.e., differing | V | ), ranging from 39 (MCF7) to 46 (BT20). Prior knowledge.

Participants in the original challenge were free to extractprior knowledge from any existing data sources. As part of their analysis,the challenge organizers combined participants’ prior graphs into a set ofedge probabilities. These aggregate priors summarize the participants’collective knowledge. They were not available to participants in theoriginal challenge, but we use them in our analyses of HPN-DREAMdata. We provide them to each of the baseline methods (see Section 2.6),so the resulting performance comparisons are fair. We do not compare anyof our scores to those listed by Hill et al. (2016) in the original challengeresults.

Performance metrics.

The HPN-DREAM challenge aims to score methodsby their ability to capture causal relationships between pairs of variables.It estimates this by comparing predicted descendant sets againstexperimentally generated descendant sets. More speciﬁcally, the challengeorganizers exposed cells to AZD8055, an mTOR inhibitor, and observedthe effects on other phosphosites. From this they determined a set ofphosphosites downstream of mTOR in the signaling pathway. Theseinclude direct substrates of the mTOR kinase as well as indirect targets.Comparing predicted descendants of mTOR against experimentallygenerated descendants of mTOR gives us a notion of false positives and false negatives . As we vary a threshold on edge probabilities, the predictedmTOR descendants change, which allows us to make a receiver operatingcharacteristic (ROC) curve. We calculate the resulting area under the ROCcurve (AUCROC) with the trapezoidal rule to score methods’ performanceon the HPN-DREAM challenge. Hill et al. (2016) provide more details forthis descendant set AUCROC scoring metric. AUCROC is sensible for thissetting since each descendant set contains a large fraction of the variables.Sparsity is not an issue.Because SSPS is stochastic we run it K =5 times per context, yielding5 AUCROC scores per context. Meanwhile the baseline methods are alldeterministic, requiring only one execution per context. We use a simpleterminology to compare SSPS’s scores against those of other methods. Ina given context, we say SSPS dominates another method if its minimum score exceeds that of the other method. Conversely, we say the othermethod dominates SSPS if its score exceeds SSPS’s maximum score. This dominance comparison has ﬂaws—e.g., its results depend on the samplesize K . However, it errs on the side of strictness and sufﬁces as an aid forsummarizing the HPN-DREAM evaluation results. Our evaluations compare SSPS against a diverse set of baseline methods.

Exact DBN (Hill et al. , 2012).

This method was an early inspiration forSSPS and is most similar to SSPS. However, the exact DBN methodencounters unique practical issues when we run it on real or simulated data.The method’s computational expense increases rapidly with problem size | V | and becomes intractable unless the “max-indegree” parameter is setto a small value. For example, we found that the method used more than32GB of RAM on problems of size | V | =100 , unless max-indegree was Merrell and Gitter set ≤ . Furthermore, the exact DBN method only admits prior knowledgein the form of Boolean reference edges , rather than continuous-valuededge conﬁdences. We overcame this by using a threshold to map edgeconﬁdences to 1 or 0. We chose a threshold of 0.25 for the HPN-DREAMchallenge evaluation because it yielded a reasonable number of prior edges.We ran Hill et al. ’s implementation using MATLAB 2018a. FunChisq (Zhang and Song, 2013).

This method is based on thenotion that two variables

X, Y have a causal relationship if thereexists a functional dependence Y = f ( X ) between them. It detectsthese dependencies using a chi-square test against the “no functionaldependence” null hypothesis. FunChisq was a strong competitor in theHPN-DREAM challenge, despite the fact that it uses no prior knowledge.In order to use

FunChisq , one must ﬁrst discretize their time coursedata. We followed Zhang and Song’s recommendation to use 1D k -meansclustering for discretization. Detailed instructions are given in the HPN-DREAM challenge supplementary materials (Hill et al. , 2016). We usedthe FunChisq (v2.4.9.1) and

Ckmeans.1d.dp (v4.3.0) R packages.

LASSO.

We included a variant of LASSO regression as a simple baseline.It incorporates prior knowledge into the typical primal formulation: ˆ β j = argmin β (cid:40) (cid:107) X + ,j − B j β (cid:107) + α V (cid:88) i =1 e − c ij | β i | (cid:41) where c ij is the prior conﬁdence (either Boolean or real-valued) for edge ( i, j ) . That is, the method uses penalty factors e − c ij to discourage edgeswith low prior conﬁdence. The method selects LASSO parameters, α ,using the Bayesian Information Criterion described by Zou et al. (2007).We use GLMNet (Friedman et al. , 2010) via the

GLMNet.jl

Juliawrapper (v0.4.2).

Prior knowledge baseline.

Our most straightforward baseline simplyreports the prior edge probabilities, performing no inference at all. Ideally,a Bayesian method should do no worse than the prior—new time coursedata should only improve our knowledge of the true graph. In reality, thisimprovement is subject to caveats about data quality and model ﬁt.

We provide the SSPS code, distributed under a MIT license, viaGitHub (https://github.com/gitter-lab/ssps) and archive it on Zenodo(https://doi.org/10.5281/zenodo.3939287). It includes a Snakemakeworkﬂow (Koster and Rahmann, 2012) for our full evaluation pipeline,enabling the reader to reproduce our results. The code used in thismanuscript corresponds to SSPS v0.1.1.

We describe evaluation results from the simulation study and HPN-DREAM network inference challenge. SSPS competes well against thebaselines, with superior scalability to other DBN-based approaches.

We compare our method to the baselines listed in Section 2.6. We focusespecially on the exact DBN method of Hill et al. (2012), as SSPS sharesmany modeling assumptions with it.

Computational expense.

Because SSPS uses MCMC, the user may allowit to run for an arbitrary amount of time. With this in mind, we summarizeSSPS’s timing with two numbers: (i) N/ cpu-hr, the number of MCMCsamples per CPU-hour; and (ii) N eff / cpu-hr, the effective number ofsamples per CPU-hour. We also measure the memory footprint per Markovchain, subject to our termination conditions. We measured these numbersfor each simulation in our grid (see Table 2). | V | N/ cpu-hr N eff / cpu-hr MB per chain40 70000 400 500100 9000 140 1200200 3000 60 1000Table 3. Computational expense of SSPS as a function of problem size | V | . N is the number of iterations completed by a Markov chain. N eff accountsfor burnin and autocorrelation in the Markov chains, giving a more accuratesense of the method’s progress. The last column gives the approximate memoryfootprint of each chain. The non-monotonic memory usage is an artifact of thechain termination conditions ( N> > hours). | V | maxindeg “linear” “full”40 4 66s 210s5 770s 3900s6 6700s TIMEOUT OOM OOM

100 3 250s 520s4

OOM OOM

200 2 53s 140s3

OOM OOM

Table 4. Computational expense of the exact DBN method of Hill et al. (2012)measured in CPU-seconds, as a function of problem size | V | and variousparameter settings. The method imposes an in-degree constraint on each vertex,shown in the “max indeg” column. The columns “linear” and “full” correspondto different regression modes, i.e., which interaction terms are included inthe DBN’s conditional probability distributions. “OOM” (Out Of Memory)indicates that the method exceeded a 32GB memory limit. “TIMEOUT”indicates that the method failed to complete within 12 hours. Table 3 reports average values of N/ cpu-hr, N eff / cpu-hr, and memoryfootprint for each problem size. As we expect, N/ cpu-hr and N eff / cpu-hrboth decrease approximately with the inverse of | V | . In contrast, the non-monotonic memory usage requires more explanation. It results from twocauses: (i) our termination condition and (ii) the sparse data structures weuse to store samples. On small problems ( | V | =40 ), the Markov chainterminates at a length of 100,000—well within the 12-hour limit. Onlarger problems ( | V | =100 or ) the Markov chain terminates at the 12-hour timeout. This accounts for the 500MB gap between small and largeproblems. The decrease in memory usage between | V | =100 and results from our sparse representations for samples. Roughly speaking,the sparse format only stores changes in the variables. So the memoryconsumption of a Markov chain depends not only on | V | , but also on the acceptance rate of the Metropolis-Hastings proposals. The acceptance rateis smaller for | V | =200 , yielding a net decrease in memory usage.Recall that SSPS differs from more traditional MCMC approaches bynature of its parent set proposal distribution, which is specially designedfor sparse graphs (see Section 2.2). When we modify SSPS to instead usea naïve uniform graph proposal, we see a striking difference in samplingefﬁciency. The uniform graph proposal distribution attains N eff / cpu-hr of100, 10, and 0.2 for | V | =40 , , and , respectively—drasticallysmaller than those listed in Table 3 for the parent set proposal. It’spossible that the traditional proposal could achieve higher N eff / cpu-hrby simply running faster. However, the more important consideration ishow N eff / cpu-hr changes with | V | . Our parent set proposal distribution’s N eff / cpu-hr decays approximately like O (1 / | V | ) . This is better thanwhat we might expect from a simple analysis (Appendix B). Meanwhile,the traditional proposal distribution’s N eff / cpu-hr decays faster than nferring Signaling Pathways with Probabilistic Programming Fig. 3.

Heatmap of AUCPR values from the simulation study. Both DBN-based techniques(SSPS and the exact method) score well on this, since the data is generated by a DBN. Onlarge problems the exact DBN method needs strict in-degree constraints, leading to poorprediction quality. LASSO and

FunChisq both perform relatively weakly. See Figure 7for representative ROC and PR curves. O (1 / | V | ) . This gap between O (1 / | V | ) and O (1 / | V | ) samplingefﬁciencies makes an enormous difference on large problems.Table 4 summarizes the computational expense of the exact DBNmethod (Hill et al. , 2012). The method quickly becomes impractical asthe problem size grows, unless we enforce increasingly strict in-degreerestrictions. In particular, the exact DBN method’s memory cost growsexponentially with its “max in-degree” parameter. The growth becomesincreasingly sharp with problem size. When | V | =200 , increasing themaximum in-degree from 2 to 3 makes the difference between terminatingin < minute and exceeding 32GB of memory. Such low bounds on in-degree are unrealistic, and will likely result in poor inference quality. Incomparison, SSPS makes no constraints on in-degree, and its memoryusage scales well with problem size.The other baseline methods— FunChisq and LASSO—are muchless computationally expensive. Both ﬁnish in seconds and require lessthan 100MB of memory for each simulated task. This highlights thecomputationally intense nature of Bayesian approaches. Not every scenariocalls for Bayesian inference. However, Bayesian inference is valuable inscientiﬁc settings where we’re concerned with uncertainty quantiﬁcation.

Fig. 4.

Heatmap of differential performance against the prior knowledge, measured byAUCPR paired t -statistics. SSPS consistently outperforms the prior knowledge acrossproblem sizes and shows robustness to errors in the prior knowledge. Predictive performance.

The simulation study provides a setting where wehave access to “ground truth”—the true simulated graph. We use AUCPRto score each method’s ability to recover the true graph’s edges.Figure 3 shows the AUCPR scores for our grid of simulations. Eachheat map shows AUCPR as a function of graph corruption parameters, r and a . The heat maps are arranged by method and problem size | V | .Each AUCPR value is an average over 5 replicates. SSPS maintainsfairly consistent performance across problem sizes. In contrast, the othermethods’ scores decrease with problem size. For the exact DBN method,this is partially due to the small in-degree constraints imposed on the largeproblems. It is forced to trade model accuracy for tractability.Figure 4 reveals further insights into these results. It plots differential performance with respect to the prior knowledge, in a layout analogousto Figure 3. Speciﬁcally, it plots the t -statistic of each method’s AUCPR,paired with the prior baseline’s AUCPR. Whenever the prior graph hassome informative edges, SSPS outperforms the prior. On the other hand,SSPS’s performance deteriorates whenever the prior contains no true edges(i.e., r =1 ). Under those circumstances FunChisq may be a better choice.Since it doesn’t rely on prior knowledge at all, it outperforms the othermethods when the prior is totally corrupted. However, we expect that inmost realistic settings there exists partially-accurate prior knowledge, inwhich case we expect

FunChisq to perform worse than SSPS.These results conﬁrm SSPS’s ability to identify the true network, givenpartially-accurate prior knowledge and time series data consistent with themodeling assumptions. SSPS is fairly robust with respect to the prior’squality and has consistent performance across different problem sizes.

Merrell and Gitter

We evaluated SSPS on experimental data from the HPN-DREAMchallenge. The challenge includes time series phosphorylation data from32 biological contexts: 8 stimuli applied to 4 breast cancer cell lines.Methods are scored on their ability to correctly identify the experimentallyderived descendants of mTOR. Figure 5 shows bar charts comparingthe methods’ AUCROC scores in each context. Appendix D providesadditional details.SSPS performs satisfactorily on this task overall. Employingterminology from Section 2.5, SSPS dominates the exact DBN method in18 of the 32 contexts, whereas the exact DBN method dominates SSPS inonly 9 contexts. Meanwhile, SSPS dominates

FunChisq in 11 contextsand is dominated by

FunChisq in 15. This is not surprising because

FunChisq was among the top competitors in the original challenge.LASSO, on the other hand, performs poorly. SSPS dominates LASSOin 18 contexts and is dominated in only 6.More puzzling is the strong performance of the prior knowledgebaseline. SSPS dominates the aggregate prior in only 9 contexts andis dominated in 21. This is not isolated to our method.

FunChisq outperforms and is outperformed by the prior knowledge in 11 and21 contexts, respectively. The aggregate prior’s strong performance isconsistent with the results from the original HPN-DREAM challenge; thisprior outperformed all individual challenge submissions (Hill et al. , 2016).Even though the aggregate prior gives identical predictions for each contextand totally ignores the time course data, it still attains better performancethan the other methods. This suggests either (i) the data is relativelyuninformative or (ii) the evaluation metric based on mTOR’s descendantsisn’t sufﬁciently precise to measure context-speciﬁc performance. Wesuspect the latter, because

FunChisq uses no prior knowledge but wasthe top performer in the HPN-DREAM challenge’s in silico tasks. Anevaluation based on one node’s descendants is not as discriminative as anevaluation of the directed edges. Many different directed graphs can haveequivalent or similar mTOR descendants. However, it is experimentallyimpractical to generate the context-speciﬁc gold standard networks thatwould be required for a more precise edge-based evaluation.

We presented SSPS, a signaling pathway reconstruction technique basedon DBN structure estimation. It uses MCMC to estimate the posteriorprobabilities of directed edges, employing a parent set proposal distributionspecially designed for sparse graphs. SSPS is a Bayesian approach. It takesadvantage of prior knowledge with edge-speciﬁc conﬁdence scores andcan provide uncertainty estimates on the predicted pathway relationships,which are valuable for prioritizing experimental validation.SSPS scales to large problems more efﬁciently than past DBN-basedtechniques. On simulated data, SSPS yields superior edge predictionswith robustness to ﬂaws in the prior knowledge. Our HPN-DREAMevaluation shows SSPS performs comparably to established techniqueson a community standard task. It is difﬁcult to make stronger statements inthe HPN-DREAM setting because the prior knowledge baseline performsso well and we can only evaluate the predicted mTOR descendants, notthe entire pathway. However, SSPS’s scalability among Bayesian methods,strong results in the simulation, and competitive performance in the HPN-DREAM challenge make it an attractive option for further investigation ofreal phosphorylation datasets.There are several potential limitations of SSPS relative to alternativepathway signaling models. Prior knowledge is not available in someorganisms or biological conditions, reducing one advantage of ourBayesian approach. Although SSPS is more scalable than related DBNtechniques, it would struggle to scale to proteome-wide phosphoproteomic data measuring thousands of phosphosites. For large datasets, werecommend running SSPS on a pruned version that includes only thehighest intensity or most variable phosphosites. SSPS, like most DBNtechniques, models only observed variables. It will erroneously excludeimportant pathway members, such as scaffold proteins, that are notphosphorylated. Latent variable models or background network-basedalgorithms are better suited for including unphosphorylated proteins in thepathway. Background network methods can also impose global constraintson the predicted pathway structure, such as controlling the number ofconnected components or proteins’ reachability from relevant receptors(Köksal et al. , 2018).There are many possible ways to improve SSPS. For example, itcould be extended to jointly model related pathways in a hierarchicalfashion, similar to Oates et al. (2014) and Hill et al. (2017). Alternatively,SSPS could be modiﬁed to accommodate causal assumptions via Pearl’sintervention operators; see the model of Spencer et al. (2015) for a relevantexample. Combining temporal and interventional data (Cardner et al. ,2019) is another rich area for future work. On the algorithmic side, wecould improve our MCMC procedure by adaptively tuning the parametersof its proposal distributions, as described by Gelman et al. (2014). BecauseSSPS is a probabilistic program, it is naturally extensible.

Acknowledgements

We thank UW-Madison’s Biomedical Computing Group for generouslyproviding compute resources; the teams developing Snakemake andHTCondor (Thain et al. , 2005) for empowering us to use those resourceseffectively; the

Gen team (Cusumano-Towner et al. , 2019) for designinga uniquely powerful probabilistic programming language; and the HPN-DREAM challenge organizers for providing experimental data for ourevaluation.

Funding

This work was funded by the National Institutes of Health (awardT32LM012413) and National Science Foundation (award DBI 1553206).

References

Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N.,Karaletsos, T., Singh, R., Szerlip, P., Horsfall, P., and Goodman, N. D.(2018). Pyro: Deep Universal Probabilistic Programming.

Journal ofMachine Learning Research .Budak, G., Ozsoy, O. E., Son, Y. A., Can, T., and Tuncbag, N. (2015).Reconstruction of the temporal signaling network in Salmonella-infectedhuman cells.

Frontiers in Microbiology , , 730.Cardner, M., Meyer-Schaller, N., Christofori, G., and Beerenwinkel, N.(2019). Inferring signalling dynamics by integrating interventional withobservational data. Bioinformatics , (14), i577–i585.Carlin, D. E., Paull, E. O., Graim, K., Wong, C. K., Bivol, A., Ryabinin,P., Ellrott, K., Sokolov, A., and Stuart, J. M. (2017). Propheticgranger causality to infer gene regulatory networks. PLOS ONE , (12),e0170340.Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B.,Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. (2017).Stan: A probabilistic programming language. Journal of StatisticalSoftware , (1).Cheong, R., Rhee, A., Wang, C. J., Nemenman, I., and Levchenko,A. (2011). Information Transduction Capacity of Noisy BiochemicalSignaling Networks. Science , (6054), 354–358. nferring Signaling Pathways with Probabilistic Programming Fig. 5.

Methods’ performances across contexts in the HPN-DREAM Challenge. MCMC is stochastic, so we run SSPS 5 times; the error bars show the range of AUCROC scores. The othermethods are all deterministic and require no error bars. See Figure 8 for example predicted networks, Figure 9 for AUCPR scores, and Figure 10 for representative ROC and PR curves.

Cusumano-Towner, M. F., Saad, F. A., Lew, A. K., and Mansinghka, V. K.(2019). Gen: A general-purpose probabilistic programming system withprogrammable inference. In

Proceedings of the 40th ACM SIGPLANConference on Programming Language Design and Implementation ,PLDI 2019, New York, NY, USA. ACM.Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and ROC curves. In

Proceedings of the 23rd InternationalConference on Machine Learning .Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan, S., Moore,D., Patton, B., Alemi, A., Hoffman, M., and Saurous, R. A. (2017).Tensorﬂow distributions. arXiv , page arXiv:1711.10604.Drake, J. M., Paull, E. O., Graham, N. A., Lee, J. K., Smith, B. A., Titz,B., Stoyanova, T., Faltermeier, C. M., Uzunangelov, V., Carlin, D. E.,Fleming, D. T., Wong, C. K., Newton, Y., Sudha, S., Vashisht, A. A.,Huang, J., Wohlschlegel, J. A., Graeber, T. G., Witte, O. N., and Stuart,J. M. (2016). Phosphoproteome Integration Reveals Patient-SpeciﬁcNetworks in Prostate Cancer.

Cell , (4), 1041–1054.Eduati, F., Jaaks, P., Wappler, J., Cramer, T., Merten, C. A., Garnett, M. J.,and Saez-Rodriguez, J. (2020). Patient-speciﬁc logic models of signalingpathways from screenings on cancer biopsies to prioritize personalizedcombination therapies. Molecular Systems Biology , (2), e8664.Eker, S., Knapp, M., Laderoute, K., Lincoln, P., Meseguer, J., and Sonmez,K. (2002). Pathway logic: symbolic analysis of biological signaling. Paciﬁc Symposium on Biocomputing , pages 400–412.Flach, P. A. and Kull, M. (2015). Precision-recall-gain curves: PR analysisdone right. In

Advances in Neural Information Processing Systems .Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths forgeneralized linear models via coordinate descent.

Journal of StatisticalSoftware , .Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., andRubin, D. B. (2014). Bayesian Data Analysis . CRC Press, 3 edition.Gjerga, E., Trairatphisan, P., Gabor, A., Koch, H., Chevalier, C.,Ceccarelli, F., Dugourd, A., Mitsos, A., and Saez-Rodriguez, J. (2020).Converting networks to predictive logic models from perturbationsignalling data with CellNOpt. bioRxiv , page 2020.03.04.976852. Gregorczyk, M. (2010). An introduction to Gaussian Bayesian networks.In Q. Yan, editor,

Systems Biology in Drug Discovery and Development:Methods and Protocols , chapter 6, pages 121–147. Springer.Guziolowski, C., Videla, S., Eduati, F., Thiele, S., Cokelaer, T., Siegel,A., and Saez-Rodriguez, J. (2013). Exhaustively characterizing feasiblelogic models of a signaling network using Answer Set Programming.

Bioinformatics , (18), 2320–2326.Halasz, M., Kholodenko, B. N., Kolch, W., and Santra, T. (2016).Integrating network reconstruction with mechanistic modeling to predictcancer therapies. Science Signaling , (455), ra114.Henriques, D., Villaverde, A. F., Rocha, M., Saez-Rodriguez, J., andBanga, J. R. (2017). Data-driven reverse engineering of signalingpathways using ensembles of dynamic models. PLOS ComputationalBiology , (2), e1005379.Hill, S. M., Lu, Y., Molina, J., Heiser, L. M., Spellman, P. T., Speed, T. P.,Gray, J. W., Mills, G. B., and Mukherjee, S. (2012). Bayesian inferenceof signaling network topology in a cancer cell line. Bioinformatics , ,2804–2810.Hill, S. M., Heiser, L. M., Cokelaer, T., Unger, M., Nesser, N. K., Carlin,D. E., Zhang, Y., Sokolov, A., Paull, E. O., Wong, C. K., Graim, K.,Bivol, A., Wang, H., Zhu, F., Afsari, B., Danilova, L. V., Favorov, A. V.,Lee, W. S., Taylor, D., Hu, C. W., Long, B. L., Noren, D. P., Bisberg,A. J., HPN-DREAM Consortium, Gray, J. W., Kellen, M., Norman, T.,Friend, S., Qutub, A. A., Fertig, E. J., Guan, Y., Song, M., Stuart, J. M.,Spellman, P. T., Koeppl, H., Stovolitzky, G., Saez-Rodriguez, J., andMukherjee, S. (2016). Inferring causal molecular networks: empiricalassessment through a community-based effort. Nature Methods .Hill, S. M., Nesser, N. K., Johnson-Camacho, K., Jeffress, M., Johnson,A., Boniface, C., Spencer, S. E., Lu, Y., Heiser, L. M., Lawrence, Y.,Pande, N. T., Korkola, J. E., Gray, J. W., Mills, G. B., Mukherjee, S., andSpellman, P. T. (2017). Context speciﬁcity in causal signaling networksrevealed by phosphoprotein proﬁling.

Cell Systems , pages 73–83.Hunter, T. (2009). Tyrosine phosphorylation: thirty years and counting.

Current Opinion in Cell Biology , (2), 140–146. Merrell and Gitter

Kholodenko, B., Yaffe, M. B., and Kolch, W. (2012). ComputationalApproaches for Analyzing Information Flow in Biological Networks.

Science Signaling , (220), re1.Kholodenko, B. N., Hancock, J. F., and Kolch, W. (2010). Signallingballet in space and time. Nature Reviews Molecular Cell Biology , (6),414–426.Köksal, A. S., Beck, K., Cronin, D. R., McKenna, A., Camp, N. D.,Srivastava, S., MacGilvray, M. E., Bodik, R., Wolf-Yadlin, A., Fraenkel,E., Fisher, J., and Gitter, A. (2018). Synthesizing signaling pathwaysfrom temporal phosphoproteomic data. Cell Reports , , 3607–3618.Koster, J. and Rahmann, S. (2012). Snakemake–a scalable bioinformaticsworkﬂow engine. Bioinformatics , (19), 2520–2522.Krishnaswamy, S., Spitzer, M. H., Mingueneau, M., Bendall, S. C., Litvin,O., Stone, E., Pe’er, D., and Nolan, G. P. (2014). Conditional density-based analysis of T cell signaling in single-cell data. Science , (6213),1250689.Molinelli, E. J., Korkut, A., Wang, W., Miller, M. L., Gauthier, N. P., Jing,X., Kaushik, P., He, Q., Mills, G., Solit, D. B., Pratilas, C. A., Weigt,M., Braunstein, A., Pagnani, A., Zecchina, R., and Sander, C. (2013).Perturbation biology: inferring signaling networks in cellular systems. PLOS Computational Biology , .Newman, R. H., Zhang, J., and Zhu, H. (2014). Toward a systems-levelview of dynamic phosphorylation networks. Frontiers in Genetics , ,263.Norman, U. and Cicek, A. E. (2019). ST-Steiner: a spatio-temporal genediscovery algorithm. Bioinformatics , (18), 3433–3440.Oates, C. J., Korkola, J., Gray, J. W., and Mukherjee, S. (2014). Jointestimation of multiple related biological networks. The Annals of AppliedStatistics , , 1892–1919.Patil, A., Kumagai, Y., Liang, K.-c., Suzuki, Y., and Nakai, K. (2013).Linking Transcriptional Changes over Time in Stimulated DendriticCells to Identify Gene Networks Activated during the Innate ImmuneResponse. PLOS Computational Biology , (11), e1003323.Pawson, T. and Warner, N. (2007). Oncogenic re-wiring of cellularsignaling pathways. Oncogene , (9), 1268–1275.Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan,G. P. (2005). Causal Protein-Signaling Networks Derived fromMultiparameter Single-Cell Data. Science , (5721), 523–529.Salvatier, J., Wiecki, T. V., and Fonnesbeck, C. (2016). Probabilisticprogramming in Python using PyMC3. PeerJ Computer Science , , e55.Schoeberl, B., Eichler-Jonsson, C., Gilles, E. D., and Müller, G.(2002). Computational modeling of the dynamics of the MAP kinasecascade activated by surface and internalized EGF receptors. NatureBiotechnology , (4), 370–375.Shojaie, A. and Michailidis, G. (2010). Discovering graphical Grangercausality using the truncating lasso penalty. Bioinformatics , (18),i517–i523.Spencer, S. E., Hill, S. M., and Mukherjee, S. (2015). Inferring networkstructure from interventional time-course experiments. The Annals ofApplied Statistics , , 507–524.Thain, D., Tannenbaum, T., and Livny, M. (2005). Distributed computingin practice: the Condor experience. Concurrency - Practice andExperience , (2-4), 323–356.Werhli, A. V. and Husmeier, D. (2007). Reconstructing gene regulatorynetworks with Bayesian networks by combining expression data withmultiple sources of prior knowledge. Statistical Applications in Geneticsand Molecular Biology , .Zhang, Y. and Song, M. (2013). Deciphering interactions incausal networks without parametric assumptions. arXiv , pagearXiv:1311.2707.Zou, H., Hastie, T., and Tibshirani, R. (2007). On the “degrees of freedom”of the lasso. The Annals of Statistics , , 2173–2192. Appendix

A Model formulation details

We provide additional information about our graph prior and marginallikelihood function. We also describe some implications of SSPS’s modelassumptions.

Derivation of graph prior (Equation 4).

We step through a more detailedderivation of SSPS’s new graph prior. We begin with the original graphprior (Equation 1) and rewrite it in terms of the edge existence variables Z : P ( G | G (cid:48) , λ ) ∝ exp (cid:0) − λ | E ( G ) \ E ( G (cid:48) ) | (cid:1) = exp  − λ (cid:88) ( i,j ) / ∈ E ( G (cid:48) ) z ij  (5) = (cid:89) ( i,j ) / ∈ E ( G (cid:48) ) e − λz ij = (cid:89) ( i,j ) / ∈ E ( G (cid:48) ) (cid:16) e − λ (cid:17) z ij ∝ (cid:18)

11 + e − λ (cid:19) V −| E ( G (cid:48) ) | · (cid:89) ( i,j ) / ∈ E ( G (cid:48) ) (cid:16) e − λ (cid:17) z ij = (cid:89) ( i,j ) / ∈ E ( G (cid:48) ) (cid:18)

11 + e − λ (cid:19) (cid:16) e − λ (cid:17) z ij = (cid:89) ( i,j ) / ∈ E ( G (cid:48) ) (cid:18)

11 + e − λ (cid:19) − z ij (cid:18) e − λ e − λ (cid:19) z ij (6)Equation 6 shows the original prior is in fact a product of independentBernoulli variables—the edge existence variables z ij . Equation 6explicitly assigns probability to the edges not contained in E ( G (cid:48) ) .However, it also implicitly assigns uniform probability to every edge contained in E ( G (cid:48) ) . We deduce that they are Bernoulli (0 . variables,allowing us to write the prior P ( Z | G (cid:48) , λ ) in the following form: (cid:89) ( i,j ) ∈ E ( G (cid:48) ) (cid:18) (cid:19) z ij (cid:18) (cid:19) − z ij (cid:89) ( i,j ) / ∈ E ( G (cid:48) ) (cid:18) e − λ e − λ (cid:19) z ij (cid:18) e − λ (cid:19) − z ij (7)just as shown in the main text.Now we modify the prior to use continuous-valued edge conﬁdences c ij instead of Boolean reference edges E ( G (cid:48) ) . Intuitively, we want torestate Equation 7 as a single product over all Z variables, rather than twoseparate products. Our goal is to ﬁnd a function q ( c ij ) such that P ( Z | C, λ ) = (cid:89) ( i,j ) q ( c ij ) z ij (1 − q ( c ij )) − z ij . However, in order to remain consistent with the original prior q ( c ij ) oughtto be monotone-increasing and satisfy these criteria: q (0) = e − λ / (1 + e − λ ) and q (1) = 1 / . It turns out that choosing q ( c ij ) = e − λ e − c ij λ + e − λ satisﬁes these requirements. This brings us to Equation 3 of the main text.From there, it is straightforward to replace the single shared λ variablewith a set of vertex-speciﬁc Λ variables and arrive at Equation 4. nferring Signaling Pathways with Probabilistic Programming Marginal likelihood function details.

Equation 2 is obtained by (i) usinga Gaussian DBN as the likelihood function for G , (ii) assuming certainprior distributions for the DBN parameters, and (iii) integrating the DBNparameters out. Speciﬁcally, let β j and σ j ∀ j ∈ { . . . | V |} be the DBN’sweight and noise parameters, respectively. We assume an improper prior σ j ∝ /σ j for the noise and a Gaussian prior for the weights: β j | σ j ∼ N (cid:16) , T σ j ( B (cid:62) j B j ) − (cid:17) . In other words, SSPS uses an improper joint prior P ( β j , σ j ) = P ( β j | σ j ) P ( σ j ) with P ( σ j ) ∝ /σ j . This choice allows β j and σ j tobe marginalized, yielding Equation 2.The power −| pa G ( j ) | / in Equation 2 is correct when the DBN onlyuses linear terms. Recall that B j may in general contain columns ofnonlinear interactions between parent variables. When that is true, thequantity | pa G ( j ) | should be replaced by the width of B j . We elide thisdetail in the main text for brevity. Our implementation uses the correctexponent.Our implementation of the marginal likelihood function employsleast recently used caching to reduce redundant computation. Codeproﬁling shows that this yields a substantial improvement to efﬁciency.For additional in-depth discussion of Equation 2, we recommend thesupplementary materials of Hill et al. (2012). Additional insights about SSPS’s model assumptions.

SSPS’s model hasinteresting properties that could lead to method improvements. Forexample, when we replace the shared λ variable with vertex-speciﬁc Λ variables, the model effectively becomes a set of | V | independent models.The plate notation in Figure 1 makes this clear; X − is the only sharedvariable, and it’s fully observed. This has algorithmic implications. Forexample, future versions of SSPS could parallelize inference at the vertexlevel, allocating more resources to the parent sets that converge slowly.In the course of deriving Equation 6, we showed that our prior is a log-linear model over edge features. Equation 5 shows this most clearly. Futureversions of SSPS could use the expressiveness of log-linear densities overhigher-order graph features to capture richer forms of prior knowledge. B Parent set proposal details

A key component of SSPS is its novel parent set proposal distribution . Wemotivate its design and discuss its computational complexity in greaterdetail.

Parent sets instead of edges.

The marginal likelihood (Equation 2) is afunction of the graph G . However, it depends on G only via its parent sets ,which are encoded in the matrices B j . Accordingly, SSPS represents G by storing a list of parents for each vertex.It makes sense to use a proposal distribution that operates directly onSSPS’s internal parent set representation. This motivates our choice of the add-parent , remove-parent , and swap-parent proposals listedin Section 2.2. There is a natural correspondence between (i) likelihoodfunction, (ii) data structure, and (iii) proposal distribution. Sampling efﬁciency.

We provide some intuition for the parent setproposal’s superior sampling efﬁciency. Let z ij be a particular edgeexistence variable. The estimate for z ij converges quickly if MCMCupdates z ij frequently. Hence, as a proxy for sampling efﬁciency, considerthe number of times z ij gets updated per unit time. We decompose thisquantity into three factors: z ij updatesunit time = (cid:15) · τ · α where (cid:15) = graph proposalsunit time τ = z ij proposalsgraph proposal α = z ij acceptance probabilityIn other words, (cid:15) is the time efﬁciency of the proposal distribution. Thefactor τ is the probability that a given proposal touches z ij . Lastly, α isthe proposal’s Metropolis-Hastings acceptance probability.For a given proposal distribution, we’re interested in how these factorsdepend on | V | . For simplicity of analysis, assume the Markov chain is ina typical state where the graph is sparse: | E ( G ) | = O ( | V | ) .For the parent set proposal, execution time has no dependence on | V | and hence (cid:15) = O (1) . Recall that the parent set proposal resides in anouter loop, which iterates through all | V | vertices. It follows that for anyparticular proposal there is a / | V | chance that it acts on vertex j . Afterchoosing vertex j , there is on average a O (1 / | V | ) chance that the proposalaffects z ij . This follows from the sparsity of the graph: vertex i is typicallya non-parent of j and the probability of choosing it via an add-parent or swap-parent action is O (1 / | V | ) . Hence, the parent set proposalhas a probability τ = O (1 / | V | ) of choosing z ij . Lastly, the acceptanceprobability α has no dependence on | V | and therefore α = O (1) . Theproduct of these factors gives an overall sampling efﬁciency of O (1 / | V | ) for the parent set proposal.For the uniform graph proposal, (cid:15) ’s complexity depends on theparticular implementation. For sake of generosity we assume an efﬁcientimplementation with (cid:15) = O (1) . The proposal chooses uniformly from O ( | V | ) actions: add-, remove-, or reverse-edge. The probability ofchoosing one that affects z ij is τ = O (1 / | V | ) . Recall that the marginallikelihood decreases steeply with parent set size. It follows that add-edgeactions will typically have low acceptance probability. Since the graph issparse, add-edge actions are overwhelmingly probable; the probability of not landing on one is O (1 / | V | ) . If we assume the acceptance probabilityis high for remove-edge and reverse-edge actions, (i.e., they are acceptedwhenever they’re proposed), then this suggests α = O (1 / | V | ) , averagedover many proposals. The product of these factors suggests a samplingefﬁciency that decays like O (1 / | V | ) .This gap between O (1 / | V | ) and O (1 / | V | ) sampling efﬁcienciesexplains most of the difference that we saw in Section 3.1. A moredetailed analysis may reveal why the parent set proposal attains samplingefﬁciencies closer to O (1 / | V | ) in practice. C Simulation study details

We give additional details about the simulation study’s methodology andresults.

Simulation process.

The simulation process described in Section 2.4 differsfrom SSPS’s modeling assumptions in several ways. Recall that thesimulator constructs a DBN to generate time series data. This simulatedDBN employs nonlinear interaction terms. The simulator assumes thatthe data at each timepoint is a cubic function of the data at the previoustimestep. In contrast, all of our analyses ran SSPS with an assumption of linear dependencies. In other words, the data contained complexities thatSSPS was unable to capture. SSPS’s performance in the simulation studysuggests that it has some robustness to modeling assumption mismatches.We provide an illustration of the simulation process in Figure 6.It is interesting to notice that the simulated networks do not resembledirected acyclic graphs (DAGs) in any way. They do not have any sense ofdirectionality. Contrast this with the biological graphs shown in Figure 8.Strictly speaking these are not DAGs, but they do have an overall direction.Some vertices are source-like, and others are sink-like. Future simulationsand models could be more biologically realistic if they incorporated thiskind of structure.

Simulation study results.

Figure 7 gives some representative ROC and PRcurves from the simulation study. On problem sizes up to | V | = 100 ,SSPS and the exact DBN method yield similar curves in both ROC and Merrell and Gitter

Fig. 6.

A schematic of the simulation study. We randomly generate a true network (upper left) and use it to simulate a time series dataset (upper right). We corrupt the true network byadding and removing edges (lower left); solid red edges have been added, dashed red edges have been removed, and black edges are original. This corrupted network serves as partiallyinaccurate prior knowledge for the inference techniques. Each technique produces a predicted network (lower right) by assigning a score to each possible edge. The predicted network isevaluated with respect to the true network.

PR space—though SSPS’s curves clearly dominate. On larger problemsthe exact DBN method’s performance quickly deteriorates. Computationaltractability requires the exact method to impose highly restrictive in-degreeconstraints. These observations are consistent with the heatmaps of Figures3 and 4 in the main text.

D HPN-DREAM challenge details

We provide additional details for the methodology and results of the HPN-DREAM challenge evaluation.

Data preprocessing.

The HPN-DREAM challenge data needed to bepreprocessed before it could be used by the inference methods. The choiceswe made during preprocessing most likely affected the inference results.Many of the time series contain duplicate measurements. We managedthis by simply averaging the duplicates. We log-transformed the time seriessince they were strictly positive and some methods (SSPS and exact DBN)assume normality. This probably made little difference for

FunChisq ,which discretizes the data as part of its own preprocessing.

Predicted networks.

Figure 8 visualizes networks from two biologicalcontexts in the HPN-DREAM challenge evaluation. This gives a senseof how the different inference methods’ predictions differ from each other.All of the predicted networks are fairly different, though the SSPS and exact DBN predictions are more similar to each other than they are to

FunChisq . FunChisq predicts more self-edges than the other methods.In the BT549 cell line, the experimentally detected mTOR descendantsinclude receptor proteins that would traditionally be considered upstreamof mTOR in the pathway. The experimental results are reasonable due to theinﬂuence of feedback loops in signaling pathways. However, the numberand positioning of the mTOR descendants highlights the differencesbetween the coarse HPN-DREAM challenge evaluation, which is basedon reachability in a directed graph, and the more precise evaluation in oursimulation study, where we have the edges in the ground truth network.

HPN-DREAM AUCPR.

For completeness, we complement the AUCROCresults of Section 3.2 with the corresponding AUCPR results. Figure 9shows AUCPR in bar charts, with an identical layout to Figure 5.AUCPR leads us to similar conclusions as those from AUCROC. SSPSdominates the exact DBN method in 19 contexts and is dominated in 10.Both SSPS and

FunChisq dominate each other in 14 contexts. However,SSPS dominates the prior knowledge in only 9 contexts, and is dominatedin 21. As before, we conclude that SSPS attains similar performance toestablished methods on this task.

ROC and PR curves.

Figure 10 shows ROC and PR curves from our HPN-DREAM evaluation. We focus on two representative contexts: cell linesBT549 and MCF7, with EGF as the stimulus. nferring Signaling Pathways with Probabilistic Programming ROC CurvesPR Curves

Fig. 7.

Representative ROC curves (top) and PR curves (bottom) from the simulation study. We show curves for three different simulations: | V | = 40 , , and (left, middle, rightrespectively). Each of these simulations used corruption parameters r = a = 0 . . Fig. 8.

Prior and predicted pathways from the HPN-DREAM challenge. We show pathways from two contexts: cell lines BT549 (top row) and MCF7 (bottom row). The stimulus is EGFfor both contexts. SSPS attained the best AUCROC of all methods in the (BT549, EGF) context and the worst in the (MCF7, EGF) context. The yellow node is mTOR; red nodes are theexperimentally generated (“ground truth”) descendants of mTOR. Merrell and Gitter

Fig. 9.

A bar chart similar to Figure 5 except that it shows AUCPR rather than AUCROC. See Figure 5 for details about the layout.

The bar charts in Figure 9 tell us that SSPS was the top performer inthe (BT549, EGF) context. The ROC and PR curves are consistent withthis. SSPS dominates the other methods in ROC and PR space. In contrast,SSPS was the worst performer in the (MCF7, EGF) context. The curvesshow SSPS performing worse than random.The LASSO ROC and PR curves are interesting. Its ROC curves shownearly random performance. Its PR curves are straight lines. Manually inspecting its predictions yields an explanation: (i)

LASSO gives nonzeroprobability to a very small number of edges; (ii) that small set of edgesresults in a very small descendant set for mTOR; (iii) that small descendantset is incorrect. nferring Signaling Pathways with Probabilistic Programming ROC CurvesPR Curves