[PDF] A Constraint-Based Algorithm for the Structural Learning of Continuous-Time Bayesian Networks

Abstract

Dynamic Bayesian networks have been well explored in the literature as discrete-time models: however, their continuous-time extensions have seen comparatively little attention. In this paper, we propose the first constraint-based algorithm for learning the structure of continuous-time Bayesian networks. We discuss the different statistical tests and the underlying hypotheses used by our proposal to establish conditional independence. Furthermore, we analyze and discuss the computational complexity of the best and worst cases for the proposed algorithm. Finally, we validate its performance using synthetic data, and we discuss its strengths and limitations comparing it with the score-based structure learning algorithm from Nodelman et al. (2003). We find the latter to be more accurate in learning networks with binary variables, while our constraint-based approach is more accurate with variables assuming more than two values. Numerical experiments confirm that score-based and constraint-based algorithms are comparable in terms of computation time.

Full PDF

aa r X i v : . [ c s . A I] J u l C ONSTRAINT -B ASED L EARNING FOR C ONTINUOUS -T IME B AYESIAN N ETWORKS

Constraint-Based Learning for Continuous-Time Bayesian Networks

Alessandro Bregoli A . BREGOLI CAMPUS . UNIMIB . IT Universit`a degli Studi di Milano-Bicocca, Milano, Italy

Marco Scutari

SCUTARI @ IDSIA . CH Istituto Dalle Molle di Studi sull’Intelligenza Artiﬁciale (IDSIA), Lugano, Switzerland

Fabio Stella

FABIO . STELLA @ UNIMIB . IT Universit`a degli Studi di Milano-Bicocca, Milano, Italy

Abstract

Dynamic Bayesian networks have been well explored in the literature as discrete-time models;however, their continuous-time extensions have seen comparatively little attention. In this paper, wepropose the ﬁrst constraint-based algorithm for learning the structure of continuous-time Bayesiannetworks. We discuss the different statistical tests and the underlying hypotheses used by ourproposal to establish conditional independence. Finally, we validate its performance using syntheticdata, and discuss its strengths and limitations. We ﬁnd that score-based is more accurate in learningnetworks with binary variables, while our constraint-based approach is more accurate with variablesassuming more than two values. However, more experiments are needed for conﬁrmation.

Keywords:

Continuous-time Bayesian networks, structure learning, constraint-based algorithm.

1. Introduction

Multivariate time-series data are becoming increasingly common in many domains such as health-care, ﬁnance, telecommunications, social networks, e-commerce, and homeland security. Their sizeand dimensionality is set to continue to increase in the future, requiring automated algorithms todiscover their probabilistic structure and to predict their trajectories over time.In this paper we focus on the problem of learning the structure of continuous-time Bayesiannetworks (CTBNs; Nodelman et al., 2002) from data. This type of probabilistic graphical modelhas been successfully used to reconstruct transcriptional regulatory networks from time-coursegene expression data (Acerbi et al., 2016), to model the presence of people at their computers(Nodelman and Horvitz, 2003), and to detect network intrusion (Xu and Shelton, 2008). The litera-ture implements CTBN structure learning using score-based algorithms to maximize the Bayesian-Dirichlet equivalent (BDe) metric, while in this paper we design the ﬁrst constraint-based algorithm.The main contributions of this paper are: • the design of the ﬁrst constraint-based algorithm for the structure learning of CTBNs, whichwe call Continuous-Time PC (CTPC); • the deﬁnition of different test statistics to assess conditional independence in CTBNs; • an empirical performance comparison between score-based algorithms and our proposal.The rest of the paper is organized as follows. Section 2 introduces CTBNs and the associatedscore-based structure learning algorithms. The proposed constraint-based algorithm and the associ-ated conditional independence tests are presented in Section 3. We then compare score-based andconstraint-based approaches in Section 4, and conclusions are summarized in Section 5. REGOLI , S

CUTARI AND S TELLA

2. Continuous-Time Bayesian Networks

CTBNs are a particular type of probabilistic graphical model that combine Bayesian networks(BNs; Koller and Friedman, 2009) and homogeneous Markov processes to model discrete-statecontinuous-time dynamical systems (Nodelman et al., 2002). Compared to their discrete-time coun-terpart, dynamic Bayesian networks (DBNs), they can efﬁciently model domains like those men-tioned above in which variables evolve at different time granularities. The complexity of exact andapproximate inference in CTBNs has been shown to be NP-hard (Sturlaugson and Sheppard, 2014).

A CTBN models a stochastic process over a structured state space for a set of random variables X = { X , X , . . . , X n } , where each X k ∈ X takes value over a ﬁnite domain Val ( X k ) . It encodessuch a process in a compact form by factorizing its dynamics into local continuous-time Markovprocesses that depend on a limited set of states. Deﬁnition 1 (Nodelman et al., 2002). A CTBN N over X is characterized by two components: • An initial distribution P ( X ) , speciﬁed as a BN over X . • A continuous-time transition model speciﬁed as: – a directed (possibly cyclic) graph G whose nodes correspond to the X k ∈ X ; – a conditional intensity matrix Q X k | U for each X k . The conditional intensity matrix (CIM) Q X k | U consists of the set of intensity matrices Q X k | u =  − q x | u q x x | u · · · q x x m | u q x x | u − q x | u · · · q x x m | u ... ... . . . ... q x m x | u q x m x | u · · · − q x m | u  , m = | Val ( X k ) | , one for each possible conﬁguration u of the parents U of X k in G . The diagonal elements of Q X k | u are such that q x i | u = P x j = x i q x i x j | u , where q x i | u is the parameter of the exponentialdistribution associated with state x i of variable X k . Therefore, /q x i | u is the expected time thatvariable X k stays in state x i before transitioning to a different state x j . The off-diagonal elements q x i x j | u are proportional to the probability that X k transitions from state x i to state x j when U = u .Note that, conditional on X k , Q X k | u can be equivalently summarized with two independent sets ofparameters: • q X k | u = (cid:8) q x i | u , ∀ x i ∈ Val ( X k ) (cid:9) , the set of intensities of the exponential distributions ofthe waits until the next transition ; and • θ X k | u = n θ x i x j | u = q x i x j | u /q x i | u , ∀ x i , x j ∈ Val ( X k ) , x i = x j o , the probabilities of tran-sitioning to speciﬁc states .

1. For simplicity of notation, and without loss of generality, we omit the k subscript from m that implies that each X k may have a different domain. ONSTRAINT -B ASED L EARNING FOR C ONTINUOUS -T IME B AYESIAN N ETWORKS

Therefore, a CTBN N over X can be equivalently described by a graph G together with the cor-responding sets of parameters q = (cid:8) q X k | u : ∀ X k ∈ X , x i ∈ Val ( X k ) , u ∈ Val ( U ) (cid:9) and Θ = (cid:8) θ X k | u : ∀ X k ∈ X , x i ∈ Val ( X k ) , u ∈ Val ( U ) (cid:9) .It is important to note that we assume that only one variable in the CTBN can change state atany speciﬁc instant; and that its transition dynamics are speciﬁed by its parents via the CIM, whilebeing independent of all other variables given its Markov Blanket. Let D = { σ , . . . , σ h } be a sample consisting of h trajectories σ j = {h t , X t i , . . . , h T j , X T j i} ,where T j represents the length of trajectory σ j , that is, the number of transitions. For each pair h t i , X t i i , we denote the time of the i th transition as t i and the variable that leaves its current state atthat time as X t i . Learning the structure of a CTBN from D can be cast as an optimization problem(Nodelman et al., 2003) in which we would like to ﬁnd the graph G ∗ with the highest posteriorlog-probability given D : ln P( G | D ) = ln P( G ) + ln P( D | G ) (1)where P( G ) is the prior distribution over the space of graphs spanning X and P( D | G ) is the marginal likelihood of the data given G averaged over all possible parameter sets.The prior P( G ) is usually assumed to satisfy the structure modularity property (Friedman and Koller,2000), so that it decomposes as P( G ) = Y X k ∈ X P( P a ( X k ) = U ) . (2)For simplicity, the literature often assumes a uniform prior, that is, P( G ) ∝ .The marginal likelihood P( D | G ) depends on the parameter prior P( q , Θ | G ) , which is usuallyassumed to satisfy the global parameter independence , the local parameter independence and the parameter modularity properties (Heckerman et al., 1995) outlined below. • Global parameter independence : the parameters q X k | U and θ X k | U associated with eachvariable X k in a graph G are independent: P( q , Θ | G ) = Y X k ∈ X P( q X k | U , θ X k | U | G ) . (3) • Local parameter independence : for each variable X k , the parameters associated with eachconﬁguration u of parent set U are independent: P( q X k | U , θ X k | U | G ) = Y u ∈ Val ( U ) Y x i ∈ Val ( X k ) P( q x i | u , θ x i | u | G ) . (4) • Parameter modularity : if variable X k has the same parent set in two distinct graphs G and G ′ ,then the prior probability for the parameters associated with X k should also the same: P( q X k | U , θ X k | U | G ) = P( q X k | U , θ X k | U | G ′ ) . (5)

2. The deﬁnition of Markov blankets in CTBNs is the same as in BNs: a Markov blanket comprises the parents, thechildren and the spouses of the target node; and it graphically separates the target node from the rest of the network. REGOLI , S

CUTARI AND S TELLA

In the context of CTBNs, we assume that the priors over the waiting times and over the transitionprobabilities are independent as well: P( q , Θ | G ) = P( q | G ) P( Θ | G ) . (6)Nodelman et al. (2003) suggested conjugate priors for both q and Θ in the form of P( q x i | u ) ∼ Gamma (cid:0) α x i | u , τ x i | u (cid:1) , (7) P( θ x i | u ) ∼ Dir (cid:0) α x i x | u , . . . , α x i x m | u (cid:1) , (8)where α x i | u , τ x i | u , α x i x | u , . . . , α x i x m | u are the priors’ hyperparameters. In particular, for any X k | U = u , α x i | u and α x i x j | u represent the pseudocounts for the number of transitions fromstate x i to state x j ; and τ x i | u represent the imaginary amount of time spent in each state x i beforeany data is observed. Note that α x i | u is inversely proportional to the number of joint states of theparents of X i . After conditioning on the dataset D , we obtain the following posterior distributions: P( q x i | u | D ) ∼ Gamma (cid:0) α x i | u + M x i | u , τ x i | u + T x i | u (cid:1) , (9) P( θ x i | u | D ) ∼ Dir (cid:0) α x i x | u + M x i x | u , . . . , α x i x m | u + M x i x m | u (cid:1) , (10)where T x i | u and M x i x j | u are the sufﬁcient statistics of the CTBN. In particular, T x i | u is the amountof time spent by X k in the state x i and M x i x j | u is the number of times that X k transitions fromthe state x i to the state x j , given U = u . The marginal likelihood P( D | G ) arising from theseposteriors can be written as P( D | G ) = Y X k ∈ X ML( q X k | U : D ) ML( θ X k | U : D ) (11)due to (3) and (6). ML( q X k | U : D ) is the marginal likelihood of q X k | U , ML( q X k | U : D ) = Y u ∈ Val ( U ) Y x i ∈ Val ( X k ) Γ (cid:0) α x i | u + M x i | u + 1 (cid:1) (cid:0) τ x i | u (cid:1) α xi | u +1 Γ (cid:0) α x i | u + 1 (cid:1) (cid:0) τ x i | u + T x i | u (cid:1) α xi | u + M xi | u +1 ; (12)and ML( θ X k | U : D ) is the marginal likelihood of θ X k | U , ML( θ X k | U : D ) = Y u ∈ Val ( U ) Y x i ∈ Val ( X k ) Γ (cid:0) α x i | u (cid:1) Γ (cid:0) α x i | u + M x i | u (cid:1) Y xj ∈ Val ( Xk ) Γ (cid:16) α x i x j | u + M x i x j | u (cid:17) Γ (cid:16) α x i x j | u (cid:17) . (13)The resulting P( D | G ) is the Bayesian-Dirichlet equivalent (BDe) metric for CTBNs (Nodelman,2007) based on the priors (7) and (8), which satisﬁes assumptions (3), (4), and (5) by construction.The posterior in (1) can then be written in closed form as P( G | D ) = X X k ∈ X log P( P a ( X k ) = U ) + log ML( q X k | U : D ) + log ML( θ X k | U : D ) (14)assuming (2) is satisﬁed.Since G does not have acyclicity constraints in a CTBN, it is possible to maximize (14) byindependently scoring the possible parent sets of each X k . Furthermore, if we bound the maximumnumber of parents we can ﬁnd the optimal P( G | D ) in polynomial time either by enumerating allpossible parent sets or by using hill-climbing to add, delete or reverse arcs (Nodelman et al., 2003).

3. The number of times X k leaves the state x i when U = u is M x i | u = P x j = x i M x i x j | u . ONSTRAINT -B ASED L EARNING FOR C ONTINUOUS -T IME B AYESIAN N ETWORKS

Algorithm 1

PC Algorithm1. Form the complete undirected graph G on the vertex set X .2. For each pair of variables X i , X j ∈ X , consider all the possible separation set from thesmallest ( S X i X j = ∅ ) to the largest ( S X i X j = X \ { X i , X j } ). If there isn’t any set S X i X j such that X i ⊥⊥ X j | S X i X j then the edge X i −− X j is removed from G .3. For each triple X i , X j , X k ∈ G such that X i −− X j , X j −− X k , and X i , X j are not connected,orient the edges into X i → X j ← X k if and only if X j S X i X j for every S X i X j that makes X i and X k independent.4. The algorithm identiﬁes the compelled directed arcs by iteratively applying the following tworules:(a) if X i is adjacent to X j and there is a strictly directed path from X i to X j then replace X i −− X j with X i → X j (to avoid introducing cycles);(b) if X i and X j are not adjacent but X i → X k and X k −− X j , then replace the latter with X k → X j (to avoid introducing new v-structures).5. Return the resulting CPDAG G .

3. A Constraint-Based Algorithm for Structure Learning

Learning the structure of a BN is a problem that is well explored in the literature. Several approacheshave been proposed spanning score-based algorithms, constraint-based algorithms and hybrid algo-rithms; a recent review is available from Scutari et al. (2019). Score-based algorithms ﬁnd the BNstructure that maximizes a given score function, while constraint-based algorithms use statisticaltests to learn conditional independence relationships (called constraints ) from the data and infer thepresence or absence of particular arcs. Hybrid algorithms combine aspects of both score-based andconstraint-based algorithms.On the other hand, the only structure learning algorithm proposed for CTBNs is the score-based algorithm from Nodelman et al. (2003) we described in the previous section; to the best ofour knowledge no constraint-based algorithm exists in the literature. After a brief introduction toconstraint-based algorithms for BNs, we propose such an algorithm for CTBNs.

Constraint-based algorithms for BN structure learning originate from the

Inductive Causation (IC)algorithm from Pearl and Verma (1991) for learning causal networks. IC starts by ﬁnding pairsof nodes connected by an undirected arc as those are not independent given any other subset ofvariables. The second step identiﬁes the v-structures X i → X k ← X j among all pairs X i and X j of non-adjacent nodes which share a common neighbour X k . Finally, IC identiﬁes compelledarcs and orients them to build the completed partially oriented DAG (CPDAG) which describes the equivalence class the BN falls into. REGOLI , S

CUTARI AND S TELLA

However, steps 1 and 2 of the IC algorithm are unfeasible for non-trivial problems due to theexponential number of conditional independence relationships to be tested. The

PC algorithm ,which is brieﬂy illustrated in Algorithm 1, was the ﬁrst proposal addressing this issue; its modernincarnation is described in Colombo and Maathuis (2014), and we will use it as the foundation forCTBN structure learning below. PC starts from a fully-connected undirected graph. Then, foreach pair of variables X i and X j it proceeds by gradually increasing the cardinality of the set ofconditioning nodes S X i X j until X i and X j are found to be independent or S X i X j = X \ { X i , X j } .The remaining steps are identical to those of IC.Neither IC nor PC (or other constraint-based algorithms, for that matter) require a speciﬁc teststatistic to test conditional independence, making them independent from the distributional assump-tions we make on the data. CTBNs differ from BNs in three fundamental ways: BNs do not model time, while CTBNs do;BNs are based on DAGs, while CTBNs allow cycles; and BNs model the dependence of a node onits parents using a conditional probability distribution, while CTBNs model it using a CIM. Thesedifferences make structure learning a simpler problem for CTBNs than it is for BNs.Firstly, learning arc directions is an issue in BNs but not in CTBNs, where arcs are requiredto follow the arrow of time. Unlike BNs, which can be grouped into equivalence classes thatare probabilistically indistinguishable, each CTBN has a unique minimal graphical representation(Nodelman et al., 2003). For instance, let a CTBN N have graph G = { X → Y } : unless triv-ially X = Y , G cannot generate the same transition probabilities as any CTBN N ′ with graph G ′ = { X ← Y } .Secondly, in CTBNs we can learn each parent set P a ( X k ) in isolation, thus making any struc-ture learning algorithm embarrassingly parallel. Acyclicity imposes a global constraint on G thatmakes it impossible to do the same in BNs.Thirdly, each variable X k is modelled conditional on a given function of its parent set P a ( X k ) :a conditional probability table for (discrete) BNs, a CIM for CTBNs. However, a CIM Q X k | U describes the temporal evolution of the state of variable X k conditionally on the state of its parentset U . Hence we can not test conditional independence by using classic test statistics like themutual information or Pearson’s χ that assume observations are independent (Koller and Friedman,2009). Instead we need to adapt our deﬁnition of conditional independence to CTBNs to design aconstraint-based algorithm for structure learning. Deﬁnition 2

Conditional Independence in a CTBNLet N be a CTBN with graph G over a set of variables X . We say that X i is conditionallyindependent from X j given S X i X j ⊆ X \ { X i , X j } if Q X i | x, s = Q X i | s ∀ x ∈ Val ( X j ) , ∀ s ∈ Val ( S X i X j ) . (15) If S X i X j = ∅ , then X i is said to be marginally independent from X j . It is important to note that Deﬁnition 2 is not symmetric: it is perfectly possible for X i to beconditionally or marginally independent from X j , while X j is not conditionally or marginally inde-pendent from X i . This discrepancy is, however, not a practical or theoretical concern because arcs ONSTRAINT -B ASED L EARNING FOR C ONTINUOUS -T IME B AYESIAN N ETWORKS are already non-symmetric (they must follow the arrow of time) and therefore we only test whether X i depends X j if X j precedes X i or the other way round.As for the test statistics, we can test for conditional independence using q X k | u (the waitingtimes) and, if we do not reject the null hypothesis of conditional independence, we can perform afurther test using θ X k | u (the transitions); q X k | u and θ X k | u have been deﬁned to be independent inSection 2.1 so they can be tested separately. Note that conditional independence can be establishedby testing only the waiting times q X k | u if the CTBN contains only binary variables. However, test-ing for conditional independence involves both waiting times and transitions in the general case inwhich variables can take more than two values. Since we consider that rates are the most importantcharacteristic to to assess in a stochastic process, we decide without loss of generality to test q X k | u ﬁrst, and then θ X k | u .For q X k | u , we deﬁne the null time to transition hypothesis as follows. Deﬁnition 3

Null Time To Transition HypothesisGiven X i , X j and the conditioning set S X i X j , the null time to transition hypothesis of X j over X i is q x | y, s = q x | s ∀ x ∈ Val ( X i ) , ∀ y ∈ Val ( X j ) , ∀ s ∈ Val ( S X i X j ) . (16)For θ X k | u , we deﬁne the null state to state transition hypothesis as follows. Deﬁnition 4

Null State To State Transition HypothesisGiven X i , X j and the conditioning set S X i X j , the null state to state transition hypothesis of X j over X i is θ x ·| y, s = θ x ·| s ∀ x ∈ Val ( X i ) , ∀ y ∈ Val ( X j ) , ∀ s ∈ Val ( S X i X j ) (17) where we let θ x ·| y, s be off diagonal elements of matrix Q X i | y, s corresponding to assignment X i = x . It is worthwhile to mention that equality θ x ·| y, s = θ x ·| s has to be understood in terms of corre-sponding components of vectors θ x ·| y, s and θ x ·| s . Deﬁnition 3 characterizes conditional independence for the times to transition for variable X i when adding (or not) X j to its parents; Deﬁnition 4 characterizes conditional independence for thetransitions of X i when adding (or not) X j to its parents.To test the null time to transition hypothesis , we use the F test to compare two exponentialdistributions from Lee and Wang (2003). In the case of CTBNs, the test statistic and the degrees offreedom take form F r ,r = q x | s q x | y, s , with r = X x ′ ∈ Val ( X i ) M xx ′ | y, s , r = X x ′ ∈ Val ( X i ) M xx ′ | s (18)To test the null state to state transition hypothesis , we investigated the use of the two-samplechi-square and Kolmogorov-Smirnov tests (Mitchell, 1971). For CTBNs the former takes form: χ = X x ′ ∈ Val ( X i ) ( K · M xx ′ | y, s − L · M xx ′ | s ) M xx ′ | y, s + M xx ′ | s , K = s P x ′ ∈ Val ( X i ) M xx ′ | s P x ′∈ Val ( Xi ) M xx ′ | y, s , L = 1 K ; (19) REGOLI , S

CUTARI AND S TELLA and is asymptotically distributed as a χ | Val ( X i ) |− . The latter is deﬁned as D r ,r = sup x ′ ∈ Val ( X i ) (cid:12)(cid:12) Θ xx ′ | s − Θ xx ′ | y, s (cid:12)(cid:12) , Θ xx ′ = X x ′′∈ Val ( Xi ) x ′′ ≤ x ′ θ xx ′′ . (20)After characterising conditional independence, we can now introduce our constraint-based algo-rithm for structure learning in CTBNs. The algorithm, which we call Continuous-Time PC (CTPC),is shown in Algorithm 2. The ﬁrst step is the same as the corresponding step of the PC algorithm inthat it determines the same pattern of conditional independence tests. However, as discussed above,the hypotheses being tested are the null time to transition hypothesis and the null state to statetransition hypothesis . The second step of CTPC differs from that in the PC algorithm. Since inde-pendence relationships are not symmetric in CTBNs, we can ﬁnd the graph G of the CTBN withoutindentifying and then reﬁning a CPDAG representing an equivalence class. Therefore, steps 3 and4 of the PC algorithm are not needed in the case of CTBNs.CTPC starts by initializing the complete directed graph G without loops (step 1). Note thatwhile loops (that is, arcs like X i → X i ) are not included, cycles of length two (that is, X i → X j and X j → X i ) are, as well as cycles of length tree or more. Step 2 iterates over the X i to identifytheir parents U . This is achieved in step 2.2.1 by ﬁrst testing for unconditional independence, thenby testing for conditional independence gradually increasing the cardinality b of the consideredseparating sets. Each time Algorithm 2 concludes that X i is independent from X j given someseparating set, we remove arc from node X j to node X i in step 2.2.2. At the same time, we alsoremove X j from the current parent set U . The iteration for X i , X j terminates either when X j isfound to be independent from X i or there are no more larger separating sets to try because b = | U | ;and the iteration over X i terminates when there are no more X j to test.CTPC checks the null time to transition hypothesis (Deﬁnition 3) by applying the test for twoexponential means (18). On the contrary, the null state to state transition hypothesis (Deﬁnition4) can be tested using two different tests, i.e., the two sample chi-square test and the two sampleKolmogorov-Smirnov test . We call these two options CTPC χ the and CTPC KS , respectively. Algorithm 2

Continuous-time PC Algorithm1. Form the complete directed graph G on the vertex set X .2. For each variable X i ∈ X :2.1 Set U = { X j ∈ X : X j → X i } , the current parent set.2.2 For increasing values b = 0 , . . . , n , until b = | U | :2.2.1 For each X j ∈ U , test X i ⊥⊥ X j | S X i X j for all possible subsets of size b of U \ X j .2.2.2 As soon as X i ⊥⊥ X j | S X i X j for some S X i X j , remove X j → X i from G and X j from U .3. Return directed graph G . ONSTRAINT -B ASED L EARNING FOR C ONTINUOUS -T IME B AYESIAN N ETWORKS

4. Numerical Experiments

We assess the performance of CTPC against that of the score-based algorithm from Nodelman et al.(2003) using synthetic data .In particular, we generate random CTBNs as the combinations of directed graphs and the asso-ciated CIMs; and we generate random trajectories from each CTBN. We perform a simulation studyusing a factorial experimental design over different numbers of nodes n = { , , , } , networkdensities { . , . , . } , number of states for the nodes | Val ( X i ) | = { , } and different numbersof trajectories h = { , , } . Each trajectory lasts on average for 100 units of time. Note thatwe only generate connected networks, hence absolute density is bounded below by | X | . We perform10 replicates for each simulation conﬁguration with n < ; and 3 replicates for conﬁgurations with n > .We measure the performance of the learning algorithms along two dimensions: structural ac-curacy and speed. In particular, structural accuracy is measured using the F1 score over the arcs,which is deﬁned as follows, F = 2 · precision × recallprecision + recall ; (21)since there is no score equivalence in CTBNs, nor are networks constrained to be acyclic, comparinggraphs is equivalent to evaluating a binary classiﬁcation problem. As for speed, we evaluate thewall-clock time on a single core. While all algorithms we consider can be parallelized in some way,we prefer to avoid the confounding effect of varying degrees of parallelism overhead on speed. The results of our simulation study are summarized in Table 1 (for the score-based algorithm inNodelman et al., 2003), Table 2 (for CTPC χ ) and Table 3 (for CTPC KS ).In the case of binary variables, the score-based algorithm performs better than the proposedconstraint-based algorithms for any combination of network density, number of trajectories andnumber of nodes. CTPC χ and CTPC KS have comparable performance, which is expected since inthis case the two algorithms are identical (because the tests are identical, that is, we only test waitingtimes). However, CTPC χ and CTPC KS have better F1 scores than the score-based algorithm forternary variables.CTPC χ appears to perform marginally better than CTPC KS when n < , but the two algorithmare again comparable when n = 20 and h = 300 . This suggests that CTPC χ is more sample-efﬁcient than CTPC KS with respect to the number of trajectories.However, CTPC χ and CTPC KS scale better than the score-based algorithm from Nodelman et al.(2003), which exhausts the 24GiB of memory allocated for the experiment and fails to complete inthe allotted time as shown in the last line of Table 1.

4. Score-based learning has been performed by the CTBN-RLE package (Shelton et al., 2010),while CTPC has been implemented in Python and made available at the following GitLab https://gitlab.com/peer.review/constraint-based-learning-for-continuous-time-bayesian-networks .CTBN-RLE uses Bayesian Score where a Gamma prior is used for parameters of the exponential distributions whilea Dirichlet prior is used for transition probabilities. Hyperparameters are set to their default value, i.e., τ = 1 for theGamma prior, while α = 1 for the Dirichlet prior. REGOLI , S

CUTARI AND S TELLA

Cardinality

Binary variables

Network density

100 200 300 100 200 300 100 200 3005 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0010 1.00 1.00 1.00 .990 .990 .990 1.00 1.00 1.0015 1.00 1.00 1.00 1.00 1.00 1.00 .941 .993 1.0020 .984 1.00 1.00 .987 1.00 1.00 .850 .922 .934

Cardinality

Ternary variables5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.0010 1.00 1.00 1.00 .949 .949 1.00 .987 .934 .96215 .971 .983 1.00 .800 .841 1.00 .541 .605 .76520 — — — — — — — — —

Table 1: F -score for the score based algorithm. Cardinality

Binary variables

Network density

100 200 300 100 200 300 100 200 3005 .988 1.00 1.00 1.00 1.00 1.00 .992 1.0 1.0010 1.00 .988 1.00 .970 .970 .970 .966 .973 .96715 .980 .994 1.00 .949 .981 .993 .830 .903 .93320 .968 .988 .992 .935 .989 .980 .787 .871 .883

Cardinality

Ternary variables5 .972 .921 .909 .973 .973 .973 .966 .953 .97910 .938 .938 .950 .984 .992 .992 .981 .975 .97015 .967 .962 .967 .966 .984 .984 .820 .871 .88720 .944 .944 .939 .880 .904 .913 .583 .720 .761

Table 2: F -score for the CTPC χ algorithm. Cardinality

Binary variables

Network density

Cardinality

Ternary variables5 .667 .667 .667 .766 .771 .720 .871 .802 .78510 .617 .623 .650 .811 .775 .780 .890 .886 .85415 .762 .782 .775 .840 .863 .857 .775 .855 .87520 .644 .624 .602 .820 .852 .859 .602 .704 .757

Table 3: F -score for the PC KS CTBN algorithm. ONSTRAINT -B ASED L EARNING FOR C ONTINUOUS -T IME B AYESIAN N ETWORKS

5. Conclusions

In this paper we introduced the ﬁrst constraint-based algorithm for structure learning in CTBNs,which we called CTPC, comprising both a suitable set of statistics for testing conditional indepen-dence and a heuristic algorithm based on PC.Compared to the only score-based algorithm previously available in the literature (Nodelman et al.,2003), CTPC has better structural reconstruction accuracy when variables in the CTBN can assumemore than two values. For binary variables, that score-based algorithm performs well, but its per-formance rapidly degrades when the number of states increases to three.A major limitation of the proposed constraint-based algorithm is the computational cost whichbecomes problematic in domains with more than 20 variables. Even so, CTPC scales better than thescore-based algorithm from Nodelman et al. (2003), which exhausts the 24GiB of memory allocatedfor the experiment and fails to complete in the allotted time.Further experiments are needed to elucidate the behaviour of CTPC when the number of statesof the variables increases. It would also be important to validate the performance of CTPC onreal-world data. Unfortunately, we are not aware of any suitable real-world data set where groundtruth is available, and thus we were unable to pursue this line of investigation. Furthermore, we areplanning additional numerical experiments to evaluate the impact of the type-I-error threshold forthe tests to better understand how to calibrate constraint-based algorithms.

References

E. Acerbi, E. Vigan`o, M. Poidinger, A. Mortellaro, T. Zelante, and F. Stella. Continuous TimeBayesian Networks Identify Prdm1 as a Negative Regulator of TH17 Cell Differentiation in Hu-mans.

Scientiﬁc Reports , 6:23128, 2016.D. Colombo and M. H. Maathuis. Order-Independent Constraint-Based Causal Structure Learning.

Journal of Machine Learning Research , 15:3921–3962, 2014.N. Friedman and D. Koller. Being Bayesian about Bayesian Network Structure: A Bayesian Ap-proach to Structure Discovery in Bayesian Networks.

Machine Learning , 50:95–125, 2000.D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian Networks: The Combinationof Knowledge and Statistical Data.

Machine Learning , 20(3):197–243, 1995.D. Koller and N. Friedman.

Probabilistic Graphical Models: Principles and Techniques . MITPress, 2009.E. T. Lee and J. Wang.

Statistical methods for survival data analysis , volume 476. John Wiley &Sons, 2003.B. Mitchell. A Comparison of Chi-Square and Kolmogorov-Smirnov Tests.

The Royal Geographi-cal Society , 3(4):237–241, 1971.U. Nodelman.

Continuous Time Bayesian Networks . PhD thesis, Stanford University, 2007.U. Nodelman and E. Horvitz. Continuous Time Bayesian Networks for Inferring Users’ Presenceand Activities with Extensions for Modeling and Evaluation. Technical Report MSR-TR-2003-97, Microsoft Research, 2003. REGOLI , S

CUTARI AND S TELLA

U. Nodelman, C. R. Shelton, and D. Koller. Continuous Time Bayesian Networks. In

Proceedingsof the Eighteenth Conference on Uncertainty in Artiﬁcial Intelligence , pages 378–387, 2002.U. Nodelman, C. Shelton, and D. Koller. Learning Continuous Time Bayesian Networks. In , pages 451–458, 2003.J. Pearl and T. Verma. A Theory of Inferred Causation. In

Proceedings of the Second InternationalConference on Principles of Knowledge Representation and Reasoning , pages 441–452. MorganKaufmann Publishers Inc., 1991.M. Scutari, C. E. Graaﬂand, and J. M. Guti´errez. Who Learns Better Bayesian Network Structures:Accuracy and Speed of Structure Learning Algorithms.

International Journal of ApproximateReasoning , 115:235–253, 2019.C. R. Shelton, Y. Fan, W. Lam, J. Lee, and J. Xu. Continuous time bayesian network reasoning andlearning engine.

Journal of Machine Learning Research , 11(Mar):1137–1140, 2010.L. Sturlaugson and J. W. Sheppard. Inference Complexity in Continuous Time Bayesian Net-works. In

The 30th Conference on Uncertainty in Artiﬁcial Intelligence (UAI 2014), QuebecCity, Canada , pages 772–779, 2014.J. Xu and C. R. Shelton. Continuous Time Bayesian Networks for Host Level Network IntrusionDetection. In

The European Conference on Machine Learning and Principles and Practice ofKnowledge Discovery in Databases (ECML PKDD 2008) , pages 613–627, 2008., pages 613–627, 2008.