Reconstruction of Causal Networks by Set Covering
aa r X i v : . [ c s . D S ] J un Reconstruction of Causal Networksby Set Covering
Nick Fyson , , Tijl De Bie , and Nello Cristianini Intelligent Systems Laboratory, Bristol UniversityMerchant Venturers Building, Woodland Road, Bristol, BS8 1UB, UK Bristol Centre for Complexity Sciences, Bristol University,Queen’s Building, University Walk, Bristol, BS8 1TR, UK http://patterns.enm.bris.ac.uk
Abstract.
We present a method for the reconstruction of networks,based on the order of nodes visited by a stochastic branching process.Our algorithm reconstructs a network of minimal size that ensures consis-tency with the data. Crucially, we show that global consistency with thedata can be achieved through purely local considerations, inferring theneighbourhood of each node in turn. The optimisation problem solved foreach individual node can be reduced to a Set Covering Problem, whichis known to be NP-hard but can be approximated well in practice. Wethen extend our approach to account for noisy data, based on the Min-imum Description Length principle. We demonstrate our algorithms onsynthetic data, generated by an SIR-like epidemiological model.
Key words: network reconstruction, set covering, temporal data mining
There has been increasing interest over recent years in the problem of recon-structing complex networks from the streams of dynamic data they produce.Such problems can be found in a highly diverse range of fields, whether deter-mining Gene Regulatory Networks (GRNs) from expression measurements [9],or the connectivity of neuronal systems from spike train data [1]. While datain the the field of GRNs is generally continuous in nature, spike train data isinherently discrete. Other fields include epidemiology, chemical engineering andmanufacturing [10], but all share the similar challenge of extracting the causalstructure of a complex dynamical system from streams of temporal data.We here address the challenge of reconstructing networks from data cor-responding to stochastic branching processes, occurring on directed networksand where a discrete ‘infection’ is propagated from node to node. The clearestanalogy lies in the field of epidemiology, where instances of infection begin atparticular nodes, before propagating stochastically along edges until the infec-tion dies out. Another source of such data could be blogs, where initial reportof a story is made on a particular site, before being picked up by other blogsand ‘cascading’ through the blogosphere [4]. Analysis of such data could permit
Reconstruction of Causal Networks by Set Covering the reconstruction of a network of readership. Most generally, we could con-sider data corresponding to ‘memes’, fundamental units of cultural informationwhich propagate through all systems of communication, notably the news mediasystem.The main contributions of this paper are:1.
A novel approach to the reconstruction of networks from datacorresponding to stochastic branching processes.
We clearly definethe form of the data and optimisation problem, before reducing it to thewell-known Set Covering problem.2.
A modification extending our approach for use on noisy data.
Weuse the concept of Minimum Description Length (MDL) to define a criterionfor halting greedy set covering, allowing us to reconstruct networks fromdata containing lost entries.The paper is organised as follows. In Sec. 2 we fully define the nature ofdata to be used, the problem we address, and the optimisation problem to betackled. Section 3 presents a theoretical analysis of our basic algorithm, beforeSec. 4 introduces an extension to address noisy data, based on the concept ofMinimum Description Length (MDL). Section 5 presents an empirical analysis,before we outline our conclusions in Sec. 6.
A directed network G is defined by a set of nodes V and a set of oriented edges E ⊆ V × V between these nodes, and we denote it as G = ( V, E ). In thispaper we consider two networks over the same set of nodes. G T = ( V, E T ) is thetrue underlying network, while the reconstructed network we infer from data isdenoted by G R = ( V, E R ). We assume a dynamic branching process occurs onthe network G T , in which the transfer of ‘markers’ occurs. Markers originate ata particular node in the network, and then propagate stochastically from nodeto adjacent node, ‘traversing’ along only those edges that exist in the set E T .With analogy to terminology in the field of epidemiology, we refer to theprocess of a node becoming a carrier of a marker as infection . Similarly, allnodes that have undergone infection at any point in the past and remain in astate where they are infectious are known as infected . Finally, any node thatunderwent infection at any point from a particular marker is referred to as a carrier .Each marker that is propagated through the network generates a ‘markertrace’, M i . The set of all marker traces is denoted by M = { M i } , and throughoutthe paper we use superscripts to index between markers. The marker trace isrepresented by an ordered set of the nodes that carried that marker, in the orderin which they became infected. We will use subscripts to refer to individual nodesin a marker trace. We formally define the notion of a marker trace as follows. econstruction of Causal Networks by Set Covering 3 Definition 1 (Marker Trace, M i ). A Marker Trace M i is an ordered set of n i distinct nodes w ij ∈ V , and we denote it as: M i = ( w i , w i , . . . , w in i )Each marker trace defines a total order over the reporting nodes, and we usethe notation v i < M i v j to state that the node v i appears before node v j in themarker trace M i .For clarity in future definitions we also formally define a path from one nodeto another within a network. Definition 2 (Path in a network G = ( V, E ) ). A sequence U = u , . . . , u k of nodes u i ∈ V is a path in G = ( V, E ) if ∀ ≤ i < k , ( u i , u i +1 ) ∈ E Problem 1 (Informal Description).
Given a set M of Marker Traces construct anetwork G R , approximating the true network G T that generated M .Intuitively, it makes sense to choose G R such that it is capable of generating M itself as well. Given our assumptions on the mechanism of data generation,this requires that for each marker a path exists from the originator to all othercarrier nodes, passing only through nodes that have been previously infected.We will refer to this as ‘global consistency’ and formalise the intuitive notion asfollows. Definition 3 (Globally Consistent, GC). G R is GC with M i ⇐⇒ ∀ w ij with j > ∃ a path w i , . . . , w ij in G R Besides being intuitively satisfying, in Sec. 3 we will prove that ensuringglobal consistency also ensures that a large number of edges from G T is guaran-teed to be reconstructed in G R . Trivially, it is clear that a completely connectednetwork is consistent with all possible data, and hence we aim to reconstruct aconsistent set E R of minimal size.Combining the above allows us to formalise our goal in terms of an optimi-sation problem. Problem 2 (Formulation in terms of Global Consistency). argmin E R | E R | subject to ∀ M i ∈ M G R = ( V, E R ) is GC with M i Reconstruction of Causal Networks by Set Covering
For a reconstruction to make intuitive sense we require global consistency be-tween network and data, but this involves consideration of paths and is imprac-tical. Below, we demonstrate the equivalence of global consistency with ‘localconsistency’, an alternative that allows us to consider the immediate neighbour-hood of each node in turn.Local consistency requires that for each node reporting a particular marker,the node must have at least one incoming edge from a node that has reportedthe marker at an earlier time. This concept is formalised as follows.
Definition 4 (Locally Consistent, LC). G R is LC with M i ⇐⇒ ∀ w ij with j > ∃ w ik with k < j : ( w ik , w ij ) ∈ E R Theorem 1 (LC ⇐⇒ GC).
Demonstrating local consistency between G R and M i is necessary and sufficient to ensure global consistency.Proof. We define an approach to constructing a network that ensures every nodehas an incoming edge from a node that reported at an earlier time (local con-sistency), and demonstrate that this necessarily ensures that a path exists fromthe originator to every other node (global consistency).For the case k = 1, we have only the originator node, hence trivially there isa path from originator to all other nodes. For the case k = 2, we add a node withan incoming edge from the only other node. Again trivially, there is a path fromthe originator to every other node. For the case k = n + 1 we take the networkfor k = n , and add a node with an incoming edge from one of the existing nodes.If there is a path from originator to all nodes in the k = n network, there willbe a path from originator to the new node in the case k = n + 1. Hence if theclaim is true for k = n then it is also true for k = n + 1.Therefore, by induction, LC ⇐⇒ GC. ⊓⊔ This allows us to formulate an alternative but equivalent optimisation prob-lem, using using the concept of local consistency.
Problem 3 (Formulation in terms of Local Consistency). argmin E R | E R | subject to ∀ M i ∈ M G R = ( V, E R ) is LC with M i Crucially, to establish local consistency, one need only consider the immedi-ate neighbourhood of each node in turn. Hence we can break this optimisationproblem into N subproblems, where N is the total number of nodes in the net-work. In each of these subproblems, we establish the minimal set of incomingedges required to explain all the markers reported by the particular node. Fromnow on, unless otherwise specified, we describe approaches as applied to discov-ering the parents of a particular node, which would then be applied to each nodein turn. econstruction of Causal Networks by Set Covering 5 Using the concept of local consistency we are able to treat the reconstructionon a node-by-node basis, and we denote the node under consideration as v . Asspecified by local consistency, in considering the incoming edges for a particularnode we must include at least one edge from a node that has reported eachmarker at an earlier time. Each edge therefore ‘explains’ the presence of a subsetof the reported markers, and if the set of all incoming edges together explains allthe reported markers, we ensure local consistency. This problem of ‘explaining’marker reports may be neatly expressed as a Set Covering Problem.Before showing how it relates to our reconstruction problem, we formallystate the Set Covering optimisation problem: Given a universe A and a family B of subsets of A , the task is to find the smallest subfamily C ⊆ B such that S C = A . This subfamily C is then the ‘minimal cover’ of A . Given this formalframework, we now define how these sets relate to our reconstruction problem. Definition 5 (Universe, A v ). The universe set of all elements is defined asthe set of all markers that have been reported by the node v : A v = { i : v ∈ M i } The node v can have an incoming edge from any other node, and hence thespace of potential incoming edges is F v = ( V /v ) × v . As stated above, eachpotential incoming edge will ‘explain’ a subset of the markers reported by v , andtherefore every edge f vj ∈ F v corresponds to one element B vj in the family ofsubsets B v . Definition 6 (Family of subsets, B v = { B vj } ). Each subset B vj is defined bya potential incoming edge ( v j , v ) = f vj ∈ F v , where i is in B vj if and only if v j appears earlier than v in the marker trace M i : B vj = { i : v j < M i v } The set covering problem then requires us to find a subfamily C v ⊆ B v suchthat S C v = A v , and this subfamily C v directly corresponds to a set of incomingedges for the node v . Definition 7 (Reconstructed Incoming Edges, E vR ). The set of reconstructededges, E vR , consists of the set of all elements in F that correspond to elementsof C : E vR = { f vj ∈ F v : B vj ∈ C v } This then allows us to make a final definition of our optimisation problem,this time in terms of the Set Covering Problem. The following problem is definedfor each node v ∈ V . Reconstruction of Causal Networks by Set Covering
Problem 4 (Formulation in terms of Set Covering). argmin E vR | E vR | subject to A v = [ C v where C v = { B vj : f vj ∈ E vR } and E vR ⊆ F v Finally, repeating this optimisation for all nodes in the network, we get E R = S v E vR , allowing us to reconstruct the entire network through only lo-cal considerations. The Set Covering Problem is known to be NP-hard, but in practice is easy toapproximate well using a greedy approach (see Sec. 3). The greedy algorithm iswell documented for set covering [2], but below we briefly outline the approach.We wish to cover the set A by selecting from the family of subsets B . Wefirst select the subset B j ∈ B that covers the greatest number of elements in A ,ie. such as to maximise | B j | . The corresponding edge f j is then added to theset of reconstructed edges E vR . A subset of A has now been covered, and hencethese elements are removed both from A and all subsets in the family B . Thisprocess is repeated until A = ∅ . We have formalised our problem using the intuitive notion of global consistencyof a network with a set of Marker Traces. Here, we will show that this strategyensures that the reconstructed network G R is close to the true network G T in awell-defined sense. In particular, we will consider the number of edges in E R alsoin E T , referred to as the True Positives (TP), as well as the number of edges in E R not in E T , referred to as the False Positives (FP). The number of true andfalse positives gives an indication both of how well the approach finds edges thatreally exist, and how likely it is to incorrectly identify edges as being part of thenetwork. The number of true positives found by the reconstruction approach is simply thenumber of edges found in both the true and reconstructed networks, given byTP = | E R ∩ E T | (1)The nature of the set covering algorithm allows us to set a lower limit onTP, given a particular M . In achieving a complete coverage we can be certainof including all those edges that were traversed first in the propagation of any econstruction of Causal Networks by Set Covering 7 marker. In other words, all pairs of nodes that appear first and second in anymarker trace are guaranteed to represent a true edge, and will also inevitablybe included in the covering. There is only a single subset covering this partic-ular marker report, an hence it must be included in the final reconstruction.We therefore only need count the number of such edges to determine the leastnumber of true edges we will identify, and thus a lower bound TP − on TPIn assessing performance it is useful also to define the True Positive Rate(TPR), which is the fraction of true positives successfully recovered, given byTPR = | E R ∩ E T || E T | (2)Trivially, if | E T | is known, we can use TP − to directly obtain a lower boundTPR − on TPR. While we may correctly include all genuine edges, it is also important to suc-cessfully exclude all false edges from our reconstruction. This is quantified bythe number of False Positives (FP):FP = | E R /E T | (3)We denote the highest possible number of false positives as FP + , and inorder to specify this bound we need to make the following definitions. The set E R is obtained from a greedy approximation to set covering, and therefore isnot guaranteed to be optimal. We denote the optimal covering as E ∗ R , whichwill always be equal or smaller in cardinality than E R . We also know that thetrue set of edges will always provide a valid covering, and hence provides anupper bound on the size of the optimal covering, giving | E ∗ R | ≤ | E T | . Finally,the heuristic ratio is defined as the upper bound on the size of the obtained setrelative to the size of the optimal set, H ≥ | E R | / | E ∗ R | .We can now specify the upper bound on false positives as follows:FP = | E R | − TP (4)FP ≤ | E R | − TP − (5) ≤ H. | E ∗ R | − TP − (6) ≤ H. | E T | − TP − (7) ∴ FP + = H. | E T | − TP − (8)The greedy approximation to set covering is known to be as good as anypolynomial-time approximation, and the literature gives us two useful values for H . The first bound is related to the maximum size of the subsets from which weconstruct the covering, max B j ∈ B | B j | , and provides a limit on the quality of thecovering related to the logarithm of the size of this set [2]. This first bound H is given by Reconstruction of Causal Networks by Set Covering H = 1 + ln (cid:18) max B j ∈ B | B j | (cid:19) (9)The second bound approaches the problem from an alternative perspective,considering the maximum number of covering subsets of which any element isa member, m [7]. In other words, for each element of the ground set we needto cover, how many subsets can be selected from in order to cover the elementin question. In the case of our algorithm this is related to the length of markertraces. The elements of the ground set we need to cover are reports of a marker ata node, and the number of ways of explaining the presence of this marker is equalto the number of nodes that have reported at an earlier time. The maximummembership across all elements in the ground set, m , is therefore related to themaximum length of marker traces. This second bound is then H = m (10)= (cid:18) max M i ∈M | M i | (cid:19) − H appears to provide a less useful bound, since it is linear as op-posed to logarithmic, but the behaviour of the bounds as the number of markersincreases is markedly different. While the maximum size of covering set contin-ues to increase with number of markers, the maximum length of marker tracerapidly tends to a fixed value. This limit is a property of the network and markerpropagation, but at most is limited by the size of the network, not the dataset.Therefore, as the amount of data used in the reconstruction increases, the tighterbound switches from H to H . Hence, we can define the heuristic ratio as theminimum of these two alternatives, and therefore bound the false positives asshown in equation 12. Again, determining this bound requires knowing the setof markers and | E T | : FP + = | E T | . min( H , H ) − TP − (12)The total number of possible false positive is given by ( | V | −| V | ) −| E T | , andhence we can also define an upper bound on the False Positive Rate (FPR + ):FPR + = | E T | . min( H , H ) − TP − ( | V | − | V | ) − | E T | (13) To assess the overall quality of our reconstructions, we require a measure of howwell the reconstructed set of edges matches the true set. In comparing two setsover the same elements, it is appropriate to use the Jaccard Distance (JD). For econstruction of Causal Networks by Set Covering 9 identical sets this has a value zero, and a value of one if the two sets have noelements in common at all. The JD is given byJD = | E T ∪ E R | − | E T ∩ E R || E T ∪ E R | (14)A lower value of Jaccard Distance indicates a closer match between true andreconstructed networks, and hence a bound on worse-case performance is anupper limit on JD. This can be calculated from bounds on the number of trueand false positives as follows:JD ≤ − TP − | E T | + FP + (15)This upper limit on JD is determined given a particular set of marker traces,and constitutes a worst-case scenario for our success in reconstructing the trueunderlying network. Our approach and analysis has thus far assumed perfect and noise free data fromwhich to reconstruct networks. In reality this is an unrealistic assumption, andhence we define an adaptation of our approach to accommodate noisy data.Our basic Set Covering approach assumes that a minimal network consistentwith the data will result in a perfect reconstruction (given infinite data andperfect minimal covering set). Every report of a marker is assumed to be dueto direct infection from an earlier infected node, and hence we require that thepresence of every marker at every node be explained. When noise is present theseassumptions do not hold, and missing marker reports may incorrectly suggestthe presence of edges that are not really present. This will lead to a large numberof false positives, increasing with the quantity of data used in reconstruction.In executing the greedy approximation to Set Covering, we first select thosesubsets that cover the greatest number of remaining elements, which in ourcase corresponds to choosing edges that explain the greatest number of markerreports. While the noise level remains low, therefore, we will first select trueedges, since the incorrect edges suggested by the noise will tend to be relativelylow in frequency. We can therefore expect that, in general, the noise-inducedfalse positives will be added toward the end of the set covering process. This isdemonstrated empirically in Fig. 2a, Sec. 5.4, and motivates the definition of acriterion to halt the covering early.
In selecting the optimal point to halt the set covering when reconstructing fromnoisy data, we appeal to the Minimum Description Length (MDL) principle [11].
This states that in model selection one should prefer models that are able tocommunicate the data in the lowest number of bits. This is in principle equivalentto considering Maximum Likelihood Estimation [5], but our case lends itselfparticularly well to the use of MDL.
Marker Trace Coding Scheme
We choose to describe the network in themost simple way, in which all edges are explicitly assigned 0 or 1, and hence thenetwork is description is of fixed length. As such, our coding scheme containsno inherent preference for sparsity, and the Description Length (DL) is entirelydependent on how efficiently the set of all markers can be expressed.In order to describe a marker trace we need to specify in order all thosenodes that are members of the set M i . A simple ordered list requires ln N bitsof information per node, where N is the number of nodes in the network. This isstraightforward, but using the framework of the underlying network may allowus to describe this same information in a compressed form. Instead of simplylisting the reporting nodes, we describe the progression of the marker throughthe network.When the network is consistent with the data, we are able to describe allmarkers exactly with the following approach. We first identify the originatornode, at a cost of ln N . We then describe each node of the marker by firstidentifying its parent (from the set of those that have already reported), andthen specifying the particular child of this node. The cost of identifying the2nd report is then (ln 1 + ln d p ), where d p is the out-degree of the parent.The 3rd report then requires ln 2 bits to specify the parent, since there are twopossibilities, and also ln d p to specify which child. This progresses similarly forall subsequent reports in the trace. By then summing the description lengths ofall marker traces we get the cost of describing the set of data completely.To render this coding scheme useful in practice we need to be able to describemarkers that are not consistent with the network, for which we need only make asimple extension, allowing for the coding of ‘exceptions’. We do this by defininga ‘supernode’ in addition to the standard network, which is the originator ofall markers and by definition a parent of every other node. The description ofthe first report then becomes (ln 1 + ln d p ) = ln N , where the cost of specifyingthe parent is ln 1 = 0 (since all markers originate at the supernode) and thecost of specifying the child is ln N . For the second report there are now twopotential parents, and hence to specify the second reporting node we require(ln 2 + ln d p ) bits. If the first reporter is a parent of the second, d p will be equalto the out-degree of the first reporter, and otherwise d p = N , the out-degree ofthe supernode. Similarly, the cost for the third report is (ln 3 + ln d p ) bits, thefourth (ln 4 + ln d p ) and so on.A crucial characteristic of this coding scheme is that, while there is no explicitcost to defining edges, nodes of higher degree are more expensive to use as theparent of a report. Therefore, while it is expensive to describe a report as anexception, there is a trade off between creating a network that does not requireany exceptions and the increased cost of describing all of the marker reports. In econstruction of Causal Networks by Set Covering 11 general, therefore, the network that allows the shortest description of all markertraces will lie at some point between completely disconnected and completelyconnected. To use MDL as a stopping criterion requires a minor change to the set coveringreconstruction algorithm, in which the addition of edges is considered globally,rather than simply on a node by node basis. We still perform greedy set coveringfor each node in turn, but instead of placing selected edges directly into thereconstructed network, we make a note of each edge and the number of additionalelements covered when it is selected. After doing this for all nodes we have a listof edges across the whole network, along with their explanatory power within thegreedy set covering framework. We then rank them all by the elements covered,and follow this order in adding edges to the network.While the Jaccard Distance requires knowledge of the true network to calcu-late, we can calculate the description length using only the data and the currentreconstructed network. We can therefore calculate the new DL after each edgeis added, and subsequently select the network E R that gave the lowest totaldescription length. The model for generation of our generalised ‘markers’ is based on an SIR epi-demiological model. We simulate each marker separately, dropped at randominto the network and subsequently propagated between outlets in a stochasticfashion. The definition of this model then falls into three sections; the networkitself, the generation of markers and the model used for noise in the data.
Network Model
The definition of the network consists of a non-symmetricbinary adjacency matrix, ( i, j ) = 1 indicating an edge connecting from node i to node j . We use a directed Erd˝os-R´enyi model, in which each edge exists withprobability p = 2 /N , where N is the total number of nodes. This results in anaverage of 2 outgoing and 2 incoming edges per node, resulting in a relativelysparse network that is likely to be a single weakly-connected component. Marker Generation
Throughout the simulation, each node can be in one ofthree states; Susceptible (S), Infected (I) or Recovered (R). All nodes begin instate S, before the marker initially ‘seeded’ at a randomly selected node, set toI. The state of each node in the next time step is determined stochastically fromits current state and that of all its parents. The potential transitions of a nodeand their associated probabilities are shown in table 1. We select parameters togenerate markers paths of reasonable length and frequency, with p I = 0 . p R = 0 . Generation of noisy data
We consider the most basic model of noise forthe type of data we are looking at, in which each marker report has a certainprobability of being ‘lost’. The single parameter is p loss, giving the likelihoodthat a carrier node is omitted from the marker.Table 1: The transition probabilities for nodes, dependent on currentstate. n I is number of incoming edges from infected nodes, p I is probabilitythat infection will pass along an edge in a time step, and p R is probability thatnode will recover from infection in a time step. Susceptible Infected Recovered P ( S ) = (1 − p i ) n I P ( I ) = 1 − p R P ( R ) = 1 P ( I ) = 1 − (1 − p i ) n I P ( R ) = p R We define two naive algorithms for network reconstruction, to which it will beinstructive to compare our Set Covering approach.
Naive 1
The most immediately obvious explanation for the creation of a markertrace is that each node became infected by that node immediately preceding it intime. Indeed, assuming all network structures are equally likely, and consideringa trace in isolation, this would be our best guess. We therefore simply takethe union of all edges implied by a literal interpretation of each marker trace.The resultant network is capable of producing the observed data, and hence isconsistent. This set of edges is given by E N = [ M i ∈M ( w in , w in +1 ) ∀ n (16) Naive 2
In the second naive approach to reconstruction, we consider only thosemarker reports for which only one edge can provide the explanation. In otherwords, we take only those edges that are guaranteed true positives. This doesnot make full use of the available information, since it effectively throws awayall reports of a marker beyond the second, but ensures no false positives areincluded. The set of edges given by the second naive method is given by E N = [ M i ∈M ( w i , w i ) (17) econstruction of Causal Networks by Set Covering 13 Figure 1 shows the results of network reconstruction using our Set Coveringalgorithm, along with baseline results and the worst-case bound. Probably theclearest result is that the bound on performance holds for both true and falsepositives, but more important is comparison of our algorithm with the naivebaseline approaches.Figures 1b and 1c clearly show that false positives are the cause of the poorperformance of the first naive approach. This is entirely expected, but illus-trates that it is not sufficient to simply find any network that is consistent withthe data. Both the first naive approach and our Set Covering algorithm returna network that is consistent with the data, but the results clearly show thatsearching for one that is maximally sparse leads to a reconstruction closer to thetrue network.We also see in Fig. 1 that the Set Covering algorithm exceeds the performanceof the second naive approach. The second naive method never returns any falsepositives but throws away everything except the first two reports of every markertrace. This loses valuable information, and hence does not perform as well as theSet Covering algorithm.
In Sec. 4 we introduced a criterion for early stopping, arguing that this wouldresult in improved performance on noisy data. Figure 2a gives the empiricalverification for this, showing that for noisy data the closest match to the truenetwork is obtained before the Set Covering is complete.The circles plotted in Fig. 2 indicate the point at which the minimum descrip-tion length was obtained, and hence the point at which the set covering wouldbe halted. Figures 2c and 2d demonstrate that halting using MDL includes themajority of true positives, but limits the inclusion of false edges.
Finally, in Fig. 3 we show results of network reconstruction for various noiselevels, with and without the use of MDL stopping. Figure 3c clearly shows thatwhen we use MDL stopping the rate of false positives remains bounded as theamount of data increases, in stark contrast to results for the basic algorithm.The use of MDL does not completely compensate for the presence of noise,however, as evidenced by the lower performance shown in Fig. 3b. The order inwhich edges are added to the set E R determines the proportion of true positivesadded before halting, and the results show that for higher noise conditions, morefalse edges will end up included in the final network. In the limit of large amountsof data, both TPR and FPR tend to a fixed level, determined by the exact natureof the network and markers, as well as the level of noise. J a cc a r d D i s t an c e (a) naive 1naive 2set coveringset covering bound T r ue + v e r a t e (b) F a l s e + v e r a t e (c) Fig. 1:
Performance of Set Covering Reconstruction, relative to naiveapproaches and theoretical bounds.
For TPR, the data for naive 2 and setcovering bound coincide. FPR for naive 2 is always zero, and hence not shown.Results are shown for networks of 100 nodes. J a cc a r d D i s t an c e (a) D e sc r i p t i on Leng t h (b) T r ue + v e r a t e (c) p loss = 0.00p loss = 0.05p loss = 0.10 F a l s e + v e r a t e (d) Fig. 2:
Plots showing variation of JD, DL, TPR and FPR with theprogress of set covering.
Circles indicate the point at which the MDL criterionwould have halted the covering. Each line shows reconstruction of a network of100 nodes, using 1000 markers. econstruction of Causal Networks by Set Covering 15 J a cc a r d D i s t an c e (a) p loss = 0.00p loss = 0.05p loss = 0.10With MDL StoppingWithout MDL Stopping T r ue + v e r a t e (b) F a l s e + v e r a t e (c) Fig. 3:
Performance of reconstruction for various levels of noise, withand without MDL stopping.
Results are shown for a network of 100 nodes.
Our work demonstrates a novel approach to the reconstruction of causal net-works underlying stochastic branching processes, such as from data representinginformation flow or the spread of an epidemic on a network. Using the intuitivenotion of consistency between a network and such data, we demonstrated thatthe entire network can be reconstructed node by node, using only local consid-erations. In this way, we were able to reformulate the problem in terms of theSet Covering problem, which is NP-hard but can be approximated well using anefficient greedy algorithm.We developed two versions of the algorithm for different settings. The firstversion attempts to achieve perfect consistency with the data, and is thereforerestricted to noise-free and fully-observed settings. This version is likely to beuseful in controlled settings, such as in fault propagation networks in large en-terprises. The second version was designed for the more common noisy setting,e.g. where certain marker observations may not have been observed or detected.It is based on the empirical observation that reliable edges tend to be addedfirst, such that early stopping combined with our first algorithm is sufficient toprovide good results. As a stopping criterion, the MDL principle proved to bean excellent measure, as shown by our experiments.In further work we plan to investigate direct optimisation of the MDL costfunction, rather than using MDL only as a stopping criterion. Another avenue for extending the approach is the use of exact times, rather than our currentapproach considering only the order of reports. Finally, we intend to apply ourmethods to various real-life data sets, such as the propagation of memes on themedia network [3,8], and fault propagation data [6].
References
1. E. Brown, R. Kass, and P. Mitra. Multiple neural spike train data analysis: state-of-the-art and future challenges.
Nature Neuroscience , Jan 2004.2. V. Chvatal. A greedy heuristic for the set-covering problem.
Mathematics ofoperations research , 4(3):233–235, 1979.3. J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-tracking and the dynamics ofthe news cycle.
Proceedings of the 15th ACM SIGKDD international conferenceon Knowledge discovery and data mining , pages 497–506, 2009.4. J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading be-havior in large blog graphs: Patterns and a model.
Society of Applied and IndustrialMathematics: Data Mining , 2007.5. D. J. C. MacKay.
Information Theory, Inference & Learning Algorithms . Cam-bridge University Press, 1st edition, June 2002.6. S. Rao and N. Viswanadham. Fault diagnosis in dynamical systems: A graphtheoretic approach.
International Journal of System Science , Jan 1987.7. P. Slav´ık. Improved performance of the greedy algorithm for partial cover.
Infor-mation Processing Letters , Jan 1997.8. T. Snowsill, F. Nicart, M. Stefani, T. De Bie, and N. Cristianini. Finding sur-prising patterns in textual data streams. In , Elba Island, Italy, 2010.9. D. Sprinzak and M. Elowitz. Reconstruction of genetic circuits.
Nature ,438(7067):443–448, 2005.10. K. Unnikrishnan, N. Ramakrishnan, P. Sastry, and R. Uthurusamy. Network recon-struction from dynamic data.
ACM SIGKDD Explorations Newsletter , 8(2):90–91,2006.11. C. Wallace and D. Boulton. An information measure for classification.
Computerjournal , 11(2):185–194, 1968.